-
Notifications
You must be signed in to change notification settings - Fork 106
Enhance & Refactor SDG #895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance & Refactor SDG #895
Conversation
- Adds orchestrator module, synthetic data generation class, strategies for generating data - Renames sdg-config to sdg-metadata to be more apt - Removes old tightly-coupled modules - Makes counting bytes more accurate by using pickle for serialization - Ensures that more data is produced than less - Revamps unittests and brings coverage up to 80+% - Gives users options to remove conflicting generated files before generation - Moves Dask instantiation right before its needed (within Synthetic Data Generator class) Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have to review the tests and one "large" diff, but submitting this meanwhile.
existing_files = [] | ||
|
||
for file in os.listdir(output_path): | ||
if (file.startswith(index_name) and file.endswith(".json")) or (file.startswith(index_name) and file.endswith('_record.json')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some expected characters between the prefix and suffix? If so, should they be checked for? If not, the pattern can be made more precise.
osbenchmark/synthetic_data_generator/synthetic_data_generator.py
Outdated
Show resolved
Hide resolved
osbenchmark/synthetic_data_generator/synthetic_data_generator.py
Outdated
Show resolved
Hide resolved
osbenchmark/synthetic_data_generator/synthetic_data_generator.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed commenting. Can re-approve if any updates are pushed.
osbenchmark/synthetic_data_generator/strategies/mapping_strategy.py
Outdated
Show resolved
Hide resolved
self.logger = logging.getLogger(__name__) | ||
# TODO: Set self.mapping_config to automatically point to MappingSyntheticDataGenerator | ||
self.mapping_config = mapping_config or {} | ||
self.mapping_config = mapping_generation_values if mapping_generation_values else {} | ||
|
||
self.generic = Generic(locale=Locale.EN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tied to the English locale? Perhaps document text could be in other languages as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to enhance mapping SDG in a subsequent PR and will include expanding to different locales.
osbenchmark/synthetic_data_generator/strategies/mapping_strategy.py
Outdated
Show resolved
Hide resolved
document = { | ||
"dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}", | ||
"dog_name": random_mimesis.choice(custom_lists['dog_names']), | ||
"dog_breed": random_mimesis.choice(custom_lists['dog_breeds']), | ||
"license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}", | ||
"favorite_treats": random_mimesis.choice(custom_lists['treats']), | ||
"preferred_tip": random_mimesis.choice(custom_lists['tips']), | ||
"vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']), | ||
"vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']), | ||
"vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a good example for the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add this to documentation
…c_document and others Signed-off-by: Ian Hoang <[email protected]>
Addressed @gkamat's feedback and have confirmed E2E tests work. Different inputs that have been tested:
Ensured that:
|
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
Description
These are a combination of changes that focus on enhancing and refactoring SDG.
Issues Resolved
#836
#833
Testing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.