Skip to content

Conversation

IanHoang
Copy link
Collaborator

Description

These are a combination of changes that focus on enhancing and refactoring SDG.

  • Adds orchestrator module, synthetic data generation class, strategies for generating data
  • Renames sdg-config to sdg-metadata to be more apt
  • Removes old tightly-coupled modules
  • Makes counting bytes more accurate by using pickle for serialization
  • Ensures that more data is produced than less
  • Revamps unittests and brings coverage up to 80+%
  • Gives users options to remove conflicting generated files before generation
  • Moves Dask instantiation right before its needed (within Synthetic Data Generator class)

Issues Resolved

#836
#833

Testing

  • E2E testing with basic and complex (nested and with objects) OpenSearch mappings
  • E2E testing with custom python modules
  • Added new unittests to bring up coverage

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- Adds orchestrator module, synthetic data generation class, strategies for generating data
- Renames sdg-config to sdg-metadata to be more apt
- Removes old tightly-coupled modules
- Makes counting bytes more accurate by using pickle for serialization
- Ensures that more data is produced than less
- Revamps unittests and brings coverage up to 80+%
- Gives users options to remove conflicting generated files before generation
- Moves Dask instantiation right before its needed (within Synthetic Data Generator class)

Signed-off-by: Ian Hoang <[email protected]>
@IanHoang IanHoang added the 2.0.0 label Jul 23, 2025
Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have to review the tests and one "large" diff, but submitting this meanwhile.

existing_files = []

for file in os.listdir(output_path):
if (file.startswith(index_name) and file.endswith(".json")) or (file.startswith(index_name) and file.endswith('_record.json')):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some expected characters between the prefix and suffix? If so, should they be checked for? If not, the pattern can be made more precise.

Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed commenting. Can re-approve if any updates are pushed.

self.logger = logging.getLogger(__name__)
# TODO: Set self.mapping_config to automatically point to MappingSyntheticDataGenerator
self.mapping_config = mapping_config or {}
self.mapping_config = mapping_generation_values if mapping_generation_values else {}

self.generic = Generic(locale=Locale.EN)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tied to the English locale? Perhaps document text could be in other languages as well.

Copy link
Collaborator Author

@IanHoang IanHoang Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to enhance mapping SDG in a subsequent PR and will include expanding to different locales.

Comment on lines +67 to +76
document = {
"dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}",
"dog_name": random_mimesis.choice(custom_lists['dog_names']),
"dog_breed": random_mimesis.choice(custom_lists['dog_breeds']),
"license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}",
"favorite_treats": random_mimesis.choice(custom_lists['treats']),
"preferred_tip": random_mimesis.choice(custom_lists['tips']),
"vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']),
"vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']),
"vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good example for the documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this to documentation

@IanHoang
Copy link
Collaborator Author

Addressed @gkamat's feedback and have confirmed E2E tests work.

Different inputs that have been tested:

  • basic index mappings
  • basic index mappings with sdg-config
  • complex index mappings
  • complex index mappings with sdg-config
  • custom module
  • custom module with sdg-config

Ensured that:

  • No duplicates found in the corpora
  • Corpora generated is the expected amount
  • Pseudofile generates similar bytes to when writing to temporary file
  • Rollovers happen when exceeding max file size

@IanHoang IanHoang merged commit 3475c06 into opensearch-project:2.0-develop Jul 29, 2025
10 checks passed
@IanHoang IanHoang deleted the 2.0-develop branch July 29, 2025 21:56
gkamat pushed a commit that referenced this pull request Aug 1, 2025
gkamat pushed a commit to gkamat/opensearch-benchmark that referenced this pull request Aug 6, 2025
gkamat pushed a commit that referenced this pull request Aug 19, 2025
IanHoang added a commit that referenced this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants