Enhance & Refactor SDG #895

IanHoang · 2025-07-17T22:06:38Z

Description

These are a combination of changes that focus on enhancing and refactoring SDG.

Adds orchestrator module, synthetic data generation class, strategies for generating data
Renames sdg-config to sdg-metadata to be more apt
Removes old tightly-coupled modules
Makes counting bytes more accurate by using pickle for serialization
Ensures that more data is produced than less
Revamps unittests and brings coverage up to 80+%
Gives users options to remove conflicting generated files before generation
Moves Dask instantiation right before its needed (within Synthetic Data Generator class)

Issues Resolved

Testing

E2E testing with basic and complex (nested and with objects) OpenSearch mappings
E2E testing with custom python modules
Added new unittests to bring up coverage

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- Adds orchestrator module, synthetic data generation class, strategies for generating data - Renames sdg-config to sdg-metadata to be more apt - Removes old tightly-coupled modules - Makes counting bytes more accurate by using pickle for serialization - Ensures that more data is produced than less - Revamps unittests and brings coverage up to 80+% - Gives users options to remove conflicting generated files before generation - Moves Dask instantiation right before its needed (within Synthetic Data Generator class) Signed-off-by: Ian Hoang <[email protected]>

Signed-off-by: Ian Hoang <[email protected]>

gkamat

Still have to review the tests and one "large" diff, but submitting this meanwhile.

osbenchmark/benchmark.py

osbenchmark/synthetic_data_generator/helpers.py

gkamat · 2025-07-25T06:31:32Z

osbenchmark/synthetic_data_generator/helpers.py

+    existing_files = []
+
+    for file in os.listdir(output_path):
+        if (file.startswith(index_name) and file.endswith(".json")) or (file.startswith(index_name) and file.endswith('_record.json')):


Are there some expected characters between the prefix and suffix? If so, should they be checked for? If not, the pattern can be made more precise.

osbenchmark/synthetic_data_generator/helpers.py

osbenchmark/synthetic_data_generator/synthetic_data_generator.py

gkamat

Completed commenting. Can re-approve if any updates are pushed.

osbenchmark/synthetic_data_generator/strategies/mapping_strategy.py

gkamat · 2025-07-25T22:09:33Z

osbenchmark/synthetic_data_generator/strategies/mapping_strategy.py

        self.logger = logging.getLogger(__name__)
-        # TODO: Set self.mapping_config to automatically point to MappingSyntheticDataGenerator
-        self.mapping_config = mapping_config or {}
+        self.mapping_config = mapping_generation_values if mapping_generation_values else {}

        self.generic = Generic(locale=Locale.EN)


Is this tied to the English locale? Perhaps document text could be in other languages as well.

I'm planning to enhance mapping SDG in a subsequent PR and will include expanding to different locales.

osbenchmark/synthetic_data_generator/strategies/mapping_strategy.py

gkamat · 2025-07-25T22:21:43Z

tests/synthetic_data_generator/sample_custom_module.py

+    document = {
+        "dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}",
+        "dog_name": random_mimesis.choice(custom_lists['dog_names']),
+        "dog_breed": random_mimesis.choice(custom_lists['dog_breeds']),
+        "license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}",
+        "favorite_treats": random_mimesis.choice(custom_lists['treats']),
+        "preferred_tip": random_mimesis.choice(custom_lists['tips']),
+        "vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']),
+        "vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']),
+        "vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']),


Would be a good example for the documentation.

Will add this to documentation

tests/synthetic_data_generator/incorrect_sample_custom_module.py

…c_document and others Signed-off-by: Ian Hoang <[email protected]>

IanHoang · 2025-07-29T21:22:07Z

Addressed @gkamat's feedback and have confirmed E2E tests work.

Different inputs that have been tested:

basic index mappings
basic index mappings with sdg-config
complex index mappings
complex index mappings with sdg-config
custom module
custom module with sdg-config

Ensured that:

No duplicates found in the corpora
Corpora generated is the expected amount
Pseudofile generates similar bytes to when writing to temporary file
Rollovers happen when exceeding max file size

Signed-off-by: Ian Hoang <[email protected]>

IanHoang requested review from gkamat, beaioun, rishabh6788, VijayanB and OVI3D0 as code owners July 17, 2025 22:06

IanHoang added 4 commits July 17, 2025 17:10

Quick fix for unittests

4b753ef

Signed-off-by: Ian Hoang <[email protected]>

Remove pin for setuptools in Makefile

d85d8c9

Signed-off-by: Ian Hoang <[email protected]>

Address pylint errors

c4e1976

Signed-off-by: Ian Hoang <[email protected]>

Correct __all__ in init

e7805bb

Signed-off-by: Ian Hoang <[email protected]>

IanHoang added the 2.0.0 label Jul 23, 2025

gkamat reviewed Jul 25, 2025

View reviewed changes

gkamat approved these changes Jul 25, 2025

View reviewed changes

Address Gkamat feedback: generate_fake_document --> generate_syntheti…

864bfab

…c_document and others Signed-off-by: Ian Hoang <[email protected]>

IanHoang mentioned this pull request Jul 29, 2025

[SDG] Improve generator complexity in Mapping SDG #905

Open

IanHoang merged commit 3475c06 into opensearch-project:2.0-develop Jul 29, 2025
10 checks passed

IanHoang deleted the 2.0-develop branch July 29, 2025 21:56

gkamat pushed a commit that referenced this pull request Aug 1, 2025

Enhance & Refactor SDG (#895)

29ab40b

Signed-off-by: Ian Hoang <[email protected]>

gkamat pushed a commit to gkamat/opensearch-benchmark that referenced this pull request Aug 6, 2025

Enhance & Refactor SDG (opensearch-project#895)

4caaa5f

Signed-off-by: Ian Hoang <[email protected]>

gkamat pushed a commit that referenced this pull request Aug 19, 2025

Enhance & Refactor SDG (#895)

0f42f00

Signed-off-by: Ian Hoang <[email protected]>

IanHoang added a commit that referenced this pull request Aug 20, 2025

Enhance & Refactor SDG (#895)

f178768

Signed-off-by: Ian Hoang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance & Refactor SDG #895

Enhance & Refactor SDG #895

Uh oh!

IanHoang commented Jul 17, 2025

Uh oh!

gkamat left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkamat left a comment

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Uh oh!

IanHoang Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Uh oh!

IanHoang Jul 29, 2025

Uh oh!

Uh oh!

IanHoang commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Enhance & Refactor SDG #895

Enhance & Refactor SDG #895

Uh oh!

Conversation

IanHoang commented Jul 17, 2025

Description

Issues Resolved

Testing

Uh oh!

gkamat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkamat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

IanHoang Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gkamat Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

IanHoang Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IanHoang commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

IanHoang Jul 29, 2025 •

edited

Loading