feat(data): normalize HF dataset types and enable lazy image sampling #63

CatherineSue · 2025-08-08T23:58:37Z

Motivations

When trying to execute the benchmark below, the ImageDatasetLoader processes for a long time.

The dataset config is:

{
    "source": {
        "type": "huggingface",
        "path": "zhang0jhon/Aesthetic-4K",
        "huggingface_kwargs": {
            "split": "train"
        }
    },
    "prompt_column": "text",
    "image_column": "image",
    "unsafe_allow_large_images": true
}

And without commit 7895404, it appears as the benchmark is stuck. This is because with a large dataset like zhang0jhon/Aesthetic-4K, (which has around 1.5k rows), the for loop in ImageSampler._process_loaded_data needs to take a O(n) time.

This PR tries to resolve this by moving the image and prompt extraction logic into ImageSampler.

Modifications

Image loader: stop eager iteration; return datasets.Dataset as-is; for
DatasetDict auto-select a split (prefer 'train', else first) and warn;
raise on streaming (IterableDataset*) instructing to disable streaming.
Sampler: accept and store dataset_config in base; ImageSampler now
lazily samples rows with random.choices and extracts prompt/image using
config (no pre-materialization).
Utils: add safe_eval_prompt and use it from ImageSampler for
prompt_lambda.
CLI: pass dataset_config when constructing the sampler.

This eliminates startup stalls from full-dataset iteration and supports
random-access sampling on HF datasets.

gemini-code-assist

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a significant performance improvement for handling large image datasets, particularly those from HuggingFace. The core change involves shifting from eager dataset iteration, which caused long startup stalls, to a lazy sampling approach. This means the system no longer loads and processes the entire dataset into memory upfront. Instead, data is now sampled and processed on demand, drastically reducing initial processing times and enabling more efficient benchmarking with large-scale image datasets. This also includes better handling of HuggingFace dataset types and improved prompt/image extraction logic.

Highlights

Lazy Dataset Loading: The image dataset loader (ImageDatasetLoader) no longer eagerly iterates through entire datasets upon loading. Instead, it now returns HuggingFace datasets.Dataset objects directly, which supports lazy access and significantly reduces startup times for large datasets.
HuggingFace DatasetDict Handling: The ImageDatasetLoader now intelligently handles datasets.DatasetDict objects by automatically selecting a split (preferring 'train' if available, otherwise the first one) and issues a warning if multiple splits are found without an explicit choice.
Streaming Dataset Restriction: Streaming datasets (IterableDataset and IterableDatasetDict) are now explicitly disallowed for image sampling, with an error message guiding users to disable streaming, ensuring compatibility and preventing unexpected behavior.
Lazy Image and Prompt Sampling: The ImageSampler has been refactored to perform lazy sampling of individual rows using random.choices. It now extracts image and prompt data on-the-fly based on the provided dataset_config, eliminating the need to pre-materialize the entire dataset into memory.
Safe Prompt Evaluation Utility: A new utility function, safe_eval_prompt, has been introduced to safely evaluate prompt templates, including lambda expressions and direct column access, centralizing and securing prompt generation logic.
CLI Integration: The command-line interface (CLI) now correctly passes the dataset_config object to the sampler, ensuring that the new lazy loading and sampling mechanisms are properly configured and utilized during benchmark execution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a performance bottleneck by refactoring the image data loading process to be lazy instead of eager. Moving the prompt and image extraction logic into the ImageSampler and handling Hugging Face Dataset objects without immediate iteration is a solid architectural improvement. The code is well-structured, and the changes to support different dataset types in ImageDatasetLoader are robust.

My main feedback concerns the new tests added for the ImageSampler. They currently pass scenario=None to the sample method, which will cause the tests to fail with an AttributeError. I've provided suggestions to fix this by passing a valid ImageModality scenario. I also have a medium-severity suggestion for improving error feedback in the new safe_eval_prompt utility function.

tests/sampling/test_image.py

genai_bench/utils.py

CatherineSue · 2025-08-09T00:05:50Z

CI is failing. Need to rebase after #62 is merged.

… log panel - Attach console handler alongside panel handler during benchmark logging init so logs appear immediately in the terminal and in the dashboard’s log panel. - Add LoggingManager.enter_live_mode() to swap console handler for delayed handler right before entering `dashboard.live`, preventing UI interference and buffering logs until the dashboard exits. - Update benchmark flow to call `enter_live_mode()` just before `with dashboard.live:`. This reduces perceived hanging during tokenizer/HF dataset loading by surfacing progress logs in the terminal while still updating the dashboard log panel.

- Image loader: stop eager iteration; return `datasets.Dataset` as-is; for `DatasetDict` auto-select a split (prefer 'train', else first) and warn; raise on streaming (`IterableDataset*`) instructing to disable streaming. - Sampler: accept and store `dataset_config` in base; `ImageSampler` now lazily samples rows with `random.choices` and extracts prompt/image using config (no pre-materialization). - Utils: add `safe_eval_prompt` and use it from `ImageSampler` for `prompt_lambda`. - CLI: pass `dataset_config` when constructing the sampler. This eliminates startup stalls from full-dataset iteration and supports random-access sampling on HF datasets.

…irror to log panel" This reverts commit ad3871e668536b815767e2b4ba11bdd061f00c10.

CatherineSue requested a review from slin1237 as a code owner August 8, 2025 23:58

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 9, 2025

View reviewed changes

tests/sampling/test_image.py Show resolved Hide resolved

tests/sampling/test_image.py Show resolved Hide resolved

tests/sampling/test_image.py Show resolved Hide resolved

tests/sampling/test_image.py Show resolved Hide resolved

genai_bench/utils.py Outdated Show resolved Hide resolved

CatherineSue added 6 commits August 11, 2025 09:40

Fix logic of data source type

259db8d

Revert "feat(logging): show terminal logs before dashboard live and m…

5da230f

…irror to log panel" This reverts commit ad3871e668536b815767e2b4ba11bdd061f00c10.

Update UT

a6aff9d

More fix to the logic of data source type check

d94954c

CatherineSue force-pushed the chang/fix-slow-loa branch from 711097b to d94954c Compare August 11, 2025 16:40

CatherineSue added 2 commits August 11, 2025 09:43

Add warning in safe_eval_prompt

133b548

Minor update to sampler init

c3e95ea

CatherineSue merged commit 4a42ed4 into main Aug 11, 2025
7 checks passed

CatherineSue deleted the chang/fix-slow-loa branch August 11, 2025 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(data): normalize HF dataset types and enable lazy image sampling #63

feat(data): normalize HF dataset types and enable lazy image sampling #63

Uh oh!

CatherineSue commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CatherineSue commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

feat(data): normalize HF dataset types and enable lazy image sampling #63

feat(data): normalize HF dataset types and enable lazy image sampling #63

Uh oh!

Conversation

CatherineSue commented Aug 8, 2025

Motivations

Modifications

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CatherineSue commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!