Skip to content

feat(data): normalize HF dataset types and enable lazy image sampling #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 11, 2025

Conversation

CatherineSue
Copy link
Collaborator

Motivations

When trying to execute the benchmark below, the ImageDatasetLoader processes for a long time.

Screenshot 2025-08-08 at 4 53 21 PM

The dataset config is:

{
    "source": {
        "type": "huggingface",
        "path": "zhang0jhon/Aesthetic-4K",
        "huggingface_kwargs": {
            "split": "train"
        }
    },
    "prompt_column": "text",
    "image_column": "image",
    "unsafe_allow_large_images": true
}

And without commit 7895404, it appears as the benchmark is stuck. This is because with a large dataset like zhang0jhon/Aesthetic-4K, (which has around 1.5k rows), the for loop in ImageSampler._process_loaded_data needs to take a O(n) time.

This PR tries to resolve this by moving the image and prompt extraction logic into ImageSampler.

Modifications

  • Image loader: stop eager iteration; return datasets.Dataset as-is; for
    DatasetDict auto-select a split (prefer 'train', else first) and warn;
    raise on streaming (IterableDataset*) instructing to disable streaming.
  • Sampler: accept and store dataset_config in base; ImageSampler now
    lazily samples rows with random.choices and extracts prompt/image using
    config (no pre-materialization).
  • Utils: add safe_eval_prompt and use it from ImageSampler for
    prompt_lambda.
  • CLI: pass dataset_config when constructing the sampler.

This eliminates startup stalls from full-dataset iteration and supports
random-access sampling on HF datasets.

@CatherineSue CatherineSue requested a review from slin1237 as a code owner August 8, 2025 23:58
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a significant performance improvement for handling large image datasets, particularly those from HuggingFace. The core change involves shifting from eager dataset iteration, which caused long startup stalls, to a lazy sampling approach. This means the system no longer loads and processes the entire dataset into memory upfront. Instead, data is now sampled and processed on demand, drastically reducing initial processing times and enabling more efficient benchmarking with large-scale image datasets. This also includes better handling of HuggingFace dataset types and improved prompt/image extraction logic.

Highlights

  • Lazy Dataset Loading: The image dataset loader (ImageDatasetLoader) no longer eagerly iterates through entire datasets upon loading. Instead, it now returns HuggingFace datasets.Dataset objects directly, which supports lazy access and significantly reduces startup times for large datasets.
  • HuggingFace DatasetDict Handling: The ImageDatasetLoader now intelligently handles datasets.DatasetDict objects by automatically selecting a split (preferring 'train' if available, otherwise the first one) and issues a warning if multiple splits are found without an explicit choice.
  • Streaming Dataset Restriction: Streaming datasets (IterableDataset and IterableDatasetDict) are now explicitly disallowed for image sampling, with an error message guiding users to disable streaming, ensuring compatibility and preventing unexpected behavior.
  • Lazy Image and Prompt Sampling: The ImageSampler has been refactored to perform lazy sampling of individual rows using random.choices. It now extracts image and prompt data on-the-fly based on the provided dataset_config, eliminating the need to pre-materialize the entire dataset into memory.
  • Safe Prompt Evaluation Utility: A new utility function, safe_eval_prompt, has been introduced to safely evaluate prompt templates, including lambda expressions and direct column access, centralizing and securing prompt generation logic.
  • CLI Integration: The command-line interface (CLI) now correctly passes the dataset_config object to the sampler, ensuring that the new lazy loading and sampling mechanisms are properly configured and utilized during benchmark execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a performance bottleneck by refactoring the image data loading process to be lazy instead of eager. Moving the prompt and image extraction logic into the ImageSampler and handling Hugging Face Dataset objects without immediate iteration is a solid architectural improvement. The code is well-structured, and the changes to support different dataset types in ImageDatasetLoader are robust.

My main feedback concerns the new tests added for the ImageSampler. They currently pass scenario=None to the sample method, which will cause the tests to fail with an AttributeError. I've provided suggestions to fix this by passing a valid ImageModality scenario. I also have a medium-severity suggestion for improving error feedback in the new safe_eval_prompt utility function.

@CatherineSue
Copy link
Collaborator Author

CI is failing. Need to rebase after #62 is merged.

… log panel

- Attach console handler alongside panel handler during benchmark logging init so
  logs appear immediately in the terminal and in the dashboard’s log panel.
- Add LoggingManager.enter_live_mode() to swap console handler for delayed handler
  right before entering `dashboard.live`, preventing UI interference and buffering
  logs until the dashboard exits.
- Update benchmark flow to call `enter_live_mode()` just before `with dashboard.live:`.

This reduces perceived hanging during tokenizer/HF dataset loading by surfacing
progress logs in the terminal while still updating the dashboard log panel.
- Image loader: stop eager iteration; return `datasets.Dataset` as-is; for
  `DatasetDict` auto-select a split (prefer 'train', else first) and warn;
  raise on streaming (`IterableDataset*`) instructing to disable streaming.
- Sampler: accept and store `dataset_config` in base; `ImageSampler` now
  lazily samples rows with `random.choices` and extracts prompt/image using
  config (no pre-materialization).
- Utils: add `safe_eval_prompt` and use it from `ImageSampler` for
  `prompt_lambda`.
- CLI: pass `dataset_config` when constructing the sampler.

This eliminates startup stalls from full-dataset iteration and supports
random-access sampling on HF datasets.
…irror to log panel"

This reverts commit ad3871e668536b815767e2b4ba11bdd061f00c10.
@CatherineSue CatherineSue merged commit 4a42ed4 into main Aug 11, 2025
7 checks passed
@CatherineSue CatherineSue deleted the chang/fix-slow-loa branch August 11, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant