-
Notifications
You must be signed in to change notification settings - Fork 17
feat(data): normalize HF dataset types and enable lazy image sampling #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a significant performance improvement for handling large image datasets, particularly those from HuggingFace. The core change involves shifting from eager dataset iteration, which caused long startup stalls, to a lazy sampling approach. This means the system no longer loads and processes the entire dataset into memory upfront. Instead, data is now sampled and processed on demand, drastically reducing initial processing times and enabling more efficient benchmarking with large-scale image datasets. This also includes better handling of HuggingFace dataset types and improved prompt/image extraction logic.
Highlights
- Lazy Dataset Loading: The image dataset loader (
ImageDatasetLoader
) no longer eagerly iterates through entire datasets upon loading. Instead, it now returns HuggingFacedatasets.Dataset
objects directly, which supports lazy access and significantly reduces startup times for large datasets. - HuggingFace DatasetDict Handling: The
ImageDatasetLoader
now intelligently handlesdatasets.DatasetDict
objects by automatically selecting a split (preferring 'train' if available, otherwise the first one) and issues a warning if multiple splits are found without an explicit choice. - Streaming Dataset Restriction: Streaming datasets (
IterableDataset
andIterableDatasetDict
) are now explicitly disallowed for image sampling, with an error message guiding users to disable streaming, ensuring compatibility and preventing unexpected behavior. - Lazy Image and Prompt Sampling: The
ImageSampler
has been refactored to perform lazy sampling of individual rows usingrandom.choices
. It now extracts image and prompt data on-the-fly based on the provideddataset_config
, eliminating the need to pre-materialize the entire dataset into memory. - Safe Prompt Evaluation Utility: A new utility function,
safe_eval_prompt
, has been introduced to safely evaluate prompt templates, including lambda expressions and direct column access, centralizing and securing prompt generation logic. - CLI Integration: The command-line interface (CLI) now correctly passes the
dataset_config
object to the sampler, ensuring that the new lazy loading and sampling mechanisms are properly configured and utilized during benchmark execution.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a performance bottleneck by refactoring the image data loading process to be lazy instead of eager. Moving the prompt and image extraction logic into the ImageSampler
and handling Hugging Face Dataset
objects without immediate iteration is a solid architectural improvement. The code is well-structured, and the changes to support different dataset types in ImageDatasetLoader
are robust.
My main feedback concerns the new tests added for the ImageSampler
. They currently pass scenario=None
to the sample
method, which will cause the tests to fail with an AttributeError
. I've provided suggestions to fix this by passing a valid ImageModality
scenario. I also have a medium-severity suggestion for improving error feedback in the new safe_eval_prompt
utility function.
CI is failing. Need to rebase after #62 is merged. |
… log panel - Attach console handler alongside panel handler during benchmark logging init so logs appear immediately in the terminal and in the dashboard’s log panel. - Add LoggingManager.enter_live_mode() to swap console handler for delayed handler right before entering `dashboard.live`, preventing UI interference and buffering logs until the dashboard exits. - Update benchmark flow to call `enter_live_mode()` just before `with dashboard.live:`. This reduces perceived hanging during tokenizer/HF dataset loading by surfacing progress logs in the terminal while still updating the dashboard log panel.
- Image loader: stop eager iteration; return `datasets.Dataset` as-is; for `DatasetDict` auto-select a split (prefer 'train', else first) and warn; raise on streaming (`IterableDataset*`) instructing to disable streaming. - Sampler: accept and store `dataset_config` in base; `ImageSampler` now lazily samples rows with `random.choices` and extracts prompt/image using config (no pre-materialization). - Utils: add `safe_eval_prompt` and use it from `ImageSampler` for `prompt_lambda`. - CLI: pass `dataset_config` when constructing the sampler. This eliminates startup stalls from full-dataset iteration and supports random-access sampling on HF datasets.
…irror to log panel" This reverts commit ad3871e668536b815767e2b4ba11bdd061f00c10.
711097b
to
d94954c
Compare
Motivations
When trying to execute the benchmark below, the
ImageDatasetLoader
processes for a long time.The dataset config is:
And without commit 7895404, it appears as the benchmark is stuck. This is because with a large dataset like
zhang0jhon/Aesthetic-4K
, (which has around 1.5k rows), the for loop inImageSampler._process_loaded_data
needs to take a O(n) time.This PR tries to resolve this by moving the image and prompt extraction logic into
ImageSampler
.Modifications
datasets.Dataset
as-is; forDatasetDict
auto-select a split (prefer 'train', else first) and warn;raise on streaming (
IterableDataset*
) instructing to disable streaming.dataset_config
in base;ImageSampler
nowlazily samples rows with
random.choices
and extracts prompt/image usingconfig (no pre-materialization).
safe_eval_prompt
and use it fromImageSampler
forprompt_lambda
.dataset_config
when constructing the sampler.This eliminates startup stalls from full-dataset iteration and supports
random-access sampling on HF datasets.