-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Required prerequisites
- I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- Consider asking first in a Discussion.
Motivation
The current datagen pipeline interface has usability issues that hinder efficiency:
File Path Dependency: Requiring file paths in initialization adds rigidity, complicating workflows with programmatically generated data.
Inconvenient Input Format: Expecting inputs as file paths instead of direct jsonl data creates unnecessary overhead.
Lack of Flexibility: The design limits adaptability and increases boilerplate code.
We need to redesign the interface to:
Support Direct Data Input: Allow inputs as jsonl strings or Python objects.
Streamline API: Align with best practices from projects like Distilabel and Curator, emphasizing modularity and ease of use.
unified interface by adding BaseDataGenPipeline module
Solution
No response
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status