Skip to content

[Feature Request] Polish Interface of data generation pipleline #1511

@Wendong-Fan

Description

@Wendong-Fan

Required prerequisites

Motivation

The current datagen pipeline interface has usability issues that hinder efficiency:

File Path Dependency: Requiring file paths in initialization adds rigidity, complicating workflows with programmatically generated data.
Inconvenient Input Format: Expecting inputs as file paths instead of direct jsonl data creates unnecessary overhead.
Lack of Flexibility: The design limits adaptability and increases boilerplate code.

We need to redesign the interface to:
Support Direct Data Input: Allow inputs as jsonl strings or Python objects.
Streamline API: Align with best practices from projects like Distilabel and Curator, emphasizing modularity and ease of use.

unified interface by adding BaseDataGenPipeline module

Solution

No response

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

P0Task with high level priorityenhancementNew feature or request

Type

Projects

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions