open-agentinstruct

An open-source recreation of the AgentInstruct agentic workflow.

open-agentinstruct is a project aimed at recreating the AgentInstruct agentic workflow. It supports any LiteLLM model to be used in the agentic synthetic data generation worflow. The AgentInstruct workflow involves three agentic step for synthetic data generation based on "seed" data:

Content Transformation: Transforms text content using various agent configurations.
Instruction Generation: Generates instructions based on transformed content.
Instruction Refinement: Refines generated instructions to enhance complexity and challenge.

Supported tasks

The AgentInstruct paper implements the following tasks which are all implemented in open-agentinstruct:

Reading Comprehension
Open Domain Question Answering
Text Modification
Web Agent
Brain Teaser
Analytical Reasoning
Multiple Choice Questions
Data To Text
Fermi
Coding
Text Extraction
Text Classification
Retrieval Augmented Generation
Tool Use
Creative Content Generation
Few Shot Reasoning
Conversation

Supported seed datasets

Any HF datasets:
- The AgentInstruct paper uses the following:
  - Knowledge Pile
  - AutoMathText
  - subset of openstax
  - subset Apache 2.0 from codeparrot/github-code-clean
Any set of user-provided seed .pdfs

Features

LiteLLM compatible LLMs

Installation

Option 1: Install from PyPI (Recommended for users)

Once the package is published, you can install it directly using pip:

pip install open-agentinstruct

Option 2: Install from source (For developers)

Clone the repository:

git clone https://github.com/ThomasRochefortB/open-agentinstruct.git
cd open-agentinstruct

Create a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`

Install the package in editable mode along with development dependencies:
```
pip install -e ".[dev]"
```
Set up your API keys necessary to use the desired LiteLLM model(s):
- Create a .env file in the root directory (or wherever you run the command).
- Add your API key(s) to the .env file (the library uses python-dotenv to load them):
```
# Example for OpenAI
OPENAI_API_KEY=your_openai_api_key
# Add other keys as needed (e.g., COHERE_API_KEY, ANTHROPIC_API_KEY)
# ...
```

Usage

The primary way to use the data generation workflow is through the command-line interface:

# Basic usage with a Hugging Face dataset
open-agentinstruct-generate --dataset-names <hf/datasetname> --task-name <your_task_name>

# Example: Generate reading comprehension data from the first 100 chunks of openstax
open-agentinstruct-generate --dataset-names "crumb/openstax-text" --task-name reading_comprehension --max-chunks 100

# Generate data for all tasks from the specified dataset, processing max 100 chunks, skipping refinement, including content
open-agentinstruct-generate --dataset-names "crumb/openstax-text:text:train:20000" --model gemini/gemini-2.0-flash --max-chunks 100 --output-dir ./output

# Example: Generate data for all tasks from a PDF directory, including original content
open-agentinstruct-generate --pdf-dir path/to/your/pdfs --all-tasks --include-content

# See all available options
open-agentinstruct-generate --help

Generated data will be saved to ./data/generated_data/<task_name>.jsonl by default.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
open_agentinstruct		open_agentinstruct
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

open-agentinstruct

Table of Contents

Supported tasks

Supported seed datasets

Features

Installation

Usage

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

ThomasRochefortB/open-agentinstruct

Folders and files

Latest commit

History

Repository files navigation

open-agentinstruct

Table of Contents

Supported tasks

Supported seed datasets

Features

Installation

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages