Skip to content

ThomasRochefortB/open-agentinstruct

Repository files navigation

1-8d2861cb

open-agentinstruct

An open-source recreation of the AgentInstruct agentic workflow.

open-agentinstruct is a project aimed at recreating the AgentInstruct agentic workflow. It supports any LiteLLM model to be used in the agentic synthetic data generation worflow. The AgentInstruct workflow involves three agentic step for synthetic data generation based on "seed" data:

  • Content Transformation: Transforms text content using various agent configurations.
  • Instruction Generation: Generates instructions based on transformed content.
  • Instruction Refinement: Refines generated instructions to enhance complexity and challenge.

Table of Contents

Supported tasks

The AgentInstruct paper implements the following tasks which are all implemented in open-agentinstruct:

  • Reading Comprehension
  • Open Domain Question Answering
  • Text Modification
  • Web Agent
  • Brain Teaser
  • Analytical Reasoning
  • Multiple Choice Questions
  • Data To Text
  • Fermi
  • Coding
  • Text Extraction
  • Text Classification
  • Retrieval Augmented Generation
  • Tool Use
  • Creative Content Generation
  • Few Shot Reasoning
  • Conversation

Supported seed datasets

Features

  • LiteLLM compatible LLMs

Installation

Option 1: Install from PyPI (Recommended for users)

Once the package is published, you can install it directly using pip:

pip install open-agentinstruct

Option 2: Install from source (For developers)

  1. Clone the repository:

    git clone https://github.com/ThomasRochefortB/open-agentinstruct.git
    cd open-agentinstruct
  2. Create a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
  3. Install the package in editable mode along with development dependencies:

    pip install -e ".[dev]"
  4. Set up your API keys necessary to use the desired LiteLLM model(s):

    • Create a .env file in the root directory (or wherever you run the command).
    • Add your API key(s) to the .env file (the library uses python-dotenv to load them):
      # Example for OpenAI
      OPENAI_API_KEY=your_openai_api_key
      # Add other keys as needed (e.g., COHERE_API_KEY, ANTHROPIC_API_KEY)
      # ...

Usage

The primary way to use the data generation workflow is through the command-line interface:

# Basic usage with a Hugging Face dataset
open-agentinstruct-generate --dataset-names <hf/datasetname> --task-name <your_task_name>

# Example: Generate reading comprehension data from the first 100 chunks of openstax
open-agentinstruct-generate --dataset-names "crumb/openstax-text" --task-name reading_comprehension --max-chunks 100

# Generate data for all tasks from the specified dataset, processing max 100 chunks, skipping refinement, including content
open-agentinstruct-generate --dataset-names "crumb/openstax-text:text:train:20000" --model gemini/gemini-2.0-flash --max-chunks 100 --output-dir ./output

# Example: Generate data for all tasks from a PDF directory, including original content
open-agentinstruct-generate --pdf-dir path/to/your/pdfs --all-tasks --include-content

# See all available options
open-agentinstruct-generate --help

Generated data will be saved to ./data/generated_data/<task_name>.jsonl by default.

About

An open-source recreation of the AgentInstruct agentic workflow for synthetic data generation

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages