Skip to content

Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL training

Notifications You must be signed in to change notification settings

Danau5tin/tbench-agentic-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Terminal Bench Agentic Data Pipeline

TL;DR: I built a multi-agent system with 20+ Claude Code instances working in parallel to generate high-quality training data for terminal-based coding tasks, producing 331+ validated datapoints for my scalable RL training project.

Why build an agentic data pipeline?

Because generating quality training data for coding agents requires creativity, validation, and scaleβ€”tasks perfectly suited for AI collaboration.

🎬 See It In Action

Watch 20+ agents working in parallel to generate training data at scale:

20x agents in parallel

20x_agents_better.mov

6x agents in parallel (easier to follow!)

6x_agents_clear_view.mov

Pipeline Overview


πŸ“š Table of Contents


πŸ—οΈ High-Level Architecture

The pipeline transforms Terminal Bench evaluation tasks into diverse training datapoints through three specialized agent stages:

  1. Seed tasks from Terminal Bench β†’ Idea Agents β†’ Creative variations
  2. Draft ideas β†’ Builder Agents β†’ Complete executable datapoints
  3. Built datapoints β†’ Review Agents β†’ Production-ready training data

All agents work independently in parallel, coordinated by a central Task Manager that prevents duplication and handles failures.


πŸ“ˆ Pipeline Results

Production Output

  • 331 validated datapoints generated
  • 20+ agents working concurrently
  • 3 specialized agent types with distinct roles
  • 100% Docker-validated environments

Datapoint Categories Generated

Category Count Description
Software Engineering 97 API development, CLI tools, code architecture
System Administration 59 Server config, process management, monitoring
Security 42 Vulnerability fixes, authentication, encryption
Data Processing 37 ETL pipelines, data parsing, transformations
Debugging 28 Fix race conditions, memory leaks, logic errors
Machine Learning 17 Model training, data preprocessing, evaluation
File Operations 15 File parsing, I/O optimization, format conversion
Scientific Computing 15 Numerical methods, simulations, data analysis

Most Common Technologies

  • Python: Used in 196 datapoints (59%)
  • CLI Tools: 47 datapoints
  • APIs: 30 datapoints
  • C: 22 datapoints

πŸ”„ Agent Roles & Workflow

Single DP Journey

Each agent type operates independently with specialized tools and clear responsibilities:

🎨 Idea Generation Agents

  • Input: Seed tasks from Terminal Bench
  • Process: Analyze core skills β†’ Generate nΓ—multiplier variations β†’ Select best ideas
  • Output: Draft specifications in shared workspace

Key Innovation: Refinement criteria provided only AFTER brainstorming to maximize creativity

πŸ”¨ Datapoint Builder Agents

  • Input: Draft specifications from idea agents
  • Process: Build complete scenarios β†’ Validate via software script β†’ Iterate until passing
  • Output: Executable datapoints with all required components

Validation Requirements:

  • βœ… Dockerfile builds successfully
  • βœ… Tests fail before agent intervention
  • βœ… All dependencies present
  • βœ… Test weights sum to 1.0

πŸ” Quality Review Agents

  • Input: Validated datapoints from builders
  • Process: Check quality standards β†’ Edit if needed β†’ Re-validate β†’ Categorize
  • Output: Approved datapoints with metadata or rejection with reasons

πŸ“Š Task Manager System

The Task Manager enables parallel agent coordination without complex handoffs:

# Agent claims work atomically
task = tm.get_next_task("idea-agent-07", task_types=["generate_idea"])

# Process independently
draft_ideas = generate_creative_variations(task["seed_data"])

# Complete with results
tm.complete_task(task["id"], "idea-agent-07", {"drafts": draft_ideas})

Features:

  • Atomic task claiming (no collisions)
  • Automatic timeout recovery
  • Parent-child task tracking
  • Real-time status monitoring

πŸ“ Datapoint Structure

Each training datapoint contains:

draft_001_a/
β”œβ”€β”€ prompt.md       # 1-3 sentence task (e.g., "Auth times out with 100+ users. Fix it.")
β”œβ”€β”€ dockerfile      # Ubuntu 24.04 (or similar) environment setup
β”œβ”€β”€ tests.py        # Pytest verification functions
β”œβ”€β”€ weights.json    # Test importance distribution
└── files/          # Additional resources
    β”œβ”€β”€ app.py      # Broken code to fix
    └── config.json # Configuration files

βœ… Validation Pipeline

Shared validation tools ensure quality across all agents:

  1. Docker Build: Environment must build successfully
  2. Test Discovery: Pytest must find all test functions
  3. Fail-First: Tests must fail in initial state
  4. Dependency Check: All required packages present
  5. Weight Validation: Test weights sum to exactly 1.0

πŸ› οΈ Infrastructure & Tools

Workspace Organization

agents/
β”œβ”€β”€ idea_agent_workspace/      # Idea generation tools & instructions
β”œβ”€β”€ dp_builder_workspace/      # Building tools & staging areas
└── review_agent_workspace/    # Review tools & quality checks

shared_workspace/              # Common filesystem for all agents
shared_tools/                  # Validation, patching, utilities
task_manager/                  # Coordination & state management

Agent-Specific Tools

  • Idea Agents: get_task_parameters.py, get_idea_refinement_details.py
  • Builder Agents: create_dp.py, add_dp_to_review.py
  • Review Agents: approve_datapoint.py, cancel_datapoint.py, show_categories_tags.py

Shared Tools

  • validate_datapoint.py - Complete validation suite
  • patch_dp.py - Update datapoint components
  • patch_additional_files.py - Manage resource files

πŸš€ Getting Started

# Clone repository
git clone https://github.com/Danau5tin/tbench-agentic-data-pipeline.git
cd tbench-agentic-data-pipeline

# Install dependencies
uv sync

# Initialize seed tasks
python init_seed_tasks.py <path_to_terminal_bench_tasks>

# Launch agents with Claude Code
# Idea Agent:
"See @agents/idea_agent_workspace/workflow_instructions.md - you are the idea generation agent, go!"

# Builder Agent:
"See @agents/dp_builder_workspace/workflow_instructions.md - you are the datapoint builder agent, go!"

# Review Agent:
"See @agents/review_agent_workspace/workflow_instructions.md - you are the quality review agent, go!"

πŸ’‘ Key Design Decisions

Why Multiple Agent Types?

  • Separation of concerns: Each agent excels at one task
  • Parallel scaling: Multiple instances per type
  • Quality gates: Three-stage validation ensures high standards

Why Shared Filesystem?

  • Simplicity: No complex message passing
  • Reliability: File operations are atomic
  • Debugging: Easy to inspect intermediate states

Why Task Manager?

  • Coordination: Prevents duplicate work
  • Recovery: Handles agent failures gracefully
  • Monitoring: Real-time pipeline visibility

Built with Claude Code πŸ€– - This entire multi-agent system was developed using Claude Code, demonstrating the power of AI agents building infrastructure for other AI agents.

About

Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages