TL;DR: I built a multi-agent system with 20+ Claude Code instances working in parallel to generate high-quality training data for terminal-based coding tasks, producing 331+ validated datapoints for my scalable RL training project.
Why build an agentic data pipeline?
Because generating quality training data for coding agents requires creativity, validation, and scaleβtasks perfectly suited for AI collaboration.
Watch 20+ agents working in parallel to generate training data at scale:
20x_agents_better.mov
6x_agents_clear_view.mov
- See It In Action
- High-Level Architecture
- Pipeline Results
- Agent Roles & Workflow
- Task Manager System
- Datapoint Structure
- Validation Pipeline
- Infrastructure & Tools
- Getting Started
- Key Design Decisions
The pipeline transforms Terminal Bench evaluation tasks into diverse training datapoints through three specialized agent stages:
- Seed tasks from Terminal Bench β Idea Agents β Creative variations
- Draft ideas β Builder Agents β Complete executable datapoints
- Built datapoints β Review Agents β Production-ready training data
All agents work independently in parallel, coordinated by a central Task Manager that prevents duplication and handles failures.
- 331 validated datapoints generated
- 20+ agents working concurrently
- 3 specialized agent types with distinct roles
- 100% Docker-validated environments
Category | Count | Description |
---|---|---|
Software Engineering | 97 | API development, CLI tools, code architecture |
System Administration | 59 | Server config, process management, monitoring |
Security | 42 | Vulnerability fixes, authentication, encryption |
Data Processing | 37 | ETL pipelines, data parsing, transformations |
Debugging | 28 | Fix race conditions, memory leaks, logic errors |
Machine Learning | 17 | Model training, data preprocessing, evaluation |
File Operations | 15 | File parsing, I/O optimization, format conversion |
Scientific Computing | 15 | Numerical methods, simulations, data analysis |
- Python: Used in 196 datapoints (59%)
- CLI Tools: 47 datapoints
- APIs: 30 datapoints
- C: 22 datapoints
Each agent type operates independently with specialized tools and clear responsibilities:
- Input: Seed tasks from Terminal Bench
- Process: Analyze core skills β Generate nΓmultiplier variations β Select best ideas
- Output: Draft specifications in shared workspace
Key Innovation: Refinement criteria provided only AFTER brainstorming to maximize creativity
- Input: Draft specifications from idea agents
- Process: Build complete scenarios β Validate via software script β Iterate until passing
- Output: Executable datapoints with all required components
Validation Requirements:
- β Dockerfile builds successfully
- β Tests fail before agent intervention
- β All dependencies present
- β Test weights sum to 1.0
- Input: Validated datapoints from builders
- Process: Check quality standards β Edit if needed β Re-validate β Categorize
- Output: Approved datapoints with metadata or rejection with reasons
The Task Manager enables parallel agent coordination without complex handoffs:
# Agent claims work atomically
task = tm.get_next_task("idea-agent-07", task_types=["generate_idea"])
# Process independently
draft_ideas = generate_creative_variations(task["seed_data"])
# Complete with results
tm.complete_task(task["id"], "idea-agent-07", {"drafts": draft_ideas})
Features:
- Atomic task claiming (no collisions)
- Automatic timeout recovery
- Parent-child task tracking
- Real-time status monitoring
Each training datapoint contains:
draft_001_a/
βββ prompt.md # 1-3 sentence task (e.g., "Auth times out with 100+ users. Fix it.")
βββ dockerfile # Ubuntu 24.04 (or similar) environment setup
βββ tests.py # Pytest verification functions
βββ weights.json # Test importance distribution
βββ files/ # Additional resources
βββ app.py # Broken code to fix
βββ config.json # Configuration files
Shared validation tools ensure quality across all agents:
- Docker Build: Environment must build successfully
- Test Discovery: Pytest must find all test functions
- Fail-First: Tests must fail in initial state
- Dependency Check: All required packages present
- Weight Validation: Test weights sum to exactly 1.0
agents/
βββ idea_agent_workspace/ # Idea generation tools & instructions
βββ dp_builder_workspace/ # Building tools & staging areas
βββ review_agent_workspace/ # Review tools & quality checks
shared_workspace/ # Common filesystem for all agents
shared_tools/ # Validation, patching, utilities
task_manager/ # Coordination & state management
- Idea Agents:
get_task_parameters.py
,get_idea_refinement_details.py
- Builder Agents:
create_dp.py
,add_dp_to_review.py
- Review Agents:
approve_datapoint.py
,cancel_datapoint.py
,show_categories_tags.py
validate_datapoint.py
- Complete validation suitepatch_dp.py
- Update datapoint componentspatch_additional_files.py
- Manage resource files
# Clone repository
git clone https://github.com/Danau5tin/tbench-agentic-data-pipeline.git
cd tbench-agentic-data-pipeline
# Install dependencies
uv sync
# Initialize seed tasks
python init_seed_tasks.py <path_to_terminal_bench_tasks>
# Launch agents with Claude Code
# Idea Agent:
"See @agents/idea_agent_workspace/workflow_instructions.md - you are the idea generation agent, go!"
# Builder Agent:
"See @agents/dp_builder_workspace/workflow_instructions.md - you are the datapoint builder agent, go!"
# Review Agent:
"See @agents/review_agent_workspace/workflow_instructions.md - you are the quality review agent, go!"
- Separation of concerns: Each agent excels at one task
- Parallel scaling: Multiple instances per type
- Quality gates: Three-stage validation ensures high standards
- Simplicity: No complex message passing
- Reliability: File operations are atomic
- Debugging: Easy to inspect intermediate states
- Coordination: Prevents duplicate work
- Recovery: Handles agent failures gracefully
- Monitoring: Real-time pipeline visibility
Built with Claude Code π€ - This entire multi-agent system was developed using Claude Code, demonstrating the power of AI agents building infrastructure for other AI agents.