You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
💎 Gemma 3 VLM SFT example script for single-image and multi-image (huggingface#3131)
Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
log answer key to wandb
all Table
HTML logging
table
bump patch
hmm
formatting
html esacape
reward isnt string
[Liger] Liger KTO support (huggingface#2812)
Co-authored-by: Kashif Rasul <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
🏃 Migrate CI to self-hosted runners (huggingface#3174)
❤️🩹 [CI] fix transformers dev CI failure (huggingface#3176)
Co-authored-by: Quentin Gallouédec <[email protected]>
⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint (huggingface#3148)
Co-authored-by: Quentin Gallouédec <[email protected]>
📎 Fix is_clipped to compute the effective clip_ratio (huggingface#3175)
Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
Fix breaking typo for flash_attention reducing_memory_usage.md (huggingface#3190)
Show unique prompts in GRPO WandB tables (huggingface#3191)
🐗 [CI] Fix trufflehog false positives (huggingface#3192)
[GRPO] Improve completion length logging (huggingface#3188)
preliminary openai compatible endpoint
early concept, needs refining
dedupe
debug print
some slop to work on
unslop, missing hist
almost valid pseudocode
middle-ware monkey patch in mp.Pool()...
remove unused
More accurate .md
need gpu
renting lambda again
much nicer
small
aider-chat and datasets conflict
risky reqs change
should work, but hacky
some insights, but monkeypatching probably wont suffice
refactor: Rewrite test script to use SWE-bench dataset with MultiProcessAider
refactor: Remove logging statements from test.py
one step closer
finally, the correct abstraction
doc
todo
unslop
unslop
undo accidental black
cleaner abstraction
new abstraction
The `AgentManager` coordinates ephemeral agents that exist only for the duration of a single training example, replacing `vllm_client.generate()` in the GRPO trainer with an orchestration layer that captures full conversation histories for more effective reinforcement learning.
7
+
8
+
### Current Implementation
9
+
10
+
1.**API Middleware Proxy**
11
+
- Lightweight FastAPI server that intercepts API calls between agents and vLLM
12
+
- Injects and tracks conversation via custom `X-Agent-ID` headers
13
+
- Maintains thread-safe conversation history per agent
14
+
- Captures complete request/response pairs for RL training signal
15
+
16
+
2.**Multiprocessing Approach**
17
+
- Uses `multiprocessing.Pool` for parallel agent execution
18
+
- Each agent runs in an isolated process to prevent state contamination
19
+
- Monkey-patches the `requests` library to inject agent identification headers
20
+
- Process isolation ensures clean environment for each agent instance
21
+
22
+
3.**Agent Deployment Flow**
23
+
```
24
+
GRPO Training Step
25
+
└── AgentManager.deploy(prompts)
26
+
├── Generate unique agent_id for each prompt
27
+
├── Deploy agents via multiprocessing.Pool
28
+
│ ├── Each process runs _process_one(agent_id, prompt)
29
+
│ │ ├── Monkey-patch requests to add X-Agent-ID
30
+
│ │ └── Call process_one() (e.g., Aider instance)
31
+
│ └── Multiple agents run in parallel
32
+
├── API Proxy tracks all vLLM interactions
33
+
├── Await completion with timeout
34
+
├── Collect conversation histories for all agent_ids
35
+
└── Return structured completions to GRPO trainer
36
+
```
37
+
38
+
4.**GRPO Integration**
39
+
- GRPO trainer uses AgentManager.deploy() for generating completions
40
+
- Should properly convert agent completions to token IDs for the training loop
41
+
- Maintains compatibility with both direct vLLM and agent-based generation
42
+
43
+
## Challenges and Solutions
44
+
45
+
### Conversation Tracking Challenges
46
+
47
+
1.**Asynchronous API Calls**
48
+
- Agents make varying numbers of API calls at unpredictable times
49
+
- Solution: Thread-safe conversation tracking with unique agent IDs
50
+
- Thread-safe locking ensures proper history capture even with concurrent requests
51
+
52
+
2.**Process Management**
53
+
- Challenge: Ensuring clean process termination and resource cleanup
54
+
- Solution: Pool-based multiprocessing with timeout handling
55
+
- Proper cleanup in finally blocks ensures resources are released
56
+
57
+
3.**Proxy Synchronization**
58
+
- Challenge: Background tasks in FastAPI may create race conditions
59
+
- Solution: Consider making conversation tracking synchronous in the API endpoint
60
+
- More robust synchronization mechanisms for production environments
61
+
62
+
4.**Conversation Continuity**
63
+
- Challenge: Ensuring continuous context across multiple API calls
64
+
- Solution: Implement validation in the ConversationTracker
65
+
- Track and report potential discontinuities that could indicate information loss
66
+
67
+
### Technical Considerations
68
+
69
+
1.**Monkey-Patching Approach**
70
+
- Current: Patch `requests.request` in each worker process to add custom headers
71
+
- Pros: Isolated impact, minimal invasiveness to agent frameworks
72
+
- Alternative: Require direct configuration of agent framework
73
+
74
+
2.**Conversation Collection**
75
+
- Current: API Proxy collects all conversations by agent_id
76
+
- Challenge: Ensuring all API calls are captured before retrieving history
77
+
- Solution: Consider small delay or synchronization primitive before retrieval
78
+
79
+
3.**Error Handling**
80
+
- Challenge: Individual agent failures shouldn't crash the entire batch
81
+
- Solution: Improved error handling in AgentManager.deploy()
82
+
- Graceful degradation for failed agents while allowing others to continue
83
+
84
+
## Conclusions and Next Steps
85
+
86
+
The current implementation successfully achieves:
87
+
88
+
1.**Process Isolation**: Clean separation of agent environments
89
+
2.**Conversation Tracking**: Complete history capture for RL training
90
+
3.**Parallel Execution**: Efficient handling of multiple agents
91
+
4.**Resource Management**: Proper cleanup of temporary resources
92
+
93
+
Next development priorities:
94
+
95
+
1.**Implement ConversationTracker.get_completion_history()**: Properly extract and format the complete history
96
+
2.**Address race conditions**: Ensure background tasks complete before history retrieval
97
+
3.**Enhance error handling**: Improve robustness to individual agent failures
98
+
4.**Performance optimization**: Evaluate and optimize latency introduced by the proxy
99
+
5.**Testing**: Develop comprehensive tests for conversation tracking accuracy
This fork enhances the TRL (Transformer Reinforcement Learning) library with agentic capabilities, focusing on training and reinforcing multi-turn coding agents:
4
+
5
+
-**OpenAI-Compatible vLLM Endpoint**: Drop-in replacement for OpenAI API enabling seamless integration with existing tools and agents
6
+
-**Direct Agent Integration**: Use existing agent scaffolding and applications directly in the training loop without modification
7
+
-**Enterprise-Ready Solutions**: Leverage production-ready agentic frameworks rather than building custom implementations
8
+
-**Parallel Agent Execution**: Run multiple instances of the same agent architecture in parallel during training
0 commit comments