ASSERT-KTH
diff --git a/‎PLAN.md
Lines changed: 99 additions & 0 deletions b/‎PLAN.md
Lines changed: 99 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 13 additions & 0 deletions b/‎README.md
Lines changed: 13 additions & 0 deletions
diff --git a/‎trl/cli.py
Lines changed: 5 additions & 2 deletions b/‎trl/cli.py
Lines changed: 5 additions & 2 deletions
diff --git a/‎trl/extras/vllm_client.py
Lines changed: 82 additions & 31 deletions b/‎trl/extras/vllm_client.py
Lines changed: 82 additions & 31 deletions
diff --git a/‎trl/import_utils.py
Lines changed: 11 additions & 2 deletions b/‎trl/import_utils.py
Lines changed: 11 additions & 2 deletions
@@ -0,0 +1,99 @@
+# PLAN.md
+
+## Agent Manager Architecture
+
+### Core Concept
+The `AgentManager` coordinates ephemeral agents that exist only for the duration of a single training example, replacing `vllm_client.generate()` in the GRPO trainer with an orchestration layer that captures full conversation histories for more effective reinforcement learning.
+
+### Current Implementation
+
+1. **API Middleware Proxy**
+   - Lightweight FastAPI server that intercepts API calls between agents and vLLM
+   - Injects and tracks conversation via custom `X-Agent-ID` headers
+   - Maintains thread-safe conversation history per agent
+   - Captures complete request/response pairs for RL training signal
+
+2. **Multiprocessing Approach**
+   - Uses `multiprocessing.Pool` for parallel agent execution
+   - Each agent runs in an isolated process to prevent state contamination
+   - Monkey-patches the `requests` library to inject agent identification headers
+   - Process isolation ensures clean environment for each agent instance
+
+3. **Agent Deployment Flow**
+   ```
+   GRPO Training Step
+   └── AgentManager.deploy(prompts)
+       ├── Generate unique agent_id for each prompt
+       ├── Deploy agents via multiprocessing.Pool
+       │   ├── Each process runs _process_one(agent_id, prompt)
+       │   │   ├── Monkey-patch requests to add X-Agent-ID
+       │   │   └── Call process_one() (e.g., Aider instance)
+       │   └── Multiple agents run in parallel
+       ├── API Proxy tracks all vLLM interactions
+       ├── Await completion with timeout
+       ├── Collect conversation histories for all agent_ids
+       └── Return structured completions to GRPO trainer
+   ```
+
+4. **GRPO Integration**
+   - GRPO trainer uses AgentManager.deploy() for generating completions
+   - Should properly convert agent completions to token IDs for the training loop
+   - Maintains compatibility with both direct vLLM and agent-based generation
+
+## Challenges and Solutions
+
+### Conversation Tracking Challenges
+
+1. **Asynchronous API Calls**
+   - Agents make varying numbers of API calls at unpredictable times
+   - Solution: Thread-safe conversation tracking with unique agent IDs
+   - Thread-safe locking ensures proper history capture even with concurrent requests
+
+2. **Process Management**
+   - Challenge: Ensuring clean process termination and resource cleanup
+   - Solution: Pool-based multiprocessing with timeout handling
+   - Proper cleanup in finally blocks ensures resources are released
+
+3. **Proxy Synchronization**
+   - Challenge: Background tasks in FastAPI may create race conditions
+   - Solution: Consider making conversation tracking synchronous in the API endpoint
+   - More robust synchronization mechanisms for production environments
+
+4. **Conversation Continuity**
+   - Challenge: Ensuring continuous context across multiple API calls
+   - Solution: Implement validation in the ConversationTracker
+   - Track and report potential discontinuities that could indicate information loss
+
+### Technical Considerations
+
+1. **Monkey-Patching Approach**
+   - Current: Patch `requests.request` in each worker process to add custom headers
+   - Pros: Isolated impact, minimal invasiveness to agent frameworks
+   - Alternative: Require direct configuration of agent framework
+
+2. **Conversation Collection**
+   - Current: API Proxy collects all conversations by agent_id
+   - Challenge: Ensuring all API calls are captured before retrieving history
+   - Solution: Consider small delay or synchronization primitive before retrieval
+
+3. **Error Handling**
+   - Challenge: Individual agent failures shouldn't crash the entire batch
+   - Solution: Improved error handling in AgentManager.deploy()
+   - Graceful degradation for failed agents while allowing others to continue
+
+## Conclusions and Next Steps
+
+The current implementation successfully achieves:
+
+1. **Process Isolation**: Clean separation of agent environments
+2. **Conversation Tracking**: Complete history capture for RL training
+3. **Parallel Execution**: Efficient handling of multiple agents
+4. **Resource Management**: Proper cleanup of temporary resources
+
+Next development priorities:
+
+1. **Implement ConversationTracker.get_completion_history()**: Properly extract and format the complete history
+2. **Address race conditions**: Ensure background tasks complete before history retrieval
+3. **Enhance error handling**: Improve robustness to individual agent failures
+4. **Performance optimization**: Evaluate and optimize latency introduced by the proxy
+5. **Testing**: Develop comprehensive tests for conversation tracking accuracy
@@ -1,3 +1,16 @@
+# TRL Fork: Agent-In-The-Loop Reinforcement Trainer (AITLRT)
+
+This fork enhances the TRL (Transformer Reinforcement Learning) library with agentic capabilities, focusing on training and reinforcing multi-turn coding agents:
+
+- **OpenAI-Compatible vLLM Endpoint**: Drop-in replacement for OpenAI API enabling seamless integration with existing tools and agents
+- **Direct Agent Integration**: Use existing agent scaffolding and applications directly in the training loop without modification
+- **Enterprise-Ready Solutions**: Leverage production-ready agentic frameworks rather than building custom implementations
+- **Parallel Agent Execution**: Run multiple instances of the same agent architecture in parallel during training
+
+For original TRL documentation, see below.
+
+---
+
 # TRL - Transformer Reinforcement Learning
 
 <div style="text-align: center">
 
@@ -25,8 +25,8 @@
 from .scripts.kto import make_parser as make_kto_parser
 from .scripts.sft import make_parser as make_sft_parser
 from .scripts.utils import TrlParser
-from .scripts.vllm_serve import main as vllm_serve_main
-from .scripts.vllm_serve import make_parser as make_vllm_serve_parser
+from .scripts.vllm_serve_sync import main as vllm_serve_main
+from .scripts.vllm_serve_sync import make_parser as make_vllm_serve_parser
 
 
 def main():
@@ -93,7 +93,10 @@ def main():
     elif args.command == "vllm-serve":
         (script_args,) = parser.parse_args_and_config()
         vllm_serve_main(script_args)
+        
+    # Make the vllm-serve-openai-endpoint subparser
 
 
 if __name__ == "__main__":
     main()
+    
@@ -15,7 +15,8 @@
 import atexit
 import logging
 import time
-from typing import Optional
+from typing import Any, Optional
+from abc import ABC, abstractmethod
 
 import torch
 from torch import nn
@@ -36,7 +37,7 @@
 logger = logging.getLogger(__name__)
 
 
-class VLLMClient:
+class VLLMClient(ABC):
     """
     A client class to interact with a vLLM server.
 
@@ -131,24 +132,21 @@ def check_server(self, total_timeout: float = 0.0, retry_interval: float = 2.0):
 
     def generate(
         self,
-        prompts: list[str],
-        n: int = 1,
+        data: list[dict[str, Any]],
         repetition_penalty: float = 1.0,
         temperature: float = 1.0,
         top_p: float = 1.0,
         top_k: int = -1,
         min_p: float = 0.0,
         max_tokens: int = 16,
         guided_decoding_regex: Optional[str] = None,
-    ) -> list[list[int]]:
+    ) -> list[dict[str, Any]]:
         """
-        Generates model completions for the provided prompts.
+        Generates model completions for the provided data.
 
         Args:
-            prompts (`list[str]`):
-                List of text prompts for which the model will generate completions.
-            n (`int`, *optional*, defaults to `1`):
-                Number of completions to generate for each prompt.
+            data (`list[dict[str, Any]]`):
+                List of dataset entries.
             repetition_penalty (`float`, *optional*, defaults to `1.0`):
                 Parameter for repetition penalty. 1.0 means no penalty.
             temperature (`float`, *optional*, defaults to `1.0`):
@@ -165,28 +163,10 @@ def generate(
                 Regular expression to guide the decoding process.
 
         Returns:
-            `list[list[int]]`:
-                List of lists of token IDs representing the model-generated completions for each prompt.
+            `list[dict[str, Any]]`:
+                List of dataset entries with the generated completions added.
         """
-        url = f"http://{self.host}:{self.server_port}/generate/"
-        response = self.session.post(
-            url,
-            json={
-                "prompts": prompts,
-                "n": n,
-                "repetition_penalty": repetition_penalty,
-                "temperature": temperature,
-                "top_p": top_p,
-                "top_k": top_k,
-                "min_p": min_p,
-                "max_tokens": max_tokens,
-                "guided_decoding_regex": guided_decoding_regex,
-            },
-        )
-        if response.status_code == 200:
-            return response.json()["completion_ids"]
-        else:
-            raise Exception(f"Request failed: {response.status_code}, {response.text}")
+        pass
 
     def init_communicator(self):
         """
@@ -269,6 +249,77 @@ def close_communicator(self):
         else:
             if response.status_code != 200:
                 raise Exception(f"Request failed: {response.status_code}, {response.text}")
+        
+class SimpleClient(VLLMClient):
+    def generate(
+        self,
+        data: list[dict[str, Any]],
+        repetition_penalty: float = 1.0,
+        temperature: float = 1.0,
+        top_p: float = 1.0,
+        top_k: int = -1,
+        min_p: float = 0.0,
+        max_tokens: int = 16,
+        guided_decoding_regex: Optional[str] = None,
+    ) -> list[dict[str, Any]]:
+        """
+        Generates model completions for the provided data.
+
+        Args:
+            data (`list[dict[str, Any]]`):
+                List of dataset entries.
+            repetition_penalty (`float`, *optional*, defaults to `1.0`):
+                Parameter for repetition penalty. 1.0 means no penalty.
+            temperature (`float`, *optional*, defaults to `1.0`):
+                Temperature parameter for sampling. Higher values increase diversity.
+            top_p (`float`, *optional*, defaults to `1.0`):
+                Top-p sampling parameter.`1.0` means no truncation.
+            top_k (`int`, *optional*, defaults to `-1`):
+                Top-k sampling parameter. `-1` means no truncation.
+            min_p (`float`, *optional*, defaults to `0.0`):
+                Minimum probability for sampling.
+            max_tokens (`int`, *optional*, defaults to `16`):
+                Maximum number of tokens to generate for each prompt.
+            guided_decoding_regex (`str` or `None`, *optional*, defaults to `None`):
+                Regular expression to guide the decoding process.
+
+        Returns:
+            `list[dict[str, Any]]`:
+                List of dataset entries with the generated completions added.
+        """
+        url = f"http://{self.host}:{self.server_port}/v1/chat/completions"
+        headers = {"Authorization": "Bearer dummy"}
+
+        def get_answer(item):
+            messages = [
+                {"role": "system", "content": "You are a helpful AI assistant."},
+                {"role": "user", "content": item["prompt"]}
+            ]
+            payload = {
+                "model": "deployed_model",
+                "messages": messages,
+                "temperature": temperature,
+                "max_tokens": max_tokens,
+                "repetition_penalty": repetition_penalty,
+                "top_p": top_p,
+                "top_k": top_k,
+                "min_p": min_p,
+                "stream": False
+            }
+            if guided_decoding_regex is not None:
+                payload["guided_decoding_regex"] = guided_decoding_regex
+                
+            resp = requests.post(url, json=payload, headers=headers, timeout=timeout)
+            resp.raise_for_status()
+            resp_data = resp.json()
+            return resp_data["choices"][0]["message"]["content"]
+
+        with concurrent.futures.ThreadPoolExecutor() as executor:
+            futures = [executor.submit(get_answer, item) for item in data]
+            for item, future in zip(data, futures):
+                item["answer"] = future.result()
+
+        return data
 
 
 # Example usage
 
@@ -36,6 +36,7 @@
 _rich_available = _is_package_available("rich")
 _unsloth_available = _is_package_available("unsloth")
 _uvicorn_available = _is_package_available("uvicorn")
+_uvloop_available = _is_package_available("uvloop")
 _vllm_available = _is_package_available("vllm")
 _joblib_available = _is_package_available("joblib")
 
@@ -84,6 +85,10 @@ def is_uvicorn_available() -> bool:
     return _uvicorn_available
 
 
+def is_uvloop_available() -> bool:
+    return _uvloop_available
+
+
 def is_vllm_available() -> bool:
     return _vllm_available
 
@@ -99,15 +104,19 @@ class _LazyModule(ModuleType):
 
     # Very heavily inspired by optuna.integration._IntegrationModule
     # https://github.com/optuna/optuna/blob/master/optuna/integration/__init__.py
-    def __init__(self, name, module_file, import_structure, module_spec=None, extra_objects=None):
+    def __init__(
+        self, name, module_file, import_structure, module_spec=None, extra_objects=None
+    ):
         super().__init__(name)
         self._modules = set(import_structure.keys())
         self._class_to_module = {}
         for key, values in import_structure.items():
             for value in values:
                 self._class_to_module[value] = key
         # Needed for autocompletion in an IDE
-        self.__all__ = list(import_structure.keys()) + list(chain(*import_structure.values()))
+        self.__all__ = list(import_structure.keys()) + list(
+            chain(*import_structure.values())
+        )
         self.__file__ = module_file
         self.__spec__ = module_spec
         self.__path__ = [os.path.dirname(module_file)]