[RFC] ActorEnvironment implementation to support tool calling #899

casper-hansen · 2025-03-06T14:24:25Z

casper-hansen
Mar 6, 2025

This is an issue to request for comments (RFC) from veRL maintainers @eric-haibin-lin @PeterSH6 @vermouth1992 on an implementation that would fundamentally support tool calling in the roadmap #354 and close #344. veRL has seen great adoption, but a lot of interesting work is present in forks like RAGEN or Search-R1. This would help bring more use-cases directly into veRL instead of un-maintained modified implementations elsewhere that cannot benefit from the new features that veRL keeps adding.

Specifically, I am undecided on which implementation is the best when it comes to the actual use-case: reusing RAGEN/Search-R1 code or redesigning the vLLM SPMD for async streamed outputs.

Background

The actor in GRPO generates many completions. During the process of generating completions, the model can learn interleaved tool calling. However, this requires stopping on certain tokens like </tool_call> to execute and return results back into the model.

Which use-cases could interleaved tool-calling enable?

A simple and common use-case is search/retrieval. Given a search tool, the model can learn to query and retrieve additional information into the context.
Code execution and feedback. You can implement an environment with execution and feedback. This extends to the possibility of generating multiple parallel solutions for an advanced use-case.
Evaluate mathematical expressions and return feedback. Advanced use-cases could be calling Wolframalpha API or using Lean 4 for theorem proving.

Rough idea of implementation

The main idea is to create a class ActorEnvironment that will be the default in RayPPOTrainer. In the initial implementation, ActorEnvironment will only include refactored code from the fit method, but allow for extensibility through defining a custom ActorEnvironment as an input to RayPPOTrainer.

Class outline

class ActorEnvironment:
    def __init__(self): pass
    def step(self): pass
    def update(self): pass

step method

verl/verl/trainer/ppo/ray_trainer.py

Lines 901 to 926 in 4a291fa

    
           with _timer('step', timing_raw): 
        
               # generate a batch 
        
               with _timer('gen', timing_raw): 
        
                   gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch) 
        
               if self.config.algorithm.adv_estimator == AdvantageEstimator.REMAX: 
        
                   with _timer('gen_max', timing_raw): 
        
                       gen_baseline_batch = deepcopy(gen_batch) 
        
                       gen_baseline_batch.meta_info['do_sample'] = False 
        
                       gen_baseline_output = self.actor_rollout_wg.generate_sequences(gen_baseline_batch) 
        
                       batch = batch.union(gen_baseline_output) 
        
                       reward_baseline_tensor = self.reward_fn(batch) 
        
                       reward_baseline_tensor = reward_baseline_tensor.sum(dim=-1) 
        
                       batch.pop(batch_keys=list(gen_baseline_output.batch.keys())) 
        
                       batch.batch['reward_baselines'] = reward_baseline_tensor 
        
                       del gen_baseline_batch, gen_baseline_output 
        
               batch.non_tensor_batch['uid'] = np.array([str(uuid.uuid4()) for _ in range(len(batch.batch))], 
        
                                                        dtype=object) 
        
               # repeat to align with repeated responses in rollout 
        
               batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True) 
        
               batch = batch.union(gen_batch_output)

update method

verl/verl/trainer/ppo/ray_trainer.py

Lines 989 to 995 in 4a291fa

    
           # implement critic warmup 
        
           if self.config.trainer.critic_warmup <= self.global_steps: 
        
               # update actor 
        
               with _timer('update_actor', timing_raw): 
        
                   actor_output = self.actor_rollout_wg.update_actor(batch) 
        
               actor_output_metrics = reduce_metrics(actor_output.meta_info['metrics']) 
        
               metrics.update(actor_output_metrics)

Use-case: DuckDuckGo ActorEnvironment

I have chosen DuckDuckGo for this example because (1) it requires no setup in the form of API keys and (2) DuckDuckGo has an easy method DDGS.text() which returns the summaries of webpages. This will simplify the implementation.

Existing implementations

This use-case could draw on inspiration from both RAGEN and Search-R1 to make an implementation that works. Here is a summary over the features needed:

Run generation in a loop, detecting if certain tokens are generated, triggering tool calls, appending context into prompt, and continuing generation.
Configuration:
- max_tool_turns. This parameter will help control how many times a tool can be triggered per completion.
- max_tool_result_tokens. This parameter will help truncate results results that are too long so we avoid overwhelming the context window.

Alternative implementation

An alternative implementation would use the AsyncLLMEngine to stream the partial outputs as part of the vLLM SPMD implementation. The implementation should abort requests upon certain tokens being generated like </tool_call> and submit a new request with retrieved results from the tool call. The total overhead of this is only the sum of the time it takes to execute a tool plus the time to launch a new request given that enable_prefix_caching is already implemented. However, the hardest part of this implementation would be the request managing with detokenization + tokenization potentially being necessary.

I will refrain from providing certain implementation details here as this will only be possible to tell with a lot of experimentation. To pull this off successfully would likely still involve reusing code from RAGEN and Search-R1.

Future work

Once ActorEnvironment has been implemented, a trivial extension could be a CriticEnvironment to more easily integrate the split placement example for PPO since the main modification needed in the fit function is to implement a .get() call on the critic and actor.

UbeCc · 2025-03-07T17:11:06Z

UbeCc
Mar 7, 2025

Hi! Do you think the feature is like https://github.com/volcengine/verl/issues/385

0 replies

casper-hansen · 2025-03-07T17:19:39Z

casper-hansen
Mar 7, 2025
Author

@UbeCc I think the linked feature is different. You may want to have a look at https://github.com/cfpark00/verl/tree/multi_turn_rollout

0 replies

UbeCc · 2025-03-08T02:06:09Z

UbeCc
Mar 8, 2025

@UbeCc I think the linked feature is different. You may want to have a look at https://github.com/cfpark00/verl/tree/multi_turn_rollout

let me see. thank you!

0 replies

vermouth1992 · 2025-03-08T03:47:43Z

vermouth1992
Mar 8, 2025
Maintainer

My initial thought on this is that there should be a new main and trainer function that reuses the underlying rollout and computation engine.

0 replies

alexanderhanboli · 2025-03-27T00:04:29Z

alexanderhanboli
Mar 27, 2025

Is there more update on this new feature? It would be super useful to officially support this

0 replies

casper-hansen · 2025-03-27T07:46:54Z

casper-hansen
Mar 27, 2025
Author

@alexanderhanboli Part 1/2 was not accepted. I am working on streaming outputs using vLLM SPMD with AsyncLLMEngine

0 replies

uygnef · 2025-03-27T12:16:39Z

uygnef
Mar 27, 2025

@alexanderhanboli Part 1/2 was not accepted. I am working on streaming outputs using vLLM SPMD with AsyncLLMEngine

curious about streaming outputs. Does it help balance uneven workloads during generation?

4 replies

casper-hansen Apr 3, 2025
Author

@alexanderhanboli Part 1/2 was not accepted. I am working on streaming outputs using vLLM SPMD with AsyncLLMEngine

curious about streaming outputs. Does it help balance uneven workloads during generation?

I tried this now. Performance doesn’t improve. Additionally, the V1 AsyncLLM doesn’t seem to support the n sampling parameter properly.

I am pursuing another direction that I hope to soon be able to upstream, which is a vLLM SPMD with interleaved tool calling. I have a working inference implementation already, the next steps are to implement it in veRL create extensibility and an example of how to use it.

eric-haibin-lin Apr 16, 2025
Maintainer

@youkaichao any comments on asyncLLM not supporting n sampling? Do you suggest that the users repeat the prompt by n times themselves instead of letting vllm v1 to do that?

casper-hansen Apr 16, 2025
Author

In my current rollout for interleaved tool calling, I convert the n parameter to prompts by copying. This simplifies the logic and incurs no real overhead. However, this does not use the V1 AsyncLLM which I found to be too limited for the advanced use-case.

youkaichao Apr 19, 2025

any comments on asyncLLM not supporting n sampling

I don't think there's any particular reason. If it is not supported right now, it might just be bandwidth issue.

casper-hansen · 2025-04-14T09:38:23Z

casper-hansen
Apr 14, 2025
Author

An implementation is underway. It is 90% done, just need to preprocess a dataset and provide example.
https://github.com/casper-hansen/verl/tree/feature/interleaved-tool-calling

3 replies

scris Apr 19, 2025

I find that detokenize is set to True in order to enable stop of SamplingParams. I doubt if the generation process will then be a bit slower. Do we have any better way to implement that? Thanks.

cbxgss Apr 19, 2025

I find that detokenize is set to True in order to enable stop of SamplingParams. I doubt if the generation process will then be a bit slower. Do we have any better way to implement that? Thanks.

I encountered this issue while implementing vLLMAsyncRollout, and it can be resolved as follows:

If you are using vllm 0.6.3, then modify the file verl/third_party/vllm/vllm_v_0_6_3/tokenizer.py as shown below:

class TokenizerGroup(TokenizerGroup):
    """A group of tokenizers that can be used for LoRA adapters."""

    def __init__(self, tokenizer: PreTrainedTokenizer, enable_lora: bool, max_num_seqs: int,
                 max_input_length: Optional[int]):
        self.enable_lora = enable_lora
        self.max_input_length = max_input_length
        self.tokenizer = get_cached_tokenizer(tokenizer)
        self.lora_tokenizers = LRUCache[PreTrainedTokenizer](capacity=max_num_seqs) if enable_lora else None

    # FIXME(sgm): for simplicity, we assign the special token here
    @property
    def pad_token_id(self):
        return self.tokenizer.pad_token_id

    @property
    def eos_token_id(self):
        return self.tokenizer.eos_token_id

Use a tokenizer with caching support, such as replacing Qwen2TokenizerFast with CachedQwen2TokenizerFast.

Hope this helps!

casper-hansen Apr 19, 2025
Author

I find that detokenize is set to True in order to enable stop of SamplingParams. I doubt if the generation process will then be a bit slower. Do we have any better way to implement that? Thanks.

In my testing, it has about a 10-20% impact on speed. What will slow you down is the tool calling.

Additionally, veRL does not yet support adding special tokens, so it’s not possible at the moment to avoid detokenization.

eric-haibin-lin · 2025-04-16T18:56:35Z

eric-haibin-lin
Apr 16, 2025
Maintainer

i think async llm is worth exploring first - maybe the sampling config can be worked around and should not be the most important factor. i also agree with vermouth that even if we introduce such actor env, a new trainer instead of replacing the existing one is recommended.
@casper-hansen did you check why the performance did not approve with async-llm server?

2 replies

casper-hansen Apr 16, 2025
Author

i think async llm is worth exploring first - maybe the sampling config can be worked around and should not be the most important factor. i also agree with vermouth that even if we introduce such actor env, a new trainer instead of replacing the existing one is recommended.

I figured out how to run interleaved tool calling only by modifying the vLLM SPMD rollout, so there is no new trainer needed or any modifications to it. I hope a PR is ready soon and that it will be welcomed. This is a big simplification of work like Search-R1.

@casper-hansen did you check why the performance did not approve with async-llm server?

To clarify, I used async with streaming outputs because I wanted to run interleaved tool calling. To run streaming, I had to run each sample in its own asyncio task without batching.

I think async generation has a lot of potential, but only if combined with a new type of trainer. I only know of SeedThinkingv1.5 where this was achieved, but details are sparse.

eric-haibin-lin Apr 17, 2025
Maintainer

Thanks for the clarification. I just learnt that @wuxibin89 was working on integrating asyncllm for multi-turn rollout, and the PR is #1138 @casper-hansen could you help review?

it's not specifically designed for streaming rollout for now though

alexanderhanboli · 2025-04-21T22:21:10Z

alexanderhanboli
Apr 21, 2025

FYI @casper-hansen Have you seen the comments here: #176 (comment)? That issue seems very related.

0 replies

casper-hansen · 2025-04-30T05:07:57Z

casper-hansen
Apr 30, 2025
Author

I want to thank the maintainers and Bytedance for their implementation. #1138 completes what this RFC intended to create in functionality. #1297 is one example of how this can be used.

0 replies

[RFC] ActorEnvironment implementation to support tool calling #899

Uh oh!

Uh oh!

Background

Which use-cases could interleaved tool-calling enable?

Rough idea of implementation

Class outline

step method

update method

Use-case: DuckDuckGo ActorEnvironment

Existing implementations

Alternative implementation

Future work

Replies: 11 comments · 9 replies

Uh oh!

Uh oh!

casper-hansen Mar 7, 2025 Author

Uh oh!

Uh oh!

vermouth1992 Mar 8, 2025 Maintainer

Uh oh!

Uh oh!

casper-hansen Mar 27, 2025 Author

Uh oh!

Uh oh!

casper-hansen Apr 3, 2025 Author

Uh oh!

eric-haibin-lin Apr 16, 2025 Maintainer

Uh oh!

casper-hansen Apr 16, 2025 Author

Uh oh!

Uh oh!

casper-hansen Apr 14, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casper-hansen Apr 19, 2025 Author

Uh oh!

eric-haibin-lin Apr 16, 2025 Maintainer

Uh oh!

casper-hansen Apr 16, 2025 Author

Uh oh!

eric-haibin-lin Apr 17, 2025 Maintainer

Uh oh!

Uh oh!

casper-hansen Apr 30, 2025 Author

Replies: 11 comments 9 replies

casper-hansen
Mar 7, 2025
Author

vermouth1992
Mar 8, 2025
Maintainer

casper-hansen
Mar 27, 2025
Author

casper-hansen Apr 3, 2025
Author

eric-haibin-lin Apr 16, 2025
Maintainer

casper-hansen Apr 16, 2025
Author

casper-hansen
Apr 14, 2025
Author

casper-hansen Apr 19, 2025
Author

eric-haibin-lin
Apr 16, 2025
Maintainer

casper-hansen Apr 16, 2025
Author

eric-haibin-lin Apr 17, 2025
Maintainer

casper-hansen
Apr 30, 2025
Author