[RFC] ActorEnvironment implementation to support tool calling #899
Replies: 11 comments 9 replies
-
Hi! Do you think the feature is like |
Beta Was this translation helpful? Give feedback.
-
@UbeCc I think the linked feature is different. You may want to have a look at https://github.com/cfpark00/verl/tree/multi_turn_rollout |
Beta Was this translation helpful? Give feedback.
-
let me see. thank you! |
Beta Was this translation helpful? Give feedback.
-
My initial thought on this is that there should be a new main and trainer function that reuses the underlying rollout and computation engine. |
Beta Was this translation helpful? Give feedback.
-
Is there more update on this new feature? It would be super useful to officially support this |
Beta Was this translation helpful? Give feedback.
-
@alexanderhanboli Part 1/2 was not accepted. I am working on streaming outputs using vLLM SPMD with AsyncLLMEngine |
Beta Was this translation helpful? Give feedback.
-
curious about streaming outputs. Does it help balance uneven workloads during generation? |
Beta Was this translation helpful? Give feedback.
-
An implementation is underway. It is 90% done, just need to preprocess a dataset and provide example. |
Beta Was this translation helpful? Give feedback.
-
i think async llm is worth exploring first - maybe the sampling config can be worked around and should not be the most important factor. i also agree with vermouth that even if we introduce such actor env, a new trainer instead of replacing the existing one is recommended. |
Beta Was this translation helpful? Give feedback.
-
FYI @casper-hansen Have you seen the comments here: #176 (comment)? That issue seems very related. |
Beta Was this translation helpful? Give feedback.
-
I want to thank the maintainers and Bytedance for their implementation. #1138 completes what this RFC intended to create in functionality. #1297 is one example of how this can be used. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is an issue to request for comments (RFC) from veRL maintainers @eric-haibin-lin @PeterSH6 @vermouth1992 on an implementation that would fundamentally support tool calling in the roadmap #354 and close #344. veRL has seen great adoption, but a lot of interesting work is present in forks like RAGEN or Search-R1. This would help bring more use-cases directly into veRL instead of un-maintained modified implementations elsewhere that cannot benefit from the new features that veRL keeps adding.
Specifically, I am undecided on which implementation is the best when it comes to the actual use-case: reusing RAGEN/Search-R1 code or redesigning the vLLM SPMD for async streamed outputs.
Background
The actor in GRPO generates many completions. During the process of generating completions, the model can learn interleaved tool calling. However, this requires stopping on certain tokens like
</tool_call>
to execute and return results back into the model.Which use-cases could interleaved tool-calling enable?
Rough idea of implementation
The main idea is to create a class
ActorEnvironment
that will be the default inRayPPOTrainer
. In the initial implementation,ActorEnvironment
will only include refactored code from thefit
method, but allow for extensibility through defining a customActorEnvironment
as an input toRayPPOTrainer
.Class outline
step method
verl/verl/trainer/ppo/ray_trainer.py
Lines 901 to 926 in 4a291fa
update method
verl/verl/trainer/ppo/ray_trainer.py
Lines 989 to 995 in 4a291fa
Use-case: DuckDuckGo ActorEnvironment
I have chosen DuckDuckGo for this example because (1) it requires no setup in the form of API keys and (2) DuckDuckGo has an easy method
DDGS.text()
which returns the summaries of webpages. This will simplify the implementation.Existing implementations
This use-case could draw on inspiration from both RAGEN and Search-R1 to make an implementation that works. Here is a summary over the features needed:
max_tool_turns
. This parameter will help control how many times a tool can be triggered per completion.max_tool_result_tokens
. This parameter will help truncate results results that are too long so we avoid overwhelming the context window.Alternative implementation
An alternative implementation would use the
AsyncLLMEngine
to stream the partial outputs as part of the vLLM SPMD implementation. The implementation should abort requests upon certain tokens being generated like</tool_call>
and submit a new request with retrieved results from the tool call. The total overhead of this is only the sum of the time it takes to execute a tool plus the time to launch a new request given thatenable_prefix_caching
is already implemented. However, the hardest part of this implementation would be the request managing with detokenization + tokenization potentially being necessary.I will refrain from providing certain implementation details here as this will only be possible to tell with a lot of experimentation. To pull this off successfully would likely still involve reusing code from RAGEN and Search-R1.
Future work
Once
ActorEnvironment
has been implemented, a trivial extension could be aCriticEnvironment
to more easily integrate the split placement example for PPO since the main modification needed in thefit
function is to implement a.get()
call on the critic and actor.Beta Was this translation helpful? Give feedback.
All reactions