-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[algo] Add SPO (Single-stream Policy Optimization) recipe implementation #3503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[algo] Add SPO (Single-stream Policy Optimization) recipe implementation #3503
Conversation
- Add SPO algorithm implementation with KL-adaptive value tracker - Implement single-stream architecture eliminating group synchronization - Add prioritized sampling and global advantage normalization - Include comprehensive README with performance results and usage guide - Add configuration files and training scripts - Achieve +3.4 pp improvement on math benchmarks vs GRPO
Remove Chinese language comments from spo_ray_trainer.py to improve code readability and maintain English-only codebase standards.
…990407/verl_spo_dev into feature/spo-implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Single-stream Policy Optimization (SPO) algorithm, a novel reinforcement learning method for Large Language Models. The changes primarily consist of new files for the SPO recipe, including configuration, the main training script, a run script, and the core Ray trainer implementation. My review has identified two critical issues. First, the run_spo.sh
script uses an undefined variable which will cause the training to fail at launch. Second, the spo_ray_trainer.py
contains unsafe exception handling during data resampling, which could lead to silent data corruption and hard-to-debug training failures. Addressing these issues is crucial for the correctness and stability of the new algorithm's implementation.
Could you pin the verl commit in your readme? |
recipe/spo/README.md
Outdated
```bash | ||
# Enable SPO training mode | ||
export SPO_ENABLE=True | ||
export SPO_OFFLINE_VALUES="/path/to/offline/values.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see Appendix A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is an offline value estimate (Appendix A), I have added a link to huggingface in the README.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
I have update it in the readme file. |
- Switch offline values from local JSON file to HuggingFace dataset loading - Update README with offline value generation instructions - Add debug mode support with RAY_DEBUG flag in config - Fix config name reference from ppo_trainer to spo_trainer - Update batch sizes and paths to use environment variables - Change custom module paths from retool to spo directory - Switch multi-turn format from retool_paper to hermes - Adjust offline value threshold from 0 to 0.5 for binary classification This improves the SPO training pipeline by using centralized dataset storage and providing better configuration flexibility through environment variables.
@wuxibin89 @vermouth1992 @tongyx361 @PeterSH6 Thanks for all the great feedback! I have updated the code based on the review comments and pushed the changes. Please take a quick look when you have a chance. Thanks! |
…dataset processing and execution
What does this PR do?
This PR implements the Single-stream Policy Optimization proposed by paper https://arxiv.org/abs/2509.13232.
Checklist Before Starting
[{modules}] {type}: {description}
(This will be checked by the CI){modules}
includefsdp
,megatron
,sglang
,vllm
,rollout
,trainer
,ci
,training_utils
,recipe
,hardware
,deployment
,ray
,worker
,single_controller
,misc
,perf
,model
,algo
,env
,tool
,ckpt
,doc
,data
,
like[megatron, fsdp, doc]
{type}
is infeat
,fix
,refactor
,chore
,test
[BREAKING]
to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batching
Test
API and Usage Example
# Add code snippet or script demonstrating how to use this
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
ci-request
channel in theverl
Slack workspace. (If not accessible, please try the Feishu group (飞书群).)