Add remote reward #2

carrot0117 · 2025-07-09T08:06:28Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

HyperdriveHustle · 2025-07-09T09:14:39Z

verl/utils/reward_score/remote_reward/__init__.py

+# MODEL_NAME = "Qwen2.5-32B-Instruct"
+MODEL_NAME = "Qwen3-30B-A3B"
+
+


BASE_URL、API_KEY 和 MODEL_NAME 这三个加到训练的参数配置里面吧（给一个默认值），剩下的可以暂时写死

HyperdriveHustle · 2025-07-09T09:28:09Z

verl/utils/reward_score/__init__.py

+        from . import remote_reward
+        # 调用judge model
+        # r"\\boxed\s*{([^}]*)}"匹配response中的pred 在路径/afs/chatrl/users/hwq/code/verl-req-sched/verl/utils/reward_score/remote_reward/__init__.py
+        # 数据集需要在extra_info给出question


这里为什么不可以直接从 messages 里面拿 prompt 呢

HyperdriveHustle · 2025-07-09T09:29:58Z

verl/utils/reward_score/remote_reward/__init__.py

+Output:
+'''.strip()
+
+SAVE_PATH = "/afs/chatrl/users/hwq/code/verl-remote-reward/examples_sensecore/grpo_remote_reward/output.jsonl"  # 设置你要保存的 JSONL 文件路径


SAVE_PATH 也改成在 config 里配置路径然后读 config吧

HyperdriveHustle · 2025-07-09T09:31:38Z

verl/utils/reward_score/remote_reward/tools/api/base.py

+            completion = self.client.chat.completions.create(
+                model=model_name,
+                messages=messages,
+                extra_body={"chat_template_kwargs": {"enable_thinking": False}}, # nothinking


这个字段非 qwen 系列模型兼容吗

HyperdriveHustle · 2025-07-09T09:35:22Z

verl/utils/reward_score/remote_reward/tools/rm/base.py

+import vllm
+from transformers import AutoTokenizer
+
+from harpy.tools.base import Tool, SortTool


harpy 这些没有用到的 Tool 可以都去掉，vllm 推 rm 应该暂时也用不到，也去掉吧，留下 remote rm 的部分就好

HyperdriveHustle · 2025-07-09T09:35:56Z

verl/utils/reward_score/remote_reward/tools/rm/model_builder.py

+)
+
+
+def build_rm(model_name, **kwargs):


这部分也可以去掉

HyperdriveHustle · 2025-07-09T09:38:55Z

verl/utils/reward_score/remote_reward/tools/vllm/base.py

+from verl.utils.reward_score.remote_reward.tools.base import Tool, GenerateTool
+
+
+class VllmTool(GenerateTool):


remote 的话，vllm 这个部分也可以去掉

Co-authored-by: Bihan Rana <[email protected]> Co-authored-by: peterschmidt85 <[email protected]>

…engine#2365) ### What does this PR do? Fix a regression from volcengine#1911, because the PR did not change the sglang async branch. CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps. To reproduce the bug, run test: TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh It fail with: ``` (WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0 (WorkerDict pid=1257286) Actor use_remove_padding=True (WorkerDict pid=1257286) Actor use_fused_kernels=False (AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/) (WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error' (WorkerDict pid=1257286) what(): CUDA error: an illegal memory access was encountered (WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=1257286) (WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in ``` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` (TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153 ``` ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

carrot0117 added 2 commits July 9, 2025 14:44

add remote reward

c5fe36a

add remote reward

c41efa6

HyperdriveHustle reviewed Jul 9, 2025

View reviewed changes

fix: fix from comment

b5fd631

HyperdriveHustle pushed a commit that referenced this pull request Aug 20, 2025

Add dstack example (#2) (volcengine#1706)

54b2677

Co-authored-by: Bihan Rana <[email protected]> Co-authored-by: peterschmidt85 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add remote reward #2

Add remote reward #2

Uh oh!

carrot0117 commented Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

HyperdriveHustle Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# MODEL_NAME = "Qwen2.5-32B-Instruct"
		MODEL_NAME = "Qwen3-30B-A3B"

		from verl.utils.reward_score.remote_reward.tools.base import Tool, GenerateTool


		class VllmTool(GenerateTool):

Add remote reward #2

Are you sure you want to change the base?

Add remote reward #2

Uh oh!

Conversation

carrot0117 commented Jul 9, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

HyperdriveHustle Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants