Skip to content

Conversation

@Tyizhanshen
Copy link
Collaborator

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以新增一个 H100 任务的scripts,就可以不在sensecore 的 script 上改了

# logger.log(data=val_metrics, step=self.global_steps)
if self.config.trainer.get('val_only', False):
return
# if self.val_reward_fn is not None and self.config.trainer.get('val_before_train', True):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把训前 valid 打开

@HyperdriveHustle
Copy link
Owner

LGTM

@HyperdriveHustle HyperdriveHustle merged commit df2ecd5 into req_sched_token_even_kk Jun 27, 2025
1 of 2 checks passed
HyperdriveHustle pushed a commit that referenced this pull request Aug 20, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants