forked from volcengine/verl
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 1
Verl kk with profiling #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
            HyperdriveHustle
  merged 6 commits into
  req_sched_token_even_kk
from
verl_kk_with_profiling
  
      
      
   
  Jun 27, 2025 
      
    
                
     Merged
            
            Verl kk with profiling #1
                    HyperdriveHustle
  merged 6 commits into
  req_sched_token_even_kk
from
verl_kk_with_profiling
  
      
      
   
  Jun 27, 2025 
              
            Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以新增一个 H100 任务的scripts,就可以不在sensecore 的 script 上改了
| # logger.log(data=val_metrics, step=self.global_steps) | ||
| if self.config.trainer.get('val_only', False): | ||
| return | ||
| # if self.val_reward_fn is not None and self.config.trainer.get('val_before_train', True): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把训前 valid 打开
| LGTM | 
df2ecd5
      into
      
  
    req_sched_token_even_kk
  
    
        1 of 2 checks passed
      
      
    
  HyperdriveHustle 
      pushed a commit
      that referenced
      this pull request
    
      Aug 20, 2025 
    
    
      
  
    
      
    
  
…engine#2365) ### What does this PR do? Fix a regression from volcengine#1911, because the PR did not change the sglang async branch. CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps. To reproduce the bug, run test: TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh It fail with: ``` (WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0 (WorkerDict pid=1257286) Actor use_remove_padding=True (WorkerDict pid=1257286) Actor use_fused_kernels=False (AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/) (WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error' (WorkerDict pid=1257286) what(): CUDA error: an illegal memory access was encountered (WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=1257286) (WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in ``` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` (TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153 ``` ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Checklist Before Starting
What does this PR do?
High-Level Design
Specific Changes
API
Usage Example
# Add code snippet or script demonstrating how to use thisTest
Additional Info.
Checklist Before Submitting
[BREAKING]to the PR title if it breaks any API.