Skip to content

Conversation

SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Jul 9, 2025

What does this PR do?

Upgrades to torch 2.7. The upgrade is somewhat implicit because we upgrade the inference engines and then let them pin the required torch versions. (vllm uses torch 2.7.0 and sglang uses 2.7.1)

  • This PR also upgrades CUDA to 12.8.

TODO:

  • Test sglang after upgrade
  • Publish new docker image to dockerhub

Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH requested a review from Copilot July 9, 2025 19:08
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR upgrades PyTorch to 2.7 (via inference engines vllm and sglang), bumps CUDA to 12.8, and adjusts code and environment settings to support the new versions.

  • Replace raw torch.dtype usage in weight broadcasting with string conversions (torch_dtype_to_str / str_to_torch_dtype).
  • Add VLLM_ALLOW_INSECURE_SERIALIZATION for vllm ≥0.9.0 and adjust initialize_ray.
  • Update dependency pins in pyproject.toml, installation docs, and Dockerfile to CUDA 12.8 and new package versions.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
skyrl-train/.../fsdp/fsdp_worker.py Use torch_dtype_to_str when sending dtype over RPC
skyrl-train/.../deepspeed/deepspeed_worker.py Same dtype conversion update for DeepSpeed worker
skyrl-train/utils/utils.py Add VLLM_ALLOW_INSECURE_SERIALIZATION and split inner if
skyrl-train/inference_engines/vllm/vllm_engine.py Import str_to_torch_dtype and convert incoming dtype strings
skyrl-train/pyproject.toml Bump flash-attn, vllm, sglang, torch and add pytorch-cu128 index
skyrl-train/docs/getting-started/installation.rst Update CUDA recommendations and Docker image tags
docker/Dockerfile Switch to CUDA 12.8 installer, consolidate cleanup steps
Comments suppressed due to low confidence (1)

skyrl-train/skyrl_train/utils/utils.py:255

  • The logger.info call should be indented under the inner if not os.environ.get("VLLM_USE_V1") so it only runs when VLLM_USE_V1 is unset. Currently it executes for all cfg.generator.backend == "vllm" cases.
            logger.info(

Comment on lines 3 to +20
RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential libnuma-dev

# the cuda compiler here is needed for deepspeed
RUN wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
RUN sudo sh cuda_12.4.0_550.54.14_linux.run --silent --toolkit
RUN wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run \
&& sudo sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit && rm -rf cuda_12.8.0_570.86.10_linux.run

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc

RUN sudo apt-get update \
&& sudo apt-get install -y openssh-server iputils-ping net-tools iproute2 traceroute netcat \
libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata
RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk
libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata \
&& sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*

RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk \
&& sudo apt-get clean \
&& sudo rm -rf /var/lib/apt/lists/*

Copy link
Preview

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Multiple RUN apt-get update and cleanup layers can be consolidated into a single RUN block to reduce image layers and overall size.

Copilot uses AI. Check for mistakes.

x
Signed-off-by: SumanthRH <[email protected]>
Comment on lines 82 to 83
"sglang[srt,openai]==0.4.8.post1",
"torch-memory-saver>=0.0.5",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do sglang[srt,openai,torch_memory_saver]==0.4.8.post1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

SumanthRH added 5 commits July 9, 2025 20:45
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@@ -172,7 +172,7 @@ generator:
# whether to use a conversation based format for multi-turn generations
# if false, append multi-turn model responses and env observations to the original assistant response
# if true, each multi-turn model response and env observations is stored in a separate assistant/user message respectively
use_conversation_multi_turn: false
use_conversation_multi_turn: true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be true by default. This is because the standard way should be using multi-turn conversations for observation. We had noticed some perf degradation for qwen 7B models if be used true instead of false, but that seems model specific.

sglang has also some issues with the flag -I think their /completions endpoint behaves differently in some way, because of which I got empty responses.

I will make sure to add this caveat with sglang to the docs as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tracking this and other sglang feature support tasks here: #82

If there's more to say about this issue and what you observed, could you dump it into the issue?

Copy link
Member Author

@SumanthRH SumanthRH Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sg. Let me in fact dive into this separately. The main issue is probably due to some difference in the implementation of the /v1/completions API in sglang vs vllm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty response comes up due to our defaults (which might have to change as well)

x
Signed-off-by: SumanthRH <[email protected]>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch is no longer needed after sglang upgrade

x
Signed-off-by: SumanthRH <[email protected]>
@@ -120,7 +118,7 @@ async def update_named_weight(self, request: NamedWeightUpdateRequest):
f"{self.url}/{weight_update_method}",
json={
"name": request["name"],
"dtype": torch_dtype_to_str(request["dtype"]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor edit here: since the inference engines expect this to be a string anyways, I have changed the datatype to be string for consistency

x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@@ -127,7 +126,7 @@ async def broadcast_to_inference_engines(self, inference_engine_client):
inference_engine_client.update_named_weight(
{
"name": name,
"dtype": generator_dtype,
"dtype": self.cfg.generator.model_dtype,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to directly providing the string version

x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH
Copy link
Member Author

This Pr should be ready for review now.

I've tested:

  • vLLM with colocation and remote server (with FSDP2)
  • vLLM with colocation with DeepSpeed
  • Sglang remote server with FSDP2 and DeepSpeed

and things have worked.

I haven't done careful perf benchmarking given no major changes

@SumanthRH SumanthRH marked this pull request as ready for review July 10, 2025 01:12
uv run --isolated --extra vllm -m skyrl_train.inference_engines.vllm.vllm_server \
# NOTE (sumanthrh): Currently, there's an issue with distributed executor backend ray for vllm 0.9.2.
# For standalone server, we use mp for now.
CUDA_VISIBLE_DEVICES=4,5,6,7 uv run --isolated --extra vllm -m skyrl_train.inference_engines.vllm.vllm_server \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails rn with vllm 0.9.2:

uv run --isolated --extra vllm -m  vllm.entrypoints.api_server     --model Qwen/Qwen2.5-1.5B-Instruct     --tensor-parallel-size 4     --host 127.0.0.1     --port 8001     --seed 42     --max-model-len 4096     --enable-prefix-caching     --enable-chunked-prefill     --dtype bfloat16     --gpu-memory-utilization 0.9     --enable-sleep-mode     --max-num_batched_tokens 8192     --max-num-seqs 1024     --trust-remote-code     --distributed-executor-backend ray

There's some issue here that we can dig into later. as such the remote server can use any backend so I'm just using mp backend

"flashinfer-python@https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.5/flashinfer_python-0.2.5+cu124torch2.6-cp38-abi3-linux_x86_64.whl#sha256=43d767b912c0c43a04be99595e0123eab9385fc72530a2874b5fb08e3145c0be",
"flashinfer-python@https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl",
"torch==2.7.0",
"torchvision"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit torch and torchvision dependencies

Torch versions need not be specified but it's kinda helpful to know what we're dealing with. Does not hurt since uv dependency resolution will catch issues.

the main reason is being able to use torch compiled for cu12.8. By default, the version used is 12.6

"""Broadcast weight to all vllm workers from source rank 0 (actor model)"""
dtype = str_to_torch_dtype(dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm 0.9.2 change AFAIK. Previously we passed string direclty here but now model_config.dtype is of type torch.dtype

Comment on lines +203 to +204
if cfg.generator.backend == "sglang" and not cfg.generator.use_conversation_multi_turn:
raise NotImplementedError("`use_conversation_multi_turn=False` is not supported for SGLang backend")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encountered an issue with sglang and use_conversation_multi_turn=False so it's not implemented rn

@CharlieFRuan
Copy link
Collaborator

Should we bump sglang to 0.4.9.post1? It seems to have some RL-related fixes:

I could also do it in #68

@SumanthRH
Copy link
Member Author

SumanthRH commented Jul 10, 2025

@CharlieFRuan Interesting point!

Regarding the version bump:

For this PR, I want to simply use a torch 2.7 compatible sglang version. The fixes for RL training you mentioned are for colocated training, and should not affect non-colocated training AFAIK. We only support non-colocated rn so this should be fine.

I could also do it in #68

This is best, since you are introducing colocated training. Btw, another issue with going from sglang 0.4.8 -> 0.4.9 is that flash-infer version gets upgraded from 0.2.6 -> 0.2.7 . flashinfer is switching over to not releasing cuda specific wheels from 0.2.7: flashinfer-ai/flashinfer#1139 (comment) , so we'll need to carefully test out how slow JIT compilation is (it was pretty bad in previous versions). Ideally we avoid JIT compiled as long as possible.

x
Signed-off-by: SumanthRH <[email protected]>
@@ -52,6 +52,7 @@ uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
generator.async_engine=true \
generator.batched=false \
environment.env_class=text2sql \
generator.use_conversation_multi_turn=false \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #65 has landed, the same fix needs to be made for the search example

Copy link
Member

@tyler-griggs tyler-griggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran colocated vllm and remote sglang, no issues!

@@ -172,7 +172,7 @@ generator:
# whether to use a conversation based format for multi-turn generations
# if false, append multi-turn model responses and env observations to the original assistant response
# if true, each multi-turn model response and env observations is stored in a separate assistant/user message respectively
use_conversation_multi_turn: false
use_conversation_multi_turn: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tracking this and other sglang feature support tasks here: #82

If there's more to say about this issue and what you observed, could you dump it into the issue?

@@ -0,0 +1,12 @@
# Launches sglang server for Meta-Llama-3-8B-Instruct on 2 GPUs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's Qwen2.5-1.5B-Instruct (not llama), and defaults to 4 gpus (not 2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks

Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH merged commit b539ee9 into main Jul 14, 2025
3 checks passed
@SumanthRH SumanthRH deleted the sumanthrh/update-to-2.7 branch July 16, 2025 22:12
@SumanthRH SumanthRH restored the sumanthrh/update-to-2.7 branch July 16, 2025 22:13
@SumanthRH SumanthRH deleted the sumanthrh/update-to-2.7 branch July 16, 2025 22:13
erictang000 added a commit that referenced this pull request Jul 25, 2025
… L4/L40S after #73 upgrade to cuda 12.8 (#108)

# Overview
After #73, the main code path no longer runs on GPUs without P2P support
(potentially due to cuda 12.8 upgrade?) - an error would be thrown like

```bash
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
```

This PR adds a check for whether peer access is supported (using
torch/cuda) between all GPUs on a node to the ray initialization, and
sets relevant NCCL env vars to allow the code to run on these machine
types.

```python
if not peer_access_supported():
        logger.info("Peer access is not supported, disabling P2P and SHM")
        env_vars["NCCL_P2P_DISABLE"] = "1"
        env_vars["NCCL_SHM_DISABLE"] = "1"
```

Example running on L40S:
<img width="1854" height="227" alt="image"
src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb"
/>
vinid added a commit to vinid/SkyRL that referenced this pull request Aug 11, 2025
* [Trainer] Support per-token rewards in trainer (NovaSky-AI#109)

* Add check for whether p2p access is supported - allows code to run on L4/L40S after NovaSky-AI#73 upgrade to cuda 12.8 (NovaSky-AI#108)

# Overview
After NovaSky-AI#73, the main code path no longer runs on GPUs without P2P support
(potentially due to cuda 12.8 upgrade?) - an error would be thrown like

```bash
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
```

This PR adds a check for whether peer access is supported (using
torch/cuda) between all GPUs on a node to the ray initialization, and
sets relevant NCCL env vars to allow the code to run on these machine
types.

```python
if not peer_access_supported():
        logger.info("Peer access is not supported, disabling P2P and SHM")
        env_vars["NCCL_P2P_DISABLE"] = "1"
        env_vars["NCCL_SHM_DISABLE"] = "1"
```

Example running on L40S:
<img width="1854" height="227" alt="image"
src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb"
/>

* [dependencies] Upgrade ray to 2.48.0 (NovaSky-AI#106)

# What does this PR do
Upgrades ray to 2.48.0, which allows us to remove the pip install vllm
in the Dockerfile as a fallback for when uv + vllm does not resolve
dependencies with the vllm + ray backend correctly.

We leave the previous Dockerfile in `docker/Dockerfile.ray244` for
backwards compatibility

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* fix issue with NovaSky-AI#108 that broke gpu ci (NovaSky-AI#112)

missed an argument in `gpu_ci/conftest.py` for `peer_access_supported()`
- fix for gpu ci to run

Passing now with update:
<img width="1811" height="861" alt="image"
src="https://github.com/user-attachments/assets/70011c54-1e33-44b5-83a0-616029f891d2"
/>


And main runs (and disables p2p access) correctly:
<img width="2067" height="203" alt="image"
src="https://github.com/user-attachments/assets/399aff67-cc51-4588-a632-47698073593c"
/>

* Add warning for certain uv versions due to `uv run --with` regression (NovaSky-AI#113)

# What does this PR do?

Adds a warning for uv versions 0.8.0, 0.8.1 and 0.8.2 due to a bug in
the uv run --with flag for "Running in ray cluster" section. These are
relatively new versions and thus it's better to have this detail in the
documentation for users.


<img width="692" height="458" alt="Screenshot 2025-07-25 at 6 09 15 PM"
src="https://github.com/user-attachments/assets/f1997eac-2867-4552-8ef7-eea8741e32b6"
/>
<img width="779" height="568" alt="Screenshot 2025-07-25 at 6 09 19 PM"
src="https://github.com/user-attachments/assets/5080d328-c934-4864-91a8-932902dea934"
/>

---------

Signed-off-by: SumanthRH <[email protected]>

* [GPU CI] Only trigger workflow for relevant changes in `skyrl-train` (NovaSky-AI#114)

* [bug] Loading saved HF weights errors (NovaSky-AI#118)

Addresses NovaSky-AI#97

* [DAPO] Add support for overlong filtering (NovaSky-AI#111)

## What does this PR do? 

Adds `apply_overlong_filtering` to the generator config, and provides a
generator utility method `apply_overlong_filtering()` for
post-processing the loss mask.

I originally implemented this using the `stop_reasons` to determine
whether the sequence was truncated, but instead switched to looking for
`eos_token` in the response IDs for a more general approach.

## Tests
Added CPU tests for the utility method and for SkyRL Gym Generator's use
of the utility method.

* [skyrl-gym] GSM8k - LLM Judge example (NovaSky-AI#74)

* Fix MLFlow logging (NovaSky-AI#121)

This is a small change to make the MLFlow integration work. Currently
this fails with a Pandas error when trying to flatten an Omega dict; we
need to convert to a regular Python dictionary.

Can confirm this works on our MLFlow setup:
<img width="1406" height="683" alt="image"
src="https://github.com/user-attachments/assets/fcee526a-815e-4f08-bf25-d2709779ced7"
/>

* [Trainer] Support registering custom advantage estimators (NovaSky-AI#115)

## What does this PR do? 

Adds an `AdvantageEstimatorRegistry` to support custom advantage
estimation methods without modifying the skyrl-train package.

Added `examples/algorithm/custom_advantage_estimator` folder to give
quick example of how to register a custom adv est function.

## Tests
Adding cpu test to ensure registration works.

* [checkpointing] Add HF model config and tokenizer config to checkpoint folder  (NovaSky-AI#124)

# Overview
Adds the HF model config and tokenizer config to `ckpt_path/huggingface`
for deepspeed and fsdp. So now the checkpoint directory will be:

```
{ckpt_path}/
├── latest_ckpt_global_step.txt           # Holds the global step of the latest checkpoint
├── global_step_10/                       # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── fsdp_config.json      # stores fsdp version and world size
│   │   ├── huggingface/
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── model_state.pt               # Model parameters
│   │   ├── optimizer_state.pt           # Optimizer state
│   │   └── lr_scheduler_state.pt        # Learning rate scheduler state
```

For deepspeed it will be similar but without `fsdp_config.json`

```
{ckpt_path}/
├── latest_ckpt_global_step.txt           # Holds the global step of the latest checkpoint
├── global_step_10/                       # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── huggingface/
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── ...               # deepspeed checkpointing files
```

* Fix discord link (NovaSky-AI#125)

* Fix broken link (NovaSky-AI#128)

* [Trainer/Algorithm] Support registering custom policy loss functions + refactor adv estimator registry to allow registration outside ray workers (NovaSky-AI#126)

# Overview
- Adds support for registering custom policy loss functions, similar to
NovaSky-AI#115,
- Refactors the policy loss to be a function in `ppo_utils.py` instead
of a (`nn.Module` in `worker.py`)
- Introduces a breaking change in renaming
`trainer.algorithm.ppo_loss_type` to
`trainer.algorithm.policy_loss_type`
- Addresses Issue NovaSky-AI#116 by creating a new `BaseFunctionRegistry` class
that uses a [named
actor](https://docs.ray.io/en/latest/ray-core/actors/named-actors.html)
to support the following pattern:

```python
# Example of custom policy loss: "simple_baseline"
def compute_simple_baseline_policy_loss(
    log_probs: torch.Tensor,
    ...
):
    return torch.randn(1, device=log_probs.device), 0.0

# Register the custom policy loss - outside of the ray worker
PolicyLossRegistry.register("simple_baseline", compute_simple_baseline_policy_loss)


@ray.remote(num_cpus=1)
def skyrl_entrypoint(cfg: DictConfig):
    exp = BasePPOExp(cfg)
    exp.run()


@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
def main(cfg: DictConfig) -> None:
    # validate the arguments
    validate_cfg(cfg)

    initialize_ray(cfg)

    ray.get(skyrl_entrypoint.remote(cfg))
```
this change was necessary for `PolicyLossRegistry` to be accessible,
since the worker `actor_loss_fn` attribute is set in `init_model` within
the `worker` actor, which is a ray actor created from within the
skyrl_entrypoint ray task (and registering within the entrypoint
wouldn't propagate down another layer).
- updates AdvantageEstimatorRegistry to extend the same
`BaseFunctionRegistry` class


Example runs:
Custom advantage (mean of reward)
<img width="956" height="326" alt="image"
src="https://github.com/user-attachments/assets/1b7222bc-fbb9-49b1-876d-265b71201087"
/>

Custom policy loss (reinforce - just (-logprobs * advantages).mean())
<img width="939" height="330" alt="image"
src="https://github.com/user-attachments/assets/cbed7ef5-b3e7-4e32-beba-b52b80879f47"
/>

* [SkyAgent] Upload initial refactored code (NovaSky-AI#131)

# What does this PR do?

Uploading our initial refactored code for SkyAgent

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Shiyi Cao <[email protected]>
Co-authored-by: Dacheng Li <[email protected]>

* [trainer] add more robust generation output validation (NovaSky-AI#132)

# Overview
Adds a `validate_generation_output` function in `trainer_utils.py` with
more robust validation of generation output format. Specifically, given
```
class GeneratorOutput(TypedDict):
    prompt_token_ids: List[List[int]]
    response_ids: List[List[int]]
    rewards: Union[List[float], List[List[float]]]
    loss_masks: List[List[int]]
    stop_reasons: Optional[List[str]]
    rollout_metrics: Optional[Dict[str, Any]]
```

We expect
- all list attributes should have the same length and be the same length
as the input batch of prompts at dim=0
- non zero length lists
- response_ids, loss masks, and rewards (if token level rewards) should
be the same length
- the sum of loss masks should be non-zero (logging a warning if it is
not)

verified gsm8k run still works:
<img width="563" height="330" alt="image"
src="https://github.com/user-attachments/assets/eeefebcb-d5fc-486d-b906-f4344b1e2779"
/>

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* [Trainer] GSPO support (NovaSky-AI#120)

This PR adds support for [Group Sequence Policy Optimization
(GSPO)](https://arxiv.org/abs/2507.18071), the hotness du jour from
Alibaba Qwen. The implementation in this PR is loosely based on [this
one](huggingface/trl#3775) from TRL. It adds an
`importance_sampling_level` config option which can be `token`
(PPO/GRPO) or `sequence` (GSPO).

I ran a short/small GSM8k run with Qwen2.5-0.5B and the loss curves look
okay:
<img width="314" height="240" alt="image"
src="https://github.com/user-attachments/assets/f52d7c64-416c-4419-aa96-4a03c9048007"
/>

However, I had to hack a few things to get this to run on Datadog's
cloud infra (including changing some dependency versions) so I'd
encourage one of the maintainers to reproduce these results locally
before merging.

* [SkyAgent] Add initial docs (NovaSky-AI#134)

# What does this PR do?

Adds initial documentation for SkyAgent. 

We are still actively cleaning this package up, but I thought initial
documentation will be helpful for anyone who stumbles across this.


The documentation folder is still in `skyrl-train`, and much of the docs
also refer to "SkyRL" when they are really referring to "SkyRL-train",
so to avoid any confusion, I have just added this as a simple page on
the sidebar. We need to make the docs be mono-repo wide and structure it
better but I'm leaving it for a future PR.

---------

Signed-off-by: SumanthRH <[email protected]>

* [trainer/algorithm] Implement DAPO and Polaris style dynamic sampling + add DAPO docs + example (NovaSky-AI#130)

# Overview
This PR introduces filter (DAPO) and replace (Polaris/WebSailor) style
dynamic sampling strategies. The dynamic sampling strategy can be
configured as below:

```yaml
# dynamic sampling parameters
dynamic_sampling:
  type: null # filter (DAPO), replace (POLARIS/WebSailor), or null
  max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever
  min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only)
```
This PR also adds a docs page describing how to enable all DAPO
features, and adds an example GSM8K script where all these features are
used.

## Minor Changes
Some minor changes to make this dynamic sampling implementation clean:
- the utils `Timer` class now updates the dict instead of overwriting in
order to correctly track generation time w/ dynamic sampling, which
means we need to make sure to reset `all_timings` in any trainer
- The use of `self.weights_manager` is a little tricky for the dynamic
sampling - introduced the the `ConditionalWeightsManager` to make the
added code in the training loop as clean as possible


## Example runs
<img width="413" height="264" alt="image"
src="https://github.com/user-attachments/assets/072f716a-3632-42bb-a5f7-5f9d6064bd93"
/>

Generation time for dapo style filtering increases as the training run
goes on, while it is stable for polaris and the baseline.

<img width="419" height="265" alt="image"
src="https://github.com/user-attachments/assets/887df550-e4b9-4623-b578-b4809a9f403f"
/>

We can see that the training pass @ n metric is 1 for both polaris and
dapo style filtering as expected.

<img width="421" height="259" alt="image"
src="https://github.com/user-attachments/assets/bb63af77-1fbb-4d89-9216-b028f1551ea7"
/>

For GSM8k + Qwen 1.5B, the sampling strategy (as well as the full DAPO
run) results in minimal gains - need larger models/harder dataset to
test more fully

DAPO sampling Example Run:
```bash
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:13.439 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:13.737 | INFO     | skyrl_train.weights_manager:__enter__:76 - Started: 'sync_weights_to_inference_engines'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.401 | INFO     | skyrl_train.weights_manager:__enter__:76 - Finished: 'sync_weights_to_inference_engines', time cost: 2.66s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.401 | INFO     | skyrl_train.weights_manager:__enter__:80 - Started: 'offload_policy_model_to_cpu'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.842 | INFO     | skyrl_train.weights_manager:__enter__:80 - Finished: 'offload_policy_model_to_cpu', time cost: 0.44s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.888 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:13 [executor_base.py:227] It took 0.243244 seconds to wake up tags ['weights']. [repeated 4x across cluster]
(AsyncVLLMInferenceEngine pid=223854) INFO 08-04 23:13:16 [executor_base.py:227] It took 0.040547 seconds to wake up tags ['kv_cache'].
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:16 [block_pool.py:316] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=223855) INFO 08-04 23:13:16 [executor_base.py:227] It took 0.041721 seconds to wake up tags ['kv_cache'].
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.378 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 17.49s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:433 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:434 - Dynamic sampling: 460 < 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:435 - Resample batch 1, continue sampling...
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:436 - ==================================================
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.trainer:train:245 - Finished: 'step', time cost: 20.96s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.407 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.445 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.014 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 17.57s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:433 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:434 - Dynamic sampling: 941 < 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:435 - Resample batch 2, continue sampling...
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:436 - ==================================================
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.030 | INFO     | skyrl_train.trainer:train:245 - Finished: 'step', time cost: 17.62s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.033 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.074 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.380 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 16.31s
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.396 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:439 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.396 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:440 - Dynamic sampling: collected 1467 >= 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.397 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:443 - ==================================================
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:12 [gpu_worker.py:98] Sleep mode freed 61.88 GiB memory, 4.98 GiB memory is still in use. [repeated 3x across cluster]
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:12 [executor_base.py:211] It took 1.264572 seconds to fall asleep. [repeated 3x across cluster]
```

Polaris Style example run:
```bash
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:01.648 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(AsyncVLLMInferenceEngine pid=308521) INFO 08-05 00:29:58 [executor_base.py:227] It took 0.240372 seconds to wake up tags ['weights']. [repeated 4x across cluster]
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:30:01 [executor_base.py:227] It took 0.040980 seconds to wake up tags ['kv_cache'].
(AsyncVLLMInferenceEngine pid=308521) INFO 08-05 00:30:00 [block_pool.py:316] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=308518) INFO 08-05 00:30:01 [executor_base.py:227] It took 0.041175 seconds to wake up tags ['kv_cache'].
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.663 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 15.01s
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.679 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:316 - Replace sampling: 629 good UIDs out of 1024 total prompts
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:320 - ============= Dynamic sampling replace ===========
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:321 - Number of good prompts: 629
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:322 - Number of bad prompts: 395
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.694 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:352 - After replacement - Replaced 395 bad prompts
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.694 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:353 - ==================================================
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:29:57 [gpu_worker.py:98] Sleep mode freed 62.14 GiB memory, 6.28 GiB memory is still in use. [repeated 3x across cluster]
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:29:57 [executor_base.py:211] It took 1.331663 seconds to fall asleep.
```

## Full DAPO example run 
From example script
<img width="417" height="262" alt="image"
src="https://github.com/user-attachments/assets/2592a06f-8b8a-4cf1-a29e-321bff819eb0"
/>
<img width="909" height="325" alt="image"
src="https://github.com/user-attachments/assets/50922afd-1424-4183-9329-4f1f340287eb"
/>

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* [algorithm] Support Dr. GRPO + refactor where policy/critic loss functions are set (NovaSky-AI#133)

# Overview
## Dr GRPO
Adds `loss_reduction`: `seq_mean_token_sum_norm ` option, and
`grpo_norm_by_std` option to support Dr. GRPO

So to run Dr. GRPO, set: 

```yaml
trainer:
 algorithm:
   grpo_norm_by_std: false
   loss_reduction: "seq_mean_token_sum_norm"
...
```

Example run:
<img width="906" height="317" alt="image"
src="https://github.com/user-attachments/assets/ce9db2ef-253e-45c8-adba-1ef8a270bbd9"
/>

Reward looks similar

<img width="419" height="263" alt="image"
src="https://github.com/user-attachments/assets/a4bc4d8c-f3c1-4bad-a497-0297dc30bc27"
/>

Magnitude of policy loss is lower as expected (since we are normalizing
by a larger constant rather than taking the mean)

## Refactor where Critic/Policy Loss are set
Changes ppo critic `ValueLoss` to just a function instead of a
`nn.Module` for consistency with `policy_loss`, and adds new algorithm
field to cfg that require evaluating field values in
`utils::validate_cfg` (this runs before entrypoint code, allowing users
to modify the cfg further by subclassing `BasePPOExp`)

PPO example still running after this refactor:
<img width="421" height="262" alt="image"
src="https://github.com/user-attachments/assets/88985da3-1403-49c6-8cb5-f1434151fd9e"
/>

* [fix] move algorithm folder -> algorithms (NovaSky-AI#136)

left the algorithm folder in NovaSky-AI#133, move it over

* [Logging] Forward mlflow env vars to ray runtime env (NovaSky-AI#135)

This PR forward the `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_TOKEN`
environment variable to the ray runtime env during its initialization.

This will enable users to simply provide the above env vars at the driver and be able to use MLFlow for experiment tracking.

* data folder

* some stuff

* updates

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Sumanth R Hegde <[email protected]>
Co-authored-by: Eric Tang <[email protected]>
Co-authored-by: Tyler Griggs <[email protected]>
Co-authored-by: Shu Liu <[email protected]>
Co-authored-by: Ben Cohen <[email protected]>
Co-authored-by: Shiyi Cao <[email protected]>
Co-authored-by: Dacheng Li <[email protected]>
Co-authored-by: Etienne Brodu <[email protected]>
fannie1208 pushed a commit to vinid/SkyRL that referenced this pull request Aug 19, 2025
# What does this PR do?

Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. 

This PR also upgrades CUDA to 12.8. 

TODO: 
- [x] Test sglang after upgrade 
- [x] Publish new docker image to dockerhub

---------

Signed-off-by: SumanthRH <[email protected]>
fannie1208 pushed a commit to vinid/SkyRL that referenced this pull request Aug 19, 2025
* [Trainer] Support per-token rewards in trainer (NovaSky-AI#109)

* Add check for whether p2p access is supported - allows code to run on L4/L40S after NovaSky-AI#73 upgrade to cuda 12.8 (NovaSky-AI#108)

# Overview
After NovaSky-AI#73, the main code path no longer runs on GPUs without P2P support
(potentially due to cuda 12.8 upgrade?) - an error would be thrown like

```bash
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
```

This PR adds a check for whether peer access is supported (using
torch/cuda) between all GPUs on a node to the ray initialization, and
sets relevant NCCL env vars to allow the code to run on these machine
types.

```python
if not peer_access_supported():
        logger.info("Peer access is not supported, disabling P2P and SHM")
        env_vars["NCCL_P2P_DISABLE"] = "1"
        env_vars["NCCL_SHM_DISABLE"] = "1"
```

Example running on L40S:
<img width="1854" height="227" alt="image"
src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb"
/>

* [dependencies] Upgrade ray to 2.48.0 (NovaSky-AI#106)

# What does this PR do
Upgrades ray to 2.48.0, which allows us to remove the pip install vllm
in the Dockerfile as a fallback for when uv + vllm does not resolve
dependencies with the vllm + ray backend correctly.

We leave the previous Dockerfile in `docker/Dockerfile.ray244` for
backwards compatibility

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* fix issue with NovaSky-AI#108 that broke gpu ci (NovaSky-AI#112)

missed an argument in `gpu_ci/conftest.py` for `peer_access_supported()`
- fix for gpu ci to run

Passing now with update:
<img width="1811" height="861" alt="image"
src="https://github.com/user-attachments/assets/70011c54-1e33-44b5-83a0-616029f891d2"
/>


And main runs (and disables p2p access) correctly:
<img width="2067" height="203" alt="image"
src="https://github.com/user-attachments/assets/399aff67-cc51-4588-a632-47698073593c"
/>

* Add warning for certain uv versions due to `uv run --with` regression (NovaSky-AI#113)

# What does this PR do?

Adds a warning for uv versions 0.8.0, 0.8.1 and 0.8.2 due to a bug in
the uv run --with flag for "Running in ray cluster" section. These are
relatively new versions and thus it's better to have this detail in the
documentation for users.


<img width="692" height="458" alt="Screenshot 2025-07-25 at 6 09 15 PM"
src="https://github.com/user-attachments/assets/f1997eac-2867-4552-8ef7-eea8741e32b6"
/>
<img width="779" height="568" alt="Screenshot 2025-07-25 at 6 09 19 PM"
src="https://github.com/user-attachments/assets/5080d328-c934-4864-91a8-932902dea934"
/>

---------

Signed-off-by: SumanthRH <[email protected]>

* [GPU CI] Only trigger workflow for relevant changes in `skyrl-train` (NovaSky-AI#114)

* [bug] Loading saved HF weights errors (NovaSky-AI#118)

Addresses NovaSky-AI#97

* [DAPO] Add support for overlong filtering (NovaSky-AI#111)

## What does this PR do? 

Adds `apply_overlong_filtering` to the generator config, and provides a
generator utility method `apply_overlong_filtering()` for
post-processing the loss mask.

I originally implemented this using the `stop_reasons` to determine
whether the sequence was truncated, but instead switched to looking for
`eos_token` in the response IDs for a more general approach.

## Tests
Added CPU tests for the utility method and for SkyRL Gym Generator's use
of the utility method.

* [skyrl-gym] GSM8k - LLM Judge example (NovaSky-AI#74)

* Fix MLFlow logging (NovaSky-AI#121)

This is a small change to make the MLFlow integration work. Currently
this fails with a Pandas error when trying to flatten an Omega dict; we
need to convert to a regular Python dictionary.

Can confirm this works on our MLFlow setup:
<img width="1406" height="683" alt="image"
src="https://github.com/user-attachments/assets/fcee526a-815e-4f08-bf25-d2709779ced7"
/>

* [Trainer] Support registering custom advantage estimators (NovaSky-AI#115)

## What does this PR do? 

Adds an `AdvantageEstimatorRegistry` to support custom advantage
estimation methods without modifying the skyrl-train package.

Added `examples/algorithm/custom_advantage_estimator` folder to give
quick example of how to register a custom adv est function.

## Tests
Adding cpu test to ensure registration works.

* [checkpointing] Add HF model config and tokenizer config to checkpoint folder  (NovaSky-AI#124)

# Overview
Adds the HF model config and tokenizer config to `ckpt_path/huggingface`
for deepspeed and fsdp. So now the checkpoint directory will be:

```
{ckpt_path}/
├── latest_ckpt_global_step.txt           # Holds the global step of the latest checkpoint
├── global_step_10/                       # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── fsdp_config.json      # stores fsdp version and world size
│   │   ├── huggingface/
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── model_state.pt               # Model parameters
│   │   ├── optimizer_state.pt           # Optimizer state
│   │   └── lr_scheduler_state.pt        # Learning rate scheduler state
```

For deepspeed it will be similar but without `fsdp_config.json`

```
{ckpt_path}/
├── latest_ckpt_global_step.txt           # Holds the global step of the latest checkpoint
├── global_step_10/                       # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── huggingface/
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── ...               # deepspeed checkpointing files
```

* Fix discord link (NovaSky-AI#125)

* Fix broken link (NovaSky-AI#128)

* [Trainer/Algorithm] Support registering custom policy loss functions + refactor adv estimator registry to allow registration outside ray workers (NovaSky-AI#126)

# Overview
- Adds support for registering custom policy loss functions, similar to
NovaSky-AI#115,
- Refactors the policy loss to be a function in `ppo_utils.py` instead
of a (`nn.Module` in `worker.py`)
- Introduces a breaking change in renaming
`trainer.algorithm.ppo_loss_type` to
`trainer.algorithm.policy_loss_type`
- Addresses Issue NovaSky-AI#116 by creating a new `BaseFunctionRegistry` class
that uses a [named
actor](https://docs.ray.io/en/latest/ray-core/actors/named-actors.html)
to support the following pattern:

```python
# Example of custom policy loss: "simple_baseline"
def compute_simple_baseline_policy_loss(
    log_probs: torch.Tensor,
    ...
):
    return torch.randn(1, device=log_probs.device), 0.0

# Register the custom policy loss - outside of the ray worker
PolicyLossRegistry.register("simple_baseline", compute_simple_baseline_policy_loss)


@ray.remote(num_cpus=1)
def skyrl_entrypoint(cfg: DictConfig):
    exp = BasePPOExp(cfg)
    exp.run()


@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
def main(cfg: DictConfig) -> None:
    # validate the arguments
    validate_cfg(cfg)

    initialize_ray(cfg)

    ray.get(skyrl_entrypoint.remote(cfg))
```
this change was necessary for `PolicyLossRegistry` to be accessible,
since the worker `actor_loss_fn` attribute is set in `init_model` within
the `worker` actor, which is a ray actor created from within the
skyrl_entrypoint ray task (and registering within the entrypoint
wouldn't propagate down another layer).
- updates AdvantageEstimatorRegistry to extend the same
`BaseFunctionRegistry` class


Example runs:
Custom advantage (mean of reward)
<img width="956" height="326" alt="image"
src="https://github.com/user-attachments/assets/1b7222bc-fbb9-49b1-876d-265b71201087"
/>

Custom policy loss (reinforce - just (-logprobs * advantages).mean())
<img width="939" height="330" alt="image"
src="https://github.com/user-attachments/assets/cbed7ef5-b3e7-4e32-beba-b52b80879f47"
/>

* [SkyAgent] Upload initial refactored code (NovaSky-AI#131)

# What does this PR do?

Uploading our initial refactored code for SkyAgent

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Shiyi Cao <[email protected]>
Co-authored-by: Dacheng Li <[email protected]>

* [trainer] add more robust generation output validation (NovaSky-AI#132)

# Overview
Adds a `validate_generation_output` function in `trainer_utils.py` with
more robust validation of generation output format. Specifically, given
```
class GeneratorOutput(TypedDict):
    prompt_token_ids: List[List[int]]
    response_ids: List[List[int]]
    rewards: Union[List[float], List[List[float]]]
    loss_masks: List[List[int]]
    stop_reasons: Optional[List[str]]
    rollout_metrics: Optional[Dict[str, Any]]
```

We expect
- all list attributes should have the same length and be the same length
as the input batch of prompts at dim=0
- non zero length lists
- response_ids, loss masks, and rewards (if token level rewards) should
be the same length
- the sum of loss masks should be non-zero (logging a warning if it is
not)

verified gsm8k run still works:
<img width="563" height="330" alt="image"
src="https://github.com/user-attachments/assets/eeefebcb-d5fc-486d-b906-f4344b1e2779"
/>

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* [Trainer] GSPO support (NovaSky-AI#120)

This PR adds support for [Group Sequence Policy Optimization
(GSPO)](https://arxiv.org/abs/2507.18071), the hotness du jour from
Alibaba Qwen. The implementation in this PR is loosely based on [this
one](huggingface/trl#3775) from TRL. It adds an
`importance_sampling_level` config option which can be `token`
(PPO/GRPO) or `sequence` (GSPO).

I ran a short/small GSM8k run with Qwen2.5-0.5B and the loss curves look
okay:
<img width="314" height="240" alt="image"
src="https://github.com/user-attachments/assets/f52d7c64-416c-4419-aa96-4a03c9048007"
/>

However, I had to hack a few things to get this to run on Datadog's
cloud infra (including changing some dependency versions) so I'd
encourage one of the maintainers to reproduce these results locally
before merging.

* [SkyAgent] Add initial docs (NovaSky-AI#134)

# What does this PR do?

Adds initial documentation for SkyAgent. 

We are still actively cleaning this package up, but I thought initial
documentation will be helpful for anyone who stumbles across this.


The documentation folder is still in `skyrl-train`, and much of the docs
also refer to "SkyRL" when they are really referring to "SkyRL-train",
so to avoid any confusion, I have just added this as a simple page on
the sidebar. We need to make the docs be mono-repo wide and structure it
better but I'm leaving it for a future PR.

---------

Signed-off-by: SumanthRH <[email protected]>

* [trainer/algorithm] Implement DAPO and Polaris style dynamic sampling + add DAPO docs + example (NovaSky-AI#130)

# Overview
This PR introduces filter (DAPO) and replace (Polaris/WebSailor) style
dynamic sampling strategies. The dynamic sampling strategy can be
configured as below:

```yaml
# dynamic sampling parameters
dynamic_sampling:
  type: null # filter (DAPO), replace (POLARIS/WebSailor), or null
  max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever
  min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only)
```
This PR also adds a docs page describing how to enable all DAPO
features, and adds an example GSM8K script where all these features are
used.

## Minor Changes
Some minor changes to make this dynamic sampling implementation clean:
- the utils `Timer` class now updates the dict instead of overwriting in
order to correctly track generation time w/ dynamic sampling, which
means we need to make sure to reset `all_timings` in any trainer
- The use of `self.weights_manager` is a little tricky for the dynamic
sampling - introduced the the `ConditionalWeightsManager` to make the
added code in the training loop as clean as possible


## Example runs
<img width="413" height="264" alt="image"
src="https://github.com/user-attachments/assets/072f716a-3632-42bb-a5f7-5f9d6064bd93"
/>

Generation time for dapo style filtering increases as the training run
goes on, while it is stable for polaris and the baseline.

<img width="419" height="265" alt="image"
src="https://github.com/user-attachments/assets/887df550-e4b9-4623-b578-b4809a9f403f"
/>

We can see that the training pass @ n metric is 1 for both polaris and
dapo style filtering as expected.

<img width="421" height="259" alt="image"
src="https://github.com/user-attachments/assets/bb63af77-1fbb-4d89-9216-b028f1551ea7"
/>

For GSM8k + Qwen 1.5B, the sampling strategy (as well as the full DAPO
run) results in minimal gains - need larger models/harder dataset to
test more fully

DAPO sampling Example Run:
```bash
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:13.439 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:13.737 | INFO     | skyrl_train.weights_manager:__enter__:76 - Started: 'sync_weights_to_inference_engines'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.401 | INFO     | skyrl_train.weights_manager:__enter__:76 - Finished: 'sync_weights_to_inference_engines', time cost: 2.66s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.401 | INFO     | skyrl_train.weights_manager:__enter__:80 - Started: 'offload_policy_model_to_cpu'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.842 | INFO     | skyrl_train.weights_manager:__enter__:80 - Finished: 'offload_policy_model_to_cpu', time cost: 0.44s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:16.888 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:13 [executor_base.py:227] It took 0.243244 seconds to wake up tags ['weights']. [repeated 4x across cluster]
(AsyncVLLMInferenceEngine pid=223854) INFO 08-04 23:13:16 [executor_base.py:227] It took 0.040547 seconds to wake up tags ['kv_cache'].
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:16 [block_pool.py:316] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=223855) INFO 08-04 23:13:16 [executor_base.py:227] It took 0.041721 seconds to wake up tags ['kv_cache'].
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.378 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 17.49s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:433 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:434 - Dynamic sampling: 460 < 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:435 - Resample batch 1, continue sampling...
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:436 - ==================================================
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.395 | INFO     | skyrl_train.trainer:train:245 - Finished: 'step', time cost: 20.96s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.407 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:34.445 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.014 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 17.57s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:433 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:434 - Dynamic sampling: 941 < 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:435 - Resample batch 2, continue sampling...
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.029 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:436 - ==================================================
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.030 | INFO     | skyrl_train.trainer:train:245 - Finished: 'step', time cost: 17.62s
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.033 | INFO     | skyrl_train.trainer:train:245 - Started: 'step'
(skyrl_entrypoint pid=222117) 2025-08-04 23:13:52.074 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.380 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 16.31s
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.396 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:439 - ============= Dynamic sampling filter =============
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.396 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:440 - Dynamic sampling: collected 1467 >= 1024 prompts
(skyrl_entrypoint pid=222117) 2025-08-04 23:14:08.397 | INFO     | skyrl_train.utils.trainer_utils:handle_filter_sampling:443 - ==================================================
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:12 [gpu_worker.py:98] Sleep mode freed 61.88 GiB memory, 4.98 GiB memory is still in use. [repeated 3x across cluster]
(AsyncVLLMInferenceEngine pid=223856) INFO 08-04 23:13:12 [executor_base.py:211] It took 1.264572 seconds to fall asleep. [repeated 3x across cluster]
```

Polaris Style example run:
```bash
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:01.648 | INFO     | skyrl_train.trainer:train:261 - Started: 'generate'
(AsyncVLLMInferenceEngine pid=308521) INFO 08-05 00:29:58 [executor_base.py:227] It took 0.240372 seconds to wake up tags ['weights']. [repeated 4x across cluster]
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:30:01 [executor_base.py:227] It took 0.040980 seconds to wake up tags ['kv_cache'].
(AsyncVLLMInferenceEngine pid=308521) INFO 08-05 00:30:00 [block_pool.py:316] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=308518) INFO 08-05 00:30:01 [executor_base.py:227] It took 0.041175 seconds to wake up tags ['kv_cache'].
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.663 | INFO     | skyrl_train.trainer:train:261 - Finished: 'generate', time cost: 15.01s
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.679 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:316 - Replace sampling: 629 good UIDs out of 1024 total prompts
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:320 - ============= Dynamic sampling replace ===========
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:321 - Number of good prompts: 629
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.680 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:322 - Number of bad prompts: 395
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.694 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:352 - After replacement - Replaced 395 bad prompts
(skyrl_entrypoint pid=306764) 2025-08-05 00:30:16.694 | INFO     | skyrl_train.utils.trainer_utils:handle_replace_sampling:353 - ==================================================
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:29:57 [gpu_worker.py:98] Sleep mode freed 62.14 GiB memory, 6.28 GiB memory is still in use. [repeated 3x across cluster]
(AsyncVLLMInferenceEngine pid=308520) INFO 08-05 00:29:57 [executor_base.py:211] It took 1.331663 seconds to fall asleep.
```

## Full DAPO example run 
From example script
<img width="417" height="262" alt="image"
src="https://github.com/user-attachments/assets/2592a06f-8b8a-4cf1-a29e-321bff819eb0"
/>
<img width="909" height="325" alt="image"
src="https://github.com/user-attachments/assets/50922afd-1424-4183-9329-4f1f340287eb"
/>

---------

Co-authored-by: Sumanth R Hegde <[email protected]>

* [algorithm] Support Dr. GRPO + refactor where policy/critic loss functions are set (NovaSky-AI#133)

# Overview
## Dr GRPO
Adds `loss_reduction`: `seq_mean_token_sum_norm ` option, and
`grpo_norm_by_std` option to support Dr. GRPO

So to run Dr. GRPO, set: 

```yaml
trainer:
 algorithm:
   grpo_norm_by_std: false
   loss_reduction: "seq_mean_token_sum_norm"
...
```

Example run:
<img width="906" height="317" alt="image"
src="https://github.com/user-attachments/assets/ce9db2ef-253e-45c8-adba-1ef8a270bbd9"
/>

Reward looks similar

<img width="419" height="263" alt="image"
src="https://github.com/user-attachments/assets/a4bc4d8c-f3c1-4bad-a497-0297dc30bc27"
/>

Magnitude of policy loss is lower as expected (since we are normalizing
by a larger constant rather than taking the mean)

## Refactor where Critic/Policy Loss are set
Changes ppo critic `ValueLoss` to just a function instead of a
`nn.Module` for consistency with `policy_loss`, and adds new algorithm
field to cfg that require evaluating field values in
`utils::validate_cfg` (this runs before entrypoint code, allowing users
to modify the cfg further by subclassing `BasePPOExp`)

PPO example still running after this refactor:
<img width="421" height="262" alt="image"
src="https://github.com/user-attachments/assets/88985da3-1403-49c6-8cb5-f1434151fd9e"
/>

* [fix] move algorithm folder -> algorithms (NovaSky-AI#136)

left the algorithm folder in NovaSky-AI#133, move it over

* [Logging] Forward mlflow env vars to ray runtime env (NovaSky-AI#135)

This PR forward the `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_TOKEN`
environment variable to the ray runtime env during its initialization.

This will enable users to simply provide the above env vars at the driver and be able to use MLFlow for experiment tracking.

* data folder

* some stuff

* updates

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Sumanth R Hegde <[email protected]>
Co-authored-by: Eric Tang <[email protected]>
Co-authored-by: Tyler Griggs <[email protected]>
Co-authored-by: Shu Liu <[email protected]>
Co-authored-by: Ben Cohen <[email protected]>
Co-authored-by: Shiyi Cao <[email protected]>
Co-authored-by: Dacheng Li <[email protected]>
Co-authored-by: Etienne Brodu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants