Why does applying sequence parallelism reduce the step count?

Thanks for building such a great framework!

I had a question while using GRPO with sequence parallelism (SP). I was training on 1,000 data samples using 2 GPUs, per_device_train_batch_size=1 and I noticed something that confused me:

With SP=2 → training steps = 250
With SP=1 → training steps = 500

I initially thought that with SP, each sequence is split across two GPUs, so it should actually take more steps to process the same number of sequences. But the opposite is happening.

Am I misunderstanding something here? Would love your help!

here is command line I used for training model
```
NPROC_PER_NODE=2 \
PYTORCH_CUDA_ALLOC_CONF='' \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-7B \
    --train_type full \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 1024 \
    --vllm_tensor_parallel_size 1 \
    --dataset AI-MO/NuminaMath-TIR@1000
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --max_length 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --eval_steps 1000 \
    --save_steps 1000 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 1024 \
    --reward_funcs accuracy format \
    --num_generations 8 \
    --system examples/train/grpo/prompt.txt \
    --deepspeed zero3_offload \
    --temperature 1.0 \
    --top_p 1.0 \
    --top_k 80 \
    --attn_impl flash_attn \
    --log_completions true \
    --async_generate false \
    --offload_optimizer true \
    --offload_model true \
    --padding_free true \
    --sequence_parallel_size 2 \
    --gc_collect_after_offload true \
    --dataloader_drop_last true \
    --sleep_level 1 \
    --split_dataset_ratio 0

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does applying sequence parallelism reduce the step count? #4553

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does applying sequence parallelism reduce the step count? #4553

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions