-
Notifications
You must be signed in to change notification settings - Fork 820
Closed
Description
Thanks for building such a great framework!
I had a question while using GRPO with sequence parallelism (SP). I was training on 1,000 data samples using 2 GPUs, per_device_train_batch_size=1 and I noticed something that confused me:
With SP=2 → training steps = 250
With SP=1 → training steps = 500
I initially thought that with SP, each sequence is split across two GPUs, so it should actually take more steps to process the same number of sequences. But the opposite is happening.
Am I misunderstanding something here? Would love your help!
here is command line I used for training model
NPROC_PER_NODE=2 \
PYTORCH_CUDA_ALLOC_CONF='' \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-7B \
--train_type full \
--use_vllm true \
--vllm_mode colocate \
--vllm_gpu_memory_utilization 0.5 \
--vllm_max_model_len 1024 \
--vllm_tensor_parallel_size 1 \
--dataset AI-MO/NuminaMath-TIR@1000
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--max_length 1024 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--eval_steps 1000 \
--save_steps 1000 \
--learning_rate 1e-6 \
--save_total_limit 2 \
--logging_steps 5 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--max_completion_length 1024 \
--reward_funcs accuracy format \
--num_generations 8 \
--system examples/train/grpo/prompt.txt \
--deepspeed zero3_offload \
--temperature 1.0 \
--top_p 1.0 \
--top_k 80 \
--attn_impl flash_attn \
--log_completions true \
--async_generate false \
--offload_optimizer true \
--offload_model true \
--padding_free true \
--sequence_parallel_size 2 \
--gc_collect_after_offload true \
--dataloader_drop_last true \
--sleep_level 1 \
--split_dataset_ratio 0
Metadata
Metadata
Assignees
Labels
No labels