Hi,
Thanks for your excellent work. During training I encountered lots of errors and some garbled output in the samples— is this expected? On 8 A100 GPUs it shows that training 444 steps will take 40 hours, roughly 5 minutes per step. Is that a normal training speed?
Qwen2.5-Math-1.5B-raft-plusplus-numina_math-n4.log