Gradient becomes NaN during training.

After feeding input into the Mamba layer, the output range keeps expanding and eventually explodes as training progresses. How can this be resolved?Does Mamba have specific precision requirements? The entire model uses FP16, with only the Mamba layer converted to FP32.