### System Info NA ### Who can help? @StudyingShao ### Information - [x] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction NA ### Expected behavior INT4 AWQ can work correctly with an extremely large input sequence length. ### actual behavior The launch of the kernel `apply_per_channel_scale` fails, when the CUDA `grid.y` of the launching configurations exceeds 65535. ### additional notes NA