Skip to content

Conversation

yuxiaoguo
Copy link

The TMA descriptor for attn_lse_intermediates is initialized based on the original number of SMs in the hardware during make_globals (in latency/scheduler.py). However, its actual allocated size is later rounded up to a multiple of 16 based on the number of SMs (in demos/low-latency-llama/attention_reduction.cu). This discrepancy leads to a failure when creating the TMA descriptor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant