-
1, in the legacy code: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Ah, I somewhat remember that. You are right that because |
Beta Was this translation helpful? Give feedback.
-
1, in the legacy code: |
Beta Was this translation helpful? Give feedback.
-
Ah, I somewhat remember that. You are right that because |
Beta Was this translation helpful? Give feedback.
Ah, I somewhat remember that. You are right that because
contig_per_thread=1
, shared memory ops aren't well vectorized on V100. I don't think there is any deep reason why; this happened at a time during which I was very busy with OpenAI stuff, and probably I just wanted to get good A100 perf ASAP without risking to break the V100 codegen. :D