Skip to content
Discussion options

You must be logged in to vote

Ah, I somewhat remember that. You are right that because contig_per_thread=1, shared memory ops aren't well vectorized on V100. I don't think there is any deep reason why; this happened at a time during which I was very busy with OpenAI stuff, and probably I just wanted to get good A100 perf ASAP without risking to break the V100 codegen. :D

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by goostavz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants