Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions #7904

woct0rdho · 2025-08-19T16:32:15Z

Motivation

Nvidia GPUs with sm < 89 are still widely used, see e.g. Steam hardware survey. When running large AI models, a common usage is to store the parameters in fp8, and cast them to fp16 for computation on hardware that doesn't have native fp8. This reduces the memory requirement, even though no speed advantage. This PR aims to enable torch.compile on this usage.

We may refer to XLA's fallback mechanism for fp8 operations, see openxla/xla#23124 , although I think we only need to support the conversions rather than all arithmetic operations.

Implementation

Before #2105 , there were some PTX code for converting F8E4M3/F8E5M2 <-> F16/BF16, but they did not correctly handle denormalized values and rounding to nearest even (RTNE). I've fixed these cases, and added the code for F32 -> F8E4M3/F8E5M2.

I've tested that for all 2^8 F8E4M3/F8E5M2 values, all 2^16 F16/BF16 values, and all 2^32 F32 values, the conversion results are bitwise identical to the PyTorch implementation, except some glitches about inf and nan, see the comments. The tests in test_conversions.py are passed.

I've checked that all unit tests are passed on RTX 3080 (sm86). There is no IR change for sm >= 90. For sm89, there is a minor change that previously F32 -> F8E4M3/F8E5M2 was implemented by F32 -> F16 -> F8E4M3/F8E5M2 without correct RTNE, now it's directly implemented with RTNE.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

ThomasRaoux

The main reason we haven't been supporting e4m3 on those targets is because it is inefficient so we don't want to give user an impression that this is natively supported or has efficient conversion.
Conversion can be done at kernel level for users that want to support this format so I'm not sure there is a strong motivation to have the emulation in Triton.

ThomasRaoux · 2025-08-19T18:44:51Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp

-           "add.u32 a0, a0, 0x00800080;  \n"   // a0 += 0x00800080
-           "add.u32 a1, a1, 0x00800080;  \n"   // (round to nearest)
-           "prmt.b32 $0, a0, a1, 0x7531; \n\t" // output = a1a0
+           ".reg .b32 a<2>, b<2>;        \n"


changing the conversion of e5m2 will have a large impact on BC compatibility precision and performance. This is going to be a problem for OAI internally, I don't think we want to change it, at least not in the PR.

woct0rdho · 2025-08-19T19:49:51Z

Thank you and I understand your concern. Do you think there is a way to 'inject' these kernels when doing triton.jit or torch.compile, or add a custom pass, without modifying Triton itself? (And preferably, avoid graph break in torch.compile and retain graph-level optimizations?)

Also I'd say non-standard rounding may cause some surprising compatibility issue, such as 'why I suddenly get noise when enabling torch.compile', although I can't yet show a typical case.

ThomasRaoux · 2025-08-19T21:02:02Z

Thank you and I understand your concern. Do you think there is a way to 'inject' these kernels when doing triton.jit or torch.compile, or add a custom pass, without modifying Triton itself? (And preferably, avoid graph break in torch.compile and retain graph-level optimizations?)

Also I'd say non-standard rounding may cause some surprising compatibility issue, such as 'why I suddenly get noise when enabling torch.compile', although I can't yet show a typical case.

it should be easy to make a pass on ttir or ttgir to transform it in supported IR using elementwise_inline_asm

Enable FP8 conversion on sm < 89

88c486d

woct0rdho requested a review from ptillet as a code owner August 19, 2025 16:32

woct0rdho mentioned this pull request Aug 19, 2025

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions woct0rdho/triton-windows#140

Open

ThomasRaoux requested changes Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions #7904

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions #7904

woct0rdho commented Aug 19, 2025 •

edited

Loading

Uh oh!

ThomasRaoux left a comment

Uh oh!

ThomasRaoux Aug 19, 2025

Uh oh!

woct0rdho commented Aug 19, 2025 •

edited

Loading

Uh oh!

ThomasRaoux commented Aug 19, 2025

Uh oh!

Uh oh!

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions #7904

Are you sure you want to change the base?

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions #7904

Conversation

woct0rdho commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Implementation

New contributor declaration

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

woct0rdho commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasRaoux commented Aug 19, 2025

Uh oh!

Uh oh!

woct0rdho commented Aug 19, 2025 •

edited

Loading

woct0rdho commented Aug 19, 2025 •

edited

Loading