-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) #17280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
…m-project#16515) Signed-off-by: kaln27 <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
I have verified that this PR works on FP8 models and it has the same speed as the TRT-LLM FP8 |
After merging vllm-project/vllm today's main into https://github.com/kaln27/vllm/tree/main (I did it on my current https://github.com/cyril23/vllm/tree/main/), I've build it via
The pytorch wheel size is still < 400 MB although I've built with default settings i.e. You can try it out on Docker Hub: based 27th June 2025, after cyril23#2)
Strangely performance is not as good as some older builds from 19th May 2025, after cyril23#1:
BF 16 performance has degraded since the older build, too: Furthermore the very first request after starting up vLLM takes 30-60 seconds. Feels like PTX being compiled or something. This only happens on my June builds. However I don't think it has anything to do with your code @kaln27 but rather some recent changes to the vllm main branch. Maybe I'm missing some important run time flags or built it wrong? edit:
Apparently #19336 is why this happened
unfortunately we need that PR until Pypi max. wheel size has been increased etc. Furthermore according to #19336 (comment) I just need more warmup time. edit: more warmup does not help in my case, strange. |
Hi @tlrmchlsmth, May I know when this PR is expected to be merged? I’ve also verified @kaln27 's changes on an RTX 5090 (sm120), and they work well with the LLaMA 3.2 11B Vision model. Thanks! |
Please check and merge the PR ASAP, this is very useful for the people using 50 series and black wall.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for missing this PR, thanks for the kernel support! This looks reasonable to me, but could you share an e2e accuracy eval to make sure the kernel runs properly? Typically we use gsm8k on lm-eval
@mgoin thanks for your reply. I have download (Qwen2.5-3B-FP8-dynamic)[https://huggingface.co/RedHatAI/Qwen2.5-3B-FP8-dynamic] and run benchmark on gsm8k. The vllm that I use was build yesterday. nvidia-sminvidia-smi
Tue Jul 1 10:38:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 Off | N/A |
| 53% 32C P0 22W / 300W | 0MiB / 16303MiB | 6% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+ Resultslm_eval \
--model vllm \
--model_args pretrained="/data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.9,add_bos_token=True,max_model_len=4096,enable_chu
nked_prefill=True,tensor_parallel_size=1 \
--tasks gsm8k \ --batch_size auto
INFO 07-01 10:26:54 [__init__.py:244] Automatically detected platform cuda.
2025-07-01:10:27:19 INFO [__main__:440] Selected Tasks: ['gsm8k']
2025-07-01:10:27:19 INFO [evaluator:185] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-07-01:10:27:19 INFO [evaluator:223] Initializing vllm model, with arguments: {'pretrained': '/data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic', 'dtype': '
auto', 'gpu_memory_utilization': 0.9, 'add_bos_token': True, 'max_model_len': 4096, 'enable_chunked_prefill': True, 'tensor_parallel_size': 1}
INFO 07-01 10:27:28 [config.py:831] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
INFO 07-01 10:27:28 [config.py:1444] Using max model len 4096 [23/1873]
INFO 07-01 10:27:29 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-01 10:27:30 [core.py:460] Waiting for init message from front-end.
INFO 07-01 10:27:30 [core.py:70] Initializing a V1 LLM engine (v0.1.dev7202+g7414eb0.d20250630) with config: model='/data/models/RedHatAI/Qwen2.5-3B-FP8-dynami
c', speculative_config=None, tokenizer='/data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_
neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_paralle
l_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=
cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, ser
ved_model_name=/data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill
_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["no
ne"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_
auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,45
6,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,1
36,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
2025-07-01 10:27:30,718 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 07-01 10:27:31 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_w
orker.Worker object at 0x7416a73657e0> INFO 07-01 10:27:32 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-01 10:27:32 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-01 10:27:32 [gpu_model_runner.py:1633] Starting to load model /data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic...
INFO 07-01 10:27:32 [gpu_model_runner.py:1638] Loading model from scratch...
INFO 07-01 10:27:33 [cuda.py:259] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:29<00:00, 29.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:29<00:00, 29.10s/it]
INFO 07-01 10:28:02 [default_loader.py:272] Loading weights took 29.20 seconds
INFO 07-01 10:28:02 [gpu_model_runner.py:1662] Model loading took 3.2290 GiB and 29.655233 seconds
INFO 07-01 10:28:17 [backends.py:508] Using cache directory: /data/liaojuncheng/.cache/vllm/torch_compile_cache/892cdfc123/rank_0_0/backbone for vLLM's torch.compile
INFO 07-01 10:28:17 [backends.py:519] Dynamo bytecode transform time: 14.21 s
INFO 07-01 10:28:20 [backends.py:181] Cache the graph of shape None for later use
^[INFO 07-01 10:28:51 [backends.py:193] Compiling a graph for general shape takes 33.59 s
INFO 07-01 10:29:06 [monitor.py:34] torch.compile takes 47.79 s in total
2025-07-01 10:29:06,431 - INFO - flashinfer.jit: Loading JIT ops: sampling
/data/liaojuncheng/miniconda3/envs/llm50xx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/data/liaojuncheng/miniconda3/envs/llm50xx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
2025-07-01 10:29:06,672 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
INFO 07-01 10:29:07 [gpu_worker.py:232] Available KV cache memory: 9.87 GiB
INFO 07-01 10:29:07 [kv_cache_utils.py:716] GPU KV cache size: 287,376 tokens
INFO 07-01 10:29:07 [kv_cache_utils.py:720] Maximum concurrency for 4,096 tokens per request: 70.16x
WARNING 07-01 10:29:07 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graphs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:20<00:00, 3.35it/s]
INFO 07-01 10:29:27 [gpu_model_runner.py:2092] Graph capturing finished in 20 secs, took 0.72 GiB
INFO 07-01 10:29:27 [core.py:173] init engine (profile, create kv cache, warmup model) took 85.11 seconds
2025-07-01:10:29:41 INFO [evaluator:286] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2025-07-01:10:29:41 INFO [api.task:434] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:03<00:00, 350.59it/s]
2025-07-01:10:29:45 INFO [evaluator:559] Running generate_until requests
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 7570.54it/s]Processed prompts: 100%|██████████████████████████████████████████| 1319/1319 [02:57<00:00, 7.43it/s, est. speed input: 7377.40 toks/s, output: 909.31 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:57<00:00, 7.42it/s]
2025-07-01:10:32:49 INFO [loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=/data/models/RedHatAI/Qwen2.5-3B-FP8-dynamic,dtype=auto,gpu_memory_utilization=0.9,add_bos_token=True,max_model_len=4096,enable_chunked_prefill=True,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7324|± |0.0122|
| | |strict-match | 5|exact_match|↑ |0.6603|± |0.0130| You can see the vllm version is |
CUDA_ARCHS "${SCALED_MM_ARCHS}") | ||
list(APPEND VLLM_EXT_SRC "${SRCS}") | ||
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM120=1") | ||
# Let scaled_mm_c2x know it doesn't need to build these arches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy from the top one. If you think that's incorrect you can delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that other cutlass_scaled_mm (use cutlass 3.0) in cmake also have this comment.
LGTM
CUDA_ARCHS "${SCALED_MM_ARCHS}") | ||
list(APPEND VLLM_EXT_SRC "${SRCS}") | ||
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM120=1") | ||
# Let scaled_mm_c2x know it doesn't need to build these arches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy from the top one. If you think that's incorrect you can delete it.
…llm-project#17280) Signed-off-by: kaln27 <[email protected]> Co-authored-by: mgoin <[email protected]>
…llm-project#17280) Signed-off-by: kaln27 <[email protected]> Co-authored-by: mgoin <[email protected]>
…llm-project#17280) Signed-off-by: kaln27 <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: avigny <[email protected]>
…llm-project#17280) Signed-off-by: kaln27 <[email protected]> Co-authored-by: mgoin <[email protected]>
…llm-project#17280) Signed-off-by: kaln27 <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Add Cutlass w8a8 support for Blackwell Geforce sm120.
Currently when use sm100 kernel will cause an internal error. I don't know the reason.
Work well on RTX 5070Ti with Qwen2.5-VL-7B-Instruct-FP8-Dynamic model which quantized use llm-compressor.
FIX #16515