-
Notifications
You must be signed in to change notification settings - Fork 284
[GGUF] support Qwen3 architecture #2273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the Qwen3 architecture by updating model configuration, weight loading, and head-splitting logic. Key changes include:
- Adding Qwen3 to the list of supported architectures in gguf_modeling.cpp.
- Updating head_size configuration to use the key_length metadata if available and loading attn_q_norm/attn_k_norm weights in gguf.cpp.
- Introducing a new split_heads helper function in building_blocks.cpp that applies RMS normalization for Qwen3 and updating multi_head_attention to use it.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
src/cpp/src/gguf_utils/gguf_modeling.cpp | Expanded architecture support by including Qwen3 in model creation logic. |
src/cpp/src/gguf_utils/gguf.cpp | Adjusted head_size calculation and added loading for Qwen3-specific normalization weights. |
src/cpp/src/gguf_utils/building_blocks.cpp | Added a new split_heads helper function with RMS norm handling and updated multi_head_attention to use it. |
Comments suppressed due to low confidence (1)
src/cpp/src/gguf_utils/building_blocks.cpp:357
- The key used for v-split normalization ("self_attn.v_norm") does not appear to be loaded anywhere, unlike the Qwen3 q_norm and k_norm weights. Please confirm if the absence of a v_norm weight is intentional or if the weight loading logic should be updated accordingly.
auto v_split = split_heads(value, num_heads_kv, head_dim, rms_norm_eps, key_name + ".self_attn.v_norm", consts);
build_jenkins |
Please add a test case for Qwen3: e.g. Qwen/Qwen3-0.6B-GGUF
|
It seems that Qwen3 GGUF was not supported by transformer yet: ggml.py |
please add a separate test without comparation with HF results. Just let us compare with some hard-coded expected result for input prompt. And add a comment that this is temporal testing solution until transformer starts to support qwen3 in GGUF format. |
Revert to default value
build_jenkins |
Fix qwen3 test case
build_jenkins |
Qwen3 GGUF related test added:
CI test run & passed: https://github.com/openvinotoolkit/openvino.genai/actions/runs/15546210687/job/43773618720?pr=2273#step:8:151 |
build_jenkins |
24493e9
**Detail:** 1. Load attn_q_norm attn_k_norm weight for Qwen3 architecture. 2. Add rms_norm in splite_head after q k head for Qwen3 architecture. **Validation Scope:** [Qwen3-0.6B-f16](https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-f16.gguf) (CPU/GPU) [Qwen3-0.6B-Q8_0](https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q8_0.gguf) (CPU*) [Qwen3-4B-Q4_K_M](https://huggingface.co/Qwen/Qwen3-4B-GGUF/blob/main/Qwen3-4B-Q4_K_M.gguf) (CPU/GPU) *Q8_0 has accuracy issue on GPU: CVS-166108 --------- Co-authored-by: Xiake Sun <[email protected]>
Detail:
Validation Scope:
Qwen3-0.6B-f16 (CPU/GPU)
Qwen3-0.6B-Q8_0 (CPU*)
Qwen3-4B-Q4_K_M (CPU/GPU)
*Q8_0 has accuracy issue on GPU: CVS-166108