Skip to content

Conversation

TianmengChen
Copy link
Contributor

@TianmengChen TianmengChen commented May 27, 2025

Detail:

  1. Load attn_q_norm attn_k_norm weight for Qwen3 architecture.
  2. Add rms_norm in splite_head after q k head for Qwen3 architecture.

Validation Scope:
Qwen3-0.6B-f16 (CPU/GPU)
Qwen3-0.6B-Q8_0 (CPU*)
Qwen3-4B-Q4_K_M (CPU/GPU)

*Q8_0 has accuracy issue on GPU: CVS-166108

@github-actions github-actions bot added the category: GGUF GGUF file reader label May 27, 2025
@sammysun0711 sammysun0711 requested a review from Copilot May 27, 2025 03:39
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the Qwen3 architecture by updating model configuration, weight loading, and head-splitting logic. Key changes include:

  • Adding Qwen3 to the list of supported architectures in gguf_modeling.cpp.
  • Updating head_size configuration to use the key_length metadata if available and loading attn_q_norm/attn_k_norm weights in gguf.cpp.
  • Introducing a new split_heads helper function in building_blocks.cpp that applies RMS normalization for Qwen3 and updating multi_head_attention to use it.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/cpp/src/gguf_utils/gguf_modeling.cpp Expanded architecture support by including Qwen3 in model creation logic.
src/cpp/src/gguf_utils/gguf.cpp Adjusted head_size calculation and added loading for Qwen3-specific normalization weights.
src/cpp/src/gguf_utils/building_blocks.cpp Added a new split_heads helper function with RMS norm handling and updated multi_head_attention to use it.
Comments suppressed due to low confidence (1)

src/cpp/src/gguf_utils/building_blocks.cpp:357

  • The key used for v-split normalization ("self_attn.v_norm") does not appear to be loaded anywhere, unlike the Qwen3 q_norm and k_norm weights. Please confirm if the absence of a v_norm weight is intentional or if the weight loading logic should be updated accordingly.
auto v_split = split_heads(value, num_heads_kv, head_dim, rms_norm_eps, key_name + ".self_attn.v_norm", consts);

@sammysun0711
Copy link

build_jenkins

@sammysun0711
Copy link

Please add a test case for Qwen3: e.g. Qwen/Qwen3-0.6B-GGUF

@sammysun0711
Copy link

It seems that Qwen3 GGUF was not supported by transformer yet: ggml.py
Hence the test failed here: https://github.com/openvinotoolkit/openvino.genai/actions/runs/15337247120/job/43157146883?pr=2273
GGUF model with architecture qwen3 is not supported yet.
@rkazants, can we skip the test first after qwen3 GGUF support by transformers?

@rkazants
Copy link
Collaborator

rkazants commented May 30, 2025

It seems that Qwen3 GGUF was not supported by transformer yet: ggml.py Hence the test failed here: https://github.com/openvinotoolkit/openvino.genai/actions/runs/15337247120/job/43157146883?pr=2273 GGUF model with architecture qwen3 is not supported yet. @rkazants, can we skip the test first after qwen3 GGUF support by transformers?

please add a separate test without comparation with HF results. Just let us compare with some hard-coded expected result for input prompt. And add a comment that this is temporal testing solution until transformer starts to support qwen3 in GGUF format.

@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: LLM samples GenAI LLM samples and removed no-match-files labels May 30, 2025
@github-actions github-actions bot removed the category: LLM samples GenAI LLM samples label May 30, 2025
@sammysun0711
Copy link

build_jenkins

@sammysun0711 sammysun0711 requested a review from rkazants June 4, 2025 04:51
@sammysun0711
Copy link

build_jenkins

@sammysun0711
Copy link

It seems that Qwen3 GGUF was not supported by transformer yet: ggml.py Hence the test failed here: https://github.com/openvinotoolkit/openvino.genai/actions/runs/15337247120/job/43157146883?pr=2273 GGUF model with architecture qwen3 is not supported yet. @rkazants, can we skip the test first after qwen3 GGUF support by transformers?

please add a separate test without comparation with HF results. Just let us compare with some hard-coded expected result for input prompt. And add a comment that this is temporal testing solution until transformer starts to support qwen3 in GGUF format.

Qwen3 GGUF related test added:

def test_full_gguf_qwen3_pipeline(pipeline_type, model_ids):

CI test run & passed: https://github.com/openvinotoolkit/openvino.genai/actions/runs/15546210687/job/43773618720?pr=2273#step:8:151

@sammysun0711
Copy link

build_jenkins

@rkazants rkazants added this to the 2025.3 milestone Jun 10, 2025
@Wovchena Wovchena added this pull request to the merge queue Jun 10, 2025
Merged via the queue into openvinotoolkit:master with commit 24493e9 Jun 10, 2025
165 of 176 checks passed
sammysun0711 pushed a commit to sammysun0711/openvino.genai that referenced this pull request Jun 20, 2025
**Detail:**

1. Load attn_q_norm attn_k_norm weight for Qwen3 architecture.
2. Add rms_norm in splite_head after q k head for Qwen3 architecture.

**Validation Scope:** 

[Qwen3-0.6B-f16](https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-f16.gguf)
(CPU/GPU)

[Qwen3-0.6B-Q8_0](https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q8_0.gguf)
(CPU*)

[Qwen3-4B-Q4_K_M](https://huggingface.co/Qwen/Qwen3-4B-GGUF/blob/main/Qwen3-4B-Q4_K_M.gguf)
(CPU/GPU)

*Q8_0 has accuracy issue on GPU: CVS-166108

---------

Co-authored-by: Xiake Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GGUF GGUF file reader category: LLM LLM pipeline (stateful, static)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants