You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
我使用了以下yaml优化规则:`- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeRMSNorm
replace:
class: ktransformers.operators.layernorm.KQwen3MoeRMSNorm
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeRMSNorm
replace:
class: ktransformers.operators.layernorm.KQwen3MoeRMSNorm
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeRMSNorm
replace:
class: ktransformers.operators.layernorm.KQwen3MoeRMSNorm
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\.| ^model.norm"
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeRMSNorm
replace:
class: ktransformers.operators.layernorm.KQwen3MoeRMSNorm
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeMLP
replace:
class: ktransformers.operators.mlp.KQwen2MoeMLP
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeMLP
replace:
class: ktransformers.operators.mlp.KQwen2MoeMLP
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeMLP
replace:
class: ktransformers.operators.mlp.KQwen2MoeMLP
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\."
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeMLP
replace:
class: ktransformers.operators.mlp.KQwen2MoeMLP
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\."
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.RotaryEmbedding
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\."
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.RotaryEmbedding
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\."
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.RotaryEmbedding
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\."
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.RotaryEmbedding
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\.(?!.mlp\.shared_expert_gate).$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\.(?!.mlp\.shared_expert_gate).$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\.(?!.mlp\.shared_expert_gate).$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\.(?!.mlp\.shared_expert_gate).$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\.mlp$"
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeSparseMoeBlock
replace:
class: ktransformers.operators.experts.KQwen3MoeSparseMoeBlockV2 # mlp module with custom forward function
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\.mlp$"
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeSparseMoeBlock
replace:
class: ktransformers.operators.experts.KQwen3MoeSparseMoeBlockV2 # mlp module with custom forward function
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\.mlp$"
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeSparseMoeBlock
replace:
class: ktransformers.operators.experts.KQwen3MoeSparseMoeBlockV2 # mlp module with custom forward function
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\.mlp$"
class: ktransformers.models.modeling_qwen3_moe.Qwen3MoeSparseMoeBlock
replace:
class: ktransformers.operators.experts.KQwen3MoeSparseMoeBlockV2 # mlp module with custom forward function
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExpertsV2 # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExpertsV2 # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:1"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:1"
recursive: False # don't recursively inject submodules of this module
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExpertsV2 # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:2"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:2"
recursive: False # don't recursively inject submodules of this module
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExpertsV2 # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:3"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:3"
recursive: False # don't recursively inject submodules of this module
match:
name: "^model\.layers\.([0-9]|1[0-9]|2[0-2])\.self_attn$"
replace:
class: ktransformers.operators.balance_serve_attention.KQwen3MoeAttention # optimized MLA implementation
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
match:
name: "^model\.layers\.(2[3-9]|3[0-9]|4[0-5])\.self_attn$"
replace:
class: ktransformers.operators.balance_serve_attention.KQwen3MoeAttention # optimized MLA implementation
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
match:
name: "^model\.layers\.(4[6-9]|5[0-9]|6[0-8])\.self_attn$"
replace:
class: ktransformers.operators.balance_serve_attention.KQwen3MoeAttention # optimized MLA implementation
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
match:
name: "^model\.layers\.(6[9]|7[0-9]|8[0-9]|9[0-3])\.self_attn$"
replace:
class: ktransformers.operators.balance_serve_attention.KQwen3MoeAttention # optimized MLA implementation
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KQwen2MoeModel"
kwargs:
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
transfer_map:
23: "cuda:1"
46: "cuda:2"
69: "cuda:3"
match:
name: "^lm_head$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
generate_op: "VLinearMarlin"
prefill_op: "KLinearTorch"
`
,但是在warmup阶段会报错capturing cuda graph 1 1
2025-06-13 11:44:07,720 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-06-13 11:44:07,746 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-06-13 11:44:08,128 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-06-13 11:44:08,152 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-06-13 11:44:08,262 - INFO - flashinfer.jit: Loading JIT ops: page
2025-06-13 11:44:08,285 - INFO - flashinfer.jit: Finished loading JIT ops: page
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 279, in run_engine
engine.model_runner.warmup()
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 148, in warmup
self.outputs_buf[i] = self.model(self.input[i], self.features_buf[i], self.bsz_tensor_buf, self.num_tokens_tensor_buf, self.page_idx_buf[i], self.page_offset_buf[i], cuda_graph_idx=i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/ktransformers/models/custom_modeling_qwen3_moe.py", line 100, in forward
hidden_states, residual = decode_layer.input_layernorm(hidden_states, num_tokens_tensors, residual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/ktransformers/operators/layernorm.py", line 150, in forward
fused_add_rmsnorm(x, residual, self.weight.data, batch_size_tensor, self.variance_epsilon)
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/flashinfer/norm.py", line 145, in fused_add_rmsnorm
get_module_attr("fused_add_rmsnorm")(input_tensor, residual, weight, batch_size_tensor, eps, enable_pdl)
File "/home/localhost/miniconda3/envs/kt311/lib/python3.11/site-packages/torch/_ops.py", line 756, in call
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CHECK_EQ(residual.device(), device) failed. cuda:0 vs cuda:1 为什么
Beta Was this translation helpful? Give feedback.
All reactions