Skip to content

Conversation

@cyrusd98
Copy link

Summary: GPT OSS has a different activation function from Llama4. This diff removes activation function hardcoding to allow general fused moe support.

Test Plan: CI Passes

Differential Revision: D86720793

Summary: No functional changes

Test Plan: CI Passes

Differential Revision: D86799190
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to remove the hardcoded 'silu' activation in the MoE expert implementation to support more general activation functions. The changes are in the right direction, but I've identified a critical issue in torch_moe where a call to the updated torch_experts function is now incorrect, which will lead to a runtime error. I've provided a fix for this. Additionally, I've suggested a small performance optimization in torch_experts to avoid redundant function calls inside a loop.

Comment on lines +951 to +953
def apply_moe_activation(act_str, act_input):
act_fn = get_act_and_mul_fn(act_str)
return act_fn(act_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The apply_moe_activation helper function repeatedly calls get_act_and_mul_fn inside the loop over experts. This is inefficient. It's better to fetch the activation function once before the loop.

I suggest removing this helper function and instead initializing the activation function before the loop:

    act_fn = get_act_and_mul_fn(activation)

    for i in range(num_experts):
        mask = topk_ids == i
        if mask.sum():
            if quant_dtype is None:
                # ...
                tmp2 = act_fn(tmp1)
                # ...

This change should be applied to all calls to apply_moe_activation within the loop.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1045 to 1051
topk_weight,
topk_ids,
global_num_experts,
b_bias1,
b_bias2,
w1_bias,
w2_bias,
expert_map,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Provide activation argument when calling torch_experts

The new activation parameter in torch_experts is required but torch_moe still calls the helper with the old positional layout. As written the integer global_num_experts is passed into get_act_and_mul_fn as the activation name and the bias tensors are shifted into the wrong parameters, so any test that reaches this code will raise before exercising the kernel. Please either give activation a default (e.g., "silu") or add an activation argument to torch_moe and forward it correctly.

Useful? React with 👍 / 👎.

Summary:
Pull Request resolved: vllm-project#28503

GPT OSS has a different activation function from Llama4. This diff removes activation function hardcoding to allow general fused moe support.

Test Plan: CI Passes

Differential Revision: D86720793
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant