Skip to content

add qwen 3 support #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 13, 2025
Merged

add qwen 3 support #118

merged 4 commits into from
May 13, 2025

Conversation

saum7800
Copy link
Collaborator

@saum7800 saum7800 commented May 13, 2025

qwen 3 works now!

  • update vllm and unsloth
  • update patches, specifically the get lora tokenizer one and the wake_up() call
  • add example with qwen 3. important to use thinking true or false properly!

have not tested the moe models yet, but that can be a future pr (if they don't already start working).

worked with python 3.12. not tested with 3.10/11 yet

@saum7800 saum7800 requested review from bradhilton and corbt May 13, 2025 08:29
pyproject.toml Outdated
@@ -54,8 +54,8 @@ dev-dependencies = [
]
override-dependencies = [
"bitsandbytes; sys_platform == 'linux'",
"unsloth==2025.3.19 ; sys_platform == 'linux'",
"unsloth-zoo==2025.3.17 ; sys_platform == 'linux'",
"unsloth==2025.5.1 ; sys_platform == 'linux'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these from override-dependencies and just keep them in dependencies now?

Copy link
Contributor

@corbt corbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! You've confirmed training appears to converge?

Copy link
Collaborator

@bradhilton bradhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My main concern is that when I previously tested vLLM v0.8.3, inference, for at least LoRA or quantized models, was notably slower. Maybe they've since addressed that, but it's something I'd be worried about.

@corbt
Copy link
Contributor

corbt commented May 13, 2025

This is a fair concern—in our internal testing of vllm 0.8.4 LoRA inference was 4x slower than non-LoRA. Seems like there may have been a regression recently.

@bradhilton
Copy link
Collaborator

I wonder if it is possible to upgrade Unsloth to support Qwen3 without upgrading vLLM

@saum7800
Copy link
Collaborator Author

saum7800 commented May 13, 2025

vLLM does need to be >=0.8.5 for qwen3 I believe -- https://qwen.readthedocs.io/en/latest/deployment/vllm.html . will still test. currently running quick benchmark tests with ART to see how much slower generation is.

@corbt , yeah, checked convergence, which looks good!
Screenshot 2025-05-13 at 2 41 42 PM

@saum7800
Copy link
Collaborator Author

Inference benchmark results v0.8.5:
Iterations: 10
Concurrency: 5
Total time: 52.77 s
Min iteration: 5.18 s
Max iteration: 6.06 s
Avg iteration: 5.28 s
Std dev iter: 0.28 s
Avg per req: 1.06 s/request

Inference benchmark results v0.7.3:
Iterations: 10
Concurrency: 5
Total time: 49.77 s
Min iteration: 4.93 s
Max iteration: 5.02 s
Avg iteration: 4.98 s
Std dev iter: 0.02 s
Avg per req: 1.00 s/request

each run is 10 iterations, where each iteration is 5 concurrent requests, each with 200 output tokens. you're right @bradhilton , it's definitely slower, but not unusably slower imo. in general, we will have to keep moving with vllm and unsloth versions, so I am in favour of going ahead. Thoughts, @corbt and @bradhilton ?

@bradhilton
Copy link
Collaborator

bradhilton commented May 13, 2025

I don't know if this is the best example to benchmark because the requests are exceptionally small. When we get a chance, I'd like to see benchmarks for Temporal Clue or a similarly inference intense task.

@saum7800
Copy link
Collaborator Author

what's the ballpark input token, output token, concurrency for temporal clue would you say? shouldn't be too hard to update benchmark script

@bradhilton
Copy link
Collaborator

800 concurrent requests with about 1,000 input and 1,000 output tokens, Qwen2.5 16B, on an H100.

@saum7800
Copy link
Collaborator Author

we're looking good, in fact better in the high concurrency run high input output token run.

Inference benchmark results v0.8.5:
Iterations: 1
Concurrency: 800
Total time: 245.20 s
Avg iteration: 245.20 s
Avg per req: 0.31 s/request
Avg prompt tokens: 1538.00
Avg completion tokens: 813.50

Inference benchmark results v0.7.3:
Iterations: 1
Concurrency: 800
Total time: 274.60 s
Avg iteration: 274.60 s
Avg per req: 0.34 s/request
Avg prompt tokens: 1538.00
Avg completion tokens: 794.20

seems to be on par. we can merge right now... can always revert if we see massive performance losses. will wait for your opinion before merging @corbt @bradhilton

@bradhilton
Copy link
Collaborator

Okay, awesome. LGTM.

@bradhilton bradhilton self-requested a review May 13, 2025 23:44
@saum7800 saum7800 merged commit 76199b6 into main May 13, 2025
1 check passed
@saum7800 saum7800 deleted the q3_check branch May 13, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants