-
Notifications
You must be signed in to change notification settings - Fork 325
add qwen 3 support #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add qwen 3 support #118
Conversation
pyproject.toml
Outdated
@@ -54,8 +54,8 @@ dev-dependencies = [ | |||
] | |||
override-dependencies = [ | |||
"bitsandbytes; sys_platform == 'linux'", | |||
"unsloth==2025.3.19 ; sys_platform == 'linux'", | |||
"unsloth-zoo==2025.3.17 ; sys_platform == 'linux'", | |||
"unsloth==2025.5.1 ; sys_platform == 'linux'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these from override-dependencies and just keep them in dependencies now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! You've confirmed training appears to converge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. My main concern is that when I previously tested vLLM v0.8.3, inference, for at least LoRA or quantized models, was notably slower. Maybe they've since addressed that, but it's something I'd be worried about.
This is a fair concern—in our internal testing of vllm 0.8.4 LoRA inference was 4x slower than non-LoRA. Seems like there may have been a regression recently. |
I wonder if it is possible to upgrade Unsloth to support Qwen3 without upgrading vLLM |
vLLM does need to be >=0.8.5 for qwen3 I believe -- https://qwen.readthedocs.io/en/latest/deployment/vllm.html . will still test. currently running quick benchmark tests with ART to see how much slower generation is. @corbt , yeah, checked convergence, which looks good! |
Inference benchmark results v0.8.5: Inference benchmark results v0.7.3: each run is 10 iterations, where each iteration is 5 concurrent requests, each with 200 output tokens. you're right @bradhilton , it's definitely slower, but not unusably slower imo. in general, we will have to keep moving with vllm and unsloth versions, so I am in favour of going ahead. Thoughts, @corbt and @bradhilton ? |
I don't know if this is the best example to benchmark because the requests are exceptionally small. When we get a chance, I'd like to see benchmarks for Temporal Clue or a similarly inference intense task. |
what's the ballpark input token, output token, concurrency for temporal clue would you say? shouldn't be too hard to update benchmark script |
800 concurrent requests with about 1,000 input and 1,000 output tokens, Qwen2.5 16B, on an H100. |
we're looking good, in fact better in the high concurrency run high input output token run. Inference benchmark results v0.8.5: Inference benchmark results v0.7.3: seems to be on par. we can merge right now... can always revert if we see massive performance losses. will wait for your opinion before merging @corbt @bradhilton |
Okay, awesome. LGTM. |
qwen 3 works now!
have not tested the moe models yet, but that can be a future pr (if they don't already start working).
worked with python 3.12. not tested with 3.10/11 yet