-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Enable Numba for FFD packing algorithm #3524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Numba for FFD packing algorithm #3524
Conversation
Nice! The only concern I have is that numba is not a dep of trl. And I don't think that adding it just for this really makes sense |
hm, honestly I haven't installed it in fresh venv, let me check if it's a dependency of some lib which is part of trl's requirements |
Okay, so seems like numba is part of vllm dependencies, so if you install |
@qgallouedec needs to re-open with main branch now, thanks! |
Hi, I just opened a PR with a faster FFD implementation that doesn't introduce new dependencies and doesn't require warmup. |
It could be safe, but here's what I think: it would make sense if FFD were prohibitively slow to add a dependency, because it would be necessary to use it. Here, it's just to go from “very fast” to “very, very fast.” Thus, adding a dependency doesn't seem justified to me. |
What does this PR do?
Related to #3521 : enables Numba compilation for FFD packing.
Improves speed by 12.5%, tested on 10K, 100K, 1M sample datasets.
Code to test for speed
📊 BENCHMARK SUMMARY (Before)
10K Dataset Results:
baseline_ffd : 0.67s ( 44,914 samples/sec)
fixed : 0.02s (1,668,828 samples/sec)
100K Dataset Results:
baseline_ffd : 6.72s ( 44,646 samples/sec)
fixed : 0.26s (1,149,577 samples/sec)
1M Dataset Results:
baseline_ffd : 67.98s ( 44,132 samples/sec)
fixed : 3.42s ( 876,439 samples/sec)
📊 BENCHMARK SUMMARY (After)
10K Dataset Results:
baseline_ffd : 0.60s ( 50,087 samples/sec)
fixed : 0.02s (1,718,645 samples/sec)
100K Dataset Results:
baseline_ffd : 5.98s ( 50,152 samples/sec)
fixed : 0.26s (1,140,803 samples/sec)
1M Dataset Results:
baseline_ffd : 60.42s ( 49,650 samples/sec)
fixed : 3.45s ( 870,240 samples/sec)
Speedup: 12.5%
Code to verify correctness
Usage:
Right now number of packed samples will be matched, however on this branch packing is more exact with +1 last sequence put into some of the bins.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.