-
Notifications
You must be signed in to change notification settings - Fork 3k
Implement LRU eviction policy for LoRA adapters #11041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ConnorLi96, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the memory management for LoRA adapters within SGLang by introducing a configurable Least Recently Used (LRU) eviction policy. This new policy aims to optimize cache efficiency by prioritizing frequently accessed adapters, keeping them in memory longer than less used ones. The changes involve a new modular framework for eviction policies, integration into the LoRA memory pool, and a command-line option for users to select their preferred policy, all while ensuring backward compatibility with the existing FIFO behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable LRU eviction policy for LoRA adapters, which is a great enhancement for managing memory more intelligently. The implementation is well-structured, introducing a new eviction policy framework and integrating it cleanly into the existing LoRAMemoryPool
and LoRAManager
. The changes maintain backward compatibility by defaulting to the existing FIFO policy. My review includes a minor suggestion to improve code conciseness in the eviction logic.
python/sglang/srt/lora/mem_pool.py
Outdated
candidates = set() | ||
pinned_uids = set() | ||
|
||
for buffer_id in range(self.max_loras_per_batch): | ||
uid = self.buffer_id_to_uid[buffer_id] | ||
if uid not in cur_uids and uid is not None: | ||
candidates.add(uid) | ||
lora_ref = lora_refs.get(uid) | ||
if lora_ref is not None and lora_ref.pinned: | ||
pinned_uids.add(uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for collecting eviction candidates can be made more concise and readable. Using a comprehension to build a list of candidate info first, then creating the candidates
and pinned_uids
sets from it, can make the code more declarative and easier to follow.
all_candidates = [
(uid, lora_refs.get(uid))
for uid in self.buffer_id_to_uid
if uid not in cur_uids and uid is not None
]
candidates = {uid for uid, _ in all_candidates}
pinned_uids = {uid for uid, ref in all_candidates if ref and ref.pinned}
90b9bf5
to
af09896
Compare
af09896
to
8e7afe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work!
nice, thank you so much for the guidance all the way! Can we add run-ci label for this PR? or we can just merge it directly. |
Motivation
For addressing this feature: [Feature] (2/2) Support LRU cache for LoRA eviction
This PR implements a configurable LRU (Least Recently Used) eviction policy for LoRA adapters to provide more intelligent memory management. Currently, SGLang only supports FIFO eviction, which may not be optimal for workloads where certain LoRA adapters are accessed more frequently than others. The LRU policy ensures that frequently used adapters remain in memory while less recently used ones are evicted first, potentially improving cache hit rates and overall performance.
Modifications
eviction_policy.py
module with abstractEvictionPolicy
classLRUEvictionPolicy
using OrderedDict for O(1) access trackingFIFOEvictionPolicy
for backward compatibility--lora-eviction-policy
argument toServerArgs
with choices ["fifo", "lru"]LoRAMemoryPool
to use configurable eviction policiesLoRAManager
to pass eviction policy to memory poolSRTRunner
to accept eviction policy parameterAll changes maintain full backward compatibility with default FIFO behavior.
Accuracy Tests
This PR does not affect model outputs or inference accuracy.
Benchmarking and Profiling
The LRU eviction policy is designed to improve cache efficiency for workloads with non-uniform adapter access patterns. Performance impact is minimal:
Detailed benchmarking will be conducted with realistic workloads in future testing.
Checklist