Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This enhancement proposes integrating a Query-Aware Cache Selection mechanism into Volcengine’s LLM serving stack. The idea is to dynamically adjust cache usage based on query characteristics, instead of applying uniform cache reuse. By profiling each query and selecting an appropriate cache strategy (aggressive reuse, partial reuse, or recomputation), the system can significantly reduce serving latency, especially the long-tail delays, while maintaining generation quality. This approach has been validated across large-scale workloads and diverse LLM architectures, as presented in our ACM Multimedia 2025 Oral paper TinyServe: Query-Aware Cache Selection for Efficient LLM Serving.