Skip to content

Conversation

NoakLiu
Copy link

@NoakLiu NoakLiu commented Aug 16, 2025

This enhancement proposes integrating a Query-Aware Cache Selection mechanism into Volcengine’s LLM serving stack. The idea is to dynamically adjust cache usage based on query characteristics, instead of applying uniform cache reuse. By profiling each query and selecting an appropriate cache strategy (aggressive reuse, partial reuse, or recomputation), the system can significantly reduce serving latency, especially the long-tail delays, while maintaining generation quality. This approach has been validated across large-scale workloads and diverse LLM architectures, as presented in our ACM Multimedia 2025 Oral paper TinyServe: Query-Aware Cache Selection for Efficient LLM Serving.

@CLAassistant
Copy link

CLAassistant commented Aug 16, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants