TinyServe #66

NoakLiu · 2025-08-16T03:37:34Z

This enhancement proposes integrating a Query-Aware Cache Selection mechanism into Volcengine’s LLM serving stack. The idea is to dynamically adjust cache usage based on query characteristics, instead of applying uniform cache reuse. By profiling each query and selecting an appropriate cache strategy (aggressive reuse, partial reuse, or recomputation), the system can significantly reduce serving latency, especially the long-tail delays, while maintaining generation quality. This approach has been validated across large-scale workloads and diverse LLM architectures, as presented in our ACM Multimedia 2025 Oral paper TinyServe: Query-Aware Cache Selection for Efficient LLM Serving.

CLAassistant · 2025-08-16T03:37:44Z

All committers have signed the CLA.

NoakLiu added 5 commits August 16, 2025 11:16

tinyserve

2099907

tinyserve

087a816

tinyserve

26b063e

update

960fedd

update

788765e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TinyServe #66

TinyServe #66

Uh oh!

NoakLiu commented Aug 16, 2025

Uh oh!

CLAassistant commented Aug 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TinyServe #66

Are you sure you want to change the base?

TinyServe #66

Uh oh!

Conversation

NoakLiu commented Aug 16, 2025

Uh oh!

CLAassistant commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Aug 16, 2025 •

edited

Loading