[Feature]: Add Character-Level Prefix Routing for Cache-Aware Load Balancers (SGLang)

### The Feature

Add support for prefix-based prompt caching in LiteLLM's router.


### Motivation, pitch

I am currently using LiteLLM as an API gateway/management layer in front of SGLang Router to help with rate limiting (TPM) and routing to the appropriate endpoints (Router instances). SGLang Router itself supports a cache-aware load balancing strategy.

If my current understanding is correct, LiteLLM's current prompt caching implementation uses message-level exact hash matching. This is a bit different than SGLang which uses a[ character level prefix matching](https://github.com/sgl-project/sglang/blob/main/sgl-router/src/router.rs) strategy for cache routing.

I'm wondering if there's any support, or plans to support, a prefix-aware cache routing strategy to help maintain global cache affinity in this scenario. Without this mechanism, requests that would normally result in a cache hit (e.g., from one router to an SGLang worker) might instead be routed to a different router instance, therefore missing the cache.

To address this, is it possible to add support for prefix-aware prompt caching/routing in LiteLLM's router? Instead of only matching exact hashes, the router could:

Use the longest shared prefix of the prompt (character based) to select a worker.
Optionally, expose configuration for prefix match thresholds (as in SGLang).
This would allow better cache affinity and performance when using LiteLLM with SGLang or any prefix-caching backend.

Currently, LiteLLM has a fallback mechanism that, on a cache miss, [removes up to the last 3 messages and tries again](https://github.com/BerriAI/litellm/blob/3a13d5419a35a2dbf97676e74f51b45a8d120d11/litellm/router_utils/prompt_caching_cache.py). I think this may not handle cases where there are structural changes in the conversation (e.g., inserted or deleted in the middle, or shared long system prompts across different sessions).

Having some prefix-aware routing logic however, can help in this scenario.

### LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

No

### Twitter / LinkedIn details

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Add Character-Level Prefix Routing for Cache-Aware Load Balancers (SGLang) #16144

The Feature

Motivation, pitch

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Add Character-Level Prefix Routing for Cache-Aware Load Balancers (SGLang) #16144

Description

The Feature

Motivation, pitch

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions