fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654

keivenchang · 2025-11-27T06:49:26Z

Overview:

Fixes LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is explicitly set by users (K8s deployments). Previously, lmcache:* metrics were not exposed due to Prometheus registry conflicts when the environment variable was set before process initialization.

Details:

Implemented dual-registry approach in vLLM main.py to handle both deployment scenarios
Added setup_metrics_collection() helper function that tries MultiProcessCollector(REGISTRY) first, falls back to separate registry on conflict
Moved prometheus_client imports to top of file for proper multiprocess mode initialization
Added aggregated_lmcache_multiproc test scenario with explicit PROMETHEUS_MULTIPROC_DIR
Updated aggregated_lmcache.sh to explicitly unset the variable for local dev testing
Both test scenarios produce identical metrics (492 lines: 338 vllm, 101 lmcache) with no duplicates

Where should the reviewer start?

components/src/dynamo/vllm/main.py - Review the setup_metrics_collection() function (lines 110-190) which implements the dual-registry logic with detailed comments explaining the approach.

Related Issues:

Relates to DIS-1071

/coderabbit profile chill

Summary by CodeRabbit

Release Notes

New Features
- Added multi-process Prometheus monitoring setup for enhanced metrics collection in distributed scenarios.
Refactor
- Centralized metrics initialization for improved reliability and consistency across different deployment modes.
- Enhanced multiprocess metrics handling with better conflict resolution.
Tests
- Added test configuration for multi-process Prometheus monitoring verification.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-02T17:16:15Z

Walkthrough

The changes introduce centralized Prometheus metrics collection for vLLM and LMCache with multiprocess support. A new setup_metrics_collection function consolidates scattered initialization logic. Bash scripts enable multi-process Prometheus testing, and test infrastructure validates the configuration with environment-based multiprocess directory setup.

Changes

Cohort / File(s)	Summary
Metrics infrastructure refactoring `components/src/dynamo/vllm/main.py`	Adds `setup_metrics_collection()` to centralize Prometheus metrics initialization; replaces ad-hoc inline MultiProcessCollector usage with unified logic; imports prometheus_client modules (REGISTRY, CollectorRegistry, multiprocess) and register_engine_metrics_callback; implements conditional multiprocess handling with fallback to in-memory registry on collision.
Bash launch scripts `examples/backends/vllm/launch/agg_lmcache.sh`, `examples/backends/vllm/launch/agg_lmcache_multiproc.sh`	Updates agg_lmcache.sh to explicitly unset PROMETHEUS_MULTIPROC_DIR; introduces new agg_lmcache_multiproc.sh to launch multi-process Prometheus setup with isolated temp directory, trap-based cleanup, and environment variable wiring for frontend and worker processes.
Test configuration `tests/serve/test_vllm.py`	Adds aggregated_lmcache_multiproc test config with script reference, GPU-1 marks, randomized PROMETHEUS_MULTIPROC_DIR path, and expanded request payloads (chat, completion, vllm metric, lmcache metric); adds random import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Main focus areas:
- Logic in setup_metrics_collection() for multiprocess collision handling and conditional registry initialization—verify error handling path and ensure REGISTRY fallback behavior is correct
- Interaction between inline metrics removal and centralized initialization across init_prefill, init, and init_decode code paths—confirm no missed callsites
- Shell script trap/cleanup logic in agg_lmcache_multiproc.sh—ensure proper resource cleanup on exit
- Test randomization of PROMETHEUS_MULTIPROC_DIR—verify isolation and cleanup in test teardown

Poem

🐰 Metrics dance in process threads,
Registries collect their spreads,
Old chaos swept, new order placed,
Prometheus metrics now embraced! 📊

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: enabling LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is set, which aligns with the core fix implemented across the changeset.
Description check	✅ Passed	The description follows the template structure with comprehensive Overview, Details, and Where should the reviewer start sections. It clearly explains the problem, solution approach, and files modified, though the Related Issues section uses 'Relates to' rather than the template's action keywords.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/vllm/main.py (1)

697-768: Add missing setup_metrics_collection call for multimodal workers.

The init_multimodal_worker function omits the setup_metrics_collection call that is present in both init_prefill (line 416) and init (line 529). This means multimodal workers lack critical metrics setup:

MultiProcessCollector registration for prometheus multiprocess mode

Engine metrics callbacks for vLLM and LMCache metrics

Dual-registry handling for metric conflicts

Add setup_metrics_collection(config, generate_endpoint, logger) after the KV publisher setup and before the asyncio.gather call, following the same pattern as other worker initialization functions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9fdc2e and 10b5231.

📒 Files selected for processing (4)

components/src/dynamo/vllm/main.py (5 hunks)
examples/backends/vllm/launch/agg_lmcache.sh (1 hunks)
examples/backends/vllm/launch/agg_lmcache_multiproc.sh (1 hunks)
tests/serve/test_vllm.py (2 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-11-14T01:09:35.244Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.

Applied to files:

examples/backends/vllm/launch/agg_lmcache.sh
examples/backends/vllm/launch/agg_lmcache_multiproc.sh

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

components/src/dynamo/vllm/main.py

🧬 Code graph analysis (1)

components/src/dynamo/vllm/main.py (1)

components/src/dynamo/common/utils/prometheus.py (1)

register_engine_metrics_callback (30-78)

🪛 Ruff (0.14.7)

tests/serve/test_vllm.py

75-75: Probable insecure usage of temporary file or directory: "/tmp/prometheus_multiproc_test_"

(S108)

75-75: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

examples/backends/vllm/launch/agg_lmcache.sh (1)

7-8: LGTM! Clean separation of test scenarios.

The explicit unset ensures this script tests the non-multiprocess metrics path, complementing the new agg_lmcache_multiproc.sh that tests the multiprocess path.

examples/backends/vllm/launch/agg_lmcache_multiproc.sh (2)

6-10: LGTM! Unique directory creation prevents test conflicts.

The combination of $$ (PID) and $RANDOM ensures uniqueness across concurrent test runs. The rm -rf before mkdir handles any stale directories from previous failed runs.

12-18: LGTM! Cleanup approach is appropriate for test scripts.

The trap ensures directory removal and process termination on exit. Based on learnings, temp directory cleanup is considered low priority for production containerized deployments, but having it in test scripts aids local development.

components/src/dynamo/vllm/main.py (3)

12-12: LGTM! Correct placement for multiprocess mode.

Moving prometheus_client imports to the top ensures proper multiprocess mode initialization before any metrics are created.

110-124: LGTM! Clear explanation of the dual-registry approach.

The docstring effectively explains the multiprocess metrics challenge and the fallback strategy for handling registry conflicts in Kubernetes deployments.

416-416: LGTM! Correct placement of metrics setup.

Both init_prefill and init properly call setup_metrics_collection after vLLM engine initialization and before serving endpoints.

Also applies to: 529-529

tests/serve/test_vllm.py (1)

68-83: LGTM! Test config appropriately covers the multiprocess metrics path.

The new test configuration enables verification that both metrics paths (with and without PROMETHEUS_MULTIPROC_DIR) produce identical outputs. The use of /tmp and random.randint provides adequate isolation for concurrent test runs.

Note: Static analysis warnings (S108, S311) are false positives—cryptographic-grade randomness and secure temp directories are unnecessary for test isolation.

components/src/dynamo/vllm/main.py

examples/backends/vllm/launch/agg_lmcache_multiproc.sh

When PROMETHEUS_MULTIPROC_DIR is set before process starts, lmcache:* metrics were not visible due to 'Duplicated timeseries' error when trying to collect metrics from .db files. Implemented dual-registry approach: - Try adding MultiProcessCollector(REGISTRY) first - On conflict, create separate CollectorRegistry for .db file metrics - Register both registries to ensure all metrics collected Added test scenarios: - aggregated_lmcache: Without PROMETHEUS_MULTIPROC_DIR set - aggregated_lmcache_multiproc: With PROMETHEUS_MULTIPROC_DIR set Both scenarios produce identical metrics (492 lines: 338 vllm, 101 lmcache) with no duplicates. Signed-off-by: Keiven Chang <[email protected]> tidy up comments Clarify PROMETHEUS_MULTIPROC_DIR is optional in documentation Signed-off-by: Keiven Chang <[email protected]>

…4654) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>

keivenchang self-assigned this Nov 27, 2025

pull-request-size bot added the size/L label Nov 27, 2025

github-actions bot added the fix label Nov 27, 2025

keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from f67133f to 09b6c68 Compare November 28, 2025 23:20

copy-pr-bot bot had a problem deploying to GITLAB November 28, 2025 23:20 Failure

keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from 09b6c68 to eedd70a Compare December 2, 2025 02:09

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 02:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 02:16 Inactive

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 16:02 Failure

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 17:10 Failure

keivenchang marked this pull request as ready for review December 2, 2025 17:10

keivenchang requested review from a team as code owners December 2, 2025 17:10

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

components/src/dynamo/vllm/main.py Show resolved Hide resolved

ziqifan617 reviewed Dec 2, 2025

View reviewed changes

examples/backends/vllm/launch/agg_lmcache_multiproc.sh Show resolved Hide resolved

keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from dafc575 to ecceb45 Compare December 2, 2025 20:50

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 20:50 Failure

ziqifan617 approved these changes Dec 3, 2025

View reviewed changes

keivenchang merged commit c655585 into main Dec 3, 2025
46 of 52 checks passed

keivenchang deleted the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch December 3, 2025 02:47

keivenchang added a commit that referenced this pull request Dec 3, 2025

fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR (#…

da3e902

…4654) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>

keivenchang mentioned this pull request Dec 3, 2025

fix: cherrypick 4654 to 0.7.0.post1 - LMCache fix #4729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654

fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654

Uh oh!

keivenchang commented Nov 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654

fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654

Uh oh!

Conversation

keivenchang commented Nov 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 2, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keivenchang commented Nov 27, 2025 •

edited by coderabbitai bot

Loading