Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Nov 27, 2025

Overview:

Fixes LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is explicitly set by users (K8s deployments). Previously, lmcache:* metrics were not exposed due to Prometheus registry conflicts when the environment variable was set before process initialization.

Details:

  • Implemented dual-registry approach in vLLM main.py to handle both deployment scenarios
  • Added setup_metrics_collection() helper function that tries MultiProcessCollector(REGISTRY) first, falls back to separate registry on conflict
  • Moved prometheus_client imports to top of file for proper multiprocess mode initialization
  • Added aggregated_lmcache_multiproc test scenario with explicit PROMETHEUS_MULTIPROC_DIR
  • Updated aggregated_lmcache.sh to explicitly unset the variable for local dev testing
  • Both test scenarios produce identical metrics (492 lines: 338 vllm, 101 lmcache) with no duplicates

Where should the reviewer start?

components/src/dynamo/vllm/main.py - Review the setup_metrics_collection() function (lines 110-190) which implements the dual-registry logic with detailed comments explaining the approach.

Related Issues:

Relates to DIS-1071

/coderabbit profile chill

Summary by CodeRabbit

Release Notes

  • New Features

    • Added multi-process Prometheus monitoring setup for enhanced metrics collection in distributed scenarios.
  • Refactor

    • Centralized metrics initialization for improved reliability and consistency across different deployment modes.
    • Enhanced multiprocess metrics handling with better conflict resolution.
  • Tests

    • Added test configuration for multi-process Prometheus monitoring verification.

✏️ Tip: You can customize this high-level summary in your review settings.

@keivenchang keivenchang self-assigned this Nov 27, 2025
@github-actions github-actions bot added the fix label Nov 27, 2025
@keivenchang keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from f67133f to 09b6c68 Compare November 28, 2025 23:20
@keivenchang keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from 09b6c68 to eedd70a Compare December 2, 2025 02:09
@keivenchang keivenchang marked this pull request as ready for review December 2, 2025 17:10
@keivenchang keivenchang requested review from a team as code owners December 2, 2025 17:10
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

The changes introduce centralized Prometheus metrics collection for vLLM and LMCache with multiprocess support. A new setup_metrics_collection function consolidates scattered initialization logic. Bash scripts enable multi-process Prometheus testing, and test infrastructure validates the configuration with environment-based multiprocess directory setup.

Changes

Cohort / File(s) Summary
Metrics infrastructure refactoring
components/src/dynamo/vllm/main.py
Adds setup_metrics_collection() to centralize Prometheus metrics initialization; replaces ad-hoc inline MultiProcessCollector usage with unified logic; imports prometheus_client modules (REGISTRY, CollectorRegistry, multiprocess) and register_engine_metrics_callback; implements conditional multiprocess handling with fallback to in-memory registry on collision.
Bash launch scripts
examples/backends/vllm/launch/agg_lmcache.sh, examples/backends/vllm/launch/agg_lmcache_multiproc.sh
Updates agg_lmcache.sh to explicitly unset PROMETHEUS_MULTIPROC_DIR; introduces new agg_lmcache_multiproc.sh to launch multi-process Prometheus setup with isolated temp directory, trap-based cleanup, and environment variable wiring for frontend and worker processes.
Test configuration
tests/serve/test_vllm.py
Adds aggregated_lmcache_multiproc test config with script reference, GPU-1 marks, randomized PROMETHEUS_MULTIPROC_DIR path, and expanded request payloads (chat, completion, vllm metric, lmcache metric); adds random import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Main focus areas:
    • Logic in setup_metrics_collection() for multiprocess collision handling and conditional registry initialization—verify error handling path and ensure REGISTRY fallback behavior is correct
    • Interaction between inline metrics removal and centralized initialization across init_prefill, init, and init_decode code paths—confirm no missed callsites
    • Shell script trap/cleanup logic in agg_lmcache_multiproc.sh—ensure proper resource cleanup on exit
    • Test randomization of PROMETHEUS_MULTIPROC_DIR—verify isolation and cleanup in test teardown

Poem

🐰 Metrics dance in process threads,
Registries collect their spreads,
Old chaos swept, new order placed,
Prometheus metrics now embraced! 📊

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: enabling LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is set, which aligns with the core fix implemented across the changeset.
Description check ✅ Passed The description follows the template structure with comprehensive Overview, Details, and Where should the reviewer start sections. It clearly explains the problem, solution approach, and files modified, though the Related Issues section uses 'Relates to' rather than the template's action keywords.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/main.py (1)

697-768: Add missing setup_metrics_collection call for multimodal workers.

The init_multimodal_worker function omits the setup_metrics_collection call that is present in both init_prefill (line 416) and init (line 529). This means multimodal workers lack critical metrics setup:

  • MultiProcessCollector registration for prometheus multiprocess mode
  • Engine metrics callbacks for vLLM and LMCache metrics
  • Dual-registry handling for metric conflicts

Add setup_metrics_collection(config, generate_endpoint, logger) after the KV publisher setup and before the asyncio.gather call, following the same pattern as other worker initialization functions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9fdc2e and 10b5231.

📒 Files selected for processing (4)
  • components/src/dynamo/vllm/main.py (5 hunks)
  • examples/backends/vllm/launch/agg_lmcache.sh (1 hunks)
  • examples/backends/vllm/launch/agg_lmcache_multiproc.sh (1 hunks)
  • tests/serve/test_vllm.py (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-11-14T01:09:35.244Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.

Applied to files:

  • examples/backends/vllm/launch/agg_lmcache.sh
  • examples/backends/vllm/launch/agg_lmcache_multiproc.sh
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • components/src/dynamo/vllm/main.py
🧬 Code graph analysis (1)
components/src/dynamo/vllm/main.py (1)
components/src/dynamo/common/utils/prometheus.py (1)
  • register_engine_metrics_callback (30-78)
🪛 Ruff (0.14.7)
tests/serve/test_vllm.py

75-75: Probable insecure usage of temporary file or directory: "/tmp/prometheus_multiproc_test_"

(S108)


75-75: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (7)
examples/backends/vllm/launch/agg_lmcache.sh (1)

7-8: LGTM! Clean separation of test scenarios.

The explicit unset ensures this script tests the non-multiprocess metrics path, complementing the new agg_lmcache_multiproc.sh that tests the multiprocess path.

examples/backends/vllm/launch/agg_lmcache_multiproc.sh (2)

6-10: LGTM! Unique directory creation prevents test conflicts.

The combination of $$ (PID) and $RANDOM ensures uniqueness across concurrent test runs. The rm -rf before mkdir handles any stale directories from previous failed runs.


12-18: LGTM! Cleanup approach is appropriate for test scripts.

The trap ensures directory removal and process termination on exit. Based on learnings, temp directory cleanup is considered low priority for production containerized deployments, but having it in test scripts aids local development.

components/src/dynamo/vllm/main.py (3)

12-12: LGTM! Correct placement for multiprocess mode.

Moving prometheus_client imports to the top ensures proper multiprocess mode initialization before any metrics are created.


110-124: LGTM! Clear explanation of the dual-registry approach.

The docstring effectively explains the multiprocess metrics challenge and the fallback strategy for handling registry conflicts in Kubernetes deployments.


416-416: LGTM! Correct placement of metrics setup.

Both init_prefill and init properly call setup_metrics_collection after vLLM engine initialization and before serving endpoints.

Also applies to: 529-529

tests/serve/test_vllm.py (1)

68-83: LGTM! Test config appropriately covers the multiprocess metrics path.

The new test configuration enables verification that both metrics paths (with and without PROMETHEUS_MULTIPROC_DIR) produce identical outputs. The use of /tmp and random.randint provides adequate isolation for concurrent test runs.

Note: Static analysis warnings (S108, S311) are false positives—cryptographic-grade randomness and secure temp directories are unnecessary for test isolation.

When PROMETHEUS_MULTIPROC_DIR is set before process starts, lmcache:* metrics
were not visible due to 'Duplicated timeseries' error when trying to collect
metrics from .db files.

Implemented dual-registry approach:
- Try adding MultiProcessCollector(REGISTRY) first
- On conflict, create separate CollectorRegistry for .db file metrics
- Register both registries to ensure all metrics collected

Added test scenarios:
- aggregated_lmcache: Without PROMETHEUS_MULTIPROC_DIR set
- aggregated_lmcache_multiproc: With PROMETHEUS_MULTIPROC_DIR set

Both scenarios produce identical metrics (492 lines: 338 vllm, 101 lmcache)
with no duplicates.

Signed-off-by: Keiven Chang <[email protected]>

tidy up comments

Clarify PROMETHEUS_MULTIPROC_DIR is optional in documentation

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang force-pushed the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch from dafc575 to ecceb45 Compare December 2, 2025 20:50
@keivenchang keivenchang merged commit c655585 into main Dec 3, 2025
46 of 52 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS-1071__fix-Pinterest-lmcache-PROMETHEUS_MULTIPROC_DIR branch December 3, 2025 02:47
keivenchang added a commit that referenced this pull request Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants