-
Notifications
You must be signed in to change notification settings - Fork 732
fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: enable LMCache metrics visibility with PROMETHEUS_MULTIPROC_DIR #4654
Conversation
f67133f to
09b6c68
Compare
09b6c68 to
eedd70a
Compare
WalkthroughThe changes introduce centralized Prometheus metrics collection for vLLM and LMCache with multiprocess support. A new setup_metrics_collection function consolidates scattered initialization logic. Bash scripts enable multi-process Prometheus testing, and test infrastructure validates the configuration with environment-based multiprocess directory setup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/main.py (1)
697-768: Add missingsetup_metrics_collectioncall for multimodal workers.The
init_multimodal_workerfunction omits thesetup_metrics_collectioncall that is present in bothinit_prefill(line 416) andinit(line 529). This means multimodal workers lack critical metrics setup:
- MultiProcessCollector registration for prometheus multiprocess mode
- Engine metrics callbacks for vLLM and LMCache metrics
- Dual-registry handling for metric conflicts
Add
setup_metrics_collection(config, generate_endpoint, logger)after the KV publisher setup and before theasyncio.gathercall, following the same pattern as other worker initialization functions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/src/dynamo/vllm/main.py(5 hunks)examples/backends/vllm/launch/agg_lmcache.sh(1 hunks)examples/backends/vllm/launch/agg_lmcache_multiproc.sh(1 hunks)tests/serve/test_vllm.py(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-11-14T01:09:35.244Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.
Applied to files:
examples/backends/vllm/launch/agg_lmcache.shexamples/backends/vllm/launch/agg_lmcache_multiproc.sh
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
Applied to files:
components/src/dynamo/vllm/main.py
🧬 Code graph analysis (1)
components/src/dynamo/vllm/main.py (1)
components/src/dynamo/common/utils/prometheus.py (1)
register_engine_metrics_callback(30-78)
🪛 Ruff (0.14.7)
tests/serve/test_vllm.py
75-75: Probable insecure usage of temporary file or directory: "/tmp/prometheus_multiproc_test_"
(S108)
75-75: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (7)
examples/backends/vllm/launch/agg_lmcache.sh (1)
7-8: LGTM! Clean separation of test scenarios.The explicit
unsetensures this script tests the non-multiprocess metrics path, complementing the newagg_lmcache_multiproc.shthat tests the multiprocess path.examples/backends/vllm/launch/agg_lmcache_multiproc.sh (2)
6-10: LGTM! Unique directory creation prevents test conflicts.The combination of
$$(PID) and$RANDOMensures uniqueness across concurrent test runs. Therm -rfbeforemkdirhandles any stale directories from previous failed runs.
12-18: LGTM! Cleanup approach is appropriate for test scripts.The trap ensures directory removal and process termination on exit. Based on learnings, temp directory cleanup is considered low priority for production containerized deployments, but having it in test scripts aids local development.
components/src/dynamo/vllm/main.py (3)
12-12: LGTM! Correct placement for multiprocess mode.Moving
prometheus_clientimports to the top ensures proper multiprocess mode initialization before any metrics are created.
110-124: LGTM! Clear explanation of the dual-registry approach.The docstring effectively explains the multiprocess metrics challenge and the fallback strategy for handling registry conflicts in Kubernetes deployments.
416-416: LGTM! Correct placement of metrics setup.Both
init_prefillandinitproperly callsetup_metrics_collectionafter vLLM engine initialization and before serving endpoints.Also applies to: 529-529
tests/serve/test_vllm.py (1)
68-83: LGTM! Test config appropriately covers the multiprocess metrics path.The new test configuration enables verification that both metrics paths (with and without
PROMETHEUS_MULTIPROC_DIR) produce identical outputs. The use of/tmpandrandom.randintprovides adequate isolation for concurrent test runs.Note: Static analysis warnings (S108, S311) are false positives—cryptographic-grade randomness and secure temp directories are unnecessary for test isolation.
When PROMETHEUS_MULTIPROC_DIR is set before process starts, lmcache:* metrics were not visible due to 'Duplicated timeseries' error when trying to collect metrics from .db files. Implemented dual-registry approach: - Try adding MultiProcessCollector(REGISTRY) first - On conflict, create separate CollectorRegistry for .db file metrics - Register both registries to ensure all metrics collected Added test scenarios: - aggregated_lmcache: Without PROMETHEUS_MULTIPROC_DIR set - aggregated_lmcache_multiproc: With PROMETHEUS_MULTIPROC_DIR set Both scenarios produce identical metrics (492 lines: 338 vllm, 101 lmcache) with no duplicates. Signed-off-by: Keiven Chang <[email protected]> tidy up comments Clarify PROMETHEUS_MULTIPROC_DIR is optional in documentation Signed-off-by: Keiven Chang <[email protected]>
dafc575 to
ecceb45
Compare
…4654) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>
Overview:
Fixes LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is explicitly set by users (K8s deployments). Previously, lmcache:* metrics were not exposed due to Prometheus registry conflicts when the environment variable was set before process initialization.
Details:
Where should the reviewer start?
components/src/dynamo/vllm/main.py - Review the setup_metrics_collection() function (lines 110-190) which implements the dual-registry logic with detailed comments explaining the approach.
Related Issues:
Relates to DIS-1071
/coderabbit profile chill
Summary by CodeRabbit
Release Notes
New Features
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.