- 
                Notifications
    
You must be signed in to change notification settings  - Fork 184
 
Improve prometheus truncation #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          
WalkthroughReduces default MAX_GRAPH_POINTS from 300 to 200. Enhances Prometheus range query step handling with optional step, auto-calculation, and duration parsing utility. Adds large-result summarization path for Prometheus queries. Updates instructions for high-cardinality handling. Introduces new high-cardinality test fixtures, adds localhost passthrough in tests, and adjusts related tests. Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  actor U as User
  participant PT as PrometheusToolset
  participant PU as Utils (parse_duration_to_seconds)
  participant P as Prometheus API
  U->>PT: ExecuteRangeQuery(start, end, step?, query)
  PT->>PU: parse_duration_to_seconds(step)
  PU-->>PT: step_seconds or None
  PT->>PT: adjust_step_for_max_points(start, end, step_seconds?)\n- auto-calc if None\n- cap by MAX_GRAPH_POINTS (200)
  PT->>P: GET /api/v1/query_range?query...&step=adjusted
  P-->>PT: data (result, size)
  alt data_size_chars > 80k
    PT->>PT: create_data_summary_for_large_result(is_range_query=True)
    PT-->>U: summary (series_count, est_points, sample_data, suggestion)
  else
    PT-->>U: full data
  end
    sequenceDiagram
  autonumber
  actor U as User
  participant PT as PrometheusToolset
  participant P as Prometheus API
  U->>PT: ExecuteInstantQuery(query)
  PT->>P: GET /api/v1/query?query
  P-->>PT: data (result, size)
  alt data_size_chars > 80k
    PT->>PT: create_data_summary_for_large_result(is_range_query=False)
    PT-->>U: summary (result_count, type, sample_data, suggestion)
  else
    PT-->>U: full data
  end
    Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
 Suggested reviewers
 Pre-merge checks (1 passed, 1 warning, 1 inconclusive)❌ Failed checks (1 warning, 1 inconclusive)
 ✅ Passed checks (1 passed)
 Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs. 
 Please see the documentation for more information. Example: reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing touches
 🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (2)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
931-935: Makestepoptional in the tool schema to match auto-step behavior.Runtime accepts missing
step, but the schema still requires it. Aligning avoids conflicting signals to the LLM/UI.- "step": ToolParameter( - description="Query resolution step width in duration format or float number of seconds", - type="number", - required=True, - ), + "step": ToolParameter( + description="Optional: resolution step in seconds (float). Omit to auto-calculate.", + type="number", + required=False, + ),
33-33: Correct docstring for env var and default
Inholmes/plugins/toolsets/prometheus/prometheus.py(line 291), replaceMax points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 300).with
Max points is controlled by the MAX_GRAPH_POINTS environment variable (default: 200).
🧹 Nitpick comments (12)
holmes/common/env_vars.py (1)
76-76: Use integer type and consistent default parsing for MAX_GRAPH_POINTS.Points are a count; store as int and use a string default for consistency with other envs.
-MAX_GRAPH_POINTS = float(os.environ.get("MAX_GRAPH_POINTS", 200)) +MAX_GRAPH_POINTS: int = int(os.environ.get("MAX_GRAPH_POINTS", "200"))holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2 (1)
24-29: Prefer topk over an aggregated expression to rank by the intended label.Recommend demonstrating
sum by (<label>) (...)insidetopkso users pick “top pods/containers,” not arbitrary series.-* Standard pattern for high-cardinality queries: - - Use `topk(5, <your_query>)` to get the top 5 series - - Example: `topk(5, rate(container_cpu_usage_seconds_total{namespace="default"}[5m]))` +* Standard pattern for high-cardinality queries: + - Use `topk(5, sum by (pod) (<your_query>))` (replace `pod` with the label you care about) + - Example: `topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="default"}[5m])))`holmes/plugins/toolsets/prometheus/prometheus.py (5)
291-293: Docstring: wrong env var name and default.Update to reflect
MAX_GRAPH_POINTS(default 200).- Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 300). + Max points are controlled by the MAX_GRAPH_POINTS environment variable (default: 200).
335-406: Make topk/“other” suggestions accurate and label-agnostic.The current suggestion strings alternate between
podandinstanceand uselabel_replacepatterns that may be invalid. Provide a generic, correct template that nudges users to aggregate before topk and compute “other” viasum without(<label>) (...).- "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results to the top {min(5, num_items)} series. To also capture remaining data as \'other\': topk({min(5, num_items)}, {query}) or label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "pod", "other", "", "")', + "suggestion": ( + "Consider limiting series with topk(5, sum by (<label>) (<expr>)) " + "where <label> is your grouping (e.g., pod). " + "To include an 'other' bucket: " + "sum without(<label>) (sum by (<label>) (<expr>)) - " + "sum without(<label>) (topk(5, sum by (<label>) (<expr>)))." + ),- "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results. To also capture remaining data as \'other\': topk({min(5, num_items)}, {query}) or label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "instance", "other", "", "")', + "suggestion": ( + "Consider limiting series with topk(5, sum by (<label>) (<expr>)) " + "(pick <label> relevant to your query). " + "For an 'other' bucket, use the same pattern shown above." + ),
827-852: Avoid JSON re-serialization to measure payload size.Use the HTTP payload size to decide summarization, then parse for samples only when needed. Saves CPU/memory on huge responses.
- # Estimate the size of the data - data_str_preview = json.dumps(result_data) - data_size_chars = len(data_str_preview) + # Estimate size from HTTP payload + data_size_chars = int(response.headers.get("Content-Length", 0)) or len(response.content)
1014-1036: Same optimization as instant query: use HTTP size, not JSON dump, for thresholding.- # Estimate the size of the data - data_str_preview = json.dumps(result_data) - data_size_chars = len(data_str_preview) + # Estimate size from HTTP payload + data_size_chars = int(response.headers.get("Content-Length", 0)) or len(response.content)
284-301: Optional: early-return guard for degenerate ranges.If end <= start, return step=1 to avoid division anomalies.
time_range_seconds = (end_dt - start_dt).total_seconds() + if time_range_seconds <= 0: + logging.debug("Non-positive time range; defaulting step=1s") + return 1.0tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml (2)
31-36: Increase readiness wait to reduce flakiness (45+5 pods may take >120s).Bump to 240–300s.
- kubectl wait --for=condition=Ready pod -l test=high-cardinality -n app-159 --timeout=120s + kubectl wait --for=condition=Ready pod -l test=high-cardinality -n app-159 --timeout=300s
24-27: Don’t mask Prometheus readiness failures unless intentional.
|| truehides real readiness issues. If that’s desired, leave it; otherwise, remove to fail fast.- kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=prometheus -n default --timeout=60s || true + kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=prometheus -n default --timeout=60stests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml (3)
53-71: Security context hardening (test fixture, optional).Add non-root and no-priv-esc to address Checkov findings (CKV_K8S_20, CKV_K8S_23).
- name: app image: alpine:latest + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + readOnlyRootFilesystem: true
88-108: Apply same securityContext to high-CPU pods (example for z-cpu-heavy-1).Replicate to z-cpu-heavy-2..5 as well.
- name: app image: alpine:latest + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + readOnlyRootFilesystem: true
53-53: Pin image tag (avoid “latest” for determinism).Use a fixed Alpine (e.g., 3.20) to reduce pull drift across CI runs.
- image: alpine:latest + image: alpine:3.20Also applies to: 90-90, 121-121, 152-152, 183-183, 214-214
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
holmes/common/env_vars.py(1 hunks)holmes/plugins/toolsets/prometheus/prometheus.py(7 hunks)holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2(1 hunks)tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
tests/llm/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place LLM evaluation tests under tests/llm/
Files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
tests/llm/fixtures/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
LLM test scenarios should use YAML configs under tests/llm/fixtures/
Files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
tests/llm/**/toolsets.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/**/toolsets.yaml: In evals, all toolset-specific configuration must be under a config field in toolsets.yaml
Only the allowed top-level fields are permitted in toolsets.yaml (enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url for MCP only)
Files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Tests should mirror the source structure under tests/
Files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yamltests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
holmes/plugins/toolsets/prometheus/prometheus.pyholmes/common/env_vars.py
tests/llm/**/test_case.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/**/test_case.yaml: Each LLM test must use a dedicated namespace app- defined in test_case.yaml
Do not put toolsets configuration in test_case.yaml; it must be in a separate toolsets.yaml file
Files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml
🧠 Learnings (3)
📚 Learning: 2025-09-08T11:12:28.846Z
Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-08T11:12:28.846Z
Learning: Applies to tests/llm/**/test_case.yaml : Do not put toolsets configuration in test_case.yaml; it must be in a separate toolsets.yaml file
Applied to files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml
📚 Learning: 2025-09-08T11:12:28.846Z
Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-08T11:12:28.846Z
Learning: Applies to tests/llm/**/toolsets.yaml : Only the allowed top-level fields are permitted in toolsets.yaml (enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url for MCP only)
Applied to files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml
📚 Learning: 2025-08-05T00:42:23.792Z
Learnt from: vishiy
PR: robusta-dev/holmesgpt#782
File: config.example.yaml:31-49
Timestamp: 2025-08-05T00:42:23.792Z
Learning: In robusta-dev/holmesgpt config.example.yaml, the azuremonitorlogs toolset configuration shows "enabled: true" as an example of how to enable the toolset, not as a default setting. The toolset is disabled by default and requires explicit enablement in user configurations.
Applied to files:
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml
🪛 Checkov (3.2.334)
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
[medium] 34-76: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 34-76: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 78-108: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 78-108: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 109-139: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 109-139: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 140-170: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 140-170: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 171-201: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 171-201: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 202-231: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 202-231: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pre-commit checks
 - GitHub Check: llm_evals
 - GitHub Check: Pre-commit checks
 
🔇 Additional comments (7)
holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2 (1)
23-30: Great addition: clear high-cardinality guidance and topk pattern.This will materially reduce result sizes and keep graphs readable.
holmes/plugins/toolsets/prometheus/prometheus.py (1)
307-314: Smart default step sizing — LGTM.Auto-targeting ~60 points keeps charts dense enough without overfetching.
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/toolsets.yaml (4)
4-4: Verify toolsets.yaml schema (top-level “toolsets” key).Guidelines say only specific top-level fields are allowed in toolsets.yaml. “toolsets:” isn’t in that list. If the test runner expects “toolsets:”, ignore this; otherwise, restructure accordingly.
8-15: Prometheus config looks correct for the fixture.Port-forward URL, metadata/labels toggle, and returning data align with the scenario.
5-7: Enabling kubernetes/core for this test is appropriate.Nothing to change.
16-17: Enabling kubernetes/logs is fine.No action needed.
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml (1)
1-8: Ensure graphing won’t be summarized away.With many series, results might cross the summarization threshold added in this PR. If flakes appear, consider explicitly guiding topk()/aggregation in the prompt or accepting summarized output in expectations.
        
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
              
          
                tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/manifest.yaml
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py (1)
577-580: Fix comment: MAX_GRAPH_POINTS value typo.The comment says 300 but computes with 200. Adjust to avoid confusion.
- # With 86400 seconds and MAX_GRAPH_POINTS=300, min step = ceil(86400/200) = 432 + # With 86400 seconds and MAX_GRAPH_POINTS=200, min step = ceil(86400/200) = 432holmes/plugins/toolsets/prometheus/prometheus.py (3)
284-326: Unify step adjustment with utils, fix default-points doc, and return an int step.There’s already a canonical
adjust_step_for_max_pointsinholmes.plugins.toolsets.utils. Re-implementing it here risks divergence. Recommend delegating to the utils function after computingtime_range_secondsand your default step, fixing the doc (default is 200 now), and returning an integer step (Prometheus accepts seconds as integer; avoiding floats improves determinism).Apply:
def adjust_step_for_max_points( start_timestamp: str, end_timestamp: str, - step: Optional[float] = None, + step: Optional[float] = None, ) -> float: """ Adjusts the step parameter to ensure the number of data points doesn't exceed max_points. - Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 300). + Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 200). @@ - # If no step provided, calculate a reasonable default - # Aim for ~60 data points across the time range (1 per minute for hourly, etc) - if step is None: - step = max(1, time_range_seconds / 60) - logging.debug( - f"No step provided, defaulting to {step}s for {time_range_seconds}s range" - ) - - current_points = time_range_seconds / step - - # If current points exceed max, adjust the step - if current_points > MAX_GRAPH_POINTS: - adjusted_step = time_range_seconds / MAX_GRAPH_POINTS - logging.info( - f"Adjusting step from {step}s to {adjusted_step}s to limit points from {current_points:.0f} to {MAX_GRAPH_POINTS}" - ) - return adjusted_step - - return step + # If no step provided, calculate a reasonable default (~1/min resolution) + if step is None: + step = max(1.0, time_range_seconds / 60.0) + logging.debug( + f"No step provided, defaulting to {step}s for {time_range_seconds}s range" + ) + + # Delegate final adjustment to the shared helper to enforce the cap + from holmes.plugins.toolsets import utils as _utils # local import to avoid cycles + adjusted = _utils.adjust_step_for_max_points( + time_range_seconds=int(time_range_seconds), + max_points=MAX_GRAPH_POINTS, + step=int(step) if step is not None else None, + ) + return float(adjusted)
335-406: Large-result summary: good guardrail; refine suggestion text and bound work.
- The topk/label_replace suggestion assumes specific labels (“pod”, “instance”). That can mislead on other datasets. Suggest making the example label-agnostic or omitting the label_replace pattern.
 - Consider limiting JSON work further: you already pass
 data_size_charsin; keep sampling to metadata only (as you do) and avoid any full-data serialization in this helper.- "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results to the top {min(5, num_items)} series. To also capture remaining data as \'other\': topk({min(5, num_items)}, {query}) or label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "pod", "other", "", "")', + "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results. If you need totals, combine with sum()/avg() by the labels you care about.', @@ - "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results. To also capture remaining data as \'other\': topk({min(5, num_items)}, {query}) or label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "instance", "other", "", "")', + "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results. If you need totals, combine with sum()/avg() by the labels you care about.',
1190-1194: Nit: redundant exception context.
logging.exceptionalready includes the traceback;exc_info=Trueis redundant.- logging.exception("Failed to initialize Prometheus", exc_info=True) + logging.exception("Failed to initialize Prometheus")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
conftest.py(1 hunks)holmes/plugins/toolsets/prometheus/prometheus.py(8 hunks)tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
conftest.pytests/plugins/toolsets/grafana/test_grafana_tempo_tools.pyholmes/plugins/toolsets/prometheus/prometheus.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Do not use pytest markers/tags that are not declared; only use markers listed in pyproject.toml
Files:
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Tests should mirror the source structure under tests/
Files:
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
holmes/plugins/toolsets/utils.py (1)
adjust_step_for_max_points(199-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: llm_evals
 - GitHub Check: Pre-commit checks
 
🔇 Additional comments (4)
conftest.py (1)
167-168: Scope localhost passthrough to an explicit allowlist of localhost ports used by tests (don't restrict to :3000).File: conftest.py Lines: 167-168 — tests reference many localhost ports (found: 15672, 3000, 3003, 3004, 3100, 3101, 3200, 5000, 8000, 8080, 9093, 9099, 9153, 9159, 9200), so changing passthru to only :3000 will break other tests.
Apply this patch instead (or maintain an allowlist derived from tests / environment):
- rsps.add_passthru("http://localhost") + # Allow only explicitly required localhost ports (enumerated from tests) + import re # noqa: PLC0415 - test helper import + rsps.add_passthru( + re.compile( + r"^http://localhost:(?:15672|3000|3003|3004|3100|3101|3200|5000|8000|8080|9093|9099|9153|9159|9200)(?:/|$)" + ) + )If the fixture is meant only for Grafana (:3000), move other tests to use their own fixtures or explicitly add per-test passthru entries.
Likely an incorrect or invalid review comment.
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py (3)
333-346: LGTM: step updated to match MAX_GRAPH_POINTS=200 auto-step behavior.The explicit
"8m"expectation aligns with the new cap and auto-step math.Also applies to: 339-346
552-555: Comment explains the 18s derivation clearly.Accurately documents ceil(3600/200)=18s.
635-637: LGTM: expectations for 1h and 1w ranges reflect 200-point cap.The 18s and 50m24s expectations are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
287-326: Unify auto-step logic with MAX_GRAPH_POINTS and fix docstring.Current defaulting to ~60 points diverges from the repo’s shared policy (limit to MAX_GRAPH_POINTS). Also, docstring still mentions 300. Recommend computing the smallest allowed step directly from MAX_GRAPH_POINTS (ceil) and using integer seconds to keep payloads stable.
Apply this diff:
def adjust_step_for_max_points( start_timestamp: str, end_timestamp: str, - step: Optional[float] = None, + step: Optional[float] = None, ) -> float: """ Adjusts the step parameter to ensure the number of data points doesn't exceed max_points. - Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 300). + Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (see MAX_GRAPH_POINTS). @@ - start_dt = dateutil.parser.parse(start_timestamp) - end_dt = dateutil.parser.parse(end_timestamp) - - time_range_seconds = (end_dt - start_dt).total_seconds() - - # If no step provided, calculate a reasonable default - # Aim for ~60 data points across the time range (1 per minute for hourly, etc) - if step is None: - step = max(1, time_range_seconds / 60) - logging.debug( - f"No step provided, defaulting to {step}s for {time_range_seconds}s range" - ) - - current_points = time_range_seconds / step - - # If current points exceed max, adjust the step - if current_points > MAX_GRAPH_POINTS: - adjusted_step = time_range_seconds / MAX_GRAPH_POINTS - logging.info( - f"Adjusting step from {step}s to {adjusted_step}s to limit points from {current_points:.0f} to {MAX_GRAPH_POINTS}" - ) - return adjusted_step - - return step + start_dt = dateutil.parser.parse(start_timestamp) + end_dt = dateutil.parser.parse(end_timestamp) + time_range_seconds = int((end_dt - start_dt).total_seconds()) + # smallest allowed to keep points <= MAX_GRAPH_POINTS + min_step = max(1, (time_range_seconds + MAX_GRAPH_POINTS - 1) // MAX_GRAPH_POINTS) + if step is None: + return float(min_step) + return float(max(min_step, int(step)))Also consider returning a Prometheus duration string (e.g., "18s") for readability and parity with other toolsets. See next comment.
🧹 Nitpick comments (3)
conftest.py (1)
167-168: Constrain localhost passthrough to avoid non-hermetic tests.File: conftest.py Lines: 167-168 — Replace the blanket passthru with a tight regex that matches localhost and 127.0.0.1 (optional port).
- rsps.add_passthru("http://localhost") + rsps.add_passthru(re.compile(r"^http://(localhost|127\.0\.0\.1)(:\d+)?(/|$)"))Add at top of file:
import reIf a blanket passthru is intentional, gate it behind an env var (e.g., ENABLE_LOCAL_PASSTHRU) so CI remains hermetic.
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py (1)
577-580: Fix typo in comment: MAX_GRAPH_POINTS value.The comment says MAX_GRAPH_POINTS=300 but computes using 200. Update the comment to avoid confusion.
Apply this diff:
- # With 86400 seconds and MAX_GRAPH_POINTS=300, min step = ceil(86400/200) = 432 + # With 86400 seconds and MAX_GRAPH_POINTS=200, min step = ceil(86400/200) = 432holmes/plugins/toolsets/prometheus/prometheus.py (1)
335-406: Large-result summarization — solid start; expose threshold and avoid label assumptions.
- The 80K char cutoff is hardcoded. Consider promoting it to a constant or env var to tune per-deployment.
 - Suggestions assume labels like “pod”/“instance” exist; mention “e.g., pod/instance” or detect a common label from sample_metrics to avoid misleading advice.
 Example tweak:
- if is_range_query: + if is_range_query: ... - return { + common_label = next(iter(sample_metrics[0].keys()), "instance") if sample_metrics else "instance" + return { "message": ..., "suggestion": f'Consider using topk({min(5, num_items)}, {query}) ... label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "{common_label}", "other", "", "")', }Optionally add a small unit test covering the summary path with synthetic large results.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
conftest.py(1 hunks)holmes/plugins/toolsets/prometheus/prometheus.py(8 hunks)tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
conftest.pytests/plugins/toolsets/grafana/test_grafana_tempo_tools.pyholmes/plugins/toolsets/prometheus/prometheus.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Do not use pytest markers/tags that are not declared; only use markers listed in pyproject.toml
Files:
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Tests should mirror the source structure under tests/
Files:
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py
🧬 Code graph analysis (2)
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py (1)
holmes/core/tools.py (1)
ToolResultStatus(51-75)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
holmes/plugins/toolsets/utils.py (1)
adjust_step_for_max_points(199-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Pre-commit checks
 - GitHub Check: llm_evals
 
🔇 Additional comments (5)
tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py (3)
333-342: Updated expected step to 8m — LGTM.Matches the reduced MAX_GRAPH_POINTS behavior.
552-555: Comment explains new min-step math — LGTM.The rationale for 18s on a 1h range with MAX_GRAPH_POINTS=200 is correct.
635-637: Adjusted expectations — LGTM.The new expected steps ("18s", "50m24s") align with MAX_GRAPH_POINTS=200.
holmes/plugins/toolsets/prometheus/prometheus.py (2)
1189-1194: Broadened exception handling in health check — LGTM.Catching all and logging with exc_info avoids masking non-RequestException failures during detection.
975-980: No change required — /api/v1/query_range accepts numeric seconds or duration strings for "step".Prometheus accepts step as either a duration string (e.g. "15s") or a float number of seconds in POST bodies, so keep
"step": stepas-is; optionally serialize tof"{step}s"for readability (note: converting tointwill drop sub-second precision).
File: holmes/plugins/toolsets/prometheus/prometheus.py Lines: 975-980Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
285-327: Avoid duplicating step-adjust logic; reuse the shared utility.There’s already
adjust_step_for_max_points(time_range_seconds, max_points, step)inholmes/plugins/toolsets/utils.py. Duplicating here risks drift and inconsistent behavior/logging.-from holmes.plugins.toolsets.utils import ( +from holmes.plugins.toolsets.utils import ( get_param_or_raise, process_timestamps_to_rfc3339, standard_start_datetime_tool_param_description, toolset_name_for_one_liner, + adjust_step_for_max_points as _adjust_step_secs, ) @@ -def adjust_step_for_max_points( - start_timestamp: str, - end_timestamp: str, - step: Optional[float] = None, -) -> float: +def adjust_step_for_max_points( + start_timestamp: str, + end_timestamp: str, + step: Optional[float] = None, +) -> float: @@ - time_range_seconds = (end_dt - start_dt).total_seconds() + time_range_seconds = (end_dt - start_dt).total_seconds() @@ - if step is None: - step = max(1, time_range_seconds / 60) - logging.debug( - f"No step provided, defaulting to {step}s for {time_range_seconds}s range" - ) + if step is None: + step = max(1.0, time_range_seconds / 60.0) + logging.debug( + "No step provided, defaulting to %.3fs for %.0fs range", + step, + time_range_seconds, + ) @@ - current_points = time_range_seconds / step - - # If current points exceed max, adjust the step - if current_points > MAX_GRAPH_POINTS: - adjusted_step = time_range_seconds / MAX_GRAPH_POINTS - logging.info( - f"Adjusting step from {step}s to {adjusted_step}s to limit points from {current_points:.0f} to {MAX_GRAPH_POINTS}" - ) - return adjusted_step - - return step + # Delegate to shared util (returns int seconds) + adjusted = _adjust_step_secs( + time_range_seconds=int(time_range_seconds), + max_points=MAX_GRAPH_POINTS, + step=int(step) if step else None, + ) + return float(adjusted)
♻️ Duplicate comments (1)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
932-936: Schema mismatch: step accepts duration strings but type=number.Clients will reject
"5m"while the code now supports it. Maketype="string"and keeprequired=False.- "step": ToolParameter( - description="Query resolution step width in duration format or float number of seconds", - type="number", - required=False, - ), + "step": ToolParameter( + description="Query resolution step width (e.g., 30s, 5m, 1h, or numeric seconds). Omit to auto-calculate.", + type="string", + required=False, + ),
🧹 Nitpick comments (9)
holmes/plugins/toolsets/prometheus/utils.py (2)
7-15: Tighten typing guard for mypy.After excluding int/float, assert string before calling
strip()to keep mypy happy.def parse_duration_to_seconds(v: Optional[Union[str, float, int]]) -> Optional[float]: if v is None: return None if isinstance(v, (int, float)): return float(v) - s = v.strip().lower() + assert isinstance(v, str) + s = v.strip().lower()
19-30: Option: reject trailing garbage in partial formats.Right now
"1h30mx"would be accepted (ignores trailingx). Consider full-match validation so only unit tokens are allowed.- pattern = r"(\d+(?:\.\d+)?)(d|h|m|s)" - matches = re.findall(pattern, s) + token = r"(\d+(?:\.\d+)?)(d|h|m|s)" + pattern = rf"^{token}+$" + matches = re.findall(token, s) if re.fullmatch(pattern, s) else []tests/plugins/toolsets/prometheus/test_utils.py (1)
65-77: Add a couple of edge cases.Consider adding:
- Mixed tokens with spaces:
 "1h 30m"→ 5400.0- Trailing junk should fail:
 "1h30mx"→ ValueError- Negative durations (decide on behavior):
 "-5m"holmes/plugins/toolsets/prometheus/prometheus.py (6)
308-315: Optional: ceil default step to avoid overshooting max points.
time_range/60can be fractional; ceilling reduces risk of exceeding MAX_GRAPH_POINTS by a few points.- if step is None: - step = max(1, time_range_seconds / 60) + if step is None: + step = max(1.0, math.ceil(time_range_seconds / 60.0))
375-382: Make guidance label-agnostic.Hard-coding labels like
pod/instancemay not exist for many metrics. Suggest generic hints without specific label names.- "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results to the top {min(5, num_items)} series. To also capture remaining data as \'other\': topk({min(5, num_items)}, {query}) or label_replace((sum({query}) - sum(topk({min(5, num_items)}, {query}))), "pod", "other", "", "")', + "suggestion": f'Consider using topk({min(5, num_items)}, {query}) to limit results. To capture the remainder as "other", aggregate by your primary label, e.g., sum by (<label>) ({query}) - sum by (<label>)(topk({min(5, num_items)}, {query})).',(and similarly in the instant-query branch)
828-853: Compute size from the raw response to avoid re-serialization cost.
json.dumps(result_data)can be expensive. Uselen(response.text)as a quick threshold, then parse or summarize as needed.- data_str_preview = json.dumps(result_data) - data_size_chars = len(data_str_preview) + data_size_chars = len(response.text)
1015-1037: Reuse raw response length here too.Same optimization as instant query: prefer
len(response.text)over serializingresult_data.- data_str_preview = json.dumps(result_data) - data_size_chars = len(data_str_preview) + data_size_chars = len(response.text)
737-752: Use logging.warning instead of deprecated logging.warn.Minor cleanup for consistency and to satisfy linters.
- logging.warn("Timeout while fetching prometheus metrics", exc_info=True) + logging.warning("Timeout while fetching prometheus metrics", exc_info=True) @@ - logging.warn("Failed to fetch prometheus metrics", exc_info=True) + logging.warning("Failed to fetch prometheus metrics", exc_info=True) @@ - logging.warn("Failed to process prometheus metrics", exc_info=True) + logging.warning("Failed to process prometheus metrics", exc_info=True)
1191-1195: Nit: logging.exception already includes exc_info.
logging.exception(..., exc_info=True)is redundant. Dropexc_info=True.- logging.exception("Failed to initialize Prometheus", exc_info=True) + logging.exception("Failed to initialize Prometheus")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
holmes/plugins/toolsets/prometheus/prometheus.py(10 hunks)holmes/plugins/toolsets/prometheus/utils.py(1 hunks)tests/plugins/toolsets/prometheus/test_utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
tests/plugins/toolsets/prometheus/test_utils.pyholmes/plugins/toolsets/prometheus/utils.pyholmes/plugins/toolsets/prometheus/prometheus.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Do not use pytest markers/tags that are not declared; only use markers listed in pyproject.toml
Files:
tests/plugins/toolsets/prometheus/test_utils.py
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Tests should mirror the source structure under tests/
Files:
tests/plugins/toolsets/prometheus/test_utils.py
🧬 Code graph analysis (2)
tests/plugins/toolsets/prometheus/test_utils.py (1)
holmes/plugins/toolsets/prometheus/utils.py (1)
parse_duration_to_seconds(7-30)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
holmes/plugins/toolsets/prometheus/utils.py (1)
parse_duration_to_seconds(7-30)holmes/plugins/toolsets/utils.py (1)
adjust_step_for_max_points(199-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Pre-commit checks
 - GitHub Check: llm_evals
 
🔇 Additional comments (5)
tests/plugins/toolsets/prometheus/test_utils.py (1)
1-78: Great coverage and clear expectations.The tests thoroughly cover None, numerics, units, partials, casing, whitespace, truncation, and invalids. LGTM.
holmes/plugins/toolsets/prometheus/prometheus.py (4)
25-25: Good: centralizes duration parsing.Importing the shared parser keeps range-query step handling consistent.
336-407: Helpful large-result summarization.Good metadata-only sampling, point estimation, and clear messaging.
965-972: LGTM: robust step parsing + auto-adjust.Parsing flexible inputs and letting the adjuster cap points is the right flow.
162-172: AMP path: propagate timeout to AWSPrometheusConnect.signed_requestThe AMP branch calls client.signed_request without timeout — pass the timeout to avoid hangs.
holmes/plugins/toolsets/prometheus/prometheus.py (do_request, AMP branch)
- return client.signed_request( # type: ignore + return client.signed_request( # type: ignore method=method, url=url, data=data, params=params, verify=verify, headers=headers, + timeout=timeout, )If AWSPrometheusConnect.signed_request doesn't accept timeout, enforce a per-call timeout in the SigV4 client implementation or wrap the call with a session that applies the timeout.
        
          
                holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
holmes/plugins/toolsets/prometheus/prometheus.py (1)
285-327: Fix ZeroDivisionError, update docstring default, and round step to whole secondsGuard against non-positive step (avoids ZeroDivisionError), update the docstring to reflect the actual env var/default, and align rounding with the utils helper.
File: holmes/plugins/toolsets/prometheus/prometheus.py (function around lines ~285–327)
- Change docstring line to: "Max points is controlled by the MAX_GRAPH_POINTS environment variable (default: 200)."
 - Add a guard:
 elif step <= 0: logging.warning(...); step = max(1, time_range_seconds / 60)(treat non-positive like None).- After computing adjusted_step, round up to whole seconds and return as float:
 adjusted_step = max(1.0, float(math.ceil(adjusted_step)))andreturn max(1.0, float(math.ceil(step))).- Add
 import mathto module imports.Note: holmes/common/env_vars.py defines
MAX_GRAPH_POINTSdefaulting to 200; holmes/plugins/toolsets/utils.py already usesmath.ceiland returns an int — make prometheus behavior consistent (float of the ceiling).Run tests: tests/plugins/toolsets/test_prometheus_unit.py and tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py to confirm behavior.
♻️ Duplicate comments (4)
holmes/plugins/toolsets/prometheus/prometheus.py (4)
828-853: Avoid hard-coded 80K cutoff; derive from global context window budget and skip full dumps for sizing.
- Tie the threshold to the general TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT so Prometheus truncation is always below the global cap (prior comment).
 - Computing size via json.dumps duplicates memory for large payloads; estimate via counts or cap-serialize a slice instead.
 Please locate the global budget to wire into both instant/range paths:
#!/bin/bash rg -n 'TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT|CONTEXT_WINDOW|MAX_OUTPUT_CHARS|RESULT_MAX_CHARS' -C2
1015-1037: Same 80K cutoff concern for range queries.
Mirror the instant-query advice: derive limit from the global context window budget and avoid full json.dumps for size estimation.
932-936: Make “step” truly optional and accept duration strings in schema.
You already parse durations; schema still declares type=number. Switch to string and clarify text to reflect durations and seconds.- "step": ToolParameter( - description="Query resolution step width in duration format or float number of seconds", - type="number", - required=False, - ), + "step": ToolParameter( + description="Query resolution step width. Accepts Prometheus duration (e.g., '30s','5m','1h') or seconds (e.g., '30'). Omit to auto-calculate.", + type="string", + required=False, + ),
965-972: Sanitize parsed step before adjustment.
If a user passes 0/negative, we still hit division by zero inside adjust. Either keep the defensive fix in adjust (preferred) or also guard here.- step = parse_duration_to_seconds(params.get("step")) + step = parse_duration_to_seconds(params.get("step")) + if step is not None and step <= 0: + step = None # trigger auto-calc @@ - step=step, + step=step,
🧹 Nitpick comments (2)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
336-407: Large-result summarization: helpful; consider adaptive label and safer suggestion text.
Current suggestion hardcodes label names ("pod"/"instance"). Prefer selecting a present label from sampled metrics (e.g., pod > instance > container > node > job) and adapt the label_replace accordingly to reduce user confusion.
977-982: Consider sending whole-second step to the API.
Some Prometheus-compatible backends are strict about step granularity. Casting to ceil int seconds avoids ambiguity and guarantees points <= MAX_GRAPH_POINTS after adjustment.- "step": step, + "step": int(math.ceil(step)),Add at top if not present:
import math
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
holmes/common/env_vars.py(1 hunks)holmes/plugins/toolsets/prometheus/prometheus.py(10 hunks)tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/plugins/toolsets/grafana/test_grafana_tempo_tools.py
 - holmes/common/env_vars.py
 
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
holmes/plugins/toolsets/prometheus/prometheus.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
holmes/plugins/toolsets/prometheus/utils.py (1)
parse_duration_to_seconds(7-30)holmes/plugins/toolsets/utils.py (1)
adjust_step_for_max_points(199-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: llm_evals
 - GitHub Check: Pre-commit checks
 
🔇 Additional comments (2)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
25-25: Good import placement; enables flexible step parsing.
Import is at top-level and aligns with mypy/ruff rules.
1191-1195: Expanded exception context in health check: LGTM.
Using logging.exception with exc_info improves traceability without overexposing details to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
holmes/plugins/toolsets/prometheus/utils.py (1)
1-2: Fixed: correct typing import.The bad
Unionimport from a previous revision is resolved. Good catch.holmes/plugins/toolsets/prometheus/prometheus.py (1)
939-943: Make “step” a string to match description and allow durations.Schema says “duration format or float” but
type="number". Switch to"string"; your parser already supports both.Apply:
- "step": ToolParameter( - description="Query resolution step width in duration format or float number of seconds", - type="number", - required=False, - ), + "step": ToolParameter( + description="Query resolution step width (e.g., 30s, 5m, 1h, or seconds as a number). Omit to auto-calculate.", + type="string", + required=False, + ),
🧹 Nitpick comments (7)
holmes/plugins/toolsets/prometheus/utils.py (3)
5-13: Add a short docstring to define accepted formats and rounding.Clarify inputs (e.g., "5m", "1h30m", "123.45") and that unit-bearing values are floored to whole seconds.
Apply:
def parse_duration_to_seconds(v: Optional[Union[str, float, int]]) -> Optional[float]: + """ + Parse durations to seconds. + Accepts: None, numbers (int/float), numeric strings, or Prometheus-like + unit strings/compositions (e.g., "30s", "5m", "1h30m", "2.5s"). + Note: unit-bearing values are floored to whole seconds (e.g., "2.9s" -> 2.0). + """
17-25: Avoid silently accepting partial matches (e.g., "1h30mX").Current
findallwill parse valid tokens and ignore trailing garbage. Prefer full-string validation.Apply:
- pattern = r"(\d+(?:\.\d+)?)(d|h|m|s)" - matches = re.findall(pattern, s) - - if matches: - total_seconds = 0.0 - for value_str, unit in matches: - value = float(value_str) - total_seconds += value * units[unit] - return float(int(total_seconds)) + token_pattern = r"(\d+(?:\.\d+)?)(d|h|m|s)" + full_pattern = rf"(?:\s*{token_pattern}\s*)+" + if re.fullmatch(full_pattern, s): + total_seconds = 0.0 + for m in re.finditer(token_pattern, s): + value = float(m.group(1)) + unit = m.group(2) + total_seconds += value * units[unit] + return float(int(total_seconds))
27-28: Guard against NaN/Infinity in fallback.
float(s)will accept "nan"/"inf", which later breaks step math. Reject non-finite values.Apply:
+import math @@ - return float(s) + val = float(s) + if not math.isfinite(val): + raise ValueError("Duration must be a finite number") + return valtests/plugins/toolsets/prometheus/test_prometheus_utils.py (1)
5-59: Solid coverage. Consider adding strictness/edge cases.Add cases to define behavior for mixed-valid-with-garbage and non-finite inputs.
Apply:
@@ class TestParseDurationToSeconds: def test_parse_duration_to_seconds(self, input_value, expected): result = parse_duration_to_seconds(input_value) assert result == expected + + @pytest.mark.parametrize( + "invalid_input", + [ + "1h30mX", # garbage suffix + "2.5 m", # space between number and unit not supported + "NaN", + "Infinity", + ], + ) + def test_parse_duration_to_seconds_strictness(self, invalid_input): + with pytest.raises(ValueError): + parse_duration_to_seconds(invalid_input)holmes/plugins/toolsets/prometheus/prometheus.py (3)
59-61: Cap response size: align with overall context budget.Ensure
query_response_size_limitis always strictly below the general truncation threshold (TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT-based). Otherwise this path may still truncate downstream.Would you like a helper that reads the global limit and clamps this value at startup?
288-296: Docstring mentions 300 but default changed to 200.Update text to avoid drift or reference the env var instead of a number.
Apply:
- Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable (default: 300). + Max points is controlled by the PROMETHEUS_MAX_GRAPH_POINTS environment variable.
1202-1206: Minor:logging.exceptionalready includes exc_info.Drop
exc_info=Trueto avoid redundancy.Apply:
- logging.exception("Failed to initialize Prometheus", exc_info=True) + logging.exception("Failed to initialize Prometheus")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
holmes/plugins/toolsets/prometheus/prometheus.py(11 hunks)holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2(1 hunks)holmes/plugins/toolsets/prometheus/utils.py(1 hunks)tests/plugins/toolsets/prometheus/test_prometheus_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2
 
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Ruff for formatting and linting
Type hints are required (checked by mypy)
Always place Python imports at the top of the file, not inside functions or methods
Files:
holmes/plugins/toolsets/prometheus/prometheus.pytests/plugins/toolsets/prometheus/test_prometheus_utils.pyholmes/plugins/toolsets/prometheus/utils.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Do not use pytest markers/tags that are not declared; only use markers listed in pyproject.toml
Files:
tests/plugins/toolsets/prometheus/test_prometheus_utils.py
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Tests should mirror the source structure under tests/
Files:
tests/plugins/toolsets/prometheus/test_prometheus_utils.py
🧬 Code graph analysis (2)
holmes/plugins/toolsets/prometheus/prometheus.py (2)
holmes/plugins/toolsets/prometheus/utils.py (1)
parse_duration_to_seconds(5-28)holmes/plugins/toolsets/utils.py (1)
adjust_step_for_max_points(199-222)
tests/plugins/toolsets/prometheus/test_prometheus_utils.py (1)
holmes/plugins/toolsets/prometheus/utils.py (1)
parse_duration_to_seconds(5-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Pre-commit checks
 - GitHub Check: llm_evals
 
🔇 Additional comments (4)
holmes/plugins/toolsets/prometheus/prometheus.py (4)
25-25: Good reuse of shared parser.Importing
parse_duration_to_secondscentralizes duration handling.
339-409: Large-result summarization: pragmatic and safe.Sampling labels only and estimating total points is a good trade-off for huge responses.
831-860: Large-result guard on instant queries looks good.The size check avoids flooding outputs; logging is informative.
1022-1048: Range query summarization path: LGTM.Consistent with instant path; helpful logs and compact summary.
No description provided.