Improvements to loki and datadog logging #1016

aantn · 2025-09-30T04:33:01Z

No description provided.

coderabbitai · 2025-09-30T04:33:08Z

Walkthrough

Adds wildcard-aware pod-name handling and centralized query-building for Datadog and Loki log tools. Introduces new PodLoggingTool subclasses for both providers, updates healthcheck behavior and error context, and adds Loki-specific instructions template. Refactors Datadog query construction into a helper and enhances Loki label matching to support wildcards.

Changes

Cohort / File(s)	Summary of Changes
Datadog logs toolset refactor & wildcard support `holmes/plugins/toolsets/datadog/toolset_datadog_logs.py`	Added build_datadog_query helper; replaced inline query construction; introduced DatadogPodLoggingTool with wildcard-aware pod parameter; switched toolset to use it; expanded healthcheck to treat NO_DATA as success and log context (time range, query, URL); added ToolParameter import.
Loki API wildcard label matching & error context `holmes/plugins/toolsets/grafana/loki_api.py`	Added wildcard handling for label values (exact, `*`, and simple patterns via regex); improved execute_loki_query error with URL/query/time/limit context; adjusted control flow for query construction with wildcards.
Loki toolset customization & instructions `holmes/plugins/toolsets/grafana/toolset_grafana_loki.py`, `holmes/plugins/toolsets/grafana/loki_instructions.jinja2`	Introduced LokiPodLoggingTool overriding parameter description to enable pod-name wildcards (via ToolParameter); updated GrafanaLokiToolset to use it; added _reload_instructions to load new loki_instructions.jinja2; imported os; added instructional template (documentation only).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant DatadogToolset as DatadogLogsToolset
  participant PodTool as DatadogPodLoggingTool
  participant DD as Datadog API

  rect rgba(230,240,255,0.5)
    note right of PodTool: Centralized query construction
    User->>DatadogToolset: fetch_pod_logs(params)
    DatadogToolset->>DatadogToolset: build_datadog_query(params, dd_config)
    DatadogToolset->>DD: fetch_paginated_logs(query, time range)
    DD-->>DatadogToolset: logs / NO_DATA
    DatadogToolset-->>User: logs / empty result + context
  end

  rect rgba(240,255,230,0.5)
    note right of DatadogToolset: Healthcheck treats NO_DATA as reachable
    User->>DatadogToolset: healthcheck()
    DatadogToolset->>DatadogToolset: build_datadog_query(wildcard-capable)
    DatadogToolset->>DD: query (short time range)
    alt Data or NO_DATA
      DD-->>DatadogToolset: response
      DatadogToolset-->>User: success (with URL, time range, query)
    else Error
      DD-->>DatadogToolset: error
      DatadogToolset-->>User: failure (with context)
    end
  end

sequenceDiagram
  autonumber
  participant User
  participant LokiToolset as GrafanaLokiToolset
  participant PodTool as LokiPodLoggingTool
  participant Loki as Loki API

  rect rgba(255,245,230,0.6)
    User->>LokiToolset: fetch_pod_logs(params)
    LokiToolset->>LokiToolset: _reload_instructions()
    note over PodTool,LokiToolset: pod_name supports wildcards (exact, "*", simple patterns)
    LokiToolset->>Loki: query_loki_logs_by_label(pod=pattern)
    alt Exact label
      Loki-->>LokiToolset: logs
    else Wildcard pattern
      LokiToolset->>Loki: regex-matched query
      Loki-->>LokiToolset: logs
    end
    LokiToolset-->>User: logs or error (with context)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Improve historical data querying for logs and metrics #905: Also modifies Datadog logs toolset for wildcard/historical querying; overlaps query construction and tool initialization.
Datadog improvements #973: Touches the same Datadog log query paths (including URL generation) now centralized via build_datadog_query.
ROB-1723: datadog logs toolset #636: Earlier changes to Datadog toolset and query logic in the same module.

Suggested reviewers

moshemorad
arikalon1

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	There is no pull request description provided, so there is no contextual information explaining the purpose, scope, or details of the changeset.	Please add a brief description summarizing the objectives of the pull request, including the key improvements made to the Loki and Datadog logging toolsets and any relevant usage or behavior changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “Improvements to loki and datadog logging” succinctly captures the primary focus of the pull request by highlighting enhancements to both the Grafana Loki and Datadog logging toolsets without unnecessary detail or noise, making it clear and specific enough for a teammate to understand the main change at a glance.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch logging-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

holmes/plugins/toolsets/datadog/toolset_datadog_logs.py (2)
438-469: Consider using logging.debug() for debug messages.

The healthcheck correctly uses wildcards and the timestamp calculations are accurate. However, lines 459-468 and 471-473 use logging.info() for messages prefixed with "DEBUG:". These should use logging.debug() for proper log level semantics.

Apply this diff:
-            logging.info(
-                f"DEBUG: Running Datadog healthcheck with params: namespace={healthcheck_params.namespace}, pod_name={healthcheck_params.pod_name}, start_time={healthcheck_params.start_time}"
-            )
-            logging.info(
-                f"DEBUG: Healthcheck time range: from {from_dt.isoformat()} to {to_dt.isoformat()}"
-            )
+            logging.debug(
+                f"Running Datadog healthcheck with params: namespace={healthcheck_params.namespace}, pod_name={healthcheck_params.pod_name}, start_time={healthcheck_params.start_time}"
+            )
+            logging.debug(
+                f"Healthcheck time range: from {from_dt.isoformat()} to {to_dt.isoformat()}"
+            )
             if self.dd_config:
-                logging.info(
-                    f"DEBUG: Healthcheck query will be sent to: {self.dd_config.site_api_url}"
-                )
+                logging.debug(
+                    f"Healthcheck query will be sent to: {self.dd_config.site_api_url}"
+                )
470-485: LGTM! Sensible healthcheck semantics.

The change to treat NO_DATA as a successful healthcheck is correct—the healthcheck validates API connectivity and credentials, not data availability. The additional debug logging (though consider using logging.debug() instead of logging.info() for lines 471-473 and 482-484) provides helpful diagnostic context.

Consider applying this diff for consistent log levels:
-            logging.info(
-                f"DEBUG: Healthcheck result status: {result.status}, error: {result.error}, data length: {len(result.data or '')}"
-            )
+            logging.debug(
+                f"Healthcheck result status: {result.status}, error: {result.error}, data length: {len(result.data or '')}"
+            )
And:
-                logging.info(
-                    "Datadog healthcheck completed successfully (no data found, but API is accessible)"
-                )
+                logging.debug(
+                    "Datadog healthcheck completed successfully (no data found, but API is accessible)"
+                )
holmes/plugins/toolsets/grafana/loki_api.py (1)

86-96: LGTM! Wildcard support correctly implemented.

The query construction properly handles three cases:

"*" → regex match any non-empty value

Contains "*" → convert to regex pattern (e.g., "payment-*" → "payment-.*")

No wildcards → exact match

The implementation aligns with the Loki instructions template and will work correctly for typical Kubernetes pod names.

Optional edge case consideration:
If label_value contains other regex special characters (., +, ?, etc.), they'll be interpreted as regex metacharacters. For example, "my.pod" would match "myzpod". While Kubernetes pod names rarely have such characters, you could escape them with re.escape() before replacing "*" if stricter matching is desired.

holmes/plugins/toolsets/grafana/toolset_grafana_loki.py (1)

46-58: Consider more specific return type for better type safety.

The return type dict is correct and matches the parent class signature, but dict[str, ToolParameter] would provide better type hints for mypy. However, this would require updating the parent class signature in logging_api.py as well.

The implementation logic is sound and properly extends the base class to add wildcard support.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d00778f and 199e083.

📒 Files selected for processing (4)

holmes/plugins/toolsets/datadog/toolset_datadog_logs.py (7 hunks)
holmes/plugins/toolsets/grafana/loki_api.py (2 hunks)
holmes/plugins/toolsets/grafana/loki_instructions.jinja2 (1 hunks)
holmes/plugins/toolsets/grafana/toolset_grafana_loki.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

holmes/plugins/toolsets/grafana/loki_api.py
holmes/plugins/toolsets/grafana/toolset_grafana_loki.py
holmes/plugins/toolsets/datadog/toolset_datadog_logs.py

🧬 Code graph analysis (2)

holmes/plugins/toolsets/grafana/toolset_grafana_loki.py (2)

holmes/core/tools.py (3)

StructuredToolResult (78-102)

ToolParameter (154-159)

_load_llm_instructions (730-735)

holmes/plugins/toolsets/logging_utils/logging_api.py (3)

PodLoggingTool (77-205)

_get_tool_parameters (113-176)

BasePodLoggingToolset (60-74)

holmes/plugins/toolsets/datadog/toolset_datadog_logs.py (4)

holmes/core/tools.py (2)

ToolParameter (154-159)

StructuredToolResultStatus (51-75)

holmes/plugins/toolsets/logging_utils/logging_api.py (4)

FetchPodLogsParams (42-57)

PodLoggingTool (77-205)

_get_tool_parameters (113-176)

BasePodLoggingToolset (60-74)

holmes/plugins/toolsets/grafana/toolset_grafana_loki.py (1)

_get_tool_parameters (46-58)

holmes/plugins/toolsets/utils.py (1)

process_timestamps_to_int (90-136)

🪛 Ruff (0.13.1)

holmes/plugins/toolsets/grafana/loki_api.py

68-68: Use explicit conversion flag

Replace with conversion flag

(RUF010)

70-70: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

70-70: Create your own exception

(TRY002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals

🔇 Additional comments (8)

holmes/plugins/toolsets/grafana/loki_instructions.jinja2 (1)

1-76: LGTM! Comprehensive Loki usage guide.

This template provides clear, actionable guidance for querying Loki logs with appropriate emphasis on wildcard usage for pod names. The examples and investigation patterns are well-structured and should help users effectively retrieve logs.

holmes/plugins/toolsets/datadog/toolset_datadog_logs.py (6)

31-31: LGTM!

Import properly placed at the top with other imports and is used by DatadogPodLoggingTool.

72-94: LGTM! Good refactor centralizing query construction.

The helper function properly handles wildcards by excluding single "" from filters, correctly escapes quotes in filter strings, and provides a sensible fallback to "" for empty queries. This centralization reduces duplication across multiple call sites.

226-241: LGTM! Consistent wildcard support implementation.

The DatadogPodLoggingTool class correctly extends PodLoggingTool and adds wildcard support to the pod_name parameter, mirroring the pattern used in LokiPodLoggingTool.

113-114: LGTM! Successful refactor to use centralized query builder.

Good use of the new build_datadog_query helper to eliminate duplication.

195-196: LGTM! Consistent use of centralized query builder.

Properly uses build_datadog_query to maintain consistency with fetch_paginated_logs.

265-266: LGTM! Enables wildcard support for Datadog logs.

Correctly switches to DatadogPodLoggingTool to provide wildcard-aware parameter descriptions.

holmes/plugins/toolsets/grafana/toolset_grafana_loki.py (1)

77-79: LGTM! Clean integration of custom tool and instruction loading.

The instantiation of LokiPodLoggingTool properly replaces the base tool with wildcard-aware functionality, and calling _reload_instructions() at the end of __init__ ensures Loki-specific guidance is loaded.

coderabbitai · 2025-09-30T04:37:24Z

holmes/plugins/toolsets/grafana/loki_api.py

+        error_details = (
+            f"Failed to query Loki logs:\n"
+            f"  URL: {url}\n"
+            f"  Query: {query}\n"
+            f"  Time range: {start} to {end}\n"
+            f"  Limit: {limit}\n"
+            f"  Error: {str(e)}"
+        )
+        raise Exception(error_details)


⚠️ Potential issue | 🟠 Major

Add exception chaining to preserve context.

The enhanced error message with query details is excellent for debugging. However, line 70 should use raise ... from e to preserve the exception chain, which is important for debugging and traceback clarity.

Apply this diff:

error_details = ( f"Failed to query Loki logs:\n" f" URL: {url}\n" f" Query: {query}\n" f" Time range: {start} to {end}\n" f" Limit: {limit}\n" - f" Error: {str(e)}" + f" Error: {e!s}" ) - raise Exception(error_details) + raise Exception(error_details) from e

Optional improvements:

Line 68: Use {e!s} instead of {str(e)} for cleaner f-string formatting.

Line 70: Consider creating a custom LokiQueryError exception class instead of generic Exception for better error handling downstream.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

error_details = (

f"Failed to query Loki logs:\n"

f" URL: {url}\n"

f" Query: {query}\n"

f" Time range: {start} to {end}\n"

f" Limit: {limit}\n"

f" Error: {str(e)}"

)

raise Exception(error_details)

error_details = (

f"Failed to query Loki logs:\n"

f" URL: {url}\n"

f" Query: {query}\n"

f" Time range: {start} to {end}\n"

f" Limit: {limit}\n"

f" Error: {e!s}"

)

raise Exception(error_details) from e

🧰 Tools

🪛 Ruff (0.13.1)

68-68: Use explicit conversion flag

Replace with conversion flag

(RUF010)

70-70: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

70-70: Create your own exception

(TRY002)

🤖 Prompt for AI Agents

In holmes/plugins/toolsets/grafana/loki_api.py around lines 62 to 70, the code builds a detailed error_details string and raises a plain Exception without chaining; update the raise to preserve the original exception by using "raise Exception(error_details) from e". While editing, replace "{str(e)}" with "{e!s}" in the f-string for cleaner formatting, and optionally consider defining and raising a custom LokiQueryError instead of a generic Exception for clearer downstream handling.

coderabbitai · 2025-09-30T04:37:24Z

holmes/plugins/toolsets/grafana/toolset_grafana_loki.py

+    def _reload_instructions(self):
+        """Load Loki specific instructions."""
+        template_file_path = os.path.abspath(
+            os.path.join(os.path.dirname(__file__), "loki_instructions.jinja2")
+        )
+        self._load_llm_instructions(jinja_template=f"file://{template_file_path}")


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Add return type hint for mypy compliance.

The method is missing a return type hint, which is required by the coding guidelines. Add -> None to the method signature.

Apply this diff:

- def _reload_instructions(self): + def _reload_instructions(self) -> None: """Load Loki specific instructions."""

Verify the loki_instructions.jinja2 template file exists.

Ensure the template file is present in the same directory as this file, as the path construction expects it to be co-located.

Run this script to verify the template file exists:

🏁 Script executed:

#!/bin/bash # Verify that loki_instructions.jinja2 exists in the same directory fd -t f "loki_instructions.jinja2" holmes/plugins/toolsets/grafana/

Length of output: 56

Add missing return type hint for mypy compliance.

Apply this diff:

- def _reload_instructions(self): + def _reload_instructions(self) -> None: """Load Loki specific instructions."""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _reload_instructions(self):

"""Load Loki specific instructions."""

template_file_path = os.path.abspath(

os.path.join(os.path.dirname(__file__), "loki_instructions.jinja2")

)

self._load_llm_instructions(jinja_template=f"file://{template_file_path}")

def _reload_instructions(self) -> None:

"""Load Loki specific instructions."""

template_file_path = os.path.abspath(

os.path.join(os.path.dirname(__file__), "loki_instructions.jinja2")

)

self._load_llm_instructions(jinja_template=f"file://{template_file_path}")

🤖 Prompt for AI Agents

In holmes/plugins/toolsets/grafana/toolset_grafana_loki.py around lines 138 to 143, the _reload_instructions method is missing a return type hint which fails mypy; add an explicit return type hint (-> None) to the def signature so it reads def _reload_instructions(self) -> None:, leaving the body unchanged.

github-actions · 2025-09-30T04:43:32Z

Results of HolmesGPT evals

ask_holmes: 33/36 test cases were successful, 1 regressions, 1 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	✅
ask	02_what_is_wrong_with_pod	✅
ask	04_related_k8s_events	✅
ask	05_image_version	✅
ask	09_crashpod	✅
ask	10_image_pull_backoff	✅
ask	110_k8s_events_image_pull	✅
ask	11_init_containers	✅
ask	13a_pending_node_selector_basic	✅
ask	14_pending_resources	✅
ask	15_failed_readiness_probe	✅
ask	17_oom_kill	✅
ask	19_detect_missing_app_details	✅
ask	20_long_log_file_search	✅
ask	24_misconfigured_pvc	✅
ask	24a_misconfigured_pvc_basic	✅
ask	28_permissions_error	🚧
ask	33_cpu_metrics_discovery	❌
ask	39_failed_toolset	✅
ask	41_setup_argo	✅
ask	42_dns_issues_steps_new_tools	⚠️
ask	43_current_datetime_from_prompt	✅
ask	45_fetch_deployment_logs_simple	✅
ask	51_logs_summarize_errors	✅
ask	53_logs_find_term	✅
ask	54_not_truncated_when_getting_pods	✅
ask	59_label_based_counting	✅
ask	60_count_less_than	✅
ask	61_exact_match_counting	✅
ask	63_fetch_error_logs_no_errors	✅
ask	79_configmap_mount_issue	✅
ask	83_secret_not_found	✅
ask	86_configmap_like_but_secret	✅
ask	93_calling_datadog[0]	✅
ask	93_calling_datadog[1]	✅
ask	93_calling_datadog[2]	✅

Legend

✅ the test was successful
:minus: the test was skipped
⚠️ the test failed but is known to be flaky or known to fail
🚧 the test had a setup failure (not a code regression)
🔧 the test failed due to mock data issues (not a code regression)
🚫 the test was throttled by API rate limits/overload
❌ the test failed and should be fixed before merging the PR

WIP

199e083

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to loki and datadog logging #1016

Improvements to loki and datadog logging #1016

aantn commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improvements to loki and datadog logging #1016

Are you sure you want to change the base?

Improvements to loki and datadog logging #1016

Conversation

aantn commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 30, 2025

Results of HolmesGPT evals

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Sep 30, 2025 •

edited

Loading