Skip to content

Conversation

@nherment
Copy link
Collaborator

No description provided.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds a ToolInvokeContext model and threads LLM and per-tool max_token_count through tool invocation paths; introduces token-counting and per-tool max-token helpers; changes Tool.invoke/_invoke signatures to accept a context across many plugins; integrates truncation and updates tests/mocks to construct and pass contexts. (≤50 words)

Changes

Cohort / File(s) Summary
Core tool-calling entry
holmes/core/tool_calling_llm.py
Compute per-tool max_token_count via get_max_token_count_for_single_tool(llm), build ToolInvokeContext, pass it to tool.invoke(...), and apply token-aware counting/truncation while preserving error paths.
Tool base API
holmes/core/tools.py
Add ToolInvokeContext (pydantic) and change Tool.invoke, Tool._invoke, YAMLTool._invoke signatures to accept context: ToolInvokeContext; propagate tool_number, user_approved, llm, max_token_count.
Tool executor
holmes/core/tools_utils/tool_executor.py
Update ToolExecutor.invoke(tool_name, params, context) to accept and forward ToolInvokeContext to Tool.invoke.
Token utilities & limiter
holmes/core/tools_utils/tool_context_window_limiter.py
holmes/core/tools_utils/token_counting.py
Add get_pct_token_count, get_max_token_count_for_single_tool, and count_tool_response_tokens; centralize token-based max computation and standardize oversized-tool-result handling and reporting.
Models / formatting
holmes/core/models.py
Move params-prefixing into format_tool_result_data; ToolCallResult.as_tool_call_message delegates to the centralized formatter.
Logging truncation
holmes/plugins/toolsets/logging_utils/logging_api.py
Add truncation constants and truncate_logs(...); change PodLoggingTool._invoke to accept ToolInvokeContext and call truncate_logs(result, context.llm, context.max_token_count).
Prometheus toolset
holmes/plugins/toolsets/prometheus/prometheus.py
Add MetricsBasedResponse, create_structured_tool_result, token-based sizing and large-result handling; switch _invoke signatures to accept ToolInvokeContext.
Git toolset & helpers
holmes/plugins/toolsets/git.py
Switch _invoke signatures to accept ToolInvokeContext; add branch/PR tracking helpers, _sanitize_error, prerequisites_callable, and get_example_config.
Plugin signature updates (many)
holmes/plugins/toolsets/...
Replace user_approved: bool with context: ToolInvokeContext across many plugin modules (e.g., atlas_mongodb, azure_sql, bash, datadog, grafana, internet, kafka, mcp, newrelic, opensearch, rabbitmq, robusta, runbook, servicenow, investigator, etc.); add imports and adjust internal usage where applicable.
Logging / truncation integration
holmes/plugins/toolsets/logging_utils/*
Integrate token counting/truncation into logging tool flows and expose truncate_logs utility.
Tests, mocks, fixtures
tests/conftest.py, tests/llm/utils/mock_toolset.py, tests/mocks/toolset_mocks.py, tests/**
Add MockLLM and create_mock_tool_invoke_context; update MockableToolWrapper, mocks, and numerous tests to construct and pass ToolInvokeContext to invoke/_invoke; adjust test call sites and expectations accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Caller as ToolCallingLLM
  participant LLM
  participant Limiter as WindowLimiter
  participant Executor as ToolExecutor
  participant ToolImpl as Tool

  User->>Caller: request tool call
  Caller->>LLM: get_context_window_size()/get_maximum_output_token()
  Caller->>Limiter: get_max_token_count_for_single_tool(llm)
  Limiter-->>Caller: max_token_count
  Note right of Caller #DFF2E1: build ToolInvokeContext {llm, tool_number, user_approved, max_token_count}

  Caller->>Executor: invoke(tool_name, params, context)
  Executor->>ToolImpl: invoke(params, context)
  ToolImpl->>ToolImpl: _invoke(params, context)
  ToolImpl-->>Executor: StructuredToolResult
  Executor-->>Caller: StructuredToolResult
  Caller->>LLM: count_tool_response_tokens(llm, result)
  alt result too large
    Caller->>Caller: truncate_logs(result, llm, max_token_count)
  end
  Caller-->>User: tool result (possibly truncated)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • moshemorad
  • arikalon1

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ❓ Inconclusive There is no author-provided pull request description, so there is no contextual information about the changeset. Without at least a brief summary of what the pull request aims to achieve, it is impossible to determine whether the description relates to the changeset. As a result, the description check is inconclusive because it conveys no meaningful information. Please add a descriptive summary outlining the key changes introduced by this pull request, such as the addition of ToolInvokeContext, the new token-counting utilities, and the signature updates across tools and tests.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title “ROB-2136 link tool limit to global truncation limit” directly reflects the core change of tying the tool output limit to the global truncation limit mechanism, and it concisely conveys the main intent without unnecessary detail. Including the issue identifier is acceptable and does not detract from clarity. The phrasing is specific enough that a reviewer scanning history can understand the primary focus of the update. Therefore, the title meets the criteria for summarizing the main change.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ROB-2136_link_tool_limit_to_global_truncation_limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nherment nherment marked this pull request as ready for review September 23, 2025 10:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
holmes/plugins/toolsets/datadog/toolset_datadog_traces.py (3)

303-317: Include query/time/index details in NO_DATA response

Per plugin guidelines, specify what was searched and where; also include invocation (url/payload).

-            if not formatted_output:
-                return StructuredToolResult(
-                    status=StructuredToolResultStatus.NO_DATA,
-                    params=params,
-                    data="No matching traces found.",
-                )
+            if not formatted_output:
+                return StructuredToolResult(
+                    status=StructuredToolResultStatus.NO_DATA,
+                    params=params,
+                    data=(
+                        f"No matching traces found for query='{query}', "
+                        f"time_range_ms=[{from_time_ms},{to_time_ms}], "
+                        f"indexes={self.toolset.dd_config.indexes}"
+                    ),
+                    invocation=json.dumps({"url": url, "payload": payload}),
+                )

441-447: Include time window and invocation in NO_DATA response

Make clear what was searched (trace_id + time range + indexes) and where.

-            if not formatted_output:
-                return StructuredToolResult(
-                    status=StructuredToolResultStatus.NO_DATA,
-                    params=params,
-                    data=f"No trace found for trace_id: {trace_id}",
-                )
+            if not formatted_output:
+                return StructuredToolResult(
+                    status=StructuredToolResultStatus.NO_DATA,
+                    params=params,
+                    data=(
+                        f"No trace found for trace_id='{trace_id}' within "
+                        f"time_range_ms=[{from_time_ms},{to_time_ms}] "
+                        f"on indexes={self.toolset.dd_config.indexes}"
+                    ),
+                    invocation=json.dumps({"url": url, "payload": payload}),
+                )

648-656: Include query/time/index details in NO_DATA response

Return exactly what was searched and where; add invocation details too.

-            if not formatted_output:
-                return StructuredToolResult(
-                    status=StructuredToolResultStatus.NO_DATA,
-                    params=params,
-                    data="No matching spans found.",
-                )
+            if not formatted_output:
+                return StructuredToolResult(
+                    status=StructuredToolResultStatus.NO_DATA,
+                    params=params,
+                    data=(
+                        f"No matching spans found for query='{query}', "
+                        f"time_range_ms=[{from_time_ms},{to_time_ms}], "
+                        f"indexes={self.toolset.dd_config.indexes}"
+                    ),
+                    invocation=json.dumps({"url": url, "payload": payload}),
+                )
tests/llm/utils/test_mock_toolset.py (3)

249-255: Fix assertion to expect context forwarding to the underlying tool.

wrapper.invoke({}, context) should forward the same context to tool.invoke; update the assertion accordingly.

Apply this diff:

-            tool.invoke.assert_called_once_with({})
+            tool.invoke.assert_called_once_with({}, context)

305-307: Pass context to wrapper.invoke to match new signature.

This callsite still omits context.

Apply this diff:

-            with pytest.raises(MockDataNotFoundError) as exc_info:
-                wrapper.invoke({})
+            with pytest.raises(MockDataNotFoundError) as exc_info:
+                wrapper.invoke({}, create_mock_tool_invoke_context())

475-477: Update ToolExecutor.invoke callsites to pass a ToolInvokeContext.

ToolExecutor.invoke now requires a context; these callsites will fail without it.

Apply these diffs:

-                result = tool_executor.invoke("kubectl_describe", params)
+                context = create_mock_tool_invoke_context()
+                result = tool_executor.invoke("kubectl_describe", params, context)
-                result = tool_executor.invoke("kubectl_describe", params)
+                context = create_mock_tool_invoke_context()
+                result = tool_executor.invoke("kubectl_describe", params, context)
-                with pytest.raises(MockDataNotFoundError):
-                    tool_executor.invoke("kubectl_describe", params)
+                with pytest.raises(MockDataNotFoundError):
+                    tool_executor.invoke("kubectl_describe", params, create_mock_tool_invoke_context())
-                        result = tool_executor.invoke(
-                            "kubectl_describe", {"foo": "bar"}
-                        )
+                        context = create_mock_tool_invoke_context()
+                        result = tool_executor.invoke(
+                            "kubectl_describe", {"foo": "bar"}, context
+                        )

Also applies to: 531-533, 592-595, 637-639

holmes/plugins/toolsets/grafana/toolset_grafana.py (1)

47-56: Add an HTTP timeout to prevent request-thread hangs.

External calls without timeouts can hang and cause cascading issues.

Apply this diff:

-        try:
-            response = requests.get(url, headers=headers)
+        try:
+            # Consume context to satisfy linters and for future use
+            _ = context
+            timeout = getattr(self._toolset._grafana_config, "request_timeout", 30)
+            response = requests.get(url, headers=headers, timeout=timeout)
holmes/plugins/toolsets/internet/internet.py (1)

190-216: Return ERROR when scrape reports failure; add invocation (uses context).

Currently, scrape returns an error string which is treated as SUCCESS data. Fix by detecting and surfacing it as an error; also add invocation to resolve Ruff ARG002.

@@
-        content, mime_type = scrape(url, additional_headers)
+        content, mime_type = scrape(url, additional_headers)
+
+        # Detect and surface upstream fetch errors returned by scrape()
+        if content and mime_type is None and not looks_like_html(content) and content.lower().startswith("failed to load"):
+            return StructuredToolResult(
+                status=StructuredToolResultStatus.ERROR,
+                error=content,
+                params=params,
+                invocation=f"{self.get_parameterized_one_liner(params)} (tool_no={context.tool_number})",
+            )
@@
-        if not content:
+        if not content:
             logging.error(f"Failed to retrieve content from {url}")
             return StructuredToolResult(
                 status=StructuredToolResultStatus.ERROR,
                 error=f"Failed to retrieve content from {url}",
-                params=params,
+                params=params,
+                invocation=f"{self.get_parameterized_one_liner(params)} (tool_no={context.tool_number})",
             )
@@
-        return StructuredToolResult(
-            status=StructuredToolResultStatus.SUCCESS,
-            data=content,
-            params=params,
-        )
+        return StructuredToolResult(
+            status=StructuredToolResultStatus.SUCCESS,
+            data=content,
+            params=params,
+            invocation=f"{self.get_parameterized_one_liner(params)} (tool_no={context.tool_number})",
+        )

Optional: consider truncating content based on context.max_token_count if available.

holmes/plugins/toolsets/opensearch/opensearch.py (1)

60-74: Wrong client selection logic; may return the first client even when host belongs to a different client.

found = any(host in client.hosts for client in clients) checks across all clients, not the current one, causing incorrect matches. Return the client whose hosts contain the requested host.

Apply:

 def get_client(clients: List[OpenSearchClient], host: Optional[str]):
     if len(clients) == 1:
         return clients[0]
 
     if not host:
         raise Exception("Missing host to resolve opensearch client")
 
     for client in clients:
-        found = any(host in client.hosts for client in clients)
+        found = host in client.hosts
         if found:
             return client
tests/llm/utils/mock_toolset.py (1)

442-470: Fix missing context when calling live/mock during GENERATE mode.

Both calls currently omit context and will raise TypeError.

 if self._mode == MockMode.GENERATE:
     try:
-        return self._call_mock_invoke(params)
+        return self._call_mock_invoke(params, context)
     except MockDataNotFoundError as e:
         logging.debug(str(e))

-    result = self._call_live_invoke(params)
+    result = self._call_live_invoke(params, context)
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py (1)

199-346: Move import logging to top-level to comply with project import policy.

Inline import at Line 241 violates the “imports at top of file” rule. Remove the inline import and add a top-level import.

-            # Debug log the query (useful for troubleshooting)
-            import logging
-
+            # Debug log the query (useful for troubleshooting)
             logger = logging.getLogger(__name__)

Also add at file top:

+import logging

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tests/llm/utils/test_mock_toolset.py (4)

249-255: Fix assertion to match new Tool.invoke(params, context) signature

MockableToolWrapper should call tool.invoke(params, context). Update the assertion.

Apply this diff:

-            tool.invoke.assert_called_once_with({})
+            tool.invoke.assert_called_once_with({}, context)

328-334: Fix assertion to match new Tool.invoke(params, context) signature

In generate mode, the real tool is called with context. Update the assertion.

Apply this diff:

-            tool.invoke.assert_called_once_with({})
+            tool.invoke.assert_called_once_with({}, context)

592-599: Update ToolExecutor.invoke call to include context

This call uses the old signature and will raise a TypeError. Create a context and pass it.

Apply this diff:

-                # In mock mode, calling with non-matching params should raise MockDataNotFoundError
-                with pytest.raises(MockDataNotFoundError):
-                    tool_executor.invoke("kubectl_describe", params)
+                # In mock mode, calling with non-matching params should raise MockDataNotFoundError
+                context = create_mock_tool_invoke_context()
+                with pytest.raises(MockDataNotFoundError):
+                    tool_executor.invoke("kubectl_describe", params, context)

624-644: Remove duplicate local import; rely on top-level import

Consolidate imports at module scope; drop this local import.

Apply this diff:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

The subsequent context creation and invocation are already correct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6ea5bf and 5b8e55a.

📒 Files selected for processing (1)
  • tests/llm/utils/test_mock_toolset.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/llm/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Always run LLM tests with RUN_LIVE=true to ensure results match real-world behavior

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test layout should mirror the source structure under tests/

Files:

  • tests/llm/utils/test_mock_toolset.py
🧬 Code graph analysis (1)
tests/llm/utils/test_mock_toolset.py (3)
tests/conftest.py (1)
  • create_mock_tool_invoke_context (107-133)
holmes/core/tools_utils/tool_executor.py (1)
  • invoke (52-62)
holmes/core/tools.py (1)
  • invoke (235-261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.11)
  • GitHub Check: llm_evals
🔇 Additional comments (6)
tests/llm/utils/test_mock_toolset.py (6)

14-14: Context import LGTM

Importing the factory aligns tests with the new context-aware API.


283-289: LGTM: context passed through wrapper in mock mode

Invocation correctly supplies context; assertion of no real call is valid.


292-311: LGTM: error path validates context flow

Context is created and passed; exception assertions look correct.


478-480: LGTM: ToolExecutor.invoke now correctly receives context

The invocation is aligned with the new invoke(tool_name, params, context) API.


535-537: LGTM: matches new ToolExecutor signature

Context creation and passing are correct.


641-644: LGTM: generate mode invocation passes context

Aligned with updated executor API.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/llm/utils/test_mock_toolset.py (2)

249-254: Fix assertion: include context in call to tool.invoke

Now that wrapper invokes tool.invoke(params, context), the assertion must include the context.

Apply this diff:

-            tool.invoke.assert_called_once_with({})
+            tool.invoke.assert_called_once_with({}, context)

328-334: Fix assertion: include context in call to tool.invoke (generate mode)

The real tool is called with (params, context); update the assertion accordingly.

Apply this diff:

-            tool.invoke.assert_called_once_with({})
+            tool.invoke.assert_called_once_with({}, context)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b8e55a and 80bfe84.

📒 Files selected for processing (1)
  • tests/llm/utils/test_mock_toolset.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/llm/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Always run LLM tests with RUN_LIVE=true to ensure results match real-world behavior

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:

  • tests/llm/utils/test_mock_toolset.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test layout should mirror the source structure under tests/

Files:

  • tests/llm/utils/test_mock_toolset.py
🧬 Code graph analysis (1)
tests/llm/utils/test_mock_toolset.py (4)
tests/conftest.py (1)
  • create_mock_tool_invoke_context (107-133)
holmes/core/tools_utils/tool_executor.py (1)
  • invoke (52-62)
holmes/core/tools.py (1)
  • invoke (235-261)
tests/llm/utils/mock_toolset.py (1)
  • MockDataNotFoundError (45-48)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
🔇 Additional comments (10)
tests/llm/utils/test_mock_toolset.py (10)

14-14: LGTM: importing the context factory

Importing create_mock_tool_invoke_context at module scope aligns with the new context-aware API.


283-289: LGTM: context passed through in mock mode

Passing context to wrapper.invoke is correct; assert_not_called remains valid.


292-309: LGTM: context passed to wrapper.invoke in error path

Good propagation of context for the error case.


8-16: Move ToolExecutor import to module top (and deduplicate local imports)

Per guidelines, keep imports at module scope. Add ToolExecutor here and remove local imports below.

Apply this diff to add the import at the top:

 from holmes.core.tools import (
     Tool,
     StructuredToolResult,
     StructuredToolResultStatus,
     ToolsetStatusEnum,
 )
+from holmes.core.tools_utils.tool_executor import ToolExecutor
 from tests.conftest import create_mock_tool_invoke_context

474-476: Remove local import; rely on top-level import

Keep imports at module scope; this local import should be deleted.

Apply this diff:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

478-480: LGTM: context threaded into ToolExecutor.invoke

Invocation now correctly passes (tool_name, params, context).


532-536: Remove duplicate local import; keep context usage

Use the top-level ToolExecutor import and keep the context‑aware invocation.

Apply this diff:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor
@@
-                tool_executor = ToolExecutor(mock_toolsets.toolsets)
-                context = create_mock_tool_invoke_context()
-                result = tool_executor.invoke("kubectl_describe", params, context)
+                tool_executor = ToolExecutor(mock_toolsets.toolsets)
+                context = create_mock_tool_invoke_context()
+                result = tool_executor.invoke("kubectl_describe", params, context)

592-600: Remove duplicate local import; rely on module‑level import

Delete the local import; invocation with context is correct.

Apply this diff:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor
@@
-                context = create_mock_tool_invoke_context()
-                with pytest.raises(MockDataNotFoundError):
-                    tool_executor.invoke("kubectl_describe", params, context)
+                context = create_mock_tool_invoke_context()
+                with pytest.raises(MockDataNotFoundError):
+                    tool_executor.invoke("kubectl_describe", params, context)

625-627: Remove duplicate local import; rely on module‑level import

Same refactor here.

Apply this diff:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

642-645: LGTM: context threaded in generate mode path

Correctly passing context to ToolExecutor.invoke.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/llm/utils/mock_toolset.py (1)

416-416: Fix Ruff ARG002: unused parameter context in mock path.

Rename the parameter to _context and add a return type for consistency.

-    def _call_mock_invoke(self, params: Dict, context: ToolInvokeContext):
+    def _call_mock_invoke(self, params: Dict, _context: ToolInvokeContext) -> StructuredToolResult:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80bfe84 and 94982af.

📒 Files selected for processing (2)
  • tests/llm/utils/mock_toolset.py (4 hunks)
  • tests/llm/utils/test_mock_toolset.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)

Files:

  • tests/llm/utils/mock_toolset.py
  • tests/llm/utils/test_mock_toolset.py
tests/llm/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Always run LLM tests with RUN_LIVE=true to ensure results match real-world behavior

Files:

  • tests/llm/utils/mock_toolset.py
  • tests/llm/utils/test_mock_toolset.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:

  • tests/llm/utils/mock_toolset.py
  • tests/llm/utils/test_mock_toolset.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test layout should mirror the source structure under tests/

Files:

  • tests/llm/utils/mock_toolset.py
  • tests/llm/utils/test_mock_toolset.py
🧬 Code graph analysis (2)
tests/llm/utils/mock_toolset.py (1)
holmes/core/tools.py (5)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
  • invoke (235-261)
  • _invoke (351-360)
  • _invoke (413-445)
tests/llm/utils/test_mock_toolset.py (3)
tests/conftest.py (1)
  • create_mock_tool_invoke_context (107-133)
holmes/core/tools_utils/tool_executor.py (1)
  • invoke (52-62)
holmes/core/tools.py (1)
  • invoke (235-261)
🪛 Ruff (0.13.1)
tests/llm/utils/mock_toolset.py

416-416: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (16)
tests/llm/utils/mock_toolset.py (3)

17-17: Good: threading ToolInvokeContext into the wrapper.

Import aligns with the updated Tool API.


409-415: Good: live path now forwards context to the real tool.

This keeps tests consistent with the new signature.


442-471: Good: _invoke consistently forwards context across all modes.

Behavior in GENERATE mode (prefer existing mocks, generate only when missing) matches the documented guidance elsewhere.

tests/llm/utils/test_mock_toolset.py (13)

14-14: Good: centralized helper for creating ToolInvokeContext.

Keeps tests uniform with the new interface.


249-254: Good: live mode test now asserts context propagation.

Assertion matches the new call signature.


283-285: Good: mock mode uses context-aware wrapper.invoke.


292-292: Context prepared up-front.

Looks fine.


306-308: Good: error path uses context-aware invoke.


328-333: Good: generate mode passes context end-to-end and asserts real tool call.


474-475: Move local import to module top (duplicate of prior review).

Keep imports at module scope per guidelines; remove this local import and add a single top-level import.

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

Add at the top of the file with other imports:

+from holmes.core.tools_utils.tool_executor import ToolExecutor

478-480: Good: ToolExecutor.invoke now receives and forwards context.


532-533: Remove duplicate local import; rely on the top-level import (duplicate of prior review).

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

535-537: Good: context forwarded to executor invocation (params-any).


596-600: Good: negative path asserts with context-aware executor invocation.


625-625: Remove this local import as well; use the module-level import.

Consistent with earlier refactor.

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

642-645: Good: generate-mode flow uses context with executor.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
holmes/plugins/toolsets/logging_utils/logging_api.py (2)

81-87: Minor: clarify the chars-per-token comment (optional).
The note reads as if picking a lower value reduces the risk of over-truncation; it's the opposite—using a lower chars-per-token will keep fewer characters and may over-truncate. Consider rewording to avoid confusion.


250-255: Confirm truncation conditions (avoid chopping error payloads and zero/negative limits).

  • Consider skipping truncation when result.error is set to preserve full error detail, per tools guidelines.
  • Also guard against zero/negative token limits to avoid unnecessary looping.

If desired, wrap the call:

if context.max_token_count > 0:
    truncate_logs(result, context.llm, context.max_token_count)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa2fcf and d9f6702.

📒 Files selected for processing (2)
  • holmes/plugins/toolsets/logging_utils/logging_api.py (4 hunks)
  • tests/plugins/toolsets/test_logging_api.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/plugins/toolsets/test_logging_api.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
holmes/plugins/toolsets/**

📄 CodeRabbit inference engine (CLAUDE.md)

Toolsets must be organized under holmes/plugins/toolsets/ as either {name}.yaml files or directories

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
holmes/{core,plugins}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All tools must return detailed error messages from underlying APIs, including executed command/query, time ranges/parameters, and full API error response; 'no data' responses must specify what was searched and where

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/logging_utils/logging_api.py (3)
holmes/core/llm.py (1)
  • LLM (38-66)
holmes/core/tools.py (4)
  • StructuredToolResult (79-103)
  • Tool (172-364)
  • ToolInvokeContext (163-169)
  • ToolParameter (155-160)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (2)
holmes/plugins/toolsets/logging_utils/logging_api.py (2)

4-4: LGTM: imports added are necessary for the new truncation flow.

Also applies to: 10-10, 14-14, 18-18


235-235: LGTM: _invoke signature updated to accept ToolInvokeContext.
This aligns with the new tool execution flow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
holmes/plugins/toolsets/logging_utils/logging_api.py (2)

81-89: Use or remove the unused char↔token ratio constant.

APPROXIMATE_CHAR_TO_TOKEN_RATIO is defined but not used. Either leverage it in truncation math (see next comment) or drop it to avoid confusion.


233-253: Guard truncation when the tool context limit is non-positive.

Small safety net: skip truncation if context.max_token_count <= 0.

Apply this diff:

-        truncate_logs(
-            logging_structured_tool_result=result,
-            llm=context.llm,
-            token_limit=context.max_token_count,
-        )
+        if context.max_token_count > 0:
+            truncate_logs(
+                logging_structured_tool_result=result,
+                llm=context.llm,
+                token_limit=context.max_token_count,
+            )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01543c0 and 3cb77fb.

📒 Files selected for processing (1)
  • holmes/plugins/toolsets/logging_utils/logging_api.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
holmes/plugins/toolsets/**

📄 CodeRabbit inference engine (CLAUDE.md)

Toolsets must be organized under holmes/plugins/toolsets/ as either {name}.yaml files or directories

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
holmes/{core,plugins}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All tools must return detailed error messages from underlying APIs, including executed command/query, time ranges/parameters, and full API error response; 'no data' responses must specify what was searched and where

Files:

  • holmes/plugins/toolsets/logging_utils/logging_api.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/logging_utils/logging_api.py (3)
holmes/core/llm.py (1)
  • LLM (38-66)
holmes/core/tools.py (4)
  • StructuredToolResult (79-103)
  • Tool (172-364)
  • ToolInvokeContext (163-169)
  • get_stringified_data (90-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (2)
holmes/plugins/toolsets/logging_utils/logging_api.py (2)

4-4: LLM/context threading and imports look correct.

Top-level imports comply with style; LLM/context are correctly threaded for token counting/truncation.

Also applies to: 10-10, 14-14, 18-18


91-130: Avoid duplicating the truncation prefix across iterations; add return type and trim by overflow tokens.

  • Prefix is prepended on every loop iteration, producing repeated banners and extra tokens.
  • Missing return type annotation.
  • Prefer trimming ≈ overflow tokens worth of characters to converge faster and use the defined ratio constant.

Apply this diff:

-def truncate_logs(
-    logging_structured_tool_result: StructuredToolResult, llm: LLM, token_limit: int
-):
-    token_count = count_tool_response_tokens(
-        llm=llm, structured_tool_result=logging_structured_tool_result
-    )
-    text = None
-    while token_count > token_limit:
-        # Loop because we are counting tokens but trimming characters. This means we try to trim a number of
-        # characters proportional to the number of tokens but we may still have too many tokens
-        if not text:
-            text = logging_structured_tool_result.get_stringified_data()
-        if not text:
-            # Weird scenario where the result exceeds the token allowance but there is not data.
-            # Exit and do nothing because I don't know how to handle such scenario.
-            logging.warning(
-                f"The calculated token count for logs is {token_count} but the limit is {token_limit}. However the data field is empty so there are no logs to truncate."
-            )
-            return
-        ratio = token_count / token_limit
-        character_count = len(text)
-        number_of_characters_to_truncate = character_count - ceil(
-            character_count / ratio
-        )
-        number_of_characters_to_truncate = max(
-            MIN_NUMBER_OF_CHARACTERS_TO_TRUNCATE, number_of_characters_to_truncate
-        )
-
-        if len(text) <= number_of_characters_to_truncate:
-            logging.warning(
-                f"The calculated token count for logs is {token_count} (max allowed tokens={token_limit}) but the logs are only {len(text)} characters which is below the intended truncation of {number_of_characters_to_truncate} characters. Logs will no longer be truncated"
-            )
-            return
-        else:
-            text = TRUNCATION_PROMPT_PREFIX + text[number_of_characters_to_truncate:]
-            logging_structured_tool_result.data = text
-            token_count = count_tool_response_tokens(
-                llm=llm, structured_tool_result=logging_structured_tool_result
-            )
+def truncate_logs(
+    logging_structured_tool_result: StructuredToolResult, llm: LLM, token_limit: int
+) -> None:
+    token_count = count_tool_response_tokens(
+        llm=llm, structured_tool_result=logging_structured_tool_result
+    )
+    original_text = logging_structured_tool_result.get_stringified_data()
+    if not original_text:
+        if token_count > token_limit:
+            logging.warning(
+                f"The calculated token count for logs is {token_count} but the limit is {token_limit}. However the data field is empty so there are no logs to truncate."
+            )
+        return
+
+    start_idx = 0  # cumulative number of original characters dropped (from the front)
+    while token_count > token_limit:
+        overflow = token_count - token_limit
+        # Drop enough characters to reduce approximately 'overflow' tokens.
+        number_of_characters_to_truncate = max(
+            MIN_NUMBER_OF_CHARACTERS_TO_TRUNCATE,
+            ceil(overflow * APPROXIMATE_CHAR_TO_TOKEN_RATIO),
+        )
+
+        remaining_len = len(original_text) - start_idx
+        if remaining_len <= number_of_characters_to_truncate:
+            logging.warning(
+                f"The calculated token count for logs is {token_count} (max allowed tokens={token_limit}) but the logs are only {remaining_len} characters which is below the intended truncation of {number_of_characters_to_truncate} characters. Logs will no longer be truncated"
+            )
+            return
+
+        start_idx += number_of_characters_to_truncate
+        truncated = TRUNCATION_PROMPT_PREFIX + original_text[start_idx:]
+        logging_structured_tool_result.data = truncated
+        token_count = count_tool_response_tokens(
+            llm=llm, structured_tool_result=logging_structured_tool_result
+        )

@nherment nherment requested a review from moshemorad September 25, 2025 07:19
Copy link
Contributor

@aantn aantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some comments.

In accordance with the ticket, can you also please replace DEFAULT_QUERY_RESPONSE_SIZE_LIMIT (used in Prometheus) with a percentage of tokens, instead of a hardcoded character count.

Given the vast number of metric queries holmes tends to run in a single investigation, I'd suggest using 2% as the actual value. (The current value is 5000 tokens which is 2.5% of a 200K context window. But in practice our 200K window is exposed to the LLM as a smaller window, so I'm reducing to 2%. The rationale is that the current threshold works OK so we can preserve it.)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
holmes/plugins/toolsets/logging_utils/logging_api.py (1)

90-92: Add return type annotation.

The function signature is missing a return type annotation. Per coding guidelines, type hints are required for mypy compatibility.

Apply this diff:

 def truncate_logs(
     logging_structured_tool_result: StructuredToolResult, llm: LLM, token_limit: int
-):
+) -> None:

As per coding guidelines.

🧹 Nitpick comments (6)
holmes/core/tools_utils/tool_context_window_limiter.py (3)

9-19: Consider removing redundant int() wrapper.

The floor division operator // already returns an integer, making the int() wrapper on line 16 redundant.

Apply this diff:

-        return int(context_window_size * TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT // 100)
+        return context_window_size * TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT // 100

21-21: Add return type hint for completeness.

The function lacks an explicit return type hint. Add -> None to satisfy the type-checking requirement per coding guidelines.

Apply this diff:

-def prevent_overly_big_tool_response(tool_call_result: ToolCallResult, llm: LLM):
+def prevent_overly_big_tool_response(tool_call_result: ToolCallResult, llm: LLM) -> None:

As per coding guidelines


43-43: Consider smarter truncation for error messages.

The fixed 100-character truncation may cut off mid-word or mid-character (especially with multi-byte UTF-8). Consider using a word-boundary-aware truncation or adding an ellipsis indicator.

Example with ellipsis:

-            truncated_error = original_error[:100]
+            truncated_error = original_error[:100] + ("..." if len(original_error) > 100 else "")
holmes/plugins/toolsets/logging_utils/logging_api.py (1)

29-36: Consider grouping truncation constants with other module constants.

The truncation constants are well-placed after imports, but a past review suggested placing constants at the top of the file. Consider moving these constants up to group them with DEFAULT_LOG_LIMIT, SECONDS_PER_DAY, etc. (lines 21-28) for better organization.

Based on past review comments.

tests/plugins/toolsets/test_logging_api.py (2)

199-202: Consider using MockLLM for more reliable testing.

The test currently requires a real LLM via the MODEL environment variable, which makes it dependent on external services and potentially slow or flaky. Consider using MockLLM() from conftest (which is used by create_mock_tool_invoke_context) for faster, more reliable unit tests. However, if accurate token counting is critical for this test, the current approach with a skip guard is acceptable.

Example using MockLLM:

def test_truncate_logs_nominal_scenario(self):
    """Test that truncate_logs correctly truncates logs when they exceed the token limit."""
    from tests.conftest import MockLLM
    
    llm = MockLLM()
    # ... rest of test

232-239: Consider using ≤ for token limit check.

Line 235 uses < to check the token count is under the limit. Consider using <= instead, as being exactly at the token limit is acceptable:

-        assert truncated_token_count < token_limit
+        assert truncated_token_count <= token_limit

The current assertion is slightly stricter but functionally correct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb77fb and 3187728.

📒 Files selected for processing (4)
  • holmes/core/tools_utils/tool_context_window_limiter.py (1 hunks)
  • holmes/core/tools_utils/tool_executor.py (2 hunks)
  • holmes/plugins/toolsets/logging_utils/logging_api.py (5 hunks)
  • tests/plugins/toolsets/test_logging_api.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • holmes/core/tools_utils/tool_executor.py
🧰 Additional context used
📓 Path-based instructions (3)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/core/tools_utils/tool_context_window_limiter.py
  • tests/plugins/toolsets/test_logging_api.py
  • holmes/plugins/toolsets/logging_utils/logging_api.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:

  • tests/plugins/toolsets/test_logging_api.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

  • tests/plugins/toolsets/test_logging_api.py
🧬 Code graph analysis (3)
holmes/core/tools_utils/tool_context_window_limiter.py (5)
holmes/core/llm.py (1)
  • LLM (38-66)
holmes/core/tools.py (1)
  • StructuredToolResultStatus (52-76)
holmes/core/models.py (2)
  • ToolCallResult (23-60)
  • as_tool_call_message (30-36)
tests/conftest.py (2)
  • get_context_window_size (74-75)
  • count_tokens_for_message (80-83)
holmes/utils/sentry_helper.py (1)
  • capture_toolcall_contains_too_many_tokens (22-34)
tests/plugins/toolsets/test_logging_api.py (5)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
holmes/plugins/toolsets/logging_utils/logging_api.py (2)
  • truncate_logs (90-145)
  • _invoke (249-270)
holmes/core/tools.py (4)
  • StructuredToolResult (79-103)
  • StructuredToolResultStatus (52-76)
  • _invoke (351-360)
  • _invoke (413-445)
holmes/core/llm.py (1)
  • DefaultLLM (69-371)
tests/conftest.py (1)
  • create_mock_tool_invoke_context (107-133)
holmes/plugins/toolsets/logging_utils/logging_api.py (3)
holmes/core/llm.py (1)
  • LLM (38-66)
holmes/core/tools.py (5)
  • StructuredToolResult (79-103)
  • Tool (172-364)
  • ToolInvokeContext (163-169)
  • ToolParameter (155-160)
  • get_stringified_data (90-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
🔇 Additional comments (8)
holmes/core/tools_utils/tool_context_window_limiter.py (2)

1-6: LGTM: Imports are correctly placed at the top.

All imports are properly positioned at the top of the file and appear necessary for the functions below.


34-48: Review NO_DATA handling logic for oversized responses.

When a NO_DATA response is too large, the code sets error_message = None (line 35) and then converts the status to ERROR with error = None (lines 46-48). This creates an ERROR status with no error message, which may be confusing.

Consider whether NO_DATA should remain NO_DATA (since no data is still the accurate state) or if a clearer error message should be provided explaining that the "no data" response itself was too large.

Can you clarify the intended behavior when a NO_DATA response exceeds the token limit? Should it:

  1. Remain as NO_DATA status with data=None (current state already satisfies this)?
  2. Become ERROR with a descriptive message?
  3. Become ERROR with error=None (current behavior)?
holmes/plugins/toolsets/logging_utils/logging_api.py (4)

4-4: LGTM: Imports follow coding guidelines.

All new imports are properly placed at the top of the file as required.

Also applies to: 10-10, 14-14, 18-18


96-146: Truncation logic looks good.

The implementation correctly addresses most concerns from past reviews:

  • Uses get_stringified_data() to avoid "None" stringification
  • Keeps most recent logs by slicing the tail of the text
  • Handles line boundaries to avoid cutting log lines mid-line
  • Includes clear comments explaining the iterative truncation approach

The truncation math differs from the past review suggestion but is reasonable and will converge via the while loop.

Based on past review comments.


249-249: LGTM: Signature updated to accept ToolInvokeContext.

The signature change correctly threads the invocation context through the tool, aligning with the PR objectives.


264-268: Consider skipping truncation for non-success results.

The truncation is applied regardless of the result status. Should error results also be truncated? If errors are typically short and important for debugging, consider applying truncation only when result.status == StructuredToolResultStatus.SUCCESS.

# Example conditional truncation
if result.status == StructuredToolResultStatus.SUCCESS:
    truncate_logs(
        logging_structured_tool_result=result,
        llm=context.llm,
        token_limit=context.max_token_count,
    )
tests/plugins/toolsets/test_logging_api.py (2)

3-3: LGTM: Test imports properly added.

All new imports follow coding guidelines and are correctly placed at the top of the file.

Also applies to: 6-7, 8-8, 10-10, 15-15, 18-19


44-44: LGTM: Test invocations updated correctly.

All test calls to _invoke have been properly updated to pass the ToolInvokeContext parameter using the helper function from conftest.

Also applies to: 83-83, 115-115, 140-140, 175-175

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
holmes/plugins/toolsets/newrelic/newrelic.py (1)

78-108: Implement token-aware truncation in NewRelicTool._invoke

The context parameter is unused—after api.execute_nrql_query(query), you must enforce context.max_token_count via token-aware truncation using context.llm. For example, call count_tool_response_tokens(llm=context.llm, structured_tool_result=result) and trim result.data in a loop (mirroring the logic in truncate_logs in holmes/plugins/toolsets/logging_utils/logging_api.py:90–140) until the token count ≤ context.max_token_count. Extract this into a shared helper in holmes/core/tools_utils if needed.

♻️ Duplicate comments (1)
tests/llm/utils/test_mock_toolset.py (1)

480-480: Move local ToolExecutor imports to module top.

The local imports of ToolExecutor inside test functions violate the coding guideline requiring all Python imports at the top of the file. This issue was flagged in previous reviews but remains unresolved.

As per coding guidelines, add to the top of the file with other imports:

+from holmes.core.tools_utils.tool_executor import ToolExecutor

Then remove the local import statements at lines 480, 538, 598, and 645:

-                from holmes.core.tools_utils.tool_executor import ToolExecutor

Also applies to: 538-538, 598-598, 645-645

🧹 Nitpick comments (1)
tests/llm/utils/mock_toolset.py (1)

440-492: Consider documenting the intentionally unused context parameter.

The context parameter is accepted for signature consistency with _call_live_invoke but is not used in mock mode (since mocks are pre-recorded data). While this is reasonable for maintaining a uniform interface, consider adding a brief docstring or suppressing the Ruff warning to clarify the intent.

Apply this diff to suppress the Ruff warning:

-    def _call_mock_invoke(self, params: Dict, context: ToolInvokeContext):
+    def _call_mock_invoke(
+        self, params: Dict, context: ToolInvokeContext  # noqa: ARG002
+    ):

Or add a docstring:

     def _call_mock_invoke(self, params: Dict, context: ToolInvokeContext):
+        """Read mock data from disk. Context is unused in mock mode but accepted for interface consistency."""
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3187728 and 1279783.

📒 Files selected for processing (8)
  • holmes/core/models.py (2 hunks)
  • holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py (7 hunks)
  • holmes/plugins/toolsets/investigator/core_investigation.py (2 hunks)
  • holmes/plugins/toolsets/newrelic/newrelic.py (2 hunks)
  • holmes/plugins/toolsets/runbook/runbook_fetcher.py (2 hunks)
  • holmes/plugins/toolsets/servicenow/servicenow.py (4 hunks)
  • tests/llm/utils/mock_toolset.py (4 hunks)
  • tests/llm/utils/test_mock_toolset.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • holmes/core/models.py
🧰 Additional context used
📓 Path-based instructions (3)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py
  • holmes/plugins/toolsets/servicenow/servicenow.py
  • holmes/plugins/toolsets/newrelic/newrelic.py
  • holmes/plugins/toolsets/runbook/runbook_fetcher.py
  • tests/llm/utils/test_mock_toolset.py
  • tests/llm/utils/mock_toolset.py
  • holmes/plugins/toolsets/investigator/core_investigation.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:

  • tests/llm/utils/test_mock_toolset.py
  • tests/llm/utils/mock_toolset.py
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

  • tests/llm/utils/test_mock_toolset.py
  • tests/llm/utils/mock_toolset.py
🧬 Code graph analysis (7)
holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py (2)
holmes/core/tools.py (2)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
holmes/plugins/toolsets/runbook/runbook_fetcher.py (1)
  • _invoke (69-190)
holmes/plugins/toolsets/servicenow/servicenow.py (1)
holmes/core/tools.py (2)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
holmes/plugins/toolsets/newrelic/newrelic.py (2)
holmes/core/tools.py (2)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
holmes/plugins/toolsets/logging_utils/logging_api.py (1)
  • _invoke (249-270)
holmes/plugins/toolsets/runbook/runbook_fetcher.py (1)
holmes/core/tools.py (2)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
tests/llm/utils/test_mock_toolset.py (3)
tests/conftest.py (1)
  • create_mock_tool_invoke_context (107-133)
holmes/core/tools_utils/tool_executor.py (1)
  • invoke (50-63)
holmes/core/tools.py (1)
  • invoke (235-261)
tests/llm/utils/mock_toolset.py (1)
holmes/core/tools.py (5)
  • ToolInvokeContext (163-169)
  • StructuredToolResult (79-103)
  • invoke (235-261)
  • _invoke (351-360)
  • _invoke (413-445)
holmes/plugins/toolsets/investigator/core_investigation.py (2)
holmes/core/tools.py (4)
  • ToolInvokeContext (163-169)
  • _invoke (351-360)
  • _invoke (413-445)
  • StructuredToolResult (79-103)
holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py (6)
  • _invoke (122-135)
  • _invoke (147-160)
  • _invoke (180-194)
  • _invoke (207-244)
  • _invoke (264-298)
  • _invoke (318-337)
🪛 Ruff (0.13.1)
holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py

122-122: Unused method argument: context

(ARG002)


147-147: Unused method argument: context

(ARG002)


180-180: Unused method argument: context

(ARG002)


207-207: Unused method argument: context

(ARG002)


264-264: Unused method argument: context

(ARG002)


318-318: Unused method argument: context

(ARG002)

holmes/plugins/toolsets/servicenow/servicenow.py

119-119: Unused method argument: context

(ARG002)


162-162: Unused method argument: context

(ARG002)


194-194: Unused method argument: context

(ARG002)

holmes/plugins/toolsets/newrelic/newrelic.py

78-78: Unused method argument: context

(ARG002)

holmes/plugins/toolsets/runbook/runbook_fetcher.py

69-69: Unused method argument: context

(ARG002)

tests/llm/utils/mock_toolset.py

440-440: Unused method argument: context

(ARG002)

holmes/plugins/toolsets/investigator/core_investigation.py

78-78: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
🔇 Additional comments (9)
holmes/plugins/toolsets/servicenow/servicenow.py (1)

8-8: LGTM! Signature updates align with PR refactor.

The addition of ToolInvokeContext import and the signature changes to _invoke methods are consistent with the broader PR objective of introducing context-based tool invocation. The unused context parameter is acceptable for this incremental refactor stage.

Also applies to: 119-119, 162-162, 194-194

holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py (1)

7-7: LGTM! Consistent signature updates across all MongoDB Atlas tools.

The ToolInvokeContext import and signature updates across all six _invoke methods follow the same pattern as other toolsets in this PR. The unused context parameter is expected at this stage of the refactor.

Also applies to: 122-122, 147-147, 180-180, 207-207, 264-264, 318-318

holmes/plugins/toolsets/runbook/runbook_fetcher.py (1)

9-9: LGTM! Signature updated consistently with other tools.

The ToolInvokeContext import and signature change align with the PR's refactor pattern. The unused context parameter is acceptable for this incremental change.

Also applies to: 69-69

holmes/plugins/toolsets/investigator/core_investigation.py (1)

11-11: LGTM! TodoWriteTool signature updated consistently.

The ToolInvokeContext import and signature change are consistent with the broader refactor. The unused context parameter is expected at this stage.

Also applies to: 78-78

tests/llm/utils/test_mock_toolset.py (1)

14-14: LGTM! Test updates correctly propagate ToolInvokeContext.

The addition of create_mock_tool_invoke_context import and its usage throughout the tests correctly exercises the new context-based tool invocation API. All tool and executor invocations now properly create and pass context objects.

Also applies to: 248-249, 282-283, 291-291, 306-306, 334-335, 484-485, 541-542, 603-606, 662-664

tests/llm/utils/mock_toolset.py (3)

18-18: LGTM!

The ToolInvokeContext import is correctly added and follows the coding guidelines for placing imports at the top of the file.


433-438: LGTM!

The signature update correctly threads ToolInvokeContext through to the underlying tool's invoke method. The context is properly used at line 438.


494-527: LGTM!

The _invoke method signature is correctly updated to accept ToolInvokeContext, and all internal call sites (lines 499, 503, 524, 526) properly propagate the context to both _call_mock_invoke and _call_live_invoke.

holmes/plugins/toolsets/newrelic/newrelic.py (1)

7-7: LGTM: Import addition aligns with PR changes.

The import of ToolInvokeContext is consistent with the broader refactoring across the codebase.

@nherment nherment requested review from aantn and arikalon1 October 1, 2025 10:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tests/plugins/toolsets/test_logging_api.py (1)

225-230: Typo fix confirmed (“without” variable).

The rename to truncated_log_data_without_prefix is correct and consistent in the slice and assertion.

🧹 Nitpick comments (5)
tests/plugins/toolsets/test_logging_api.py (2)

199-203: Avoid env-coupled DefaultLLM in tests; use MockLLM to keep tests hermetic.

Relying on MODEL and DefaultLLM can make tests flaky and slow. Use the mock LLM from create_mock_tool_invoke_context() instead.

-        model = os.environ.get("MODEL")
-        if not model:
-            pytest.skip("Missing MODEL env var.")
-        llm = DefaultLLM(model=model)
+        # Use a mock LLM for deterministic token counting in tests
+        llm = create_mock_tool_invoke_context().llm

238-239: Tight 95% threshold may be brittle across tokenizers.

Tokenization varies by model/version. Consider relaxing to ~90% or gating via a small epsilon to reduce false negatives.

holmes/core/tools_utils/tool_context_window_limiter.py (1)

12-16: Clamp invalid percentages to sane bounds.

If percent_of_total_context_window <= 0 or > 100, returning the full context window may defeat the limit. Prefer clamping to [1, 100].

-    if 0 < percent_of_total_context_window and percent_of_total_context_window <= 100:
-        return int(context_window_size * percent_of_total_context_window // 100)
-    else:
-        return context_window_size
+    pct = max(1.0, min(100.0, percent_of_total_context_window))
+    return int(context_window_size * pct // 100)
holmes/plugins/toolsets/prometheus/prometheus.py (2)

442-455: Silence Ruff ARG002: explicitly mark unused context.

These _invoke methods accept context but don’t use it. Add a single del context to satisfy Ruff without altering the signature (callers pass context by name).

 def _invoke(self, params: dict, context: ToolInvokeContext) -> StructuredToolResult:
+        del context  # context is required by interface but unused here

Apply to:

  • ListPrometheusRules._invoke
  • GetMetricNames._invoke
  • GetLabelValues._invoke
  • GetAllLabels._invoke
  • GetSeries._invoke
  • GetMetricMetadata._invoke


Also applies to: 560-575, 679-694, 793-806, 898-913, 1000-1013

---

`422-430`: **Consider aligning StructuredToolResult.status with response outcome.**

create_structured_tool_result always sets SUCCESS even when response.status is “Failed” with error_message. Upstream logic (e.g., limiters, UI) may rely on tool-level status.


Would you like a patch to map:
- success+data -> SUCCESS
- success+no data -> NO_DATA
- non-success/error_message -> ERROR?

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 1279783854e18c712d22074e81493201068864ab and 515b33c82e13d100663370045bcf096d1967f8cd.

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `holmes/core/tools_utils/tool_context_window_limiter.py` (1 hunks)
* `holmes/plugins/toolsets/prometheus/prometheus.py` (15 hunks)
* `tests/plugins/toolsets/test_logging_api.py` (7 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>📓 Path-based instructions (3)</summary>

<details>
<summary>{holmes,tests}/**/*.py</summary>


**📄 CodeRabbit inference engine (CLAUDE.md)**

> `{holmes,tests}/**/*.py`: Use Ruff for formatting and linting for all Python code
> Require type hints (code must be type-checkable with mypy)
> ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:
- `tests/plugins/toolsets/test_logging_api.py`
- `holmes/core/tools_utils/tool_context_window_limiter.py`
- `holmes/plugins/toolsets/prometheus/prometheus.py`

</details>
<details>
<summary>tests/**/*.py</summary>


**📄 CodeRabbit inference engine (CLAUDE.md)**

> Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Files:
- `tests/plugins/toolsets/test_logging_api.py`

</details>
<details>
<summary>tests/**</summary>


**📄 CodeRabbit inference engine (CLAUDE.md)**

> Test files should mirror the source structure under tests/

Files:
- `tests/plugins/toolsets/test_logging_api.py`

</details>

</details><details>
<summary>🧬 Code graph analysis (3)</summary>

<details>
<summary>tests/plugins/toolsets/test_logging_api.py (4)</summary><blockquote>

<details>
<summary>holmes/core/tools_utils/token_counting.py (1)</summary>

* `count_tool_response_tokens` (6-13)

</details>
<details>
<summary>holmes/plugins/toolsets/logging_utils/logging_api.py (2)</summary>

* `truncate_logs` (90-145)
* `_invoke` (249-270)

</details>
<details>
<summary>holmes/core/llm.py (1)</summary>

* `DefaultLLM` (70-379)

</details>
<details>
<summary>tests/conftest.py (1)</summary>

* `create_mock_tool_invoke_context` (107-133)

</details>

</blockquote></details>
<details>
<summary>holmes/core/tools_utils/tool_context_window_limiter.py (4)</summary><blockquote>

<details>
<summary>holmes/core/llm.py (1)</summary>

* `LLM` (39-67)

</details>
<details>
<summary>holmes/core/tools.py (1)</summary>

* `StructuredToolResultStatus` (52-76)

</details>
<details>
<summary>holmes/core/models.py (2)</summary>

* `ToolCallResult` (23-60)
* `as_tool_call_message` (30-36)

</details>
<details>
<summary>holmes/utils/sentry_helper.py (1)</summary>

* `capture_toolcall_contains_too_many_tokens` (22-34)

</details>

</blockquote></details>
<details>
<summary>holmes/plugins/toolsets/prometheus/prometheus.py (4)</summary><blockquote>

<details>
<summary>holmes/core/tools.py (5)</summary>

* `ToolInvokeContext` (163-169)
* `StructuredToolResultStatus` (52-76)
* `StructuredToolResult` (79-103)
* `_invoke` (351-360)
* `_invoke` (413-445)

</details>
<details>
<summary>holmes/core/tools_utils/token_counting.py (1)</summary>

* `count_tool_response_tokens` (6-13)

</details>
<details>
<summary>holmes/core/tools_utils/tool_context_window_limiter.py (1)</summary>

* `get_pct_token_count` (9-15)

</details>
<details>
<summary>holmes/utils/keygen_utils.py (1)</summary>

* `generate_random_key` (5-6)

</details>

</blockquote></details>

</details><details>
<summary>🪛 Ruff (0.13.1)</summary>

<details>
<summary>holmes/plugins/toolsets/prometheus/prometheus.py</summary>

442-442: Unused method argument: `context`

(ARG002)

---

560-560: Unused method argument: `context`

(ARG002)

---

679-679: Unused method argument: `context`

(ARG002)

---

793-793: Unused method argument: `context`

(ARG002)

---

898-898: Unused method argument: `context`

(ARG002)

---

1000-1000: Unused method argument: `context`

(ARG002)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary>

* GitHub Check: Pre-commit checks
* GitHub Check: llm_evals

</details>

<details>
<summary>🔇 Additional comments (1)</summary><blockquote>

<details>
<summary>holmes/plugins/toolsets/prometheus/prometheus.py (1)</summary><blockquote>

`1149-1167`: **Token cap logic LGTM; respects global limit and optional pct override.**

Good use of context.max_token_count with get_pct_token_count to optionally tighten limits.



Also applies to: 1392-1411

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
holmes/core/tools_utils/tool_context_window_limiter.py (1)

37-55: Preserve NO_DATA status instead of coercing to ERROR.

Lines 37-39 handle the NO_DATA case by setting error_message = None, but line 49 unconditionally overwrites the status to ERROR, which loses the semantic distinction between "no data found" and "error occurred."

This was flagged in a previous review but remains unaddressed. When a tool returns NO_DATA (a valid, non-error state), the status should be preserved.

Apply this diff to preserve NO_DATA status:

         if tool_call_result.result.status == StructuredToolResultStatus.NO_DATA:
             error_message = None
-            # tool_call_result.result.data is set to None below which is expected to fix the issue
         elif tool_call_result.result.status == StructuredToolResultStatus.ERROR:
             original_error = (
                 tool_call_result.result.error
                 or tool_call_result.result.data
                 or "Unknown error"
             )
             truncated_error = str(original_error)[:100]
             error_message = f"The tool call returned an error it is too large to return\nThe following original error is truncated:\n{truncated_error}"
 
-        tool_call_result.result.status = StructuredToolResultStatus.ERROR
+        # Only set status to ERROR when it wasn't NO_DATA
+        if tool_call_result.result.status != StructuredToolResultStatus.NO_DATA:
+            tool_call_result.result.status = StructuredToolResultStatus.ERROR
         tool_call_result.result.data = None
         tool_call_result.result.error = error_message
holmes/plugins/toolsets/prometheus/prometheus.py (1)

407-420: Type mismatch: data field should accept dict, not str.

Line 410 declares data: Optional[str] = None, but the code at lines 1147, 1391, 1187, and 1428 assigns result_data (a dict from Prometheus responses) directly to response_data.data. This violates the type annotation and will fail mypy type checking as required by the coding guidelines.

This was flagged in a previous review but remains unaddressed.

As per coding guidelines.

Apply this diff to fix the type mismatch:

 class MetricsBasedResponse(BaseModel):
     status: str
     error_message: Optional[str] = None
-    data: Optional[str] = None
+    data: Optional[dict[str, Any]] = None
     random_key: str
     tool_name: str
     description: str
     query: str
     start: Optional[str] = None
     end: Optional[str] = None
     step: Optional[float] = None
     output_type: Optional[str] = None
     data_summary: Optional[dict[str, Any]] = None
🧹 Nitpick comments (1)
holmes/core/tools_utils/tool_context_window_limiter.py (1)

9-21: Edge case handling could be improved.

The condition on line 12 allows values between 0 and 100, but falls back to returning the full context window for values ≤ 0 or > 100. This could lead to unexpected behavior:

  • If percent_of_total_context_window is 0, returning the full context window size seems counterintuitive.
  • If percent_of_total_context_window is > 100, capping at 100% would be clearer than returning the full window.

Consider adding explicit guards or logging for these edge cases to make the behavior more predictable.

 def get_pct_token_count(percent_of_total_context_window: float, llm: LLM) -> int:
     context_window_size = llm.get_context_window_size()
 
-    if 0 < percent_of_total_context_window and percent_of_total_context_window <= 100:
+    if 0 < percent_of_total_context_window <= 100:
         return int(context_window_size * percent_of_total_context_window // 100)
+    elif percent_of_total_context_window > 100:
+        # Cap at 100% instead of returning full window
+        return context_window_size
     else:
+        # For 0 or negative values, log warning and return 0 or raise error
         return context_window_size
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d24e9b and fe92ae3.

📒 Files selected for processing (2)
  • holmes/core/tools_utils/tool_context_window_limiter.py (1 hunks)
  • holmes/plugins/toolsets/prometheus/prometheus.py (15 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/core/tools_utils/tool_context_window_limiter.py
  • holmes/plugins/toolsets/prometheus/prometheus.py
🧬 Code graph analysis (2)
holmes/core/tools_utils/tool_context_window_limiter.py (5)
holmes/core/llm.py (1)
  • LLM (39-67)
holmes/core/tools.py (1)
  • StructuredToolResultStatus (52-76)
holmes/core/models.py (2)
  • ToolCallResult (23-60)
  • as_tool_call_message (30-36)
tests/conftest.py (2)
  • get_context_window_size (74-75)
  • count_tokens_for_message (80-83)
holmes/utils/sentry_helper.py (1)
  • capture_toolcall_contains_too_many_tokens (22-34)
holmes/plugins/toolsets/prometheus/prometheus.py (4)
holmes/core/tools.py (3)
  • ToolInvokeContext (163-169)
  • StructuredToolResultStatus (52-76)
  • StructuredToolResult (79-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
holmes/core/tools_utils/tool_context_window_limiter.py (1)
  • get_pct_token_count (9-15)
holmes/utils/keygen_utils.py (1)
  • generate_random_key (5-6)
🪛 Ruff (0.13.1)
holmes/plugins/toolsets/prometheus/prometheus.py

442-442: Unused method argument: context

(ARG002)


560-560: Unused method argument: context

(ARG002)


679-679: Unused method argument: context

(ARG002)


793-793: Unused method argument: context

(ARG002)


898-898: Unused method argument: context

(ARG002)


1000-1000: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (3)
holmes/core/tools_utils/tool_context_window_limiter.py (1)

41-47: LGTM: TypeError risk addressed.

Line 46 now wraps original_error with str() before slicing, which prevents the TypeError that would occur if result.data were a dict or other non-string type. This addresses the previous review comment.

holmes/plugins/toolsets/prometheus/prometheus.py (2)

1386-1434: Uninitialized variable when tool_calls_return_data is False.

Line 1386 declares structured_tool_result: StructuredToolResult without initialization. It's only assigned inside the if self.toolset.config.tool_calls_return_data: block (lines 1392-1394), but is then used unconditionally at lines 1430-1434.

If tool_calls_return_data is False, structured_tool_result will be uninitialized when accessed, causing an UnboundLocalError at runtime.

Apply this diff to ensure the variable is always initialized:

                 response_data = MetricsBasedResponse(
                     status=status,
                     error_message=error_message,
                     random_key=generate_random_key(),
                     tool_name=self.name,
                     description=description,
                     query=query,
                     start=start,
                     end=end,
                     step=step,
                     output_type=output_type,
                 )
-
-                structured_tool_result: StructuredToolResult
 
                 # Check if data should be included based on size
                 if self.toolset.config.tool_calls_return_data:
                     result_data = data.get("data", {})
                     response_data.data = result_data
                     structured_tool_result = create_structured_tool_result(
                         params=params, response=response_data
                     )
 
                     token_count = count_tool_response_tokens(
                         llm=context.llm, structured_tool_result=structured_tool_result
                     )
 
                     token_limit = context.max_token_count
                     if self.toolset.config.query_response_size_limit_pct:
                         custom_token_limit = get_pct_token_count(
                             percent_of_total_context_window=self.toolset.config.query_response_size_limit_pct,
                             llm=context.llm,
                         )
                         if custom_token_limit < token_limit:
                             token_limit = custom_token_limit
 
                     # Provide summary if data is too large
                     if token_count > token_limit:
                         response_data.data = None
                         response_data.data_summary = (
                             create_data_summary_for_large_result(
                                 result_data, query, token_count, is_range_query=True
                             )
                         )
                         logging.info(
                             f"Prometheus range query returned large dataset: "
                             f"{response_data.data_summary.get('series_count', 0)} series, "
                             f"{token_count:,} tokens (limit: {token_limit:,}). "
                             f"Returning summary instead of full data."
                         )
                         # Also add character info to the summary for debugging
                         response_data.data_summary["_debug_info"] = (
                             f"Data size: {token_count:,} tokens exceeded limit of {token_limit:,} tokens"
                         )
                     else:
                         response_data.data = result_data
 
                 structured_tool_result = create_structured_tool_result(
                     params=params, response=response_data
                 )
 
                 return structured_tool_result

Likely an incorrect or invalid review comment.


1143-1192: Uninitialized variable when tool_calls_return_data is False.

Line 1143 declares structured_tool_result: StructuredToolResult without initialization. It's only assigned inside the if self.toolset.config.tool_calls_return_data: block (lines 1149-1151), but is then used unconditionally at lines 1189-1192.

If tool_calls_return_data is False, structured_tool_result will be uninitialized when accessed, causing an UnboundLocalError at runtime.

Apply this diff to ensure the variable is always initialized:

                 response_data = MetricsBasedResponse(
                     status=status,
                     error_message=error_message,
                     random_key=generate_random_key(),
                     tool_name=self.name,
                     description=description,
                     query=query,
                 )
-                structured_tool_result: StructuredToolResult
                 # Check if data should be included based on size
                 if self.toolset.config.tool_calls_return_data:
                     result_data = data.get("data", {})
                     response_data.data = result_data
 
                     structured_tool_result = create_structured_tool_result(
                         params=params, response=response_data
                     )
                     token_count = count_tool_response_tokens(
                         llm=context.llm, structured_tool_result=structured_tool_result
                     )
 
                     token_limit = context.max_token_count
                     if self.toolset.config.query_response_size_limit_pct:
                         custom_token_limit = get_pct_token_count(
                             percent_of_total_context_window=self.toolset.config.query_response_size_limit_pct,
                             llm=context.llm,
                         )
                         if custom_token_limit < token_limit:
                             token_limit = custom_token_limit
 
                     # Provide summary if data is too large
                     if token_count > token_limit:
                         response_data.data = None
                         response_data.data_summary = (
                             create_data_summary_for_large_result(
                                 result_data,
                                 query,
                                 token_count,
                                 is_range_query=False,
                             )
                         )
                         logging.info(
                             f"Prometheus instant query returned large dataset: "
                             f"{response_data.data_summary.get('result_count', 0)} results, "
                             f"{token_count:,} tokens (limit: {token_limit:,}). "
                             f"Returning summary instead of full data."
                         )
                         # Also add token info to the summary for debugging
                         response_data.data_summary["_debug_info"] = (
                             f"Data size: {token_count:,} tokens exceeded limit of {token_limit:,} tokens"
                         )
                     else:
                         response_data.data = result_data
 
                 structured_tool_result = create_structured_tool_result(
                     params=params, response=response_data
                 )
                 return structured_tool_result

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py (1)

198-198: Consider implementing token-based truncation for large responses.

All eight _invoke methods now accept a ToolInvokeContext but don't use it. Given the PR's objective to link tool limits to global truncation limits, and that several tools (especially FetchTracesSimpleComparison, QueryTraceById, and QueryMetricsRange) can return very large YAML-serialized datasets, you should consider implementing token counting and truncation similar to the Prometheus toolset.

Reference implementation from holmes/plugins/toolsets/prometheus/prometheus.py (lines 1088-1229):

  • Use count_tool_response_tokens(llm=context.llm, structured_tool_result=...) to measure response size
  • Compare against context.max_token_count
  • Provide summaries or truncate data when limits are exceeded

This would prevent context window overflow when trace data or time series results are large.

If token limiting is intentionally deferred, you may want to add TODO comments or track this work separately.

Also applies to: 412-412, 492-492, 554-554, 629-629, 705-705, 783-783, 867-867

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe92ae3 and b8eda3d.

📒 Files selected for processing (1)
  • holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py (2)
holmes/core/tools.py (4)
  • ToolInvokeContext (163-169)
  • _invoke (351-360)
  • _invoke (413-445)
  • StructuredToolResult (79-103)
holmes/plugins/toolsets/prometheus/prometheus.py (8)
  • _invoke (442-512)
  • _invoke (560-632)
  • _invoke (679-750)
  • _invoke (793-854)
  • _invoke (898-968)
  • _invoke (1000-1047)
  • _invoke (1089-1230)
  • _invoke (1303-1470)
🪛 Ruff (0.13.1)
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py

198-198: Unused method argument: context

(ARG002)


412-412: Unused method argument: context

(ARG002)


492-492: Unused method argument: context

(ARG002)


554-554: Unused method argument: context

(ARG002)


629-629: Unused method argument: context

(ARG002)


705-705: Unused method argument: context

(ARG002)


783-783: Unused method argument: context

(ARG002)


867-867: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (1)
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py (1)

10-10: LGTM!

The ToolInvokeContext import is necessary for the updated method signatures throughout this file.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
holmes/plugins/toolsets/prometheus/prometheus.py (4)

321-404: LGTM! Token terminology is now consistent.

The function correctly uses "tokens" throughout both the messages and field names. The previous issue about inconsistent units has been resolved.


422-435: LGTM! Status mapping correctly implemented.

The function properly maps response status to StructuredToolResultStatus, handling error conditions and missing data appropriately. The previous concern about always returning SUCCESS has been addressed.


407-420: Fix type mismatch: data field should be dict, not str.

The data field is typed as Optional[str] but is assigned dictionary values at lines 1153, 1173, 1193, 1397, 1417, and 1434. This breaks mypy type checking.

Apply this diff:

 class MetricsBasedResponse(BaseModel):
     status: str
     error_message: Optional[str] = None
-    data: Optional[str] = None
+    data: Optional[dict[str, Any]] = None
     random_key: str
     tool_name: str
     description: str
     query: str
     start: Optional[str] = None
     end: Optional[str] = None
     step: Optional[float] = None
     output_type: Optional[str] = None
     data_summary: Optional[dict[str, Any]] = None

95-97: Set a default value equivalent to 2% as previously discussed.

Per the previous review comment, this field should have a default value equivalent to 2% to maintain backward compatibility with the previously used threshold.

Apply this diff:

-    query_response_size_limit_pct: Optional[int] = (
-        None  # Limit the max number of tokens that a query result can take to proactively prevent token limit issues. Expressed in % of the model's context window
-    )
+    query_response_size_limit_pct: Optional[int] = 2  # Limit the max number of tokens that a query result can take to proactively prevent token limit issues. Expressed in % of the model's context window
🧹 Nitpick comments (2)
holmes/plugins/toolsets/prometheus/prometheus.py (2)

1095-1198: Token-based sizing logic correctly implemented.

The implementation properly:

  • Accepts ToolInvokeContext parameter
  • Uses token counting via count_tool_response_tokens
  • Respects both global context.max_token_count and per-tool query_response_size_limit_pct
  • Takes the minimum of the two limits when both are configured
  • Generates data summaries when limits are exceeded

Optional refactor: Consider eliminating duplicate structured result creation.

The code creates structured_tool_result twice (lines 1155-1160 and 1195-1197). The first creation is solely for token counting. While functionally correct, this could be optimized.

Consider extracting the response serialization for counting:

# After line 1153
response_data.data = result_data

# Count tokens directly from response_data
token_count = count_tool_response_tokens(
    llm=context.llm,
    structured_tool_result=StructuredToolResult(
        status=StructuredToolResultStatus.SUCCESS,
        data=response_data.model_dump_json(indent=2),
        params=params,
    )
)

token_limit = context.max_token_count
# ... rest of limit logic ...

# Single final creation
structured_tool_result = create_structured_tool_result(
    params=params, response=response_data
)
return structured_tool_result

1309-1440: Token-based sizing logic correctly implemented.

The range query implementation correctly mirrors the instant query pattern:

  • Accepts ToolInvokeContext parameter
  • Uses token counting and respects both global and per-tool limits
  • Generates data summaries when limits are exceeded
  • Properly handles range-specific fields (start, end, step, output_type)

Optional refactor: Same double-creation optimization applies.

Similar to the instant query, this method creates structured_tool_result twice (lines 1398-1400 and 1436-1438). The same optimization suggested for ExecuteInstantQuery could be applied here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8eda3d and faf71ba.

📒 Files selected for processing (1)
  • holmes/plugins/toolsets/prometheus/prometheus.py (15 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/plugins/toolsets/prometheus/prometheus.py
🧬 Code graph analysis (1)
holmes/plugins/toolsets/prometheus/prometheus.py (4)
holmes/core/tools.py (5)
  • ToolInvokeContext (163-169)
  • StructuredToolResultStatus (52-76)
  • Toolset (533-748)
  • ToolsetTag (143-146)
  • StructuredToolResult (79-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
holmes/core/tools_utils/tool_context_window_limiter.py (1)
  • get_pct_token_count (9-15)
holmes/utils/keygen_utils.py (1)
  • generate_random_key (5-6)
🪛 Ruff (0.13.2)
holmes/plugins/toolsets/prometheus/prometheus.py

448-448: Unused method argument: context

(ARG002)


566-566: Unused method argument: context

(ARG002)


685-685: Unused method argument: context

(ARG002)


799-799: Unused method argument: context

(ARG002)


904-904: Unused method argument: context

(ARG002)


1006-1006: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
🔇 Additional comments (3)
holmes/plugins/toolsets/prometheus/prometheus.py (3)

18-18: LGTM! Import additions are correct.

The new imports for ToolInvokeContext, count_tool_response_tokens, and get_pct_token_count are properly placed at the top of the file and support the token-based sizing functionality.

Also applies to: 24-25


448-448: LGTM! Context parameter added for interface consistency.

The context parameter is required by the base class interface even though it's not used in this specific method. This is acceptable for maintaining a consistent interface across all tool implementations.


566-566: LGTM! Context parameter added for interface consistency.

These methods (GetMetricNames, GetLabelValues, GetAllLabels, GetSeries, GetMetricMetadata) all accept the context parameter to maintain interface consistency, even though they don't use token counting. This is acceptable design.

Also applies to: 685-685, 799-799, 904-904, 1006-1006

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
holmes/plugins/toolsets/prometheus/prometheus.py (1)

407-420: Type mismatch: MetricsBasedResponse.data should not be Optional[str].

The data field is typed as Optional[str] (line 410), but the code assigns dictionaries to it (lines 1153, 1193, 1397, 1434). This breaks type checking with mypy and can cause issues with Pydantic serialization.

Apply this diff to fix the type:

 class MetricsBasedResponse(BaseModel):
     status: str
     error_message: Optional[str] = None
-    data: Optional[str] = None
+    data: Optional[dict[str, Any]] = None
     random_key: str
     tool_name: str
     description: str
     query: str
     start: Optional[str] = None
     end: Optional[str] = None
     step: Optional[float] = None
     output_type: Optional[str] = None
     data_summary: Optional[dict[str, Any]] = None
holmes/plugins/toolsets/logging_utils/logging_api.py (1)

86-91: Add return type annotation.

The function signature is missing a return type annotation. Per coding guidelines requiring type hints for all code, add -> None to the signature.

Apply this diff:

 def truncate_logs(
     logging_structured_tool_result: StructuredToolResult,
     llm: LLM,
     token_limit: int,
     structured_params: FetchPodLogsParams,
-):
+) -> None:
🧹 Nitpick comments (1)
holmes/plugins/toolsets/logging_utils/logging_api.py (1)

29-33: Consider moving these constants up to group with other module constants.

While the constants are appropriately defined at module level, they would be more discoverable if placed immediately after line 27 (POD_LOGGING_TOOL_NAME) to group all module constants together before class definitions.

Based on past review comment from aantn requesting constants at top of file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faf71ba and 150f090.

📒 Files selected for processing (3)
  • holmes/plugins/toolsets/logging_utils/logging_api.py (5 hunks)
  • holmes/plugins/toolsets/prometheus/prometheus.py (15 hunks)
  • tests/plugins/toolsets/test_logging_api.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/plugins/toolsets/test_logging_api.py
🧰 Additional context used
📓 Path-based instructions (1)
{holmes,tests}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{holmes,tests}/**/*.py: Use Ruff for formatting and linting for all Python code
Require type hints (code must be type-checkable with mypy)
ALWAYS place Python imports at the top of the file; do not place imports inside functions or methods

Files:

  • holmes/plugins/toolsets/prometheus/prometheus.py
  • holmes/plugins/toolsets/logging_utils/logging_api.py
🧬 Code graph analysis (2)
holmes/plugins/toolsets/prometheus/prometheus.py (4)
holmes/core/tools.py (3)
  • ToolInvokeContext (163-169)
  • StructuredToolResultStatus (52-76)
  • StructuredToolResult (79-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
holmes/core/tools_utils/tool_context_window_limiter.py (1)
  • get_pct_token_count (9-15)
holmes/utils/keygen_utils.py (1)
  • generate_random_key (5-6)
holmes/plugins/toolsets/logging_utils/logging_api.py (3)
holmes/core/llm.py (1)
  • LLM (39-67)
holmes/core/tools.py (4)
  • StructuredToolResult (79-103)
  • Tool (172-364)
  • ToolInvokeContext (163-169)
  • get_stringified_data (90-103)
holmes/core/tools_utils/token_counting.py (1)
  • count_tool_response_tokens (6-13)
🪛 Ruff (0.13.2)
holmes/plugins/toolsets/prometheus/prometheus.py

448-448: Unused method argument: context

(ARG002)


566-566: Unused method argument: context

(ARG002)


685-685: Unused method argument: context

(ARG002)


799-799: Unused method argument: context

(ARG002)


904-904: Unused method argument: context

(ARG002)


1006-1006: Unused method argument: context

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: llm_evals
  • GitHub Check: Pre-commit checks
🔇 Additional comments (11)
holmes/plugins/toolsets/prometheus/prometheus.py (7)

18-18: LGTM!

The new imports are correctly placed at the top of the file and are used in the token-counting logic throughout the file.

Also applies to: 24-25


95-97: LGTM!

The config field is properly renamed to indicate percentage-based limiting, and the 2% default aligns with the reviewer's request and previous threshold behavior.


322-322: LGTM!

The function has been updated to use token-based sizing consistently. The parameter name, return values, and messages all correctly reference "tokens" instead of "characters", which aligns with the token-counting approach introduced in this PR.

Also applies to: 365-365, 368-368, 398-398, 401-401


422-435: LGTM!

The create_structured_tool_result function now correctly maps response status to appropriate StructuredToolResultStatus values, addressing the previous concern about always returning SUCCESS. The logic properly handles ERROR, NO_DATA, and SUCCESS cases.


448-448: LGTM!

The _invoke method signatures have been updated to accept context: ToolInvokeContext, maintaining interface consistency across all tools. While the context parameter is not used in these metadata-focused tools (static analysis warning ARG002), it is intentionally included for a uniform tool invocation API and is actively used in query tools (ExecuteInstantQuery, ExecuteRangeQuery).

Also applies to: 566-566, 685-685, 799-799, 904-904, 1006-1006


1095-1198: LGTM!

The token-counting and truncation logic is well-structured and correctly integrates with the ToolInvokeContext. The implementation:

  • Properly uses context.llm for token counting via count_tool_response_tokens
  • Correctly applies the token limit from context.max_token_count
  • Supports optional per-tool override via query_response_size_limit_pct (choosing the smaller limit)
  • Replaces large responses with data_summary when exceeding the limit
  • Includes debug information in the summary for troubleshooting

The flow is logical and handles both the normal and truncated-response paths appropriately.


1309-1440: LGTM!

The token-counting and truncation logic for range queries mirrors the instant query implementation and is equally well-structured. The implementation correctly:

  • Integrates with ToolInvokeContext for token counting
  • Applies token limits with optional per-tool override
  • Distinguishes range queries by passing is_range_query=True to create_data_summary_for_large_result
  • Replaces large responses with detailed summaries including series count and debug info

The consistency between instant and range query implementations ensures maintainability.

holmes/plugins/toolsets/logging_utils/logging_api.py (4)

4-4: LGTM!

The new imports are correctly placed at the top of the file and properly support the truncation functionality.

Also applies to: 10-10, 14-14, 18-18


92-145: LGTM!

The truncation logic correctly addresses previous review feedback:

  • Uses get_stringified_data() to avoid stringifying None
  • Keeps most recent logs by slicing from the tail
  • Preserves line boundaries with linefeed detection
  • Logs a single informative message only when truncation occurs
  • Handles edge cases appropriately

The iterative approach compensates for the character-to-token ratio approximation.


249-249: LGTM!

The signature change to accept ToolInvokeContext correctly aligns with the PR objectives to thread LLM and token limits through tool invocations.


264-269: LGTM!

The integration of truncate_logs correctly uses the context to access the LLM and token limit, and appropriately passes the structured parameters for logging.

@nherment nherment enabled auto-merge (squash) October 2, 2025 05:46
@nherment nherment requested a review from aantn October 2, 2025 05:56
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

Results of HolmesGPT evals

  • ask_holmes: 33/35 test cases were successful, 0 regressions, 1 setup failures
Test suite Test case Status
ask 01_how_many_pods
ask 02_what_is_wrong_with_pod
ask 04_related_k8s_events
ask 05_image_version
ask 09_crashpod
ask 10_image_pull_backoff
ask 110_k8s_events_image_pull
ask 11_init_containers
ask 13a_pending_node_selector_basic
ask 14_pending_resources
ask 15_failed_readiness_probe
ask 17_oom_kill
ask 19_detect_missing_app_details
ask 20_long_log_file_search
ask 24_misconfigured_pvc
ask 24a_misconfigured_pvc_basic
ask 28_permissions_error 🚧
ask 39_failed_toolset
ask 41_setup_argo
ask 42_dns_issues_steps_new_tools ⚠️
ask 43_current_datetime_from_prompt
ask 45_fetch_deployment_logs_simple
ask 51_logs_summarize_errors
ask 53_logs_find_term
ask 54_not_truncated_when_getting_pods
ask 59_label_based_counting
ask 60_count_less_than
ask 61_exact_match_counting
ask 63_fetch_error_logs_no_errors
ask 79_configmap_mount_issue
ask 83_secret_not_found
ask 86_configmap_like_but_secret
ask 93_calling_datadog[0]
ask 93_calling_datadog[1]
ask 93_calling_datadog[2]

Legend

  • ✅ the test was successful
  • :minus: the test was skipped
  • ⚠️ the test failed but is known to be flaky or known to fail
  • 🚧 the test had a setup failure (not a code regression)
  • 🔧 the test failed due to mock data issues (not a code regression)
  • 🚫 the test was throttled by API rate limits/overload
  • ❌ the test failed and should be fixed before merging the PR

Copy link
Contributor

@aantn aantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nherment nherment merged commit 3cb2846 into master Oct 2, 2025
7 checks passed
@nherment nherment deleted the ROB-2136_link_tool_limit_to_global_truncation_limit branch October 2, 2025 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants