Subtasks #851

arikalon1 · 2025-08-17T23:00:16Z

No description provided.

# Conflicts: # holmes/core/tools.py # holmes/plugins/prompts/generic_ask.jinja2

prompt fix

# Conflicts: # holmes/plugins/prompts/_general_instructions.jinja2 # holmes/plugins/prompts/kubernetes_workload_ask.jinja2

prompt fix

improve prompts

coderabbitai · 2025-08-17T23:00:24Z

Walkthrough

Adds per-investigation context (investigation_id) across prompts and LLM flows, introduces a thread-safe TodoListManager and TodoWrite toolset for task tracking, augments prompt templates with an investigation procedure, extends tool parameter schemas and OpenAI formatting, and raises max_steps from 10 to 40. Tests and docs updated.

Changes

Cohort / File(s)	Summary of changes
Investigation ID propagation `holmes/core/prompt.py`, `holmes/core/conversations.py`, `holmes/interactive.py`, `holmes/main.py`, `server.py`, `docs/installation/python-installation.md`, `tests/core/test_prompt.py`, `tests/llm/test_ask_holmes.py`	Added `investigation_id` to prompt contexts and call sites; updated `build_initial_ask_messages` signature and tests; docs updated to pass investigation_id.
Todo manager & core investigation toolset `holmes/core/todo_manager.py`, `holmes/core/investigation.py`, `holmes/core/tool_calling_llm.py`, `holmes/plugins/toolsets/__init__.py`, `holmes/plugins/toolsets/investigator/core_investigation.py`, `holmes/plugins/toolsets/investigator/model.py`, `tests/core/test_todo_manager.py`, `tests/core/test_todo_write_tool.py`, `tests/plugins/toolsets/test_core_investigation.py`	Added `TodoListManager` singleton, `Task`/`TaskStatus` model, integrated todo_list into prompts, introduced `TodoWriteTool` and `CoreInvestigationToolset`, and added unit tests for manager and toolset.
Prompt templates & investigation workflow `holmes/plugins/prompts/investigation_procedure.jinja2`, `holmes/plugins/prompts/_general_instructions.jinja2`, `holmes/plugins/prompts/generic_ask.jinja2`, `holmes/plugins/prompts/kubernetes_workload_ask.jinja2`, `holmes/plugins/toolsets/investigator/investigator_instructions.jinja2`	Added a structured, multi-phase investigation procedure and mandatory task-management instructions; included new templates into generic and Kubernetes prompts.
Tool parameter schema & OpenAI formatting `holmes/core/tools.py`, `holmes/core/openai_formatting.py`	Extended `ToolParameter` with `properties` and `items` for nested object/array schemas; enhanced OpenAI schema generation and validations for objects and arrays, and added error path for object-array inner types without explicit schema.
Initial prompt additions `holmes/core/prompt.py`, `tests/core/test_prompt.py`	Added `get_tasks_management_system_reminder()` and appended its content to user prompt construction; tests updated to expect the reminder.
Max steps increased to 40 `holmes/config.py`, `examples/custom_llm.py`, `holmes/main.py`, `tests/llm/test_ask_holmes.py`	Increased default/example `max_steps` from 10 to 40 and updated CLI default and tests.
LLM fixture readiness `tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml`	Added readiness wait for deployment in the test fixture setup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant CLI_App as CLI/App
  participant LLM as ToolCallingLLM
  participant Prompt as Prompt Renderer
  participant TodoMgr as TodoListManager
  participant TodoTool as TodoWriteTool

  User->>CLI_App: Ask question
  CLI_App->>LLM: Initialize (max_steps=40)
  LLM->>LLM: generate investigation_id (UUID)
  LLM->>TodoMgr: format_tasks_for_prompt(investigation_id)
  TodoMgr-->>LLM: todo_list (string)
  LLM->>Prompt: render system prompt (investigation_id, todo_list, runbooks)
  CLI_App->>Prompt: build_initial_ask_messages(..., investigation_id, ...)
  loop Investigation loop
    LLM->>TodoTool: TodoWrite todos[], investigation_id
    TodoTool->>TodoMgr: update_session_tasks(investigation_id, tasks)
    TodoTool-->>LLM: StructuredToolResult (formatted tasks)
    LLM->>Prompt: re-render with updated todo_list
  end
  LLM-->>User: Final answer (after tasks complete & final review)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

add tracing to holmes cli #660 — Modifies ToolCallingLLM internals; related to adding per-investigation context and investigation_id handling.
Fix interactive mode so it can answer without follow up questions #546 — Also changes build_initial_ask_messages signature/behavior; closely related to prompt signature updates.
ROB-1739: datadog/traces toolset #648 — Alters load_python_toolsets ordering; related to adding CoreInvestigationToolset into toolset loading.

Suggested labels

enhancement

Suggested reviewers

moshemorad
mainred

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch subtasks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🔭 Outside diff range comments (4)

tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml (1)

1-12: Rename test namespace to follow evals convention (app-)

The fixture currently uses staging-63, which violates our guideline that each LLM eval must run in its own app-<testid> namespace. Please update all occurrences to app-63, including:

The user_prompt line: replace "…in staging-63 namespace" → "…in app-63 namespace"

before_test block:

kubectl create namespace staging-63 → kubectl create namespace app-63

any -n staging-63 flags → -n app-63

after_test block:

kubectl delete namespace staging-63 → kubectl delete namespace app-63

-n staging-63 → -n app-63 for secret deletion

manifest.yaml (if it hardcodes the namespace) and any other files referenced by this test

This change is required to prevent cross-test collisions and adhere to our LLM test naming standard.

holmes/plugins/prompts/_general_instructions.jinja2 (1)

66-72: Align tool call limit in prompts with actual safeguards

The tool name is correctly defined as “TodoWrite” (class TodoWriteTool, name="TodoWrite").

We found no enforcement of a “maximum of 5 tool calls per specific tool” in prevent_overly_repeated_tool_call or elsewhere—only duplicate-call and redundant‐pod‐logs checks exist.

Please either remove or update the “5 tool calls” text in _general_instructions.jinja2 to match current behavior, or implement a configurable max‐calls threshold in the safeguards.
holmes/main.py (1)
72-79: CLI default for max_steps violates the file’s own config-precedence rule

Lines 72–74 explicitly state that CLI option defaults that are also present in the config MUST be None, or the CLI will override config values. Setting a concrete default on Line 97 breaks this contract and will always override whatever is set in the config file.

Apply this diff to restore config precedence (and set the new 40-step default in the Config layer instead):
 opt_max_steps: Optional[int] = typer.Option(
-    40,
+    None,
     "--max-steps",
     help="Advanced. Maximum number of steps the LLM can take to investigate the issue",
 )
If you want 40 to be the new default, set it where Config.load_from_file applies defaults when the CLI passes None, or in the underlying LLM client defaults. I can help patch that if you point me to the Config default site.
tests/core/test_todo_write_tool.py (1)
105-127: Update test_openai_format to pass a target_model and assert investigation_id in the schema

The TodoWriteTool.get_openai_format() method requires a target_model: str argument (it’s defined in the base Tool class), so calling it without parameters will raise a missing-argument error. Also, since investigation_id is declared on the tool, the test should verify its presence in both properties and required.

Pinpointed location:

tests/core/test_todo_write_tool.py, test_openai_format (around lines 105–127)

Suggested diff:
-        openai_format = tool.get_openai_format()
+        # Pass a valid OpenAI model identifier
+        openai_format = tool.get_openai_format("gpt-4")

         assert openai_format["type"] == "function"
         assert openai_format["function"]["name"] == "TodoWrite"
         assert "investigation tasks" in openai_format["function"]["description"]

         # Check parameters schema
         params = openai_format["function"]["parameters"]
         assert params["type"] == "object"
         assert "todos" in params["properties"]

         # Check array schema has items property
         todos_param = params["properties"]["todos"]
         assert todos_param["type"] == "array"
         assert "items" in todos_param
         assert todos_param["items"]["type"] == "object"

-        # Check required fields
-        assert "todos" in params["required"]
+        # Check required fields
+        assert "todos" in params["required"]
+
+        # Ensure investigation_id is included
+        assert "investigation_id" in params["properties"]
+        assert params["properties"]["investigation_id"]["type"] == "string"
+        assert "investigation_id" in params["required"]

🧹 Nitpick comments (30)

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/test_case.yaml (1)
3-9: Clarify matching semantics to avoid brittle/flaky assertions

Natural-language expectations are fine, but the example embed can be brittle if the harness matches exact JSON string order or the example random_key. Consider clarifying key-order independence and that random_key is arbitrary. Minor grammar fix included.

Proposed tweak:
 expected_output: |
-  Output must contain 1 or more embeds in the following format
+  Output must contain one or more embeds in the following format
     <<{"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "iD8G"}>>

-  random_key may be different than the above example, but all other parameters (type and tool_name) must be as described
+  random_key may be different than the above example, but all other parameters (type and tool_name) must be as described.
+  The order of keys inside the JSON object is not important; assertions should match keys irrespective of order.

   Output must NOT tell the user it doesn't have access to metrics or that they should use another tool
Follow-up: Please confirm the eval harness checks the embed JSON by keys (order-insensitive) and does not require the literal random_key value shown in the example.
holmes/config.py (1)

75-75: Update documentation code samples to reflect new max_steps default

The quick scan found several occurrences of max_steps=10 in your Python installation guide. Please update these examples to use the new default of 40:

• docs/installation/python-installation.md:31
• docs/installation/python-installation.md:88
• docs/installation/python-installation.md:209
• docs/installation/python-installation.md:286
• docs/installation/python-installation.md:306
holmes/plugins/prompts/investigation_procedure.jinja2 (5)
1-4: Use consistent capitalization for Investigation ID

Minor clarity/consistency tweak.
-# Investigation ID for this session
-Investigation id: {{ investigation_id }}
+# Investigation ID for this session
+Investigation ID: {{ investigation_id }}
31-36: Avoid inline comments inside JSON-like examples to prevent copy/paste into tool params

Inline comments in example objects can be copied literally by the LLM and break schema validation. Move the note outside the JSON example.
 TodoWrite(todos=[
 {"id": "1", "content": "Check frontend pod logs", "status": "in_progress"},
 {"id": "2", "content": "Check backend service config", "status": "in_progress"},
 {"id": "3", "content": "Analyze network policies", "status": "in_progress"},
-{"id": "4", "content": "Compare logs from both pods", "status": "pending"}  # Depends on 1,2
+{"id": "4", "content": "Compare logs from both pods", "status": "pending"}
 ])
+
+# Note: Task 4 depends on tasks 1 and 2
47-50: Clarify concurrency phrasing to match runtime behavior

State concurrency as “preferred/where supported,” since orchestration depends on the runtime scheduler.
-MAXIMIZE PARALLEL TOOL CALLS:
-- When executing multiple in_progress tasks, make ALL their tool calls at once
+MAXIMIZE PARALLEL TOOL CALLS:
+- When executing multiple in_progress tasks, prefer issuing tool calls concurrently where supported by the runtime
 - Example: If tasks 1,2,3 are in_progress, call kubectl_logs + kubectl_describe + kubectl_get simultaneously
71-76: Fix formatting in the Task Status Check example

Missing newline merges Task 2 and Task 3.
 [✓] completed - Task 1
-[✓] completed - Task 2[✓] completed - Task 3
+[✓] completed - Task 2
+[✓] completed - Task 3
 [✓] completed - Investigation Verification
153-157: Typo: “accross” → “across”

Small spelling fix.
-*Final Review Phase*: Validate that the chain of events, accross the different components, can lead to the investigated scenario.
+*Final Review Phase*: Validate that the chain of events, across the different components, can lead to the investigated scenario.
examples/custom_llm.py (1)
50-65: Add a return type hint to ask_holmes for mypy compliance

Minor type-hint nit for examples.
-def ask_holmes():
+def ask_holmes() -> None:
     prompt = "what pods are unhealthy in my cluster?"
server.py (1)
212-221: Add todo_list to the workload prompt context (consistency with investigation prompts).

You added investigation_id to the system prompt context here, but not todo_list. Since the workload prompt now includes investigation_procedure.jinja2, it may render a task list when present. To avoid undefined var issues (if StrictUndefined) and to keep UX consistent, pass a formatted todo_list too.

Apply this diff within this block:
         ai = config.create_toolcalling_llm(dal=dal, model=request.model)
 
+        # Provide current investigation tasks context to the prompt
+        todo_manager = get_todo_manager()
+        todo_context = todo_manager.format_tasks_for_prompt(ai.investigation_id)
+
         system_prompt = load_and_render_prompt(
             request.prompt_template,
             context={
                 "alerts": workload_alerts,
                 "toolsets": ai.tool_executor.toolsets,
                 "response_format": workload_health_structured_output,
                 "cluster_name": config.cluster_name,
+                "todo_list": todo_context,
                 "investigation_id": ai.investigation_id,
             },
         )
And add this import at the top of the file (with other imports):
from holmes.core.todo_manager import get_todo_manager
holmes/core/investigation.py (1)
137-139: Guard against missing or non-string investigation_id.

In unusual cases (e.g., alt LLM impls), ai.investigation_id might be None or non-string. Defensive cast avoids unexpected keying in the todo manager.

Apply:
-    todo_manager = get_todo_manager()
-    todo_context = todo_manager.format_tasks_for_prompt(ai.investigation_id)
+    todo_manager = get_todo_manager()
+    session_id = str(getattr(ai, "investigation_id", "") or "")
+    todo_context = todo_manager.format_tasks_for_prompt(session_id)
holmes/core/tools.py (1)
125-127: Support for nested schemas added; consider forward-ref rebuild and basic schema validation.

The recursive ToolParameter refs look good. To avoid forward-ref resolution issues with Pydantic v2 in some environments, and to enforce consistency between type and properties/items, consider:

Ensure forward refs are rebuilt after class definition:
# place near the end of the module or right after class ToolParameter
ToolParameter.model_rebuild()
Optionally add a validator to enforce that:

when properties is set, type == "object"

when items is set, type == "array"
This prevents malformed schemas reaching the formatter.
tests/plugins/toolsets/test_core_investigation.py (1)
1-43: Solid coverage of the core toolset; tighten one assertion.

The tests validate naming, defaults, tool presence, and prerequisite behavior. Consider asserting the example config is empty to codify the “no config required” contract.

Apply:
-        assert isinstance(config, dict)
-        # Core toolset doesn't need configuration
+        assert isinstance(config, dict)
+        # Core toolset doesn't need configuration
+        assert config == {}
tests/llm/test_ask_holmes.py (2)
299-301: Increase of max_steps to 40 is fine; consider making it configurable for test runs

Bumping to 40 aligns with the PR goals. To keep tests flexible and avoid long runs in constrained CI, consider reading from an env var with a sensible default.
     ai = ToolCallingLLM(
         tool_executor=tool_executor,
-        max_steps=40,
+        max_steps=int(os.environ.get("HOLMES_MAX_STEPS", "40")),
         llm=DefaultLLM(os.environ.get("MODEL", "gpt-4o"), tracer=tracer),
     )
320-327: API change to build_initial_ask_messages: pass investigation_id by keyword

Passing the new parameter is correct. To future-proof against signature changes and improve readability, prefer a keyword argument for investigation_id.
             messages = build_initial_ask_messages(
                 console,
                 test_case.user_prompt,
                 None,
                 ai.tool_executor,
-                ai.investigation_id,
+                investigation_id=ai.investigation_id,
                 runbooks,
             )
holmes/interactive.py (1)
1000-1008: Pass investigation_id by keyword to avoid future signature drift

The new parameter is wired correctly. Minor nit: use a keyword for clarity and resilience to future parameter ordering changes.
             if messages is None:
                 messages = build_initial_ask_messages(
                     console,
                     user_input,
                     include_files,
                     ai.tool_executor,
-                    ai.investigation_id,
+                    investigation_id=ai.investigation_id,
                     runbooks,
                     system_prompt_additions,
                 )
docs/installation/python-installation.md (1)

45-59: Consider documenting session-scoped investigation_id behavior.

ai.investigation_id is created once per ToolCallingLLM instance. In the Tool Details example you reuse the same ai for multiple questions, so todos will persist across all questions in that session. Consider a short note showing how to force a fresh investigation_id per question (e.g., pass str(uuid.uuid4()) explicitly) if users want isolation.

tests/core/test_prompt.py (2)

33-36: API update coverage looks good; consider deduping test UUID generation.

Passing a UUID as the new 5th argument aligns tests with the updated signature. To reduce duplication, you could generate a single value at test start or via a fixture.

Also applies to: 54-57, 78-81, 103-106, 131-134

26-36: Verify tests also cover investigation_id propagation into prompts (optional).

Current assertions don’t validate that investigation_id/todo_list reach the system prompt. If feasible, add a focused assertion in one test to ensure the system prompt contains investigation_id, guarding against regressions.
holmes/core/tool_calling_llm.py (2)
214-215: Add explicit type annotation for investigation_id to satisfy mypy.

Minor improvement to make the attribute type explicit.

Apply:
-        self.investigation_id = str(uuid.uuid4())
+        self.investigation_id: str = str(uuid.uuid4())
788-801: Optional: avoid injecting empty todo_list.

format_tasks_for_prompt returns "" when no tasks. You can omit the key entirely when empty to reduce prompt noise.

Example:
-        todo_context = todo_manager.format_tasks_for_prompt(self.investigation_id)
-        system_prompt = load_and_render_prompt(
+        todo_context = todo_manager.format_tasks_for_prompt(self.investigation_id)
+        ctx = {
             prompt,
-            {
+            {
                 "issue": issue,
                 "sections": sections,
                 "structured_output": request_structured_output_from_llm,
                 "toolsets": self.tool_executor.toolsets,
                 "cluster_name": self.cluster_name,
-                "todo_list": todo_context,
                 "investigation_id": self.investigation_id,
             },
-        )
+        }
+        if todo_context:
+            ctx[1]["todo_list"] = todo_context
+        system_prompt = load_and_render_prompt(*ctx)
holmes/core/openai_formatting.py (1)
28-36: Make nested object “required” precise (don’t force all properties unless strict_mode).

Currently, in strict_mode you require all nested properties. That may be intended, but outside strict_mode you should only require those explicitly marked required. Also, only set “required” if the list is non-empty.

Apply:
-            if strict_mode:
-                type_obj["required"] = list(param_attributes.properties.keys())
+            # Only require explicitly-required props unless strict_mode is enabled
+            if strict_mode:
+                required_props = list(param_attributes.properties.keys())
+            else:
+                required_props = [
+                    name
+                    for name, prop in param_attributes.properties.items()
+                    if getattr(prop, "required", False)
+                ]
+            if required_props:
+                type_obj["required"] = required_props
holmes/plugins/prompts/_general_instructions.jinja2 (1)

54-65: Revisit “FIRST tool call MUST be TodoWrite” rigidity.

This may be overly prescriptive for trivial single-step queries and could conflict with earlier guidance (“run as many tools as you need, then respond”). Consider softening: “If the investigation requires multiple steps, first create a plan with TodoWrite,” or gate the block behind a condition in the template context.
tests/core/test_todo_manager.py (2)
76-79: Priority assertions likely inconsistent with current Task model/formatter

The Task model (and the formatter) in the provided context do not include a priority attribute, and the formatter spec doesn’t mention rendering priority markers. These assertions will likely fail unless the formatter was extended to include priority labels regardless of input.

Options:

If priority isn’t supported in TodoListManager.format_tasks_for_prompt, remove these assertions:
-        # Check priority
-        assert "(HIGH)" in prompt_context
-        assert "(MED)" in prompt_context
-        assert "(LOW)" in prompt_context
If priority is intended, update the Task model and formatter to include and render priority, and extend the test tasks with priority fields to match expectations.

39-45: Order-sensitive checks are OK, but also consider verifying status indicators per task

You verify the retrieval order and presence of status indicators elsewhere. For tighter coverage, consider asserting that each retrieved task’s line in the prompt contains the correct status indicator ([ ], [~], [✓]) alongside its id/content.

I can draft an additional test assertion block if you want to tighten this.
holmes/core/prompt.py (2)
33-36: Signature change: add investigation_id — make sure the docstring lists it

The function signature now includes investigation_id, but the docstring’s Args section doesn’t. Add it for clarity and mypy-oriented readability.

Apply this diff to document the new parameter:
 def build_initial_ask_messages(
     console: Console,
     initial_user_prompt: str,
     file_paths: Optional[List[Path]],
     tool_executor: Any,  # ToolExecutor type
     investigation_id: str,
     runbooks: Union[RunbookCatalog, Dict, None] = None,
     system_prompt_additions: Optional[str] = None,
 ) -> List[Dict]:
     """Build the initial messages for the AI call.
 
     Args:
         console: Rich console for output
         initial_user_prompt: The user's prompt
         file_paths: Optional list of files to include
         tool_executor: The tool executor with available toolsets
+        investigation_id: Unique identifier for the current investigation/session (used for scoping todos and context)
         runbooks: Optional runbook catalog
         system_prompt_additions: Optional additional system prompt content
     """
64-70: Fix grammar/numbering in the system reminder; remove an empty “3.” item

There’s a stray “3.” with no content and minor grammar issues (“Ask your self”). Clean this up to avoid confusing the model.

Apply this diff:
-    user_prompt_with_files += (
-        "\n\n<system-reminder>\nIMPORTANT: You have access to the TodoWrite tool. It creates a TodoList, in order to track progress. It's very important. You MUST use it:\n1. FIRST: Ask your self which sub problems you need to solve in order to answer the question."
-        "Do this, BEFORE any other tools\n2. "
-        "AFTER EVERY TOOL CALL: If required, update the TodoList\n3. "
-        "\n\nFAILURE TO UPDATE TodoList = INCOMPLETE INVESTIGATION\n\n"
-        "Example flow:\n- Think and divide to sub problems → create TodoList → Perform each task on the list → Update list → Verify your solution\n</system-reminder>"
-    )
+    user_prompt_with_files += (
+        "\n\n<system-reminder>\n"
+        "IMPORTANT: You have access to the TodoWrite tool. It creates a TodoList to track progress. You MUST use it:\n"
+        "1. FIRST: Ask yourself which sub-problems you need to solve in order to answer the question. Do this BEFORE using any other tools.\n"
+        "2. AFTER EVERY TOOL CALL: If required, update the TodoList.\n"
+        "3. ON COMPLETION: Verify the TodoList reflects the final state of the investigation.\n"
+        "\nFAILURE TO UPDATE THE TodoList = INCOMPLETE INVESTIGATION\n\n"
+        "Example flow:\n- Think and divide into sub-problems → create TodoList → Perform each task on the list → Update list → Verify your solution\n"
+        "</system-reminder>"
+    )
holmes/plugins/toolsets/investigator/investigator_instructions.jinja2 (1)

58-61: Overly strict response-length constraint may degrade UX

“You MUST answer concisely with fewer than 4 lines” is very constraining for investigations and may cut off useful explanations. Consider softening or making it context-dependent.

I can propose a nuanced rule (e.g., concise by default; allow detail when asked or when summarizing findings with evidence).
tests/core/test_todo_write_tool.py (1)
24-43: Consider explicitly passing investigation_id in tests to reflect the tool contract

If TodoWriteTool requires investigation_id (as suggested elsewhere in the PR), calling _invoke without it will either fail or rely on hidden defaults. Tests should be explicit to avoid future flakiness.

Example (if required by the tool):
-        params = {
+        params = {
+            "investigation_id": "test-session",
             "todos": [
                 {
                     "id": "1",
                     "content": "Check pod status",
                     "status": "pending",
                     "priority": "high",
                 },
                 {
                     "id": "2",
                     "content": "Analyze logs",
                     "status": "in_progress",
                     "priority": "medium",
                 },
             ]
         }
If the tool is designed to tolerate a missing investigation_id by using a session default, document that behavior in the tool and keep this test as-is.
holmes/core/todo_manager.py (2)
25-29: Slight simplification in clear_session.

Use pop with a default to simplify and avoid the membership check.
     def clear_session(self, session_id: str) -> None:
         with self._lock:
-            if session_id in self._sessions:
-                del self._sessions[session_id]
+            self._sessions.pop(session_id, None)
34-78: Prefer TaskStatus over raw strings to avoid drift.

Use the enum directly for ordering and indicators to prevent string literal duplication.
-        status_order = {"pending": 0, "in_progress": 1, "completed": 2}
+        status_order = {
+            TaskStatus.PENDING: 0,
+            TaskStatus.IN_PROGRESS: 1,
+            TaskStatus.COMPLETED: 2,
+        }

         sorted_tasks = sorted(
             tasks,
-            key=lambda t: (status_order.get(t.status.value, 3),),
+            key=lambda t: (status_order.get(t.status, 3),),
         )
@@
-            status_indicator = {
-                "pending": "[ ]",
-                "in_progress": "[~]",
-                "completed": "[✓]",
-            }.get(task.status.value, "[?]")
+            status_indicator = {
+                TaskStatus.PENDING: "[ ]",
+                TaskStatus.IN_PROGRESS: "[~]",
+                TaskStatus.COMPLETED: "[✓]",
+            }.get(task.status, "[?]")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 833f0b9 and 8dc43ba.

📒 Files selected for processing (29)

docs/installation/python-installation.md (3 hunks)
examples/custom_llm.py (1 hunks)
holmes/config.py (1 hunks)
holmes/core/conversations.py (11 hunks)
holmes/core/investigation.py (3 hunks)
holmes/core/openai_formatting.py (1 hunks)
holmes/core/prompt.py (3 hunks)
holmes/core/todo_manager.py (1 hunks)
holmes/core/tool_calling_llm.py (5 hunks)
holmes/core/tools.py (1 hunks)
holmes/interactive.py (1 hunks)
holmes/main.py (2 hunks)
holmes/plugins/prompts/_general_instructions.jinja2 (2 hunks)
holmes/plugins/prompts/generic_ask.jinja2 (1 hunks)
holmes/plugins/prompts/investigation_procedure.jinja2 (1 hunks)
holmes/plugins/prompts/kubernetes_workload_ask.jinja2 (1 hunks)
holmes/plugins/toolsets/__init__.py (2 hunks)
holmes/plugins/toolsets/investigator/core_investigation.py (1 hunks)
holmes/plugins/toolsets/investigator/investigator_instructions.jinja2 (1 hunks)
holmes/plugins/toolsets/investigator/model.py (1 hunks)
server.py (1 hunks)
tests/core/test_prompt.py (6 hunks)
tests/core/test_todo_manager.py (1 hunks)
tests/core/test_todo_write_tool.py (1 hunks)
tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml (1 hunks)
tests/llm/test_ask_holmes.py (2 hunks)
tests/plugins/toolsets/test_core_investigation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: ALWAYS place Python imports at the top of the file, not inside functions or methods
Type hints are required (project is type-checked with mypy)
Use Ruff for formatting and linting (configured in pyproject.toml)

Files:

holmes/core/tools.py
examples/custom_llm.py
server.py
tests/plugins/toolsets/test_core_investigation.py
holmes/plugins/toolsets/investigator/model.py
holmes/core/investigation.py
holmes/core/conversations.py
holmes/interactive.py
tests/core/test_todo_write_tool.py
holmes/core/openai_formatting.py
tests/core/test_prompt.py
holmes/core/todo_manager.py
tests/core/test_todo_manager.py
tests/llm/test_ask_holmes.py
holmes/plugins/toolsets/__init__.py
holmes/core/prompt.py
holmes/main.py
holmes/config.py
holmes/core/tool_calling_llm.py
holmes/plugins/toolsets/investigator/core_investigation.py

{tests/**/*.py,pyproject.toml}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Only use pytest markers that are defined in pyproject.toml; never introduce undefined markers in tests

Files:

tests/plugins/toolsets/test_core_investigation.py
tests/core/test_todo_write_tool.py
tests/core/test_prompt.py
tests/core/test_todo_manager.py
tests/llm/test_ask_holmes.py

tests/llm/**/test_case.yaml

📄 CodeRabbit Inference Engine (CLAUDE.md)

Do NOT place toolset configuration inside test_case.yaml for evals

Files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/test_case.yaml
tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml

tests/llm/**/*.yaml

📄 CodeRabbit Inference Engine (CLAUDE.md)

tests/llm/**/*.yaml: In evals, ALWAYS use Kubernetes Secrets for scripts; do not embed scripts in inline manifests or ConfigMaps
Each eval test must use a dedicated Kubernetes namespace named app-
All pod names in evals must be unique and should not hint at the problem (use neutral names)

Files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/test_case.yaml
tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml
tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

tests/llm/**/toolsets.yaml

📄 CodeRabbit Inference Engine (CLAUDE.md)

tests/llm/**/toolsets.yaml: Eval-specific toolset configuration must be defined in a separate toolsets.yaml file in the test directory
In eval toolsets.yaml, all toolset-specific configuration must be under a config field

Files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

{holmes/plugins/toolsets/**/*.yaml,tests/llm/**/toolsets.yaml}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Allowed top-level fields for toolset YAMLs are limited to: enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url (MCP only)

Files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

holmes/plugins/prompts/*.jinja2

📄 CodeRabbit Inference Engine (CLAUDE.md)

Prompts must be Jinja2 templates named {name}.jinja2 under holmes/plugins/prompts/

Files:

holmes/plugins/prompts/kubernetes_workload_ask.jinja2
holmes/plugins/prompts/investigation_procedure.jinja2
holmes/plugins/prompts/_general_instructions.jinja2
holmes/plugins/prompts/generic_ask.jinja2

docs/**/*.md

📄 CodeRabbit Inference Engine (CLAUDE.md)

In MkDocs docs, always add a blank line between a header/bold text and a following list

Files:

docs/installation/python-installation.md

🧠 Learnings (8)

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to {holmes/plugins/toolsets/**/*.yaml,tests/llm/**/toolsets.yaml} : Allowed top-level fields for toolset YAMLs are limited to: enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url (MCP only)

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to tests/llm/**/toolsets.yaml : Eval-specific toolset configuration must be defined in a separate toolsets.yaml file in the test directory

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to tests/llm/**/toolsets.yaml : In eval toolsets.yaml, all toolset-specific configuration must be under a config field

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to holmes/plugins/toolsets/**/*.yaml : Toolsets must reside under holmes/plugins/toolsets/ as {name}.yaml or within a {name}/ directory

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml
holmes/plugins/toolsets/__init__.py

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to tests/llm/**/test_case.yaml : Do NOT place toolset configuration inside test_case.yaml for evals

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

📚 Learning: 2025-08-05T00:42:23.792Z

Learnt from: vishiy
PR: robusta-dev/holmesgpt#782
File: config.example.yaml:31-49
Timestamp: 2025-08-05T00:42:23.792Z
Learning: In robusta-dev/holmesgpt config.example.yaml, the azuremonitorlogs toolset configuration shows "enabled: true" as an example of how to enable the toolset, not as a default setting. The toolset is disabled by default and requires explicit enablement in user configurations.

Applied to files:

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

📚 Learning: 2025-08-13T05:57:40.420Z

Learnt from: mainred
PR: robusta-dev/holmesgpt#829
File: holmes/plugins/runbooks/runbook-format.prompt.md:11-22
Timestamp: 2025-08-13T05:57:40.420Z
Learning: In holmes/plugins/runbooks/runbook-format.prompt.md, the user (mainred) prefers to keep the runbook step specifications simple without detailed orchestration metadata like step IDs, dependencies, retry policies, timeouts, and storage variables. The current format with Action, Function Description, Parameters, Expected Output, and Success/Failure Criteria is sufficient for their AI agent troubleshooting use case.

Applied to files:

holmes/plugins/prompts/_general_instructions.jinja2

📚 Learning: 2025-08-17T08:42:48.763Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T08:42:48.763Z
Learning: Applies to holmes/plugins/prompts/*.jinja2 : Prompts must be Jinja2 templates named {name}.jinja2 under holmes/plugins/prompts/

Applied to files:

holmes/plugins/prompts/generic_ask.jinja2

🧬 Code Graph Analysis (9)

examples/custom_llm.py (1)

holmes/core/tool_calling_llm.py (1)

ToolCallingLLM (204-713)

tests/plugins/toolsets/test_core_investigation.py (2)

holmes/plugins/toolsets/investigator/core_investigation.py (3)

CoreInvestigationToolset (134-156)

TodoWriteTool (22-131)

get_example_config (149-150)

holmes/core/tools.py (6)

ToolsetStatusEnum (103-106)

ToolsetTag (109-112)

check_prerequisites (418-473)

get_example_config (476-477)

get_example_config (495-496)

get_example_config (528-529)

holmes/core/investigation.py (1)

holmes/core/todo_manager.py (2)

get_todo_manager (83-84)

format_tasks_for_prompt (34-77)

tests/core/test_todo_write_tool.py (3)

holmes/core/tools.py (6)

ToolResultStatus (23-42)

_invoke (174-175)

_invoke (226-254)

get_parameterized_one_liner (178-179)

get_parameterized_one_liner (205-212)

get_openai_format (142-148)

holmes/plugins/toolsets/investigator/core_investigation.py (3)

TodoWriteTool (22-131)

_invoke (83-127)

get_parameterized_one_liner (129-131)

holmes/plugins/toolsets/investigator/model.py (1)

TaskStatus (6-9)

holmes/core/todo_manager.py (1)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

tests/core/test_todo_manager.py (2)

holmes/core/todo_manager.py (7)

TodoListManager (7-77)

get_todo_manager (83-84)

get_session_count (30-32)

get_session_tasks (17-19)

format_tasks_for_prompt (34-77)

update_session_tasks (21-23)

clear_session (25-28)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

holmes/plugins/toolsets/__init__.py (1)

holmes/plugins/toolsets/investigator/core_investigation.py (1)

CoreInvestigationToolset (134-156)

holmes/core/tool_calling_llm.py (1)

holmes/core/todo_manager.py (2)

get_todo_manager (83-84)

format_tasks_for_prompt (34-77)

holmes/plugins/toolsets/investigator/core_investigation.py (4)

holmes/core/todo_manager.py (3)

get_todo_manager (83-84)

update_session_tasks (21-23)

format_tasks_for_prompt (34-77)

holmes/core/tools.py (15)

Toolset (342-484)

ToolsetTag (109-112)

ToolsetStatusEnum (103-106)

ToolParameter (121-126)

Tool (129-179)

StructuredToolResult (45-69)

ToolResultStatus (23-42)

_invoke (174-175)

_invoke (226-254)

get_parameterized_one_liner (178-179)

get_parameterized_one_liner (205-212)

get_example_config (476-477)

get_example_config (495-496)

get_example_config (528-529)

_load_llm_instructions (479-484)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

holmes/utils/cache.py (1)

get (64-74)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals
GitHub Check: Pre-commit checks

🔇 Additional comments (31)

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/test_case.yaml (1)

1-2: Prompt fixture looks good and aligns with test intent

The user prompt is clear and minimal. No toolset configuration is embedded here, which complies with the testing guidelines.

tests/llm/fixtures/test_ask_holmes/63_fetch_error_logs_no_errors/test_case.yaml (1)

8-8: Readiness wait added — LGTM

The kubectl wait --for=condition=available addition reduces test flakiness by ensuring the deployment is ready before assertions.

holmes/plugins/toolsets/investigator/model.py (1)

6-15: Task and TaskStatus model — LGTM

Enum values align with the prompt’s required statuses, and the id default_factory ensures UUID strings. Good Pydantic usage.

holmes/plugins/prompts/investigation_procedure.jinja2 (2)

9-17: Status values align with TaskStatus enum — LGTM

The template’s required statuses (“pending”, “in_progress”, “completed”) match the TaskStatus enum. Good consistency.

88-91: Todo list rendering — LGTM

Conditional rendering of todo_list is correct and keeps the prompt focused when no list exists.

examples/custom_llm.py (1)

58-58: Example updated to 40 steps — LGTM

Example aligns with the new default and helps users mirror production behavior.

holmes/plugins/prompts/kubernetes_workload_ask.jinja2 (1)

9-10: Verify context variables for investigation_procedure.jinja2 include

The include path is correct and the template exists, but it references both investigation_id and todo_list without StrictUndefined, so missing values will render silently. Please confirm that your renderer for kubernetes_workload_ask.jinja2 supplies these in its context:

• File: holmes/plugins/prompts/kubernetes_workload_ask.jinja2 (line 9)
• Template: holmes/plugins/prompts/investigation_procedure.jinja2
– Uses investigation_id inside {% if investigation_id %}…{% endif %}
– Uses todo_list inside {% if todo_list %}…{% endif %}

Ensure upstream callers include both keys (e.g., in your call to render_template or equivalent).

server.py (1)

219-220: Confirmed: ai.investigation_id is always defined on LLM instances
The config.create_toolcalling_llm factory returns a ToolCallingLLM, whose __init__ unconditionally sets self.investigation_id = str(uuid.uuid4()). There are no alternative code paths returning a plain LLM, so ai.investigation_id is guaranteed to exist.

holmes/core/investigation.py (2)

13-13: LGTM: Import placed correctly per project guidelines.

Importing get_todo_manager at the top complies with the repo’s import policy.

148-150: Good: Propagating todo_list and investigation_id into the system prompt.

This aligns the investigation prompt with the new TodoWrite flow and per-investigation context.

holmes/plugins/toolsets/__init__.py (2)

47-49: Importing CoreInvestigationToolset looks correct

The import path matches the provided toolset implementation and keeps imports at the top of the file.

74-76: No duplicate YAML-defined toolset named core_investigation
Verified there are no YAML files or name: core_investigation entries in the repo. Prepending CoreInvestigationToolset poses no risk of duplicate definitions.

holmes/plugins/prompts/generic_ask.jinja2 (1)

11-17: Approved: Required Jinja2 partials verified

All shared instruction blocks are present under holmes/plugins/prompts, so the includes will resolve correctly:

holmes/plugins/prompts/_current_date_time.jinja2

holmes/plugins/prompts/_general_instructions.jinja2

holmes/plugins/prompts/_runbook_instructions.jinja2

Ready to merge.
holmes/core/conversations.py (3)

136-137: Propagation of investigation_id in issue chat prompts: LGTM

investigation_id is consistently added to all template contexts (no-tools, truncated-tools, and history-update paths). This keeps system prompts aligned across flows.

Also applies to: 157-158, 191-192, 233-234, 257-258

474-475: Propagation of investigation_id in workload health prompts: LGTM

The investigation_id is added across all workload-health paths (no-tools, truncated-tools, and history-update). Consistent and correct.

Also applies to: 495-496, 529-530, 571-572, 595-596

282-283: Ensure ToolCallingLLM Always Initializes a Non-Empty investigation_id

I didn’t find any assignment of self.investigation_id in the ToolCallingLLM class, so it’s unclear whether ai.investigation_id can ever be None. Please verify and, if needed, update the constructor to always set a valid ID (or provide a default/fallback) to prevent "None" from leaking into your prompts or breaking template assumptions.

• Inspect class ToolCallingLLM in holmes/.../tool_calling_llm.py (or wherever it’s defined) and confirm its __init__ takes and assigns investigation_id.
• If no explicit assignment exists, add something like:
def __init__(..., investigation_id: str, ...):
    self.investigation_id = investigation_id or generate_default_id()
• Review all instantiations of ToolCallingLLM to ensure they pass a non-empty investigation_id.
• Optionally add a runtime assertion or fallback inside ToolCallingLLM to guard against empty IDs.
docs/installation/python-installation.md (1)

51-52: No changes needed: build_initial_ask_messages uses investigation_id
The function signature in holmes/core/prompt.py clearly defines the fifth parameter as investigation_id, and all call sites (including tests) rely on positional or keyword investigation_id. There is no request_id parameter—docs examples are correct as written.

tests/core/test_prompt.py (1)

1-1: LGTM: import placed at top-level.

uuid import placement complies with project import rules.

holmes/core/tool_calling_llm.py (2)

5-5: LGTM: imports are correctly placed at module top.

uuid and get_todo_manager are imported at the top, satisfying the import placement rule.

Also applies to: 42-44

788-801: Good addition: per-investigation todo context in system prompt.

Fetching todo_context via TodoManager and injecting both todo_list and investigation_id into the system prompt is a clean integration and keeps prompts session-aware.

holmes/core/openai_formatting.py (1)

37-47: Array handling looks solid; confirm strictness expectations.

When items are provided, recursion will enforce nested strictness on object items. For the fallback branch, you add additionalProperties: False under strict_mode, which is good. No action needed.

holmes/plugins/prompts/_general_instructions.jinja2 (1)

1-1: Include path looks correct.

investigation_procedure.jinja2 is under the same prompts directory; include should resolve in the current loader.

holmes/main.py (2)

300-308: Good: investigation_id is correctly propagated into the ask flow

Passing ai.investigation_id to build_initial_ask_messages aligns with the updated signature and ensures session-scoped context reaches prompt construction.

286-297: Investigation ID propagation verified

Both the non-interactive and interactive paths pass ai.investigation_id into build_initial_ask_messages, so the interactive flow already propagates the investigation ID as intended:

holmes/main.py (non-interactive): call at lines 300–305 includes ai.investigation_id

holmes/interactive.py (interactive): call at lines 1000–1005 includes ai.investigation_id

No changes required.

holmes/core/prompt.py (1)

49-55: LGTM: investigation_id included in template context

This makes investigation_id available to the system prompt template as intended.

tests/core/test_todo_write_tool.py (1)

68-86: Invalid enum handling: ensure both status and priority validations feed consistent errors

You expect an error for invalid status/priority. Confirm the tool validates both and returns a stable error prefix like “Failed to process tasks” as asserted here.

If not already implemented, I can help add strict enum validation with clear error aggregation for per-task failures.

holmes/core/todo_manager.py (3)

17-24: Good defensive copies with locking.

Returning a copy and storing a copied list under lock avoids external mutation and concurrent hazards. LGTM.

30-33: Session count under lock is correct.

Straightforward and thread-safe.

80-85: Singleton accessor is fine.

The module-level instance with a typed getter is appropriate for shared state.

holmes/plugins/toolsets/investigator/core_investigation.py (2)

109-120: Result construction looks good.

Success path returns a clear message and includes the formatted tasks for immediate feedback. LGTM.

121-128: Error handling is appropriate.

Exception is logged with stack trace and returned as a structured error result. LGTM.

holmes/core/todo_manager.py

holmes/plugins/toolsets/investigator/core_investigation.py

holmes/plugins/toolsets/investigator/investigator_instructions.jinja2

tests/llm/fixtures/test_ask_holmes/150_what_is_wrong_with_mimir/toolsets.yaml

holmes/core/openai_formatting.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

holmes/core/prompt.py (3)
28-36: Polish the reminder text: fix grammar and a missing space (user-visible).

The concatenated string currently yields “question.Do this” (missing space) and has a couple of typos. Suggest tightening grammar without changing meaning.

Apply this diff:
 def get_tasks_management_system_reminder() -> str:
     return (
-        "\n\n<system-reminder>\nIMPORTANT: You have access to the TodoWrite tool. It creates a TodoList, in order to track progress. It's very important. You MUST use it:\n1. FIRST: Ask your self which sub problems you need to solve in order to answer the question."
-        "Do this, BEFORE any other tools\n2. "
+        "\n\n<system-reminder>\nIMPORTANT: You have access to the TodoWrite tool. It creates a TodoList, in order to track progress. It's very important. You MUST use it:\n1. FIRST: Ask yourself which sub-problems you need to solve in order to answer the question. "
+        "Do this, BEFORE any other tools\n2. "
         "AFTER EVERY TOOL CALL: If required, update the TodoList\n3. "
         "\n\nFAILURE TO UPDATE TodoList = INCOMPLETE INVESTIGATION\n\n"
-        "Example flow:\n- Think and divide to sub problems → create TodoList → Perform each task on the list → Update list → Verify your solution\n</system-reminder>"
+        "Example flow:\n- Think and divide into sub-problems → create TodoList → Perform each task on the list → Update list → Verify your solution\n</system-reminder>"
     )
38-46: Tighten return typing for messages.

Return type is too generic (List[Dict]). It’s better as List[Dict[str, str]] to help mypy and callers.

Apply this diff:
-) -> List[Dict]:
+) -> List[Dict[str, str]]:
Optional (stronger typing): define a TypedDict and use it as the return type.
# at top of file
from typing import TypedDict, Literal

class ChatMessage(TypedDict):
    role: Literal["system", "user"]
    content: str
Then change the return annotation to List[ChatMessage].

74-75: Consider moving the reminder into the system prompt instead of the user message.

System-level guidance carries higher priority in most chat stacks and is less likely to be overridden by user content. If you agree, append the reminder to system_prompt_rendered and drop it from the user content.

Suggested change:
@@
-    user_prompt_with_files += get_tasks_management_system_reminder()
+    system_prompt_rendered += get_tasks_management_system_reminder()
tests/core/test_prompt.py (4)
27-37: Solid update to pass investigation_id into the builder.

Consider reducing duplication by introducing a fixture for investigation_id so you don’t repeat str(uuid.uuid4()) in every test.

Example fixture:
import uuid
import pytest

@pytest.fixture
def investigation_id() -> str:
    return str(uuid.uuid4())
Then use it in calls to build_initial_ask_messages.

72-86: Add an assertion that the reminder is present when attachments are used.

Currently you check file markers and content; add a check for the appended reminder to cover this path as well.

Apply this diff:
     assert "File content here" in messages[1]["content"]
     # Check for file attachment markers
     assert "<attached-file" in messages[1]["content"]
     assert "test.txt" in messages[1]["content"]
     assert "</attached-file>" in messages[1]["content"]
+    # The task management reminder should still be appended
+    assert get_tasks_management_system_reminder() in messages[1]["content"]
122-141: Also assert the reminder in the “all parameters” test.

This ensures the reminder is appended regardless of optional args.

Apply this diff:
     assert "Test prompt" in messages[1]["content"]
     assert "File content" in messages[1]["content"]
+    assert get_tasks_management_system_reminder() in messages[1]["content"]
27-37: Optional: add a focused test to verify investigation_id is passed into the template context.

Right now we don’t assert that the template receives investigation_id. You can monkeypatch load_and_render_prompt to capture the context and assert it contains the key.

Prototype test:
def test_system_template_receives_investigation_id(monkeypatch, console, mock_tool_executor):
    captured = {}
    def fake_render(template, ctx):
        captured.update(ctx)
        return "SYSTEM"
    monkeypatch.setattr("holmes.core.prompt.load_and_render_prompt", fake_render)

    messages = build_initial_ask_messages(
        console,
        "Test prompt",
        None,
        mock_tool_executor,
        "inv-123",
        None,
        None,
    )

    assert captured.get("investigation_id") == "inv-123"
    assert messages[0]["content"] == "SYSTEM"
Also applies to: 47-61, 72-86, 103-111, 122-141

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc43ba and 0301614.

📒 Files selected for processing (4)

holmes/core/prompt.py (3 hunks)
tests/core/test_prompt.py (9 hunks)
tests/core/test_todo_manager.py (1 hunks)
tests/core/test_todo_write_tool.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/core/test_todo_manager.py
tests/core/test_todo_write_tool.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: ALWAYS place Python imports at the top of the file, not inside functions or methods
Type hints are required (project is type-checked with mypy)
Use Ruff for formatting and linting (configured in pyproject.toml)

Files:

tests/core/test_prompt.py
holmes/core/prompt.py

{tests/**/*.py,pyproject.toml}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Only use pytest markers that are defined in pyproject.toml; never introduce undefined markers in tests

Files:

tests/core/test_prompt.py

🧬 Code Graph Analysis (1)

tests/core/test_prompt.py (1)

holmes/core/prompt.py (1)

get_tasks_management_system_reminder (28-35)

🔇 Additional comments (5)

holmes/core/prompt.py (2)

63-64: Injecting investigation_id into template context is spot-on.

This enables templates to reference the current investigation, aligning with the broader per-investigation context changes.

43-43: All call sites updated to pass investigation_id.
Ran the AST-based verification script across the repo; no calls to build_initial_ask_messages are missing the new parameter. No further action required.

tests/core/test_prompt.py (3)

1-2: LGTM on the UUID import.

Using a fresh investigation_id per test is appropriate.

7-12: Importing the new helper keeps expectations stable.

Asserting against get_tasks_management_system_reminder() avoids brittle hard-coding.

42-45: Assertion of the appended reminder is good coverage.

This verifies the new behavior explicitly for the basic path.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

holmes/core/openai_formatting.py (1)
70-73: Add explicit return type and generics for mypy compliance

Project policy requires type hints; the function lacks a return type, and tool_parameters is an unparameterized dict. Adding types avoids mypy complaints.
-def format_tool_to_open_ai_standard(
-    tool_name: str, tool_description: str, tool_parameters: dict, target_model: str
-):
+def format_tool_to_open_ai_standard(
+    tool_name: str,
+    tool_description: str,
+    tool_parameters: dict[str, Any],
+    target_model: str,
+) -> dict[str, Any]:
-    tool_properties = {}
+    tool_properties: dict[str, Any] = {}

🧹 Nitpick comments (2)

holmes/core/openai_formatting.py (2)

28-36: Great recursion for nested objects; consider propagating property descriptions

When mapping nested object properties, we currently drop each property's description. Keeping descriptions helps LLMs adhere to constraints and improves dev ergonomics. Also, switching from a dict comprehension to a small loop increases clarity and lets us attach metadata.

Apply:

-        if hasattr(param_attributes, "properties") and param_attributes.properties:
-            type_obj["properties"] = {
-                name: type_to_open_ai_schema(prop, strict_mode)
-                for name, prop in param_attributes.properties.items()
-            }
-            if strict_mode:
-                type_obj["required"] = list(param_attributes.properties.keys())
+        if hasattr(param_attributes, "properties") and param_attributes.properties:
+            props: dict[str, Any] = {}
+            for name, prop in param_attributes.properties.items():
+                schema = type_to_open_ai_schema(prop, strict_mode)
+                desc = getattr(prop, "description", None)
+                if desc:
+                    schema["description"] = desc
+                props[name] = schema
+            type_obj["properties"] = props
+            if strict_mode:
+                type_obj["required"] = list(param_attributes.properties.keys())

37-47: Arrays: explicit items schema is good; also carry item descriptions

The explicit items path is spot-on. Minor improvement: if the item parameter has a description, attach it onto the items schema as well.

-        if hasattr(param_attributes, "items") and param_attributes.items:
-            items_schema = type_to_open_ai_schema(param_attributes.items, strict_mode)
-            type_obj = {"type": "array", "items": items_schema}
+        if hasattr(param_attributes, "items") and param_attributes.items:
+            items_schema = type_to_open_ai_schema(param_attributes.items, strict_mode)
+            item_desc = getattr(param_attributes.items, "description", None)
+            if item_desc:
+                items_schema["description"] = item_desc
+            type_obj = {"type": "array", "items": items_schema}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0301614 and 633e185.

📒 Files selected for processing (1)

holmes/core/openai_formatting.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: ALWAYS place Python imports at the top of the file, not inside functions or methods
Type hints are required (project is type-checked with mypy)
Use Ruff for formatting and linting (configured in pyproject.toml)

Files:

holmes/core/openai_formatting.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals
GitHub Check: Pre-commit checks

🔇 Additional comments (1)

holmes/core/openai_formatting.py (1)

56-58: No legacy array[object] usages detected — safe to merge

Searches across Python, YAML, and JSON files for array[object] (including type="array[object]") returned no matches. This breaking change can be merged without impacting existing tool definitions.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

holmes/core/todo_manager.py (1)
13-16: Add missing return annotation on init (mypy compliance).

Constructor should be annotated with -> None. Lock is already typed, thanks.

Apply this diff:
-    def __init__(self):
+    def __init__(self) -> None:
         self._sessions: Dict[str, List[Task]] = {}
         self._lock: Lock = Lock()

🧹 Nitpick comments (6)

holmes/core/todo_manager.py (3)
44-49: Key mappings with string literals are typo-prone; bind to TaskStatus values.

Use TaskStatus constants to prevent accidental typos and keep things consistent with the model.

Apply this diff:
-        status_order = {"pending": 0, "in_progress": 1, "completed": 2}
+        status_order = {
+            TaskStatus.PENDING.value: 0,
+            TaskStatus.IN_PROGRESS.value: 1,
+            TaskStatus.COMPLETED.value: 2,
+        }
 
         sorted_tasks = sorted(
             tasks,
-            key=lambda t: (status_order.get(t.status.value, 3),),
+            key=lambda t: (status_order.get(t.status.value, 3),),
         )
63-70: Same note for status indicator mapping: prefer TaskStatus-based keys.

Keeps mapping resilient to typos and aligned to the enum.

Apply this diff:
-            status_indicator = {
-                "pending": "[ ]",
-                "in_progress": "[~]",
-                "completed": "[✓]",
-            }.get(task.status.value, "[?]")
+            status_indicator = {
+                TaskStatus.PENDING.value: "[ ]",
+                TaskStatus.IN_PROGRESS.value: "[~]",
+                TaskStatus.COMPLETED.value: "[✓]",
+            }.get(task.status.value, "[?]")
80-84: Global singleton is convenient; consider a reset hook for tests.

A simple clear_all() method or a reset helper can reduce test coupling when using the module singleton.

I can add a clear_all_sessions() method and a pytest fixture to reset state between tests if desired.
holmes/plugins/toolsets/investigator/investigator_instructions.jinja2 (3)
151-151: Remove stray closing tag </assistant>.

Looks like a copy artifact that could confuse downstream processing.

Apply this diff:
-Let's start by analyzing the detailed CPU metrics and correlating with request patterns.</assistant>
+Let's start by analyzing the detailed CPU metrics and correlating with request patterns.
169-172: Fix code block formatting in the Python example.

Use fenced code block for clarity and consistent rendering.

Apply this diff:
-python
-print("Hello World")
+```python
+print("Hello World")
+```
60-66: Hard constraint “fewer than 4 lines” and “Always use TodoWrite” may be too rigid.

These absolutes can degrade UX in nuanced investigations. Consider softening or gating behind a template variable (e.g., {{ concise_mode|default(false) }}).

Proposed wording:

“Answer concisely. Prefer under 4 lines unless the user asks for detail or context requires more.”

“Use TodoWrite proactively for multi-step tasks; skip it for trivial single-step requests.”

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 633e185 and 2b750ee.

📒 Files selected for processing (4)

holmes/config.py (1 hunks)
holmes/core/todo_manager.py (1 hunks)
holmes/plugins/toolsets/investigator/core_investigation.py (1 hunks)
holmes/plugins/toolsets/investigator/investigator_instructions.jinja2 (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

holmes/config.py
holmes/plugins/toolsets/investigator/core_investigation.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: ALWAYS place Python imports at the top of the file, not inside functions or methods
Type hints are required (project is type-checked with mypy)
Use Ruff for formatting and linting (configured in pyproject.toml)

Files:

holmes/core/todo_manager.py

🧬 Code Graph Analysis (1)

holmes/core/todo_manager.py (1)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: build (3.12)
GitHub Check: build (3.10)
GitHub Check: build (3.11)
GitHub Check: build (3.10)
GitHub Check: build (3.11)
GitHub Check: build (3.12)

🔇 Additional comments (2)

holmes/plugins/toolsets/investigator/investigator_instructions.jinja2 (2)

1-28: Strong addition: clear guidance and good worked examples.

The task management rationale and examples will help drive better, auditable investigations.

243-253: Ignore removal of the priority field
The priority property is part of the Task schema and is used throughout the code and tests. No change is required to the Jinja template:

• tests/core/test_todo_write_tool.py defines and exercises priority in multiple test cases
• The TodoWriteTool model and OpenAI schema generation include priority as a valid enum
• The Jinja2 snippet correctly mirrors the supported task properties

Please disregard the earlier suggestion to remove priority.

Likely an incorrect or invalid review comment.

holmes/core/todo_manager.py

holmes/config.py

holmes/core/tool_calling_llm.py

holmes/plugins/toolsets/investigator/core_investigation.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (6)

holmes/config.py (1)

77-77: Default max_steps increased to 40 — acknowledged

Rationale was discussed earlier (“todo tools add more steps”). No further concerns from this file’s side.
holmes/core/todo_manager.py (2)
13-16: Add constructor return annotation (mypy compliance)

Constructor should explicitly return None.
-    def __init__(self):
+    def __init__(self) -> None:
17-20: Avoid mutable default in dict.get and tighten typing of return

Using dict.get(..., []) returns a mutable default and can confuse mypy; also needlessly copies a fresh list.
-    def get_session_tasks(self, session_id: str) -> List[Task]:
-        with self._lock:
-            return self._sessions.get(session_id, []).copy()
+    def get_session_tasks(self, session_id: str) -> List[Task]:
+        with self._lock:
+            tasks = self._sessions.get(session_id)
+            return tasks.copy() if tasks is not None else []
holmes/plugins/toolsets/investigator/core_investigation.py (3)
139-149: Annotate constructor return type (mypy compliance)

Constructor should explicitly return None.
-    def __init__(self):
+    def __init__(self) -> None:
153-158: Annotate _reload_instructions return type (mypy compliance)

Explicit None return keeps type checking clean.
-    def _reload_instructions(self):
+    def _reload_instructions(self) -> None:
102-106: Validate investigation_id to avoid cross-session contamination

Empty/whitespace IDs lead to storing tasks under an empty key.
-            session_id = params.get("investigation_id", "")
-            todo_manager.update_session_tasks(session_id, tasks)
+            session_id = str(params.get("investigation_id", "")).strip()
+            if not session_id:
+                return StructuredToolResult(
+                    status=ToolResultStatus.ERROR,
+                    error="investigation_id is required but was missing or empty",
+                    params=params,
+                )
+            todo_manager.update_session_tasks(session_id, tasks)

🧹 Nitpick comments (5)

holmes/core/todo_manager.py (1)

84-88: Optional: Mark singleton as Final to prevent reassignment

Minor typing hardening for the module-level singleton.
-_todo_manager = TodoListManager()
+_todo_manager: Final[TodoListManager] = TodoListManager()
Add this import at the top of the file:
from typing import Final

holmes/plugins/toolsets/investigator/core_investigation.py (4)

3-6: Import List for typing and drop uuid4 (rely on Task default id)

Preps for type hints and removes unnecessary dependency.

-from typing import Any, Dict
+from typing import Any, Dict, List
-
-from uuid import uuid4

25-37: Relax id requirement; document allowed status values in schema

Runtime generates IDs when missing; schema should reflect that. Also clarify allowed statuses.

         "todos": ToolParameter(
             description="COMPLETE list of ALL tasks on the task list. Each task should have: id (string), content (string), status (pending/in_progress/completed)",
             type="array",
             required=True,
             items=ToolParameter(
                 type="object",
                 properties={
-                    "id": ToolParameter(type="string", required=True),
+                    "id": ToolParameter(type="string", required=False),
                     "content": ToolParameter(type="string", required=True),
-                    "status": ToolParameter(type="string", required=True),
+                    "status": ToolParameter(
+                        type="string",
+                        required=True,
+                        description="one of: pending, in_progress, completed",
+                    ),
                 },
             ),
         ),

45-49: Add type hints to print_tasks_table (mypy requirement)

Annotate parameters and return type.

-    def print_tasks_table(self, tasks):
+    def print_tasks_table(self, tasks: List[Task]) -> None:

89-99: Leverage Task’s default id; tighten local typing and input handling

Removes ad-hoc uuid, uses Task’s default id, and adds types.

-            tasks = []
+            tasks: List[Task] = []
 
             for todo_item in todos_data:
                 if isinstance(todo_item, dict):
-                    task = Task(
-                        id=todo_item.get("id", str(uuid4())),
-                        content=todo_item.get("content", ""),
-                        status=TaskStatus(todo_item.get("status", "pending")),
-                    )
-                    tasks.append(task)
+                    status = TaskStatus(todo_item.get("status", "pending"))
+                    content = str(todo_item.get("content", "")).strip()
+                    id_val = todo_item.get("id")
+                    if isinstance(id_val, str) and id_val:
+                        tasks.append(Task(id=id_val, content=content, status=status))
+                    else:
+                        tasks.append(Task(content=content, status=status))
+                else:
+                    logging.warning("Skipping non-dict todo item: %r", todo_item)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b750ee and 997886d.

📒 Files selected for processing (3)

holmes/config.py (1 hunks)
holmes/core/todo_manager.py (1 hunks)
holmes/plugins/toolsets/investigator/core_investigation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: ALWAYS place Python imports at the top of the file, not inside functions or methods
Type hints are required (project is type-checked with mypy)
Use Ruff for formatting and linting (configured in pyproject.toml)

Files:

holmes/config.py
holmes/core/todo_manager.py
holmes/plugins/toolsets/investigator/core_investigation.py

🧬 Code Graph Analysis (2)

holmes/core/todo_manager.py (1)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

holmes/plugins/toolsets/investigator/core_investigation.py (3)

holmes/core/todo_manager.py (3)

get_todo_manager (87-88)

update_session_tasks (21-23)

format_tasks_for_prompt (34-81)

holmes/core/tools.py (8)

ToolParameter (121-126)

Tool (129-179)

StructuredToolResult (45-69)

ToolResultStatus (23-42)

_invoke (174-175)

_invoke (226-254)

get_parameterized_one_liner (178-179)

get_parameterized_one_liner (205-212)

holmes/plugins/toolsets/investigator/model.py (2)

Task (12-15)

TaskStatus (6-9)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: build (3.10)
GitHub Check: build (3.12)
GitHub Check: build (3.11)
GitHub Check: build (3.12)
GitHub Check: build (3.11)
GitHub Check: build (3.10)
GitHub Check: llm_evals

github-actions · 2025-08-19T11:51:22Z

Results of HolmesGPT evals

ask_holmes: 36/39 test cases were successful, 0 regressions, 2 skipped, 1 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	✅
ask	02_what_is_wrong_with_pod	✅
ask	03_what_is_the_command_to_port_forward	✅
ask	04_related_k8s_events	↪️
ask	05_image_version	✅
ask	09_crashpod	✅
ask	10_image_pull_backoff	✅
ask	11_init_containers	✅
ask	14_pending_resources	✅
ask	15_failed_readiness_probe	✅
ask	17_oom_kill	✅
ask	18_crash_looping_v2	✅
ask	19_detect_missing_app_details	✅
ask	20_long_log_file_search	✅
ask	24_misconfigured_pvc	✅
ask	28_permissions_error	🚧
ask	29_events_from_alert_manager	↪️
ask	39_failed_toolset	✅
ask	41_setup_argo	✅
ask	42_dns_issues_steps_new_tools	✅
ask	43_current_datetime_from_prompt	✅
ask	45_fetch_deployment_logs_simple	✅
ask	51_logs_summarize_errors	✅
ask	53_logs_find_term	✅
ask	54_not_truncated_when_getting_pods	✅
ask	59_label_based_counting	✅
ask	60_count_less_than	✅
ask	61_exact_match_counting	✅
ask	63_fetch_error_logs_no_errors	✅
ask	79_configmap_mount_issue	✅
ask	83_secret_not_found	✅
ask	86_configmap_like_but_secret	✅
ask	93_calling_datadog	✅
ask	93_calling_datadog	✅
ask	93_calling_datadog	✅
ask	97_logs_clarification_needed	✅
ask	110_k8s_events_image_pull	✅
ask	24a_misconfigured_pvc_basic	✅
ask	13a_pending_node_selector_basic	✅

Legend

✅ the test was successful
↪️ the test was skipped
⚠️ the test failed but is known to be flaky or known to fail
🚧 the test had a setup failure (not a code regression)
🔧 the test failed due to mock data issues (not a code regression)
❌ the test failed and should be fixed before merging the PR

github-actions · 2025-08-19T11:52:10Z

Results of HolmesGPT evals

ask_holmes: 32/39 test cases were successful, 3 regressions, 2 skipped, 2 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	✅
ask	02_what_is_wrong_with_pod	✅
ask	03_what_is_the_command_to_port_forward	❌
ask	04_related_k8s_events	↪️
ask	05_image_version	✅
ask	09_crashpod	✅
ask	10_image_pull_backoff	✅
ask	11_init_containers	✅
ask	14_pending_resources	✅
ask	15_failed_readiness_probe	✅
ask	17_oom_kill	✅
ask	18_crash_looping_v2	✅
ask	19_detect_missing_app_details	✅
ask	20_long_log_file_search	✅
ask	24_misconfigured_pvc	✅
ask	28_permissions_error	🚧
ask	29_events_from_alert_manager	↪️
ask	39_failed_toolset	✅
ask	41_setup_argo	✅
ask	42_dns_issues_steps_new_tools	✅
ask	43_current_datetime_from_prompt	✅
ask	45_fetch_deployment_logs_simple	✅
ask	51_logs_summarize_errors	✅
ask	53_logs_find_term	✅
ask	54_not_truncated_when_getting_pods	✅
ask	59_label_based_counting	✅
ask	60_count_less_than	🚧
ask	61_exact_match_counting	✅
ask	63_fetch_error_logs_no_errors	✅
ask	79_configmap_mount_issue	✅
ask	83_secret_not_found	✅
ask	86_configmap_like_but_secret	❌
ask	93_calling_datadog	✅
ask	93_calling_datadog	✅
ask	93_calling_datadog	✅
ask	97_logs_clarification_needed	❌
ask	110_k8s_events_image_pull	✅
ask	24a_misconfigured_pvc_basic	✅
ask	13a_pending_node_selector_basic	✅

Legend

✅ the test was successful
↪️ the test was skipped
⚠️ the test failed but is known to be flaky or known to fail
🚧 the test had a setup failure (not a code regression)
🔧 the test failed due to mock data issues (not a code regression)
❌ the test failed and should be fixed before merging the PR

mainred

Sorry, my comments are late.

mainred · 2025-08-19T11:40:18Z

holmes/plugins/toolsets/investigator/core_investigation.py

+
+class TodoWriteTool(Tool):
+    name: str = "TodoWrite"
+    description: str = "Save investigation tasks to break down complex problems into manageable sub-tasks. ALWAYS provide the COMPLETE list of all tasks, not just the ones being updated."


The list of calls should have no dependencies, so it's ok to run them in parallel.

mainred · 2025-08-19T11:53:14Z

holmes/plugins/toolsets/investigator/core_investigation.py

+from holmes.plugins.toolsets.investigator.model import Task, TaskStatus
+
+
+class TodoWriteTool(Tool):


why not move it to the tools.py?

mainred · 2025-08-19T11:56:29Z

holmes/plugins/prompts/investigation_procedure.jinja2

@@ -0,0 +1,210 @@
+{% if investigation_id %}


I am curious, why investigation_id is so important. It's an uuid across one interactive or non-interactive mode and marking on session.

Yes, it's important to track one tasks running in parallel. These can be multiple session in one interactive mode ask.

arikalon1 added 15 commits August 6, 2025 20:25

subtasks commit

4000d46

improvements

d0dc615

subtasks commit

0e28090

improvements

3b9f95f

prompt changes

5f31e18

prompt changes

cde0932

test fix

fc9a805

test fix

791b647

Merge remote-tracking branch 'origin/subtasks' into subtasks

057a3c3

# Conflicts: # holmes/core/tools.py # holmes/plugins/prompts/generic_ask.jinja2

remove debug info

86919c0

organize the code

e3a637b

increase max steps default

b18329e

prompt fix

Merge remote-tracking branch 'origin/master' into subtasks

2f850d3

# Conflicts: # holmes/plugins/prompts/_general_instructions.jinja2 # holmes/plugins/prompts/kubernetes_workload_ask.jinja2

increase max steps default

749d935

prompt fix

fix parameter types for openai schema conversion

8dc43ba

improve prompts

arikalon1 requested a review from moshemorad August 17, 2025 23:00

coderabbitai bot reviewed Aug 17, 2025

View reviewed changes

nherment reviewed Aug 18, 2025

View reviewed changes

holmes/core/openai_formatting.py Show resolved Hide resolved

fix tests

0301614

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

cr fix

633e185

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

arikalon1 added 2 commits August 18, 2025 11:13

cr fix

ba127fe

Merge branch 'master' into subtasks

2b750ee

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

holmes/core/todo_manager.py Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/core/todo_manager.py Outdated Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/core/todo_manager.py Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/core/todo_manager.py Outdated Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/core/todo_manager.py Show resolved Hide resolved

mainred reviewed Aug 19, 2025

View reviewed changes

holmes/config.py Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/core/tool_calling_llm.py Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/plugins/toolsets/investigator/core_investigation.py Outdated Show resolved Hide resolved

moshemorad reviewed Aug 19, 2025

View reviewed changes

holmes/plugins/toolsets/investigator/core_investigation.py Show resolved Hide resolved

arikalon1 added 2 commits August 19, 2025 14:41

cr fix

fecfc9e

Merge branch 'master' into subtasks

997886d

arikalon1 enabled auto-merge (squash) August 19, 2025 11:42

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

moshemorad approved these changes Aug 19, 2025

View reviewed changes

arikalon1 merged commit e36bdb0 into master Aug 19, 2025
11 checks passed

arikalon1 deleted the subtasks branch August 19, 2025 11:52

mainred reviewed Aug 19, 2025

View reviewed changes

This was referenced Aug 22, 2025

add --fast flag #892

Open

Add a test for traces due to db slowness #907

Merged

DRAFT: toolsets for mongo, redis, Kfuse tempo trace #355

Open

ROB-1932 bash tool cli approval #906

Merged

coderabbitai bot mentioned this pull request Sep 1, 2025

ROB-1933 tool approval workflow #919

Merged

This was referenced Sep 30, 2025

ROB-2125 conversation history compaction #1014

Merged

feat: Add experimental AG-UI supported chat endpoint and PPL query assist #1035

Merged

		from holmes.plugins.toolsets.investigator.model import Task, TaskStatus


		class TodoWriteTool(Tool):

Uh oh!

Subtasks #851

Subtasks #851

Uh oh!

Conversation

arikalon1 commented Aug 17, 2025

Uh oh!

coderabbitai bot commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 19, 2025

Results of HolmesGPT evals

Uh oh!

github-actions bot commented Aug 19, 2025

Results of HolmesGPT evals

Uh oh!

Uh oh!

mainred left a comment

Choose a reason for hiding this comment

Uh oh!

mainred Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

mainred Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

mainred Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

mainred Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

coderabbitai bot commented Aug 17, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)