-
Couldn't load subscription status.
- Fork 183
ROB-2111: limit max output tokens #1005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughIntroduces a computed max_output_tokens in holmes/core/llm.py using floor(min(64000, context_window_size/5)) and augments it by checking litellm.model_cost per model variant, updating the value if larger. Replaces the previous hard-coded 4096 default in logic and warnings. Adds math.floor import. Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
participant LLM as LLM.get_max_output_tokens
participant Cost as litellm.model_cost
Caller->>LLM: request max_output_tokens(model, context_window_size)
activate LLM
Note over LLM: Compute baseline = floor(min(64000, context_window_size/5))
LLM->>Cost: Iterate model variants to read max_output_tokens
activate Cost
Cost-->>LLM: per-variant max_output_tokens (if any)
deactivate Cost
Note over LLM: Compare and update to the highest available value
LLM-->>Caller: return computed max_output_tokens
deactivate LLM
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
holmes/core/tool_calling_llm.py (4)
4-4: Drop unnecessary math.floor importSwitch to integer arithmetic in limit_max_output_tokens and remove this import.
-from math import floor
137-142: Clamp by context (>=), use integer math, and add a brief docstringCovers misconfigurations where maximum_output_token >= context, avoids float + floor, and clarifies behavior.
-def limit_max_output_tokens(maximum_output_token: int, max_context_size: int) -> int: - if maximum_output_token == max_context_size: - return floor(min(64000, maximum_output_token / 5)) - else: - return maximum_output_token +def limit_max_output_tokens(maximum_output_token: int, max_context_size: int) -> int: + """ + Clamp max output tokens when set to be >= the model's context window. + Returns 1/5 of context (integer), capped at 64k. Otherwise returns the input. + """ + if maximum_output_token >= max_context_size: + return min(64_000, max_context_size // 5) + return maximum_output_token
438-446: Use reserved output tokens for the pre-truncation budget checkThe truncation logic reserves min(max_output, MAX_OUTPUT_TOKEN_RESERVATION), but the pre-check uses maximum_output_token, which can trigger unnecessary truncation. Align the check with the actual reserved value.
maximum_output_token = limit_max_output_tokens( maximum_output_token=maximum_output_token, max_context_size=max_context_size, ) - if (total_tokens + maximum_output_token) > max_context_size: + reserved_output_tokens = min( + maximum_output_token, MAX_OUTPUT_TOKEN_RESERVATION + ) + if (total_tokens + reserved_output_tokens) > max_context_size: logging.warning("Token limit exceeded. Truncating tool responses.") truncated_res = self.truncate_messages_to_fit_context( messages, max_context_size, maximum_output_token )
914-924: Same alignment for streaming path: check against reserved output tokensKeeps call_stream consistent with call and avoids premature truncation.
maximum_output_token = limit_max_output_tokens( maximum_output_token=maximum_output_token, max_context_size=max_context_size, ) - - if (total_tokens + maximum_output_token) > max_context_size: + reserved_output_tokens = min( + maximum_output_token, MAX_OUTPUT_TOKEN_RESERVATION + ) + if (total_tokens + reserved_output_tokens) > max_context_size: logging.warning("Token limit exceeded. Truncating tool responses.") truncated_res = self.truncate_messages_to_fit_context( messages, max_context_size, maximum_output_token )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
holmes/core/tool_calling_llm.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)
Files:
holmes/core/tool_calling_llm.py
holmes/{core,plugins}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
All tools must return detailed error messages from underlying APIs, including executed command/query, time ranges/parameters, and full API error response; 'no data' responses must specify what was searched and where
Files:
holmes/core/tool_calling_llm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Pre-commit checks
- GitHub Check: llm_evals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
holmes/core/tool_calling_llm.py (1)
4-4: Remove unnecessary floor importYou can avoid float math and floor by using integer division. If you apply the refactor below, drop this import.
Apply:
-from math import floor
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
holmes/core/tool_calling_llm.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)
Files:
holmes/core/tool_calling_llm.py
holmes/{core,plugins}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
All tools must return detailed error messages from underlying APIs, including executed command/query, time ranges/parameters, and full API error response; 'no data' responses must specify what was searched and where
Files:
holmes/core/tool_calling_llm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: llm_evals
🔇 Additional comments (1)
holmes/core/tool_calling_llm.py (1)
918-922: Mirror provider-side enforcement in streaming pathApply the same provider-side cap in call_stream by passing the clamped max tokens to self.llm.completion.
Example:
full_response = self.llm.completion( messages=parse_messages_tags(messages), tools=tools, tool_choice=tool_choice, response_format=response_format, temperature=TEMPERATURE, stream=False, max_tokens=maximum_output_token, # enforce clamp here too drop_params=True, )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
holmes/core/llm.py (2)
4-4: Dropmath.floorand use integer division for clarity and zero‑alloc mathAvoid float math and an extra import;
//is clearer and sufficient since tokens are integers.Apply:
-from math import floor- max_output_tokens = floor(min(64000, self.get_context_window_size() / 5)) + cw = self.get_context_window_size() + max_output_tokens = min(64000, cw // 5)Also applies to: 296-297
305-314: Narrow exception handling and avoidtry/except/continuepattern in lookupCatching
Exceptionand returning insidetryhurts readability and can mask errors. Use.get()and simple type checks; log only when no variant matched. This also addresses Ruff TRY300/BLE001 hints.Apply:
- # Try each name variant - for name in self._get_model_name_variants_for_lookup(): - try: - litellm_max_output_tokens = litellm.model_cost[name][ - "max_output_tokens" - ] - if litellm_max_output_tokens < max_output_tokens: - max_output_tokens = litellm_max_output_tokens - return max_output_tokens - except Exception: - continue + # Try each name variant + for name in self._get_model_name_variants_for_lookup(): + entry = litellm.model_cost.get(name) + if not entry: + continue + litellm_max_output_tokens = entry.get("max_output_tokens") + if isinstance(litellm_max_output_tokens, int): + max_output_tokens = min(max_output_tokens, litellm_max_output_tokens) + return max_output_tokens- # Log which lookups we tried - logging.warning( - f"Couldn't find model {self.model} in litellm's model list (tried: {', '.join(self._get_model_name_variants_for_lookup())}), " - f"using {max_output_tokens} tokens for max_output_tokens. " - f"To override, set OVERRIDE_MAX_OUTPUT_TOKEN environment variable to the correct value for your model." - ) - return max_output_tokens + # Log which lookups we tried + logging.warning( + f"Couldn't find model {self.model} in litellm.model_cost " + f"(tried: {', '.join(self._get_model_name_variants_for_lookup())}); " + f"falling back to computed default {max_output_tokens} (min(64000, context_window/5)). " + f"To override, set OVERRIDE_MAX_OUTPUT_TOKEN env var." + ) + return max_output_tokensBased on static analysis hints
Also applies to: 316-323
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (1)
holmes/core/llm.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always place Python imports at the top of the file, not inside functions or methods
All Python code must include type hints (mypy enforced)
Files:
holmes/core/llm.py
holmes/{core,plugins}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
All tools must return detailed error messages from underlying APIs, including executed command/query, time ranges/parameters, and full API error response; 'no data' responses must specify what was searched and where
Files:
holmes/core/llm.py
🪛 Ruff (0.13.1)
holmes/core/llm.py
312-312: Consider moving this statement to an else block
(TRY300)
313-314: try-except-continue detected, consider logging the exception
(S112)
313-313: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: llm_evals
- GitHub Check: Pre-commit checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work
No description provided.