-
Notifications
You must be signed in to change notification settings - Fork 18.6k
fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ |
CodSpeed WallTime Performance ReportMerging #32205 will not alter performanceComparing
|
CodSpeed Instrumentation Performance ReportMerging #32205 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds validation to the split_text_on_tokens
function to prevent infinite loops and empty chunks. The changes address two specific issues: preventing infinite loops when tokens_per_chunk <= chunk_overlap
, and avoiding empty decoded chunks in the output.
- Adds validation to ensure
tokens_per_chunk > chunk_overlap
to prevent infinite loops - Filters out empty decoded chunks from the result list
- Includes test coverage for the empty decode scenario
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
libs/text-splitters/langchain_text_splitters/base.py | Adds validation and empty chunk filtering to split_text_on_tokens function |
libs/text-splitters/tests/unit_tests/test_text_splitters.py | Adds test case for empty decode scenario |
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids)) | ||
chunk_ids = input_ids[start_idx:cur_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calculation is duplicated from line 348. Consider calculating cur_idx and chunk_ids once at the beginning of each loop iteration to avoid redundancy.
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids)) | |
chunk_ids = input_ids[start_idx:cur_idx] |
Copilot uses AI. Check for mistakes.
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids)) | ||
chunk_ids = input_ids[start_idx:cur_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line duplicates the calculation from before the while loop. The redundant calculations of cur_idx and chunk_ids should be removed to improve code clarity.
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids)) | |
chunk_ids = input_ids[start_idx:cur_idx] |
Copilot uses AI. Check for mistakes.
@@ -2715,6 +2715,21 @@ def test_split_text_on_tokens() -> None: | |||
assert output == expected_output | |||
|
|||
|
|||
def test_decode_returns_no_chunks() -> None: | |||
"""Test that whitespace-only input results in empty output, not [''].""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is inaccurate. The test input 'foo bar baz 123' is not whitespace-only, and the test is not about whitespace handling but about filtering empty decoded strings.
"""Test that whitespace-only input results in empty output, not [''].""" | |
"""Test that when decode returns only empty strings, output is empty, not [''].""" |
Copilot uses AI. Check for mistakes.
Description
tokenizer.tokens_per_chunk > tokenizer.chunk_overlap