Skip to content

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
13 changes: 10 additions & 3 deletions libs/text-splitters/langchain_text_splitters/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,13 +338,20 @@ def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]:
splits: list[str] = []
input_ids = tokenizer.encode(text)
start_idx = 0
if tokenizer.tokens_per_chunk <= tokenizer.chunk_overlap:
msg = "tokens_per_chunk must be greater than chunk_overlap"
raise ValueError(msg)
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
Comment on lines 348 to 349
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation is duplicated from line 348. Consider calculating cur_idx and chunk_ids once at the beginning of each loop iteration to avoid redundancy.

Suggested change
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]

Copilot uses AI. Check for mistakes.

Comment on lines 348 to 349
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line duplicates the calculation from before the while loop. The redundant calculations of cur_idx and chunk_ids should be removed to improve code clarity.

Suggested change
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]

Copilot uses AI. Check for mistakes.

while start_idx < len(input_ids):
splits.append(tokenizer.decode(chunk_ids))
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
if not chunk_ids:
break
decoded = tokenizer.decode(chunk_ids)
if decoded:
splits.append(decoded)
if cur_idx == len(input_ids):
break
start_idx += tokenizer.tokens_per_chunk - tokenizer.chunk_overlap
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
return splits
Loading