Skip to content

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

with1015
Copy link
Contributor

Description

  1. Add validation to prevent infinite loop condition when tokenizer.tokens_per_chunk > tokenizer.chunk_overlap
  2. Avoid empty decoded chunk when splitter appends tokens

Copy link

vercel bot commented Jul 23, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
langchain ⬜️ Ignored Preview Aug 11, 2025 10:33pm

@with1015 with1015 changed the title chore(langchain): add validation to prevent infinite loop and prevent empty token splitter fix(langchain): add validation to prevent infinite loop and prevent empty token splitter Jul 23, 2025
Copy link

codspeed-hq bot commented Jul 23, 2025

CodSpeed WallTime Performance Report

Merging #32205 will not alter performance

Comparing with1015:chore/text-splitter-validation (50cdebc) with master (8b663ed)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 13 untouched benchmarks

Copy link

codspeed-hq bot commented Jul 23, 2025

CodSpeed Instrumentation Performance Report

Merging #32205 will not alter performance

Comparing with1015:chore/text-splitter-validation (50cdebc) with master (8b663ed)

Summary

✅ 14 untouched benchmarks

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a unit test?

@eyurtsev eyurtsev changed the title fix(langchain): add validation to prevent infinite loop and prevent empty token splitter fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter Jul 23, 2025
@eyurtsev eyurtsev self-assigned this Jul 23, 2025
@with1015 with1015 marked this pull request as draft July 27, 2025 04:41
@with1015 with1015 marked this pull request as ready for review July 31, 2025 14:25
@with1015 with1015 requested a review from eyurtsev August 2, 2025 15:29
@with1015
Copy link
Contributor Author

with1015 commented Aug 6, 2025

@eyurtsev @mdrxy
Could you check PR review?

@mdrxy mdrxy requested a review from Copilot August 11, 2025 22:33
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds validation to the split_text_on_tokens function to prevent infinite loops and empty chunks. The changes address two specific issues: preventing infinite loops when tokens_per_chunk <= chunk_overlap, and avoiding empty decoded chunks in the output.

  • Adds validation to ensure tokens_per_chunk > chunk_overlap to prevent infinite loops
  • Filters out empty decoded chunks from the result list
  • Includes test coverage for the empty decode scenario

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
libs/text-splitters/langchain_text_splitters/base.py Adds validation and empty chunk filtering to split_text_on_tokens function
libs/text-splitters/tests/unit_tests/test_text_splitters.py Adds test case for empty decode scenario

Comment on lines 348 to 349
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation is duplicated from line 348. Consider calculating cur_idx and chunk_ids once at the beginning of each loop iteration to avoid redundancy.

Suggested change
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]

Copilot uses AI. Check for mistakes.

Comment on lines 348 to 349
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line duplicates the calculation from before the while loop. The redundant calculations of cur_idx and chunk_ids should be removed to improve code clarity.

Suggested change
cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]

Copilot uses AI. Check for mistakes.

@@ -2715,6 +2715,21 @@ def test_split_text_on_tokens() -> None:
assert output == expected_output


def test_decode_returns_no_chunks() -> None:
"""Test that whitespace-only input results in empty output, not ['']."""
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is inaccurate. The test input 'foo bar baz 123' is not whitespace-only, and the test is not about whitespace handling but about filtering empty decoded strings.

Suggested change
"""Test that whitespace-only input results in empty output, not ['']."""
"""Test that when decode returns only empty strings, output is empty, not ['']."""

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants