fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

with1015 · 2025-07-23T11:58:43Z

Description

Add validation to prevent infinite loop condition when tokenizer.tokens_per_chunk > tokenizer.chunk_overlap
Avoid empty decoded chunk when splitter appends tokens

…n splitter

vercel · 2025-07-23T11:58:46Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored	Preview		Aug 11, 2025 10:33pm

codspeed-hq · 2025-07-23T12:00:16Z

CodSpeed WallTime Performance Report

Merging #32205 will not alter performance

_{Comparing with1015:chore/text-splitter-validation (50cdebc) with master (8b663ed)}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 13 untouched benchmarks

codspeed-hq · 2025-07-23T12:06:06Z

CodSpeed Instrumentation Performance Report

Merging #32205 will not alter performance

_{Comparing with1015:chore/text-splitter-validation (50cdebc) with master (8b663ed)}

Summary

✅ 14 untouched benchmarks

eyurtsev

Could you add a unit test?

with1015 · 2025-08-06T10:33:04Z

@eyurtsev @mdrxy
Could you check PR review?

Copilot

Pull Request Overview

This PR adds validation to the split_text_on_tokens function to prevent infinite loops and empty chunks. The changes address two specific issues: preventing infinite loops when tokens_per_chunk <= chunk_overlap, and avoiding empty decoded chunks in the output.

Adds validation to ensure tokens_per_chunk > chunk_overlap to prevent infinite loops
Filters out empty decoded chunks from the result list
Includes test coverage for the empty decode scenario

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
libs/text-splitters/langchain_text_splitters/base.py	Adds validation and empty chunk filtering to split_text_on_tokens function
libs/text-splitters/tests/unit_tests/test_text_splitters.py	Adds test case for empty decode scenario

Copilot · 2025-08-11T22:33:53Z

libs/text-splitters/langchain_text_splitters/base.py

    cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
    chunk_ids = input_ids[start_idx:cur_idx]


This calculation is duplicated from line 348. Consider calculating cur_idx and chunk_ids once at the beginning of each loop iteration to avoid redundancy.

Suggested change

cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))

chunk_ids = input_ids[start_idx:cur_idx]

Copilot · 2025-08-11T22:33:53Z

libs/text-splitters/langchain_text_splitters/base.py

    cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
    chunk_ids = input_ids[start_idx:cur_idx]


This line duplicates the calculation from before the while loop. The redundant calculations of cur_idx and chunk_ids should be removed to improve code clarity.

Suggested change

cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))

chunk_ids = input_ids[start_idx:cur_idx]

Copilot · 2025-08-11T22:33:54Z

libs/text-splitters/tests/unit_tests/test_text_splitters.py

@@ -2715,6 +2715,21 @@ def test_split_text_on_tokens() -> None:
    assert output == expected_output


+def test_decode_returns_no_chunks() -> None:
+    """Test that whitespace-only input results in empty output, not ['']."""


The docstring is inaccurate. The test input 'foo bar baz 123' is not whitespace-only, and the test is not about whitespace handling but about filtering empty decoded strings.

Suggested change

"""Test that whitespace-only input results in empty output, not ['']."""

"""Test that when decode returns only empty strings, output is empty, not ['']."""

chore: add validation to prevent infinite loop and prevent empty toke…

531d6b5

…n splitter

with1015 changed the title ~~chore(langchain): add validation to prevent infinite loop and prevent empty token splitter~~ fix(langchain): add validation to prevent infinite loop and prevent empty token splitter Jul 23, 2025

eyurtsev reviewed Jul 23, 2025

View reviewed changes

eyurtsev changed the title ~~fix(langchain): add validation to prevent infinite loop and prevent empty token splitter~~ fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter Jul 23, 2025

eyurtsev self-assigned this Jul 23, 2025

mdrxy added the waiting-on-author label Jul 24, 2025

with1015 marked this pull request as draft July 27, 2025 04:41

chore: add unit test when decode returns empty string chunks

283a729

with1015 marked this pull request as ready for review July 31, 2025 14:25

with1015 added 3 commits July 31, 2025 23:28

chore: remove debugging out log

412dc32

Merge branch 'master' into chore/text-splitter-validation

3c173bf

chore: add annotated type for list in unit test

8590a7c

with1015 requested a review from eyurtsev August 2, 2025 15:29

mdrxy added 2 commits August 11, 2025 18:22

Merge branch 'master' into chore/text-splitter-validation

e0a6dd7

Merge branch 'master' into chore/text-splitter-validation

50cdebc

mdrxy requested a review from Copilot August 11, 2025 22:33

Copilot AI reviewed Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

with1015 commented Jul 23, 2025

Uh oh!

vercel bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

eyurtsev left a comment

Uh oh!

with1015 commented Aug 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 11, 2025

Uh oh!

Copilot AI Aug 11, 2025

Uh oh!

Copilot AI Aug 11, 2025

Uh oh!

Uh oh!

		cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
		chunk_ids = input_ids[start_idx:cur_idx]

	"""Test that whitespace-only input results in empty output, not ['']."""
	"""Test that when decode returns only empty strings, output is empty, not ['']."""

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

Are you sure you want to change the base?

fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter #32205

Conversation

with1015 commented Jul 23, 2025

Description

Uh oh!

vercel bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed WallTime Performance Report

Merging #32205 will not alter performance

Summary

Uh oh!

codspeed-hq bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Instrumentation Performance Report

Merging #32205 will not alter performance

Summary

Uh oh!

eyurtsev left a comment

Choose a reason for hiding this comment

Uh oh!

with1015 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vercel bot commented Jul 23, 2025 •

edited

Loading

codspeed-hq bot commented Jul 23, 2025 •

edited

Loading

codspeed-hq bot commented Jul 23, 2025 •

edited

Loading

with1015 commented Aug 6, 2025 •

edited

Loading