fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947

dcdmm · 2025-08-19T08:55:56Z

Description: Fix non-ASCII character handling in semantic search and add comprehensive test coverage

This PR addresses an issue where non-ASCII characters (Chinese, Japanese, Korean, etc.) were not properly handled during semantic search operations due to ASCII encoding in the text extraction process.

Root Cause:
The get_text_at_path() function in langgraph/store/base/embed.py was using json.dumps(obj, sort_keys=True) which defaults to ensure_ascii=True. This caused non-ASCII characters to be escaped as Unicode sequences (e.g., "这是中文" became "\u8fd9\u662f\u4e2d\u6587"), leading to poor semantic search accuracy for multilingual content.

Fix:
Modified the json.dumps() calls in get_text_at_path() to include ensure_ascii=False, preserving the original Unicode characters during text extraction for embedding generation.

Changes:

Updated json.dumps(obj, sort_keys=True) to json.dumps(obj, sort_keys=True, ensure_ascii=False) in get_text_at_path() function
Added comprehensive test test_Non_ASCII_semantic_search() to validate semantic search accuracy across multiple languages

Test Coverage:
The new test validates:

English text semantic search: "This is English"
Chinese text semantic search: "这是中文"
Japanese text semantic search: "これは日本語です"
Korean text semantic search:["남극 대륙", "남미"]
Ensure that the first result of all language semantic search results is correct and that the similarity score is greater than 0.999.

Impact:
This fix ensures that multilingual applications using LangGraph's semantic search functionality will now correctly handle non-ASCII characters, providing accurate search results for global content without degradation in similarity scoring.

Testing:
All test structures have correct semantic search results and a similarity score greater than 0.999, confirming the fix resolves the Unicode handling issue.

Issue: #5946

Dependencies:
None

- Add comprehensive test for non-ASCII semantic search functionality - Verify high similarity scores (>0.999) for exact matches in multiple languages - Test cases include English, Chinese, and Japanese text embeddings - Ensure proper handling of Unicode characters in vector search operations This fix addresses issues with semantic search accuracy when processing non-ASCII text content, particularly for multilingual applications.

vercel · 2025-08-19T08:56:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
langgraph-docs-preview	Ignored	Preview		Aug 23, 2025 9:26pm

dcdmm · 2025-08-20T03:33:37Z

Hopefully it can be merged and fixed soon.

dcdmm · 2025-08-20T03:35:16Z

`store.put(("user_123", "memories"), "3", {"text": ['汉朝很强大', '汉朝很伟大']}, index=["text"])

store.search(("user_123", "memories"), query="汉朝很强大", limit=3)`

If you don't fix the above example, you will get a very small abnormal score.

dcdmm · 2025-08-20T03:40:30Z

Only strings will work properly, list and dict will get wrong results

sarathak · 2025-08-23T14:30:03Z

libs/checkpoint/tests/test_store.py

@@ -137,6 +137,34 @@ def test_get_text_at_path() -> None:
    assert get_text_at_path(nested_data, "{unclosed") == []
    assert get_text_at_path(nested_data, "nested[{invalid}]") == []

+def test_Non_ASCII_semantic_search():
+    from langgraph.store.memory import InMemoryStore


Please move this import to the top of the file with the other imports for better readability and maintainability.

This is my first Pull Request, I really like langgraph, thank you very much for the suggestion!

I have refined and optimized my test function and hope to get your approval soon.

sarathak · 2025-08-23T14:34:56Z

libs/checkpoint/tests/test_store.py

+
+    load_dotenv()
+
+    embeddings = OpenAIEmbeddings(


This dependency should be mocked in the test to ensure we're testing in isolation and not relying on external services.

consider using CharacterEmbeddings

I have refined and optimized my test function and hope to get your approval soon.

Hello, I have completed the modification and passed the test. Can you merge it?

…expand test coverage Replace OpenAIEmbeddings with mock CharacterEmbeddings in non-ASCII semantic search tests to remove external API dependency.

dcdmm changed the title ~~fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5946~~ fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. Aug 19, 2025

sarathak reviewed Aug 23, 2025

View reviewed changes

test: replace OpenAI embeddings with CharacterEmbeddings fixture and …

e3286b1

…expand test coverage Replace OpenAIEmbeddings with mock CharacterEmbeddings in non-ASCII semantic search tests to remove external API dependency.

sarathak approved these changes Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947

fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947

dcdmm commented Aug 19, 2025 •

edited

Loading

Uh oh!

vercel bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

sarathak Aug 23, 2025

Uh oh!

dcdmm Aug 23, 2025

Uh oh!

dcdmm Aug 23, 2025

Uh oh!

sarathak Aug 23, 2025

Uh oh!

sarathak Aug 23, 2025

Uh oh!

dcdmm Aug 23, 2025

Uh oh!

dcdmm Aug 23, 2025

Uh oh!

dcdmm Aug 26, 2025

Uh oh!

Uh oh!

fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947

Are you sure you want to change the base?

fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947

Conversation

dcdmm commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

dcdmm commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dcdmm commented Aug 19, 2025 •

edited

Loading

vercel bot commented Aug 19, 2025 •

edited

Loading