Skip to content

Conversation

dcdmm
Copy link

@dcdmm dcdmm commented Aug 19, 2025

Description: Fix non-ASCII character handling in semantic search and add comprehensive test coverage

This PR addresses an issue where non-ASCII characters (Chinese, Japanese, Korean, etc.) were not properly handled during semantic search operations due to ASCII encoding in the text extraction process.

Root Cause:
The get_text_at_path() function in langgraph/store/base/embed.py was using json.dumps(obj, sort_keys=True) which defaults to ensure_ascii=True. This caused non-ASCII characters to be escaped as Unicode sequences (e.g., "这是中文" became "\u8fd9\u662f\u4e2d\u6587"), leading to poor semantic search accuracy for multilingual content.

Fix:
Modified the json.dumps() calls in get_text_at_path() to include ensure_ascii=False, preserving the original Unicode characters during text extraction for embedding generation.

Changes:

  • Updated json.dumps(obj, sort_keys=True) to json.dumps(obj, sort_keys=True, ensure_ascii=False) in get_text_at_path() function
  • Added comprehensive test test_Non_ASCII_semantic_search() to validate semantic search accuracy across multiple languages

Test Coverage:
The new test validates:

  • English text semantic search: "This is English"
  • Chinese text semantic search: "这是中文"
  • Japanese text semantic search: "これは日本語です"
  • Korean text semantic search:["남극 대륙", "남미"]
  • Ensure that the first result of all language semantic search results is correct and that the similarity score is greater than 0.999.

Impact:
This fix ensures that multilingual applications using LangGraph's semantic search functionality will now correctly handle non-ASCII characters, providing accurate search results for global content without degradation in similarity scoring.

Testing:
All test structures have correct semantic search results and a similarity score greater than 0.999, confirming the fix resolves the Unicode handling issue.

Issue: #5946

Dependencies:
None

- Add comprehensive test for non-ASCII semantic search functionality
- Verify high similarity scores (>0.999) for exact matches in multiple languages
- Test cases include English, Chinese, and Japanese text embeddings
- Ensure proper handling of Unicode characters in vector search operations

This fix addresses issues with semantic search accuracy when processing
non-ASCII text content, particularly for multilingual applications.
Copy link

vercel bot commented Aug 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
langgraph-docs-preview Ignored Ignored Preview Aug 23, 2025 9:26pm

@dcdmm dcdmm changed the title fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5946 fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. Aug 19, 2025
@dcdmm
Copy link
Author

dcdmm commented Aug 20, 2025

Hopefully it can be merged and fixed soon.

@dcdmm
Copy link
Author

dcdmm commented Aug 20, 2025

`store.put(("user_123", "memories"), "3", {"text": ['汉朝很强大', '汉朝很伟大']}, index=["text"])

store.search(("user_123", "memories"), query="汉朝很强大", limit=3)`

If you don't fix the above example, you will get a very small abnormal score.

@dcdmm
Copy link
Author

dcdmm commented Aug 20, 2025

Only strings will work properly, list and dict will get wrong results

@@ -137,6 +137,34 @@ def test_get_text_at_path() -> None:
assert get_text_at_path(nested_data, "{unclosed") == []
assert get_text_at_path(nested_data, "nested[{invalid}]") == []

def test_Non_ASCII_semantic_search():
from langgraph.store.memory import InMemoryStore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this import to the top of the file with the other imports for better readability and maintainability.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first Pull Request, I really like langgraph, thank you very much for the suggestion!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refined and optimized my test function and hope to get your approval soon.


load_dotenv()

embeddings = OpenAIEmbeddings(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency should be mocked in the test to ensure we're testing in isolation and not relying on external services.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using CharacterEmbeddings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

received

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refined and optimized my test function and hope to get your approval soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I have completed the modification and passed the test. Can you merge it?

…expand test coverage

Replace OpenAIEmbeddings with mock CharacterEmbeddings in non-ASCII
semantic search tests to remove external API dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants