-
Notifications
You must be signed in to change notification settings - Fork 3.1k
fix(checkpoint): Fixed semantic search failure after converting non-ASCII characters to Unicode encoding. #5947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add comprehensive test for non-ASCII semantic search functionality - Verify high similarity scores (>0.999) for exact matches in multiple languages - Test cases include English, Chinese, and Japanese text embeddings - Ensure proper handling of Unicode characters in vector search operations This fix addresses issues with semantic search accuracy when processing non-ASCII text content, particularly for multilingual applications.
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Hopefully it can be merged and fixed soon. |
`store.put(("user_123", "memories"), "3", {"text": ['汉朝很强大', '汉朝很伟大']}, index=["text"]) store.search(("user_123", "memories"), query="汉朝很强大", limit=3)` If you don't fix the above example, you will get a very small abnormal score. |
Only strings will work properly, list and dict will get wrong results |
libs/checkpoint/tests/test_store.py
Outdated
@@ -137,6 +137,34 @@ def test_get_text_at_path() -> None: | |||
assert get_text_at_path(nested_data, "{unclosed") == [] | |||
assert get_text_at_path(nested_data, "nested[{invalid}]") == [] | |||
|
|||
def test_Non_ASCII_semantic_search(): | |||
from langgraph.store.memory import InMemoryStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this import to the top of the file with the other imports for better readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first Pull Request, I really like langgraph, thank you very much for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refined and optimized my test function and hope to get your approval soon.
libs/checkpoint/tests/test_store.py
Outdated
|
||
load_dotenv() | ||
|
||
embeddings = OpenAIEmbeddings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency should be mocked in the test to ensure we're testing in isolation and not relying on external services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using CharacterEmbeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refined and optimized my test function and hope to get your approval soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I have completed the modification and passed the test. Can you merge it?
…expand test coverage Replace OpenAIEmbeddings with mock CharacterEmbeddings in non-ASCII semantic search tests to remove external API dependency.
Description: Fix non-ASCII character handling in semantic search and add comprehensive test coverage
This PR addresses an issue where non-ASCII characters (Chinese, Japanese, Korean, etc.) were not properly handled during semantic search operations due to ASCII encoding in the text extraction process.
Root Cause:
The
get_text_at_path()
function inlanggraph/store/base/embed.py
was usingjson.dumps(obj, sort_keys=True)
which defaults toensure_ascii=True
. This caused non-ASCII characters to be escaped as Unicode sequences (e.g., "这是中文" became "\u8fd9\u662f\u4e2d\u6587"), leading to poor semantic search accuracy for multilingual content.Fix:
Modified the
json.dumps()
calls inget_text_at_path()
to includeensure_ascii=False
, preserving the original Unicode characters during text extraction for embedding generation.Changes:
json.dumps(obj, sort_keys=True)
tojson.dumps(obj, sort_keys=True, ensure_ascii=False)
inget_text_at_path()
functiontest_Non_ASCII_semantic_search()
to validate semantic search accuracy across multiple languagesTest Coverage:
The new test validates:
Impact:
This fix ensures that multilingual applications using LangGraph's semantic search functionality will now correctly handle non-ASCII characters, providing accurate search results for global content without degradation in similarity scoring.
Testing:
All test structures have correct semantic search results and a similarity score greater than 0.999, confirming the fix resolves the Unicode handling issue.
Issue: #5946
Dependencies:
None