-
Notifications
You must be signed in to change notification settings - Fork 220
Implement typeahead search #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 65 commits
Commits
Show all changes
83 commits
Select commit
Hold shift + click to select a range
cdfcd64
Add typeahead endpoints
farshidz f857323
Add API tests
farshidz 356eed5
Add stats endpoint
farshidz f668ca9
Implement typeahead
farshidz d45eb57
Working mvp
farshidz b05939f
Rename endpoints
farshidz fddb170
Update tests
farshidz ccc1ded
Improve error handling
farshidz 4ba127e
Fix delete endpoint
farshidz 138058f
Update query profiles
farshidz 1e54727
Improve fuzzy search
farshidz 6df5be6
Fix yql and improve the schema
farshidz 8e4a812
Use query hash as ID to avoid duplicate queries
farshidz 5be71da
Implement delete endpoint
farshidz da8fdd5
Minor improvements
farshidz 94bb181
Store query as a string for bm25
farshidz cafdddb
Tokenize the query so we match different parts of the query
farshidz cde876b
Add rank: filter for suffixes
farshidz d6e5260
Rename API parameters
farshidz 60c0ab6
Index words not suffixes
farshidz d552b4b
Rename rank to popularity
farshidz ab579fd
Add score modifiers
farshidz 1cbcd57
Change relevance to _score in response
farshidz 348674f
Rename query to suggestion in suggestions response
farshidz 46a5961
Fix non-fuzzy query to be prefix
farshidz d12e0c2
Index all prefixes
farshidz 188dc5c
Rename TypeaheadHandler to Typeahead
farshidz 47304af
Fix bug in delete endpoint
farshidz ca371cc
Add pydantic v2 base classes
farshidz 155d834
Refactor to use pydantic
farshidz deee5c0
More refactoring
farshidz 75c4877
Add file missed before for models
farshidz 9290e6f
Store typeahead schema name in MarqoIndex
farshidz 7532bb4
Minor refactoring
farshidz f76496e
Fix issues related to typeahead schema name
farshidz 31d7a26
Simplify
farshidz 16c569c
Refactor create index
farshidz ebdce4b
Remove inline imports
farshidz 25fb86d
Improve validation
farshidz f7d3db4
Add TODOs
farshidz 948489d
Fix circular import
farshidz ec04d23
Use blake3 for hashing
farshidz 834e11e
Remove duplicate models
farshidz 2224d47
Add todo, update throttling config
farshidz a06bc24
Remove uninteded changes to marqo test base class
farshidz bd8d60d
add metadata fields for display and filtering
papa99do 43c99a0
moved test_services_xml.py
papa99do 0804351
Add test for text normalisation
papa99do 62586da
Use Pydantic models for type validation
papa99do 1972fe9
Use pydantic model for typeahead stats
papa99do 27f4eac
Unit test index and query of typeahead
papa99do fc08159
add unit test for vespa_application_package change
papa99do 36f830e
unit test typeahead_vespa_schema.py
papa99do bbf22b6
unit test index_management
papa99do cc76fdd
test other methods in typeahead.py
papa99do 1ed2bbb
add unit test for api method and pydantic models
papa99do 59a7d2c
Support getting a list of queries from typeahead schema
papa99do 5e0d77b
batch size limit of 128
papa99do ab5f71c
Add integration test
papa99do 4992de1
add integration test
papa99do 91f6d91
fix the integration test
papa99do 27ca83e
api tests
papa99do 619d3c1
Merge branch 'mainline' into farshid/type-ahead
papa99do a6ab39a
fix unit test
papa99do a470de6
fix the tests, there's still one failing.
papa99do 4ee4474
fix an integration test failing locally
papa99do cbc9076
bump the version
papa99do 8ed078d
Create typeahead schema during Marqo bootstrapping.
papa99do fea2cee
Merge branch 'mainline' into farshid/type-ahead
papa99do d12c323
fix the unit test
papa99do 7207677
fix the unit test
papa99do aef3751
address PR comments
papa99do e8f6fdf
Add version check
papa99do 76e5e98
address review comments
papa99do 6634e15
address review comments
papa99do 1cd5abd
extract feature support check logic to an annotation
papa99do fc25342
address final comments
papa99do 77598e0
fix unit tests
papa99do 092c50e
support wildcard query
papa99do 08e75fd
support empty query for typeahead
papa99do 1210474
Merge branch 'mainline' into farshid/type-ahead
wanliAlex a673bed
Merge branch 'mainline' into farshid/type-ahead
papa99do 011dc8c
Merge branch 'mainline' into farshid/type-ahead
papa99do File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,4 +9,5 @@ | |
| opentelemetry-api==1.33.1 | ||
| opentelemetry-sdk==1.33.1 | ||
|
|
||
| cachetools==6.1.0 | ||
| cachetools==6.1.0 | ||
| blake3==1.0.5 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| """Pydantic models for typeahead API requests and responses.""" | ||
|
|
||
| from typing import List, Optional, Dict | ||
|
|
||
| from pydantic import Field, field_validator | ||
wanliAlex marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| from marqo.base_model import ImmutableStrictBaseModelV2 | ||
| from marqo.core.exceptions import InvalidArgumentError | ||
| from marqo.tensor_search.enums import EnvVars | ||
| from marqo.tensor_search.utils import read_env_vars_and_defaults_ints | ||
|
|
||
|
|
||
| class TypeaheadRequest(ImmutableStrictBaseModelV2): | ||
| """Request model for typeahead suggestions.""" | ||
|
|
||
| q: str = Field(..., description="Partial user search input") | ||
papa99do marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| limit: int = Field(default=10, ge=0, description="Maximum number of suggestions to return") | ||
papa99do marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
papa99do marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
papa99do marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| fuzzy_edit_distance: int = Field( | ||
| default=2, | ||
| ge=0, | ||
| alias="fuzzyEditDistance", | ||
| description="Maximum edit distance for fuzzy matching" | ||
| ) | ||
| min_fuzzy_match_length: int = Field( | ||
| default=3, | ||
| ge=0, | ||
| alias="minFuzzyMatchLength", | ||
| description="Minimum length to switch to fuzzy matching" | ||
| ) | ||
| popularity_weight: Optional[float] = Field( | ||
| default=None, | ||
| alias="popularityWeight", | ||
| description="Weight for popularity score in ranking" | ||
| ) | ||
| bm25_weight: Optional[float] = Field( | ||
| default=None, | ||
| alias="bm25Weight", | ||
| description="Weight for BM25 score in ranking" | ||
| ) | ||
|
|
||
| @field_validator('q') | ||
| def validate_q(cls, v: str) -> str: | ||
| if not v or not v.strip(): | ||
| raise ValueError("q is required") | ||
wanliAlex marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| return v.strip() | ||
|
|
||
|
|
||
| class TypeaheadSuggestion(ImmutableStrictBaseModelV2): | ||
| """Individual suggestion in typeahead response.""" | ||
|
|
||
| suggestion: str = Field(..., description="The suggested query text") | ||
| score: float = Field(..., alias="_score", description="Relevance score for the suggestion") | ||
| metadata: Optional[dict] = Field(default=None, description="Additional metadata") | ||
wanliAlex marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| class TypeaheadResponse(ImmutableStrictBaseModelV2): | ||
| """Response model for typeahead suggestions.""" | ||
|
|
||
| suggestions: List[TypeaheadSuggestion] = Field(..., description="List of suggestions") | ||
| processing_time_ms: Optional[float] = Field( | ||
| default=None, | ||
| alias="processingTimeMs", | ||
| description="Processing time in milliseconds" | ||
| ) | ||
|
|
||
|
|
||
| class TypeaheadAddQueryRequest(ImmutableStrictBaseModelV2): | ||
| query: str = Field(..., description="User search query") | ||
| # Please note that popularity is not mandatory. This is to support multiple popularity values in metadata for future | ||
| popularity: float = Field(default=0.0, description="Popularity score") | ||
| metadata: Dict[str, float] = Field(default_factory=dict, description="Additional metadata") | ||
|
|
||
| @field_validator('query') | ||
| def validate_q(cls, v: str) -> str: | ||
| if not v or not v.strip(): | ||
| raise ValueError("query is required") | ||
| return v.strip() | ||
|
|
||
|
|
||
| class TypeaheadIndexRequest(ImmutableStrictBaseModelV2): | ||
| queries: List[TypeaheadAddQueryRequest] | ||
|
|
||
| @field_validator('queries') | ||
| def validate_queries_batch_size(cls, queries): | ||
| query_count = len(queries) | ||
| max_queries = read_env_vars_and_defaults_ints(EnvVars.MARQO_MAX_DOCUMENTS_BATCH_SIZE) | ||
| if query_count == 0: | ||
| raise InvalidArgumentError("Received empty index queries request") | ||
| elif query_count > max_queries: | ||
| raise InvalidArgumentError( | ||
| f"Number of queries in index request ({query_count}) exceeds limit of {max_queries}. " | ||
| f"Please break up your request into smaller batches." | ||
| ) | ||
| return queries | ||
|
|
||
|
|
||
| class TypeaheadIndexError(ImmutableStrictBaseModelV2): | ||
| query: Optional[str] = None | ||
| message: str | ||
| code: int = 400 | ||
|
|
||
|
|
||
| class TypeaheadIndexResponse(ImmutableStrictBaseModelV2): | ||
| indexed: int = Field(..., description="Indexed queries") | ||
| errors: List[TypeaheadIndexError] = Field(default_factory=list, description="Index Errors") | ||
| processing_time_ms: float = Field( | ||
| alias="processingTimeMs", | ||
| description="Processing time in milliseconds" | ||
| ) | ||
|
|
||
|
|
||
| class TypeaheadStatsResponse(ImmutableStrictBaseModelV2): | ||
| indexed_queries: int = Field( | ||
| alias="indexedQueries", | ||
| description="Number of indexed queries" | ||
| ) | ||
|
|
||
|
|
||
| class TypeaheadQuery(ImmutableStrictBaseModelV2): | ||
| """Represents a query from the typeahead schema.""" | ||
| query: str = Field(..., description="The query string") | ||
| popularity: float = Field(..., description="Popularity score") | ||
| metadata: Dict[str, float] = Field(..., description="Additional metadata") | ||
| last_updated_at: Optional[int] = Field(None, alias="lastUpdatedAt", description="Last updated timestamp") | ||
|
|
||
|
|
||
| class TypeaheadGetQueriesResponse(ImmutableStrictBaseModelV2): | ||
| """Response model for getting typeahead queries.""" | ||
| queries: List[TypeaheadQuery] = Field(..., description="List of retrieved queries") | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| import unicodedata | ||
| from typing import List | ||
|
|
||
|
|
||
| def normalize_text(text: str) -> str: | ||
| """ | ||
| Normalize text by removing accents and converting to lowercase. | ||
|
|
||
| Args: | ||
| text: Input text to normalize | ||
|
|
||
| Returns: | ||
| Normalized text with accents removed and lowercased | ||
| """ | ||
| if not text: | ||
| return "" | ||
|
|
||
| # Normalize to NFKD form and remove accents | ||
| normalized = unicodedata.normalize('NFKD', text) | ||
| # Filter out combining characters (accents) | ||
| without_accents = ''.join(c for c in normalized if not unicodedata.combining(c)) | ||
| # Convert to lowercase | ||
| return without_accents.lower() | ||
|
|
||
|
|
||
| def generate_prefixes(text: str) -> List[str]: | ||
| result = [] | ||
| prefix = "" | ||
| for ch in text: | ||
| if ch.isspace(): | ||
| prefix = "" # reset when hitting whitespace | ||
| else: | ||
| prefix += ch | ||
| result.append(prefix) | ||
| return result |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.