-
Notifications
You must be signed in to change notification settings - Fork 4.3k
.Net MEVD: LINQ-based filtering #10273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dotnet/src/Connectors/Connectors.Memory.Qdrant/QdrantFilterTranslator.cs
Outdated
Show resolved
Hide resolved
Task<VectorSearchResults<TRecord>> VectorizableTextSearchAsync( | ||
string searchText, | ||
VectorSearchOptions? options = default, | ||
VectorSearchOptions<TRecord>? options = default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making VectorSearchOptions generic over TRecord isn't trivial... Modern C# users can use target-typed new, and in that case this doesn't interfere in the method calling. But people not doing target-typed new would have to explicitly specify the full type every time (even if they don't use filtering at all), which maybe isn't ideal. The alternative would be to move the LINQ filter to an optional parameter directly on the method (no strong feelings from my side here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are thinking of also making property selection an expression, that may result in a few more optional parameters on the method signature, to avoid making options generic. For hybrid search it would be 3 additional, which isn't great, especially considering that 2 of those would very rarely be used, i.e. only if there is more than one dense vector or more than one fulltextsearch property on the record.
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/VectorSearchOptions.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/VectorDataSpecificationTests/Filter/BasicFilterTestsBase.cs
Outdated
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantBuilder.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantConfiguration.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantContainer.cs
Show resolved
Hide resolved
@roji, a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those. |
dotnet/src/VectorDataTests/VectorDataSpecificationTests/Filter/BasicFilterTestsBase.cs
Outdated
Show resolved
Hide resolved
@westey-m thanks - will do that, it should be easily representible in LINQ by doing |
|
||
private readonly StringBuilder _sql = new(); | ||
|
||
internal (string Clause, List<object> Parameters) Translate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think there's room for SQL injection, considering that the resulting string is merged into a SQL statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely (and that's very important). There's a test exercising an input with a single quote to check exactly for this. I have a note to do a general comprehensive pass to make sure everything is airtight - stuff like property names isn't currently safe (those generally aren't user input, but I still need to make them airtight).
Could you confirm that NOT works also with Contains? example
thanks! |
await this._container.StartAsync(); | ||
var redis = await ConnectionMultiplexer.ConnectAsync($"{this._container.Hostname}:{this._container.GetMappedPublicPort(6379)},connectTimeout=60000,connectRetry=5"); | ||
this._database = redis.GetDatabase(); | ||
this._defaultVectorStore = new(this._database); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure to test both instances of redis: Json and Hashset. It's possible to choose the flavour by setting the StorageType option on RedisVectorStoreOptions. They do differ in some non-obvious ways, especially with regards to how you specify property names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally missed that, thanks... I pushed tests for both. Depending on the exact differences, it might make sense to just treat these as two databases test-wise and have two projects (but for now fixtures and tests are duplicated in the same projects).
Query-wise there are some unsupported scenarios in HashSet compared to JSON (array fields, null values), but otherwise the tests pass.
VectorStoreGenericDataModel is basically a way to use this abstraction without a strongly-typed .NET type (e.g. for dynamic scenarios). This is already supported in MEVD, but this PR doesn't add support for it in the new filtering mechanism. This is absolutely something I intend to do.
No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties. For example, in
For all of these scenarios, your .NET type would simply have a regular
Yes - it does in general. Specific database support varies sometimes (e.g. MongoDB has some general issues with NOT in vector search pre-filters), but it works fine on most databases. |
BTW @dluc we're definitely interested in feedback about any other specific filtering features you might be interested, don't hesitate to let us know. |
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
And the obsoleted Filter to OldFilter. Continues #10273.
@roji this is a scenario I'm about to test. I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work™️
|
@dluc if I understand correctly, does that mean you using VectorStoreGenericDataModel? If so, note that the LINQ-based filteirng hasn't yet been implemented for that - #10468 tracks that. Full support for dynamic scenarios is absolutely high our list, but some fixes probably need to happen first (see #10802). So long story short, for dynamic scenarios you'll likely need to wait a bit longer until everything is fully in place. |
…ace (#10456) (#13175) # Add generic ITextSearch<TRecord> interface with LINQ filtering support **Addresses Issue #10456**: Modernize ITextSearch to use LINQ-based vector search filtering > ** Multi-PR Strategy Context** > This is **PR 1 of multiple** in a structured implementation approach for Issue #10456. This PR targets the `feature/issue-10456-linq-filtering` branch for incremental review and testing before the final submission to Microsoft's main branch. > This approach enables focused code review, easier debugging, and safer integration of the comprehensive ITextSearch modernization. ### Motivation and Context **Why is this change required?** The current ITextSearch interface uses legacy `TextSearchFilter` which requires conversion to obsolete `VectorSearchFilter`, creating technical debt and performance overhead. Issue #10456 requests modernization to use type-safe LINQ filtering with `Expression<Func<TRecord, bool>>`. **What problem does it solve?** - Eliminates runtime errors from property name typos in filters - Removes performance overhead from obsolete filter conversions - Provides compile-time type safety and IntelliSense support - Modernizes the API to follow .NET best practices for LINQ-based filtering **What scenario does it contribute to?** This enables developers to write type-safe text search filters like: ```csharp var options = new TextSearchOptions<Article> { Filter = article => article.Category == "Technology" && article.PublishedDate > DateTime.Now.AddDays(-30) }; ``` **Issue Link:** #10456 ### Description This PR introduces foundational generic interfaces to enable LINQ-based filtering for text search operations. The implementation follows an additive approach, maintaining 100% backward compatibility while providing a modern, type-safe alternative. **Overall Approach:** - Add generic `ITextSearch<TRecord>` interface alongside existing non-generic version - Add generic `TextSearchOptions<TRecord>` with LINQ `Expression<Func<TRecord, bool>>? Filter` - Update `VectorStoreTextSearch` to implement both interfaces - Preserve all existing functionality while enabling modern LINQ filtering **Underlying Design:** - **Zero Breaking Changes**: Legacy interfaces remain unchanged and fully functional - **Gradual Migration**: Teams can adopt generic interfaces at their own pace - **Performance Optimization**: Eliminates obsolete VectorSearchFilter conversion overhead - **Type Safety**: Compile-time validation prevents runtime filter errors ### Engineering Approach: Following Microsoft's Established Patterns This solution was not created from scratch but carefully architected by **studying and extending Microsoft's existing patterns** within the Semantic Kernel codebase: **1. Pattern Discovery: VectorSearchOptions<TRecord> Template** Found the exact migration pattern Microsoft established in PR #10273: ```csharp public class VectorSearchOptions<TRecord> { [Obsolete("Use Filter instead")] public VectorSearchFilter? OldFilter { get; set; } // Legacy approach public Expression<Func<TRecord, bool>>? Filter { get; set; } // Modern LINQ approach } ``` **2. Existing Infrastructure Analysis** Discovered that `VectorStoreTextSearch.cs` already had the implementation infrastructure: ```csharp // Modern LINQ filtering method (already existed!) private async IAsyncEnumerable<VectorSearchResult<TRecord>> ExecuteVectorSearchAsync( string query, TextSearchOptions<TRecord>? searchOptions, // Generic options CancellationToken cancellationToken) { var vectorSearchOptions = new VectorSearchOptions<TRecord> { Filter = searchOptions.Filter, // Direct LINQ filtering - no conversion! }; } ``` **3. Microsoft's Additive Migration Strategy** Followed the exact pattern used across the codebase: - Keep legacy interface unchanged for backward compatibility - Add generic interface with modern features alongside - Use `[Experimental]` attributes for new features - Provide gradual migration path **4. Consistency with Existing Filter Translators** All vector database connectors (AzureAISearch, Qdrant, MongoDB, Weaviate) use the same pattern: ```csharp internal Filter Translate(LambdaExpression lambdaExpression, CollectionModel model) { // All work with Expression<Func<TRecord, bool>> // All provide compile-time safety // All follow the same LINQ expression pattern } ``` **5. Technical Debt Elimination** The existing problematic code that this PR enables fixing in PR #2: ```csharp // Current technical debt in VectorStoreTextSearch.cs #pragma warning disable CS0618 // VectorSearchFilter is obsolete OldFilter = searchOptions.Filter?.FilterClauses is not null ? new VectorSearchFilter(searchOptions.Filter.FilterClauses) : null, #pragma warning restore CS0618 ``` This will be replaced with direct LINQ filtering: `Filter = searchOptions.Filter` **Result**: This solution extends Microsoft's established patterns consistently rather than introducing new conventions, ensuring seamless integration with the existing ecosystem. ## Summary This PR introduces the foundational generic interfaces needed to modernize text search functionality from legacy `TextSearchFilter` to type-safe LINQ `Expression<Func<TRecord, bool>>` filtering. This is the first in a series of PRs to completely resolve Issue #10456. ## Key Changes ### New Generic Interfaces - **`ITextSearch<TRecord>`**: Generic interface with type-safe LINQ filtering - `SearchAsync<TRecord>(string query, TextSearchOptions<TRecord> options, CancellationToken cancellationToken)` - `GetTextSearchResultsAsync<TRecord>(string query, TextSearchOptions<TRecord> options, CancellationToken cancellationToken)` - `GetSearchResultsAsync<TRecord>(string query, TextSearchOptions<TRecord> options, CancellationToken cancellationToken)` - **`TextSearchOptions<TRecord>`**: Generic options class with LINQ support - `Expression<Func<TRecord, bool>>? Filter` property for compile-time type safety - Comprehensive XML documentation with usage examples ### Enhanced Implementation - **`VectorStoreTextSearch<TValue>`**: Now implements both generic and legacy interfaces - Maintains full backward compatibility with existing `ITextSearch` - Adds native support for generic `ITextSearch<TValue>` with direct LINQ filtering - Eliminates technical debt from `TextSearchFilter` → obsolete `VectorSearchFilter` conversion ## Benefits ### **Type Safety & Developer Experience** - **Compile-time validation** of filter expressions - **IntelliSense support** for record property access - **Eliminates runtime errors** from property name typos ### **Performance Improvements** - **Direct LINQ filtering** without obsolete conversion overhead - **Reduced object allocations** by eliminating intermediate filter objects - **More efficient vector search** operations ### **Zero Breaking Changes** - **100% backward compatibility** - existing code continues to work unchanged - **Legacy interfaces preserved** - `ITextSearch` and `TextSearchOptions` untouched - **Gradual migration path** - teams can adopt generic interfaces at their own pace ## Implementation Strategy This PR implements **Phase 1** of the Issue #10456 resolution across 6 structured PRs: 1. **[DONE] PR 1 (This PR)**: Core generic interface additions - Add `ITextSearch<TRecord>` and `TextSearchOptions<TRecord>` interfaces - Update `VectorStoreTextSearch` to implement both legacy and generic interfaces - Maintain 100% backward compatibility 2. **[TODO] PR 2**: VectorStoreTextSearch internal modernization - Remove obsolete `VectorSearchFilter` conversion overhead - Use LINQ expressions directly in internal implementation - Eliminate technical debt identified in original issue 3. **[TODO] PR 3**: Modernize BingTextSearch connector - Update `BingTextSearch.cs` to implement `ITextSearch<TRecord>` - Adapt LINQ expressions to Bing API filtering capabilities - Ensure feature parity between legacy and generic interfaces 4. **[TODO] PR 4**: Modernize GoogleTextSearch connector - Update `GoogleTextSearch.cs` to implement `ITextSearch<TRecord>` - Adapt LINQ expressions to Google API filtering capabilities - Maintain backward compatibility for existing integrations 5. **[TODO] PR 5**: Modernize remaining connectors - Update `TavilyTextSearch.cs` and `BraveTextSearch.cs` - Complete connector ecosystem modernization - Ensure consistent LINQ filtering across all text search providers 6. **[TODO] PR 6**: Tests and samples modernization - Update 40+ test files identified in impact assessment - Modernize sample applications to demonstrate LINQ filtering - Validate complete feature parity and performance improvements ## Verification Results ### **Microsoft Official Pre-Commit Compliance** ```bash [PASS] dotnet build --configuration Release # 0 warnings, 0 errors [PASS] dotnet test --configuration Release # 1,574/1,574 tests passed (100%) [PASS] dotnet format SK-dotnet.slnx --verify-no-changes # 0/10,131 files needed formatting ``` ### **Test Coverage** - **VectorStoreTextSearch**: 19/19 tests passing (100%) - **TextSearch Integration**: 82/82 tests passing (100%) - **Full Unit Test Suite**: 1,574/1,574 tests passing (100%) - **No regressions detected** ### **Code Quality** - **Static Analysis**: 0 compiler warnings, 0 errors - **Formatting**: Perfect adherence to .NET coding standards - **Documentation**: Comprehensive XML docs with usage examples ## Example Usage ### Before (Legacy) ```csharp var options = new TextSearchOptions { Filter = new TextSearchFilter().Equality("Category", "Technology") }; var results = await textSearch.SearchAsync("AI advances", options); ``` ### After (Generic with LINQ) ```csharp var options = new TextSearchOptions<Article> { Filter = article => article.Category == "Technology" }; var results = await textSearch.SearchAsync("AI advances", options); ``` ## Files Modified ``` dotnet/src/SemanticKernel.Abstractions/Data/TextSearch/ITextSearch.cs dotnet/src/SemanticKernel.Abstractions/Data/TextSearch/TextSearchOptions.cs dotnet/src/SemanticKernel.Core/Data/TextSearch/VectorStoreTextSearch.cs ``` ### Contribution Checklist - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone **Verification Evidence:** - **Build**: `dotnet build --configuration Release` - 0 warnings, 0 errors - **Tests**: `dotnet test --configuration Release` - 1,574/1,574 tests passed (100%) - **Formatting**: `dotnet format SK-dotnet.slnx --verify-no-changes` - 0/10,131 files needed formatting - **Compatibility**: All existing tests pass, no breaking changes introduced --- **Issue**: #10456 **Type**: Enhancement (Feature Addition) **Breaking Changes**: None **Documentation**: Updated with comprehensive XML docs and usage examples Co-authored-by: Alexander Zarei <[email protected]>
- Implement equality (==), inequality (!=), Contains(), and AND (&&) operators - Map LINQ expressions to Bing Web Search API advanced operators - Support negation syntax for inequality (-operator:value) - Maintain full backward compatibility Addresses microsoft#10456 Aligns with PR microsoft#10273 Tests: 38/38 pass (100%) Breaking changes: None
LINQ-based filtering
This implements LINQ-based filtering for all current implementations of IVectorStore (with the exception of Pinecone, blocking on #10415). This implements support for:
r => r.Tags.Contains("foo")
)r => new[] { "foo", "bar" }.Contains(r.String)
)Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.
API naming
Vector databases consistency refer to this feature as filter/filtering; in some cases (e.g. Chroma),
where
is used, following SQL nomenclature:where
is used in the API, but filtering is also used in the docs)It seems the choice of exposing this via a
VectorSearchOptions.Filter
is indeed the correct one. Since we want to maintain backwards compatibility for existing users and preserve the current legacy filtering mechanism, this PR must introduce a separate, temporaryNewFilter
for the new LINQ-based filteirng. When releasing the stable version, we'll need to remove the oldFilter
property renameNewFilter
toFilter
(breaking users which switched toNewFilter
).Custom mapping
The samples currently contain a MappingVectorStoreRecordCollection, which is a decorator transforming a public, user-facing record type to another internal one - the user provides two delegates to handle the conversion from the public record type to the internal, and vice versa. Such conversions become considerably more complex, since the user provides an expression tree referencing public record properties.
To make this work, the user would need to provide some logic to translate property accesses on the public record type to the internal one. For flat records, this could be a simple API where the user simply implements some interface translating public to internal; but for hierarchical records this gets a bit more complicated. I'd suggest deferring such advanced, arbitrary mapping scenarios to post-release. In the meantime, I've commented out MappingVectorStoreRecordCollection (and also Step4_NonStringKey_VectorStore) from the samples.
Note that the problem here doesn't come from LINQ per se, but from the fact that the user specifies model property names (the legacy search API had them specify storage property names instead, side-stepping the problem but causing an inconsistent API where sometimes model properties are used, and sometimes storage properties).
Integration test suite
In addition, this implements new integration test projects for all IVectorStore connectors, implementing most of #10194 (comment):
/cc @westey-m @dmytrostruk
Closes #10156
Does most of #10194