Skip to content

Conversation

roji
Copy link
Member

@roji roji commented Jan 23, 2025

LINQ-based filtering

This implements LINQ-based filtering for all current implementations of IVectorStore (with the exception of Pinecone, blocking on #10415). This implements support for:

  • Equality and inequality
  • Comparison (GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual)
  • Logical operators (and, or, not)
  • Contains over collection field (r => r.Tags.Contains("foo"))
  • Contains over inline collection (r => new[] { "foo", "bar" }.Contains(r.String))

Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.

API naming

Vector databases consistency refer to this feature as filter/filtering; in some cases (e.g. Chroma), where is used, following SQL nomenclature:

It seems the choice of exposing this via a VectorSearchOptions.Filter is indeed the correct one. Since we want to maintain backwards compatibility for existing users and preserve the current legacy filtering mechanism, this PR must introduce a separate, temporary NewFilter for the new LINQ-based filteirng. When releasing the stable version, we'll need to remove the old Filter property rename NewFilter to Filter (breaking users which switched to NewFilter).

Custom mapping

The samples currently contain a MappingVectorStoreRecordCollection, which is a decorator transforming a public, user-facing record type to another internal one - the user provides two delegates to handle the conversion from the public record type to the internal, and vice versa. Such conversions become considerably more complex, since the user provides an expression tree referencing public record properties.

To make this work, the user would need to provide some logic to translate property accesses on the public record type to the internal one. For flat records, this could be a simple API where the user simply implements some interface translating public to internal; but for hierarchical records this gets a bit more complicated. I'd suggest deferring such advanced, arbitrary mapping scenarios to post-release. In the meantime, I've commented out MappingVectorStoreRecordCollection (and also Step4_NonStringKey_VectorStore) from the samples.

Note that the problem here doesn't come from LINQ per se, but from the fact that the user specifies model property names (the legacy search API had them specify storage property names instead, side-stepping the problem but causing an inconsistent API where sometimes model properties are used, and sometimes storage properties).

Integration test suite

In addition, this implements new integration test projects for all IVectorStore connectors, implementing most of #10194 (comment):

  • A shared FiltersTestBase class defines all LINQ filter tests, and is extended and used by all providers (link). This can serve as the basic "spec" for what filters are supported across connector (subject to database support, of course).
  • Each connector has its own separate project
  • Testcontainers are used for Mongo, PostgreSQL, Qdrant, Redis, Sqlite, and Weaviate. Azure AI Search, CosmosNoSQL and CosmosMongoDB require configuration (environment variables, settings.json) to provide endpoint information, and tests are skipped if that's not provided.

/cc @westey-m @dmytrostruk

Closes #10156
Does most of #10194

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Jan 23, 2025
@github-actions github-actions bot changed the title WIP on LINQ-based criteria filtering .Net: WIP on LINQ-based criteria filtering Jan 23, 2025
Task<VectorSearchResults<TRecord>> VectorizableTextSearchAsync(
string searchText,
VectorSearchOptions? options = default,
VectorSearchOptions<TRecord>? options = default,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making VectorSearchOptions generic over TRecord isn't trivial... Modern C# users can use target-typed new, and in that case this doesn't interfere in the method calling. But people not doing target-typed new would have to explicitly specify the full type every time (even if they don't use filtering at all), which maybe isn't ideal. The alternative would be to move the LINQ filter to an optional parameter directly on the method (no strong feelings from my side here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are thinking of also making property selection an expression, that may result in a few more optional parameters on the method signature, to avoid making options generic. For hybrid search it would be 3 additional, which isn't great, especially considering that 2 of those would very rarely be used, i.e. only if there is more than one dense vector or more than one fulltextsearch property on the record.

@westey-m
Copy link
Contributor

@roji, a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

@roji
Copy link
Member Author

roji commented Jan 23, 2025

a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

@westey-m thanks - will do that, it should be easily representible in LINQ by doing b => b.Tags.Contains("foo"). Maybe let's talk tomorrow about how that's represented e.g. in Qdrant (just because I'm currently focusing on it).

@roji roji changed the base branch from main to feature-vector-data-preb1 January 24, 2025 18:18

private readonly StringBuilder _sql = new();

internal (string Clause, List<object> Parameters) Translate(
Copy link
Contributor

@dluc dluc Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think there's room for SQL injection, considering that the resulting string is merged into a SQL statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely (and that's very important). There's a test exercising an input with a single quote to check exactly for this. I have a note to do a general comprehensive pass to make sure everything is airtight - stuff like property names isn't currently safe (those generally aren't user input, but I still need to make them airtight).

@dluc
Copy link
Contributor

dluc commented Feb 11, 2025

Could you confirm that NOT works also with Contains? example

  • NOT Contains over collection field (r => !r.Tags.Contains("foo"))

thanks!

await this._container.StartAsync();
var redis = await ConnectionMultiplexer.ConnectAsync($"{this._container.Hostname}:{this._container.GetMappedPublicPort(6379)},connectTimeout=60000,connectRetry=5");
this._database = redis.GetDatabase();
this._defaultVectorStore = new(this._database);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to test both instances of redis: Json and Hashset. It's possible to choose the flavour by setting the StorageType option on RedisVectorStoreOptions. They do differ in some non-obvious ways, especially with regards to how you specify property names.

Copy link
Member Author

@roji roji Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed that, thanks... I pushed tests for both. Depending on the exact differences, it might make sense to just treat these as two databases test-wise and have two projects (but for now fixtures and tests are duplicated in the same projects).

Query-wise there are some unsupported scenarios in HashSet compared to JSON (array fields, null values), but otherwise the tests pass.

@roji
Copy link
Member Author

roji commented Feb 11, 2025

@dluc

Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.

I could not understand the implications, where/when is VectorStoreGenericDataModel used? perhaps an old abstraction that could be deprecated?

VectorStoreGenericDataModel is basically a way to use this abstraction without a strongly-typed .NET type (e.g. for dynamic scenarios). This is already supported in MEVD, but this PR doesn't add support for it in the new filtering mechanism. This is absolutely something I intend to do.

I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm?

No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties.

For example, in collection.VectorizedSearchAsync(vector, new() { NewFilter = r => r.Age > 8 }), Age is a .NET property on a real .NET type (represented here by the lambda parameter r). MEVD allows mapping .NET properties to arbitrary storage names (the names that you actually see in the database), and that's fully compatible with the new LINQ filtering (the translator goes through an internal translator table from user-facing property names to storage properties).

If needed, is there a possibility for mapping external names to internal names? For example, let's assume I build a solution that works with TAGS, and they are stored in different ways, e.g.

a list of strings in Postgres, field name "tags"
a list of string fields in Redis, field names "tag1" "tag2" "tag3" ...
a JSON property in MongoDb, field name "doc.props.tags"
In this scenario would I be able to transform the incoming LINQ filter on "tags" to another LINQ filter targeting my storage of choice?

For all of these scenarios, your .NET type would simply have a regular List<string> property called Tags; referencing it in the LINQ filter and e.g. calling Contains on that property should work simply work on all these databases. One note though: AFAIK in Redis, this would not be mapped to multiple string fields, but rather with a single Tag field; this already fully works with the Redis connector. This scenario is exercised by this test.

Could you confirm that NOT works also with Contains? example

Yes - it does in general. Specific database support varies sometimes (e.g. MongoDB has some general issues with NOT in vector search pre-filters), but it works fine on most databases.

@roji
Copy link
Member Author

roji commented Feb 11, 2025

BTW @dluc we're definitely interested in feedback about any other specific filtering features you might be interested, don't hesitate to let us know.

@roji roji merged commit 2434faf into microsoft:feature-vector-data-preb1 Feb 11, 2025
10 checks passed
@roji roji deleted the LinqFiltering branch February 11, 2025 23:02
roji added a commit to roji/semantic-kernel that referenced this pull request Feb 15, 2025
roji added a commit to roji/semantic-kernel that referenced this pull request Feb 16, 2025
roji added a commit to roji/semantic-kernel that referenced this pull request Feb 16, 2025
roji added a commit to roji/semantic-kernel that referenced this pull request Feb 16, 2025
roji added a commit that referenced this pull request Feb 18, 2025
roji added a commit to roji/semantic-kernel that referenced this pull request Mar 4, 2025
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
roji added a commit to roji/semantic-kernel that referenced this pull request Mar 4, 2025
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
roji added a commit to roji/semantic-kernel that referenced this pull request Mar 4, 2025
And the obsoleted Filter to OldFilter. Continues microsoft#10273.
roji added a commit that referenced this pull request Mar 4, 2025
And the obsoleted Filter to OldFilter. Continues #10273.
@dluc
Copy link
Contributor

dluc commented Mar 13, 2025

@roji this is a scenario I'm about to test.

I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work™️

I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm?

No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties.

@roji
Copy link
Member Author

roji commented Mar 13, 2025

I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work

@dluc if I understand correctly, does that mean you using VectorStoreGenericDataModel? If so, note that the LINQ-based filteirng hasn't yet been implemented for that - #10468 tracks that. Full support for dynamic scenarios is absolutely high our list, but some fixes probably need to happen first (see #10802).

So long story short, for dynamic scenarios you'll likely need to wait a bit longer until everything is fully in place.

markwallace-microsoft pushed a commit that referenced this pull request Sep 25, 2025
…ace (#10456) (#13175)

# Add generic ITextSearch<TRecord> interface with LINQ filtering support

**Addresses Issue #10456**: Modernize ITextSearch to use LINQ-based
vector search filtering

> ** Multi-PR Strategy Context**  
> This is **PR 1 of multiple** in a structured implementation approach
for Issue #10456. This PR targets the
`feature/issue-10456-linq-filtering` branch for incremental review and
testing before the final submission to Microsoft's main branch.

> This approach enables focused code review, easier debugging, and safer
integration of the comprehensive ITextSearch modernization.

### Motivation and Context

**Why is this change required?**
The current ITextSearch interface uses legacy `TextSearchFilter` which
requires conversion to obsolete `VectorSearchFilter`, creating technical
debt and performance overhead. Issue #10456 requests modernization to
use type-safe LINQ filtering with `Expression<Func<TRecord, bool>>`.

**What problem does it solve?**

- Eliminates runtime errors from property name typos in filters
- Removes performance overhead from obsolete filter conversions
- Provides compile-time type safety and IntelliSense support
- Modernizes the API to follow .NET best practices for LINQ-based
filtering

**What scenario does it contribute to?**
This enables developers to write type-safe text search filters like:

```csharp
var options = new TextSearchOptions<Article>
{
    Filter = article => article.Category == "Technology" && article.PublishedDate > DateTime.Now.AddDays(-30)
};
```

**Issue Link:**
#10456

### Description

This PR introduces foundational generic interfaces to enable LINQ-based
filtering for text search operations. The implementation follows an
additive approach, maintaining 100% backward compatibility while
providing a modern, type-safe alternative.

**Overall Approach:**

- Add generic `ITextSearch<TRecord>` interface alongside existing
non-generic version
- Add generic `TextSearchOptions<TRecord>` with LINQ
`Expression<Func<TRecord, bool>>? Filter`
- Update `VectorStoreTextSearch` to implement both interfaces
- Preserve all existing functionality while enabling modern LINQ
filtering

**Underlying Design:**

- **Zero Breaking Changes**: Legacy interfaces remain unchanged and
fully functional
- **Gradual Migration**: Teams can adopt generic interfaces at their own
pace
- **Performance Optimization**: Eliminates obsolete VectorSearchFilter
conversion overhead
- **Type Safety**: Compile-time validation prevents runtime filter
errors

### Engineering Approach: Following Microsoft's Established Patterns

This solution was not created from scratch but carefully architected by
**studying and extending Microsoft's existing patterns** within the
Semantic Kernel codebase:

**1. Pattern Discovery: VectorSearchOptions<TRecord> Template**

Found the exact migration pattern Microsoft established in PR #10273:

```csharp
public class VectorSearchOptions<TRecord>
{
    [Obsolete("Use Filter instead")]
    public VectorSearchFilter? OldFilter { get; set; }  // Legacy approach

    public Expression<Func<TRecord, bool>>? Filter { get; set; }  // Modern LINQ approach
}
```

**2. Existing Infrastructure Analysis**

Discovered that `VectorStoreTextSearch.cs` already had the
implementation infrastructure:

```csharp
// Modern LINQ filtering method (already existed!)
private async IAsyncEnumerable<VectorSearchResult<TRecord>> ExecuteVectorSearchAsync(
    string query,
    TextSearchOptions<TRecord>? searchOptions,  // Generic options
    CancellationToken cancellationToken)
{
    var vectorSearchOptions = new VectorSearchOptions<TRecord>
    {
        Filter = searchOptions.Filter,  // Direct LINQ filtering - no conversion!
    };
}
```

**3. Microsoft's Additive Migration Strategy**

Followed the exact pattern used across the codebase:

- Keep legacy interface unchanged for backward compatibility
- Add generic interface with modern features alongside
- Use `[Experimental]` attributes for new features
- Provide gradual migration path

**4. Consistency with Existing Filter Translators**

All vector database connectors (AzureAISearch, Qdrant, MongoDB,
Weaviate) use the same pattern:

```csharp
internal Filter Translate(LambdaExpression lambdaExpression, CollectionModel model)
{
    // All work with Expression<Func<TRecord, bool>>
    // All provide compile-time safety
    // All follow the same LINQ expression pattern
}
```

**5. Technical Debt Elimination**

The existing problematic code that this PR enables fixing in PR #2:

```csharp
// Current technical debt in VectorStoreTextSearch.cs
#pragma warning disable CS0618 // VectorSearchFilter is obsolete
OldFilter = searchOptions.Filter?.FilterClauses is not null
    ? new VectorSearchFilter(searchOptions.Filter.FilterClauses)
    : null,
#pragma warning restore CS0618
```

This will be replaced with direct LINQ filtering: `Filter =
searchOptions.Filter`

**Result**: This solution extends Microsoft's established patterns
consistently rather than introducing new conventions, ensuring seamless
integration with the existing ecosystem.

## Summary

This PR introduces the foundational generic interfaces needed to
modernize text search functionality from legacy `TextSearchFilter` to
type-safe LINQ `Expression<Func<TRecord, bool>>` filtering. This is the
first in a series of PRs to completely resolve Issue #10456.

## Key Changes

### New Generic Interfaces

- **`ITextSearch<TRecord>`**: Generic interface with type-safe LINQ
filtering

- `SearchAsync<TRecord>(string query, TextSearchOptions<TRecord>
options, CancellationToken cancellationToken)`
- `GetTextSearchResultsAsync<TRecord>(string query,
TextSearchOptions<TRecord> options, CancellationToken
cancellationToken)`
- `GetSearchResultsAsync<TRecord>(string query,
TextSearchOptions<TRecord> options, CancellationToken
cancellationToken)`

- **`TextSearchOptions<TRecord>`**: Generic options class with LINQ
support
- `Expression<Func<TRecord, bool>>? Filter` property for compile-time
type safety
  - Comprehensive XML documentation with usage examples

### Enhanced Implementation

- **`VectorStoreTextSearch<TValue>`**: Now implements both generic and
legacy interfaces
  - Maintains full backward compatibility with existing `ITextSearch`
- Adds native support for generic `ITextSearch<TValue>` with direct LINQ
filtering
- Eliminates technical debt from `TextSearchFilter` → obsolete
`VectorSearchFilter` conversion

## Benefits

### **Type Safety & Developer Experience**

- **Compile-time validation** of filter expressions
- **IntelliSense support** for record property access
- **Eliminates runtime errors** from property name typos

### **Performance Improvements**

- **Direct LINQ filtering** without obsolete conversion overhead
- **Reduced object allocations** by eliminating intermediate filter
objects
- **More efficient vector search** operations

### **Zero Breaking Changes**

- **100% backward compatibility** - existing code continues to work
unchanged
- **Legacy interfaces preserved** - `ITextSearch` and
`TextSearchOptions` untouched
- **Gradual migration path** - teams can adopt generic interfaces at
their own pace

## Implementation Strategy

This PR implements **Phase 1** of the Issue #10456 resolution across 6
structured PRs:

1. **[DONE] PR 1 (This PR)**: Core generic interface additions

- Add `ITextSearch<TRecord>` and `TextSearchOptions<TRecord>` interfaces
- Update `VectorStoreTextSearch` to implement both legacy and generic
interfaces
   - Maintain 100% backward compatibility

2. **[TODO] PR 2**: VectorStoreTextSearch internal modernization

   - Remove obsolete `VectorSearchFilter` conversion overhead
   - Use LINQ expressions directly in internal implementation
   - Eliminate technical debt identified in original issue

3. **[TODO] PR 3**: Modernize BingTextSearch connector

   - Update `BingTextSearch.cs` to implement `ITextSearch<TRecord>`
   - Adapt LINQ expressions to Bing API filtering capabilities
   - Ensure feature parity between legacy and generic interfaces

4. **[TODO] PR 4**: Modernize GoogleTextSearch connector

   - Update `GoogleTextSearch.cs` to implement `ITextSearch<TRecord>`
   - Adapt LINQ expressions to Google API filtering capabilities
   - Maintain backward compatibility for existing integrations

5. **[TODO] PR 5**: Modernize remaining connectors

   - Update `TavilyTextSearch.cs` and `BraveTextSearch.cs`
   - Complete connector ecosystem modernization
   - Ensure consistent LINQ filtering across all text search providers

6. **[TODO] PR 6**: Tests and samples modernization
   - Update 40+ test files identified in impact assessment
   - Modernize sample applications to demonstrate LINQ filtering
   - Validate complete feature parity and performance improvements

## Verification Results

### **Microsoft Official Pre-Commit Compliance**

```bash
[PASS] dotnet build --configuration Release         # 0 warnings, 0 errors
[PASS] dotnet test --configuration Release          # 1,574/1,574 tests passed (100%)
[PASS] dotnet format SK-dotnet.slnx --verify-no-changes  # 0/10,131 files needed formatting
```

### **Test Coverage**

- **VectorStoreTextSearch**: 19/19 tests passing (100%)
- **TextSearch Integration**: 82/82 tests passing (100%)
- **Full Unit Test Suite**: 1,574/1,574 tests passing (100%)
- **No regressions detected**

### **Code Quality**

- **Static Analysis**: 0 compiler warnings, 0 errors
- **Formatting**: Perfect adherence to .NET coding standards
- **Documentation**: Comprehensive XML docs with usage examples

## Example Usage

### Before (Legacy)

```csharp
var options = new TextSearchOptions
{
    Filter = new TextSearchFilter().Equality("Category", "Technology")
};
var results = await textSearch.SearchAsync("AI advances", options);
```

### After (Generic with LINQ)

```csharp
var options = new TextSearchOptions<Article>
{
    Filter = article => article.Category == "Technology"
};
var results = await textSearch.SearchAsync("AI advances", options);
```

## Files Modified

```
dotnet/src/SemanticKernel.Abstractions/Data/TextSearch/ITextSearch.cs
dotnet/src/SemanticKernel.Abstractions/Data/TextSearch/TextSearchOptions.cs
dotnet/src/SemanticKernel.Core/Data/TextSearch/VectorStoreTextSearch.cs
```

### Contribution Checklist

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone

**Verification Evidence:**

- **Build**: `dotnet build --configuration Release` - 0 warnings, 0
errors
- **Tests**: `dotnet test --configuration Release` - 1,574/1,574 tests
passed (100%)
- **Formatting**: `dotnet format SK-dotnet.slnx --verify-no-changes` -
0/10,131 files needed formatting
- **Compatibility**: All existing tests pass, no breaking changes
introduced

---

**Issue**: #10456  
**Type**: Enhancement (Feature Addition)  
**Breaking Changes**: None  
**Documentation**: Updated with comprehensive XML docs and usage
examples

Co-authored-by: Alexander Zarei <[email protected]>
alzarei added a commit to alzarei/semantic-kernel that referenced this pull request Oct 3, 2025
- Implement equality (==), inequality (!=), Contains(), and AND (&&) operators
- Map LINQ expressions to Bing Web Search API advanced operators
- Support negation syntax for inequality (-operator:value)
- Maintain full backward compatibility

Addresses microsoft#10456
Aligns with PR microsoft#10273

Tests: 38/38 pass (100%)
Breaking changes: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel.core kernel Issues or pull requests impacting the core kernel memory connector memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants