Skip to content

Conversation

@PSeitz
Copy link
Collaborator

@PSeitz PSeitz commented Jul 16, 2025

Similar fix to #2651, but more lightweight (no change of Score)

thx @stuhood

@PSeitz PSeitz merged commit 4e84c70 into quickwit-oss:main Jul 16, 2025
3 checks passed
maksym-iv-ef pushed a commit to elastiflow/tantivy that referenced this pull request Jul 24, 2025
Comment on lines 993 to 1000
if let Some(last_median) = self.threshold.clone() {
if feature < last_median {
if !REVERSE_ORDER && feature > last_median {
return;
}
if REVERSE_ORDER && feature < last_median {
return;
}
}
Copy link
Collaborator

@stuhood stuhood Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, after further consideration, I think that the previous version was slightly incorrect, and so this one is too.

There are probably multiple factors, but one reason why #2651 would be slower is that it is comparing both the Score and the DocId. Before and after this PR, main is only comparing the Score to the threshold, which means that it will not tiebreak using the DocId.

Tiebreaking with the DocId is necessary for deterministic ordering of results: this will eliminate docs with equal Scores even if they have higher DocIds, which mismatches the behavior of ComparableDoc (and thus truncate_top_n/into_sorted_vec).

Copy link
Collaborator

@stuhood stuhood Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiebreaking with the DocId is necessary for deterministic ordering of results...

And if it isn't, then we should remove it / make it optional, because it's not free! 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiebreaking with the DocId is necessary for deterministic ordering of results...

And if it isn't, then we should remove it / make it optional, because it's not free! 😅

See #2681 (comment) on this topic: I think that we could introduce optional DocId/DocAddress tiebreaking with that API, and then remove it from TopN consumers who don't need it.

Copy link
Collaborator Author

@PSeitz PSeitz Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2651 compares Docid for the threshold, but that's not necessary. It should be a very fast pre-filtering. We still compare Docids via ComparableDoc during the sorts

The check does not filter docs with equal scores only with smaller ones

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, true. And the ComparableDoc comparator should be doing the same thing and exiting early without comparing the docid.

Sorry for the noise.

stuhood added a commit to paradedb/tantivy that referenced this pull request Aug 8, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 21, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 22, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 22, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Dec 3, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Dec 3, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants