-
-
Notifications
You must be signed in to change notification settings - Fork 826
Fix TopNComputer for reverse order #2672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Pascal Seitz <[email protected]>
| if let Some(last_median) = self.threshold.clone() { | ||
| if feature < last_median { | ||
| if !REVERSE_ORDER && feature > last_median { | ||
| return; | ||
| } | ||
| if REVERSE_ORDER && feature < last_median { | ||
| return; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, after further consideration, I think that the previous version was slightly incorrect, and so this one is too.
There are probably multiple factors, but one reason why #2651 would be slower is that it is comparing both the Score and the DocId. Before and after this PR, main is only comparing the Score to the threshold, which means that it will not tiebreak using the DocId.
Tiebreaking with the DocId is necessary for deterministic ordering of results: this will eliminate docs with equal Scores even if they have higher DocIds, which mismatches the behavior of ComparableDoc (and thus truncate_top_n/into_sorted_vec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiebreaking with the
DocIdis necessary for deterministic ordering of results...
And if it isn't, then we should remove it / make it optional, because it's not free! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiebreaking with the
DocIdis necessary for deterministic ordering of results...And if it isn't, then we should remove it / make it optional, because it's not free! 😅
See #2681 (comment) on this topic: I think that we could introduce optional DocId/DocAddress tiebreaking with that API, and then remove it from TopN consumers who don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2651 compares Docid for the threshold, but that's not necessary. It should be a very fast pre-filtering. We still compare Docids via ComparableDoc during the sorts
The check does not filter docs with equal scores only with smaller ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, true. And the ComparableDoc comparator should be doing the same thing and exiting early without comparing the docid.
Sorry for the noise.
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
## What
Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).
## How
Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
* To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
* This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
* It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".
This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).
## Note
This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!
----
Upstream at quickwit-oss#2681
Similar fix to #2651, but more lightweight (no change of
Score)thx @stuhood