Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 73 additions & 5 deletions src/collector/top_score_collector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -970,7 +970,7 @@ impl<Score, D, const R: bool> From<TopNComputerDeser<Score, D, R>> for TopNCompu
}
}

impl<Score, D, const R: bool> TopNComputer<Score, D, R>
impl<Score, D, const REVERSE_ORDER: bool> TopNComputer<Score, D, REVERSE_ORDER>
where
Score: PartialOrd + Clone,
D: Ord,
Expand All @@ -991,7 +991,10 @@ where
#[inline]
pub fn push(&mut self, feature: Score, doc: D) {
if let Some(last_median) = self.threshold.clone() {
if feature < last_median {
if !REVERSE_ORDER && feature > last_median {
return;
}
if REVERSE_ORDER && feature < last_median {
return;
}
}
Comment on lines 993 to 1000
Copy link
Collaborator

@stuhood stuhood Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, after further consideration, I think that the previous version was slightly incorrect, and so this one is too.

There are probably multiple factors, but one reason why #2651 would be slower is that it is comparing both the Score and the DocId. Before and after this PR, main is only comparing the Score to the threshold, which means that it will not tiebreak using the DocId.

Tiebreaking with the DocId is necessary for deterministic ordering of results: this will eliminate docs with equal Scores even if they have higher DocIds, which mismatches the behavior of ComparableDoc (and thus truncate_top_n/into_sorted_vec).

Copy link
Collaborator

@stuhood stuhood Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiebreaking with the DocId is necessary for deterministic ordering of results...

And if it isn't, then we should remove it / make it optional, because it's not free! 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiebreaking with the DocId is necessary for deterministic ordering of results...

And if it isn't, then we should remove it / make it optional, because it's not free! 😅

See #2681 (comment) on this topic: I think that we could introduce optional DocId/DocAddress tiebreaking with that API, and then remove it from TopN consumers who don't need it.

Copy link
Collaborator Author

@PSeitz PSeitz Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2651 compares Docid for the threshold, but that's not necessary. It should be a very fast pre-filtering. We still compare Docids via ComparableDoc during the sorts

The check does not filter docs with equal scores only with smaller ones

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, true. And the ComparableDoc comparator should be doing the same thing and exiting early without comparing the docid.

Sorry for the noise.

Expand Down Expand Up @@ -1026,7 +1029,7 @@ where
}

/// Returns the top n elements in sorted order.
pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
Expand All @@ -1037,7 +1040,7 @@ where
/// Returns the top n elements in stored order.
/// Useful if you do not need the elements in sorted order,
/// for example when merging the results of multiple segments.
pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
Expand All @@ -1047,9 +1050,11 @@ where

#[cfg(test)]
mod tests {
use proptest::prelude::*;

use super::{TopDocs, TopNComputer};
use crate::collector::top_collector::ComparableDoc;
use crate::collector::Collector;
use crate::collector::{Collector, DocSetCollector};
use crate::query::{AllQuery, Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT};
use crate::time::format_description::well_known::Rfc3339;
Expand Down Expand Up @@ -1144,6 +1149,44 @@ mod tests {
}
}

proptest! {
#[test]
fn test_topn_computer_asc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, false> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}

#[test]
fn test_topn_computer_desc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, true> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}
}

#[test]
fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
let index = make_index()?;
Expand Down Expand Up @@ -1645,4 +1688,29 @@ mod tests {
);
Ok(())
}

#[test]
fn test_topn_computer_asc() {
let mut computer: TopNComputer<u32, u32, false> = TopNComputer::new(2);

computer.push(1u32, 1u32);
computer.push(2u32, 2u32);
computer.push(3u32, 3u32);
computer.push(2u32, 4u32);
computer.push(4u32, 5u32);
computer.push(1u32, 6u32);
assert_eq!(
computer.into_sorted_vec(),
&[
ComparableDoc {
feature: 1u32,
doc: 1u32,
},
ComparableDoc {
feature: 1u32,
doc: 6u32,
}
]
);
}
}
Loading