Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies

Hi,

This concerns ranking features:

      (skip[8]?0:(*vit)->pairs1_rank) +

      (skip[10]?0:(*vit)->pairs_combined_rank) +

This is a request for a more informed ranking-feature. This may be a new one or may replace the existing pairs_combined one (preferred).

Ranking feature pairs1 currently takes the count of each anagram confusion value of the pairs transferred from LDcalc to rank. Highest number of pairs transferred ranks highest in rank, given a particular set of Correction Candidates for a particular variant.

This does not always result in the most likely CC given the highest rank in the current situation. Quite spurious confusions over particularly shorter words may be ranked higher than ostensibly often recurring confusions given the particular corpus being corrected.

After some experimentation it seems that weighing the frequencies of the CCs proposed for a particular confusion might help. We have tried the mean of the frequencies, but this results in pretty much the same ranking as we currently get in pairs1.

The median of the CCs frequencies, however, appears more likely to deliver the better ranking.

This will probably have to be implemented at the end of rank.

So, given the overall set of pairs in rank that share a particular character confusion value, this new feature needs to calculate the median of the CCs frequencies (their own, not the summed frequency of their capitalised versions). Also, here, the highest median wins, i.e. is accorded rank 1.

I would very much like to be be able to experiment with this soon.

Thanks!

M.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions