Skip to content

Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33

@martinreynaert

Description

@martinreynaert

Hi,

This concerns ranking features:

  (skip[8]?0:(*vit)->pairs1_rank) +

  (skip[10]?0:(*vit)->pairs_combined_rank) +

This is a request for a more informed ranking-feature. This may be a new one or may replace the existing pairs_combined one (preferred).

Ranking feature pairs1 currently takes the count of each anagram confusion value of the pairs transferred from LDcalc to rank. Highest number of pairs transferred ranks highest in rank, given a particular set of Correction Candidates for a particular variant.

This does not always result in the most likely CC given the highest rank in the current situation. Quite spurious confusions over particularly shorter words may be ranked higher than ostensibly often recurring confusions given the particular corpus being corrected.

After some experimentation it seems that weighing the frequencies of the CCs proposed for a particular confusion might help. We have tried the mean of the frequencies, but this results in pretty much the same ranking as we currently get in pairs1.

The median of the CCs frequencies, however, appears more likely to deliver the better ranking.

This will probably have to be implemented at the end of rank.

So, given the overall set of pairs in rank that share a particular character confusion value, this new feature needs to calculate the median of the CCs frequencies (their own, not the summed frequency of their capitalised versions). Also, here, the highest median wins, i.e. is accorded rank 1.

I would very much like to be be able to experiment with this soon.

Thanks!

M.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions