Skip to content

Effect of quantization on search metrics #35

@lannelin

Description

@lannelin

Suggest that we compare performance of document retrieval with and without quantization. We can do this in isolation to the tiptoe protocol, running this in a similar way to the TFIDF benchmark - searching over a single index.

We should be able to pick up embeddings from existing pre-processing. Each run in the data/ dir should have an embeddings/embeddings subdir with:

  • embeddings.npy (dim reduced)
  • embeddings_original.npy (original)

unfortuantely, the quantized embeddings are only in the clusters_<n>.txt files (in data/<run>/clusters/)
these can be loaded (and matched to the npy files 🤞- needs to be checked!) with something like (hopefully - writing this on the fly in the issue)

from arc_tiptoe/preprocessing/utils import parse_file

# TODO more efficient to do a pre-allocation
all_inds = []
all_quantized_embs = []

for cluster_fpath in cluster_fpaths:
    indices, quantized_embs, _ = zip(*parse_file(cluster_fpath))
    all_inds.extend(indices)
    all_quantized_embs.extend(quantized_embs)

I believe data/c1970a4c-46c4-8fbe-cf94-d099e24ba206-2000-192.zip can be pulled down locally and used for development. Actual testing should be done over entire corpus.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions