-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Suggest that we compare performance of document retrieval with and without quantization. We can do this in isolation to the tiptoe protocol, running this in a similar way to the TFIDF benchmark - searching over a single index.
We should be able to pick up embeddings from existing pre-processing. Each run in the data/
dir should have an embeddings/embeddings
subdir with:
embeddings.npy
(dim reduced)embeddings_original.npy
(original)
unfortuantely, the quantized embeddings are only in the clusters_<n>.txt
files (in data/<run>/clusters/
)
these can be loaded (and matched to the npy files 🤞- needs to be checked!) with something like (hopefully - writing this on the fly in the issue)
from arc_tiptoe/preprocessing/utils import parse_file
# TODO more efficient to do a pre-allocation
all_inds = []
all_quantized_embs = []
for cluster_fpath in cluster_fpaths:
indices, quantized_embs, _ = zip(*parse_file(cluster_fpath))
all_inds.extend(indices)
all_quantized_embs.extend(quantized_embs)
I believe data/c1970a4c-46c4-8fbe-cf94-d099e24ba206-2000-192.zip
can be pulled down locally and used for development. Actual testing should be done over entire corpus.