Effect of quantization on search metrics

Suggest that we compare performance of document retrieval with and without quantization. We can do this in isolation to the tiptoe protocol, running this in a similar way to the TFIDF benchmark - searching over a single index.

We should be able to pick up embeddings from existing pre-processing. Each run in the `data/` dir should have an `embeddings/embeddings` subdir with:
- `embeddings.npy` (dim reduced)
- `embeddings_original.npy` (original)

unfortuantely, the quantized embeddings are only in the `clusters_<n>.txt` files (in `data/<run>/clusters/`)
these can be loaded (and matched to the npy files 🤞- needs to be checked!) with something like (hopefully - writing this on the fly in the issue)
```python
from arc_tiptoe/preprocessing/utils import parse_file

# TODO more efficient to do a pre-allocation
all_inds = []
all_quantized_embs = []

for cluster_fpath in cluster_fpaths:
    indices, quantized_embs, _ = zip(*parse_file(cluster_fpath))
    all_inds.extend(indices)
    all_quantized_embs.extend(quantized_embs)
```

I believe `data/c1970a4c-46c4-8fbe-cf94-d099e24ba206-2000-192.zip` can be pulled down locally and used for development. Actual testing should be done over entire corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Effect of quantization on search metrics #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Effect of quantization on search metrics #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions