Skip to content

Conversation

J-Dymond
Copy link

This pull request introduces a preprocessing utility for generating query embeddings from IR datasets:

New Script: preprocess_queries.py:

### Usage: python arc-exps/preprocess_queries.py --json_config_path [config path].json

The config json file should contain:

{
    "embed_model": "msmarco-distilbert-base-tas-b",
    "data": {
        "dataset": "msmarco",
        "query_set": "msmarco-document/trec-dl-2019",]
    },
}

Outputs are saved with the following file structure:

processed_queries/
├── {dataset}/
│   ├── {query_set}/
│   │   └── {model_name}.csv

The file output is a csv file with 3 fields, query-id, query text, and the embedding vector.

Slurm Scripts:

Also added are slurm scripts for running jobs on computing clusters with slurm job scheduling.

scripts/slurm/pre-process-embeddings.sh can be used to run preprocess_queries.py and takes the path to the config as an argument.

@J-Dymond J-Dymond linked an issue Oct 10, 2025 that may be closed by this pull request
@eddableheath eddableheath self-requested a review October 10, 2025 13:54
Copy link

@eddableheath eddableheath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All LGTM!

@eddableheath eddableheath merged commit 25227ca into main Oct 10, 2025
@eddableheath eddableheath deleted the 36-pre-compute-query-embeddings branch October 10, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pre-compute query embeddings

2 participants