Add query preprocessing script for embedding generation #39

J-Dymond · 2025-10-10T10:45:26Z

This pull request introduces a preprocessing utility for generating query embeddings from IR datasets:

New Script: `preprocess_queries.py`:

### Usage: python arc-exps/preprocess_queries.py --json_config_path [config path].json

The config json file should contain:

{
    "embed_model": "msmarco-distilbert-base-tas-b",
    "data": {
        "dataset": "msmarco",
        "query_set": "msmarco-document/trec-dl-2019",]
    },
}

Outputs are saved with the following file structure:

processed_queries/
├── {dataset}/
│   ├── {query_set}/
│   │   └── {model_name}.csv

The file output is a csv file with 3 fields, query-id, query text, and the embedding vector.

Slurm Scripts:

Also added are slurm scripts for running jobs on computing clusters with slurm job scheduling.

scripts/slurm/pre-process-embeddings.sh can be used to run preprocess_queries.py and takes the path to the config as an argument.

…he model and dataset combination

…ript to run it. Also added a QoL slurm script for monitoring slurm job outputs

eddableheath

All LGTM!

J-Dymond and others added 4 commits October 8, 2025 16:19

added a script which takes a config and saves embedded queries, for t…

9d79ce6

…he model and dataset combination

added get_device() function to choose appropriate device

d57ac10

adding slurm scripts to repo

1ba39ba

minor changes to the preprocess queries script, as well as a slurm sc…

a420451

…ript to run it. Also added a QoL slurm script for monitoring slurm job outputs

J-Dymond linked an issue Oct 10, 2025 that may be closed by this pull request

Pre-compute query embeddings #36

Closed

eddableheath self-requested a review October 10, 2025 13:54

eddableheath approved these changes Oct 10, 2025

View reviewed changes

eddableheath merged commit 25227ca into main Oct 10, 2025

eddableheath deleted the 36-pre-compute-query-embeddings branch October 10, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add query preprocessing script for embedding generation #39

Add query preprocessing script for embedding generation #39

Uh oh!

J-Dymond commented Oct 10, 2025

Uh oh!

eddableheath left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add query preprocessing script for embedding generation #39

Add query preprocessing script for embedding generation #39

Uh oh!

Conversation

J-Dymond commented Oct 10, 2025

New Script: preprocess_queries.py:

Slurm Scripts:

Uh oh!

eddableheath left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Script: `preprocess_queries.py`: