Add query preprocessing script for embedding generation #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a preprocessing utility for generating query embeddings from IR datasets:
New Script:
preprocess_queries.py
:### Usage:
python arc-exps/preprocess_queries.py --json_config_path [config path].json
The config json file should contain:
Outputs are saved with the following file structure:
The file output is a csv file with 3 fields, query-id, query text, and the embedding vector.
Slurm Scripts:
Also added are slurm scripts for running jobs on computing clusters with slurm job scheduling.
scripts/slurm/pre-process-embeddings.sh
can be used to runpreprocess_queries.py
and takes the path to the config as an argument.