This repository contains the code and resources for our paper:
- Xuanang Chen, Jian Luo, Ben He, Le Sun, Yingfei Sun. Towards Robust Dense Retrieval via Local Ranking Alignment. In IJCAI 2022.
Our code is developed based on Tevatron DR training toolkit.
We recommend you to create a new conda environment conda create -n rodr python=3.7,
activate it conda activate rodr, and then install the following packages:
torch==1.8.1, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0.
Note: In this repo, we mainly take MS MARCO passage ranking dataset for example. Before the experiments,
you can refer todownload_raw_data.shscript to download and process the raw data, which will be saved in thedata/msmarco_passage/rawfolder, liketrain.negatives.tsvfile that contains the negatives of each train query for constructing the training data.
Dev Query: All query variation sets for MS MARCO small Dev set used in our paper are provided
in the data/msmarco_passage/query/dev folder. You can directly use these query variation
sets to test the robustness of your DR model, and you can also use query_variation_generation.py
script to generate a query variation set by yourself:
qv_type=MisSpell
python query_variation_generation.py
--original_query_file ./msmarco_passage/raw/queries.dev.small.tsv
--query_variation_file ./msmarco_passage/process/query/dev/queries.dev.small.${qv_type}.tsv
--variation_type ${qv_type}
You need to appoint the type of query variation (namely, qv_type) from pre-defined eight types of query variations:
MisSpell, ExtraPunc, BackTrans, SwapSyn_Glove, SwapSyn_WNet, TransTense, NoStopword, SwapWords.
Note that a few queries can be kept original in a certain query variation set.
For example, if one query does not contain any stopword, the NoStopword variation is
not applicable. Besides, before using the query_variation_generation.py script, you may need to install
TextFlint,
TextAttack,
NLTK toolkits.
Train Query: We also need to generate variations for train queries to enhance the DR model.
Similar to Dev set, we first generate eight variation sets for the train query set, and then merge
them uniformly to obtain the final train query variation set (our generated train query variation file
is available in the data/msmarco_passage/query/train folder), which is used to insert variations
into the training data, by adding a 'query_variation' field into each training examples.
You can refer to construct_train_query_variations.py script after you obtain train variation sets
and original training data.
Standard DR: To obtain a standard DR model, like DR_OQ in our paper, you need to
construct the training data first:
OQ: the training data with original train queries, generated bybulid_train.pyscript.QV: the training data with train query variations, by inserting the variation version of original train queries into theOQtraining data.
After that, you can refer to train_standard_dpr.sh script, to train the
DR_OQ, DR_QV, and DR_OQ->QV models using the OQ and QV training data
as described in our paper.
RoDR:
As for our proposed RoDR model, to achieve better alignment, you need to collect nearer neighbors
for queries. Specifically, you can update the negatives in the OQ training data by sampling from
the top candidates returned by DR_OQ model. After that, you can refer to bulid_train_nn.py
script, wherein --query_variation argument requires the generated train query variation file.
Certainly, you can also add the variation version of train queries after constructing
the training data, similar to QV, using construct_training_data_with_variations function
available in the construct_train_query_variations.py script.
After that, you can refer to train_rodr_dpr.sh script, to train a RoDR w/ DR_OQ model
on top of the DR_OQ model. Compared to standard DR training, you need to change --training_mode
to oq.qv.lra mode, provide the initial DR model path to --model_name_or_path argument, and set
the loss weights in Eq. 8, as described in our paper.
After training a DR model, you can use it to carry out dense retrieval as follows:
- Tokenizing: using
tokenize_passages.pyandtokenize_queries.pyscripts to tokenize all passages in the corpus, the original queries and query variations. - Encoding and Retrieval: refer to
encode_retrieve_dpr.shto first encode passages and queries into vectors, and then use Faiss to index and retrieve.
As for zero-shot retrieval on ANTIQUE, all DR models are only trained on MS MARCO passage dataset,
please refer to run_antique_zeroshot.sh script.
For the evaluation on MS MARCO passage ranking dataset, such as MRR@10, Recall, and statistical t-test,
we provide variations_avg_tt_test.py script to compute the metrics for all paired run files
from two DR models waiting for comparison. You can use it like this:
# for single run file
python variations_avg_tt_test.py qrels run_file1 run_file2
# for all run files
python variations_avg_tt_test.py qrels run_dir1 run_dir2 fusion
-
Query variations:
- Passage-Dev: available in the
data/msmarco_passage/queryfolder, for bothdevandtrainquery sets. - Document-Dev: available in the
data/msmarco_doc/queryfolder, for bothdevandtrainquery sets. - ANTIQUE: available in the
data/antique/queryfolder, which are collected from five types of manually validated query variations.
- Passage-Dev: available in the
-
Models:
MS MARCO Passage MS MARCO Document DR_OQ DR_OQ DR_QV DR_QV DR_OQ->QV DR_OQ->QV RoDR w/ DR_OQ RoDR w/ DR_OQ -
Retrieval files*:
Dataset DR_OQ RoDR w/ DR_OQ Passage-Dev Download Download Document-Dev Download Download ANTIQUE Download Download * Due to the large size of run files on Passage-Dev, we only provide the run files of
DR_OQandRoDR w/ DR_OQmodels. If you want to obtain the run files ofDR_QVandDR_OQ->QVmodels, please feel free to contact us.
If you want to apply RoDR to publicly available DR models, such as ANCE, TAS-Balanced and ADORE+STAR, which are enhanced in our paper, you need to make some minor changes in the model level, such as adding the pooler in ANCE, and using separate query and passage encoders in ADORE+STAR. Herein, we provide the model checkpoints and retrieval files for the reproducibility of our experiments and other research uses.
-
Models:
Original RoDR ANCE RoDR w/ ANCE TAS-Balanced RoDR w/ TAS-Balanced ADORE+STAR RoDR w/ ADORE+STAR -
Retrieval files**:
Model Passage-Dev ANTIQUE RoDR w/ ANCE Download Download RoDR w/ TAS-Balanced Download Download RoDR w/ ADORE+STAR Download Download ** Due to the large size of run files on Passage-Dev, we only provide the run files of RoDR models. If you want to obtain the run files of original existing DR models, please feel free to contact us.
If you find our paper/resources useful, please cite:
@inproceedings{chen_ijcai2022-275,
title = {Towards Robust Dense Retrieval via Local Ranking Alignment},
author = {Xuanang Chen and
Jian Luo and
Ben He and
Le Sun and
Yingfei Sun},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {1980--1986},
year = {2022}
}
