This project includes utilities and scripts for automatic dataset generation. It is used in the following papers:
- Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia, S., Liu, H., Parrish, A. and Wang, S.F., Bowman, S.R. 2019. Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs. arXiv preprint arXiv:1909.02597.
- Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. F., & Bowman, S. R. (2019). BLiMP: A Benchmark of Linguistic Minimal Pairs for English. arXiv preprint arXiv:1912.00582.
- Jeretic, P., Warstadt, A., Bhooshan, S., & Williams, A. (2020). Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition. arXiv preprint arXiv:2004.03066.
To run a sample data generation script, navigate to the data_generation directory and run the following command:
python -m generation_projects.blimp.adjunct_island
If all dependencies are present in your workspace, this will generate the adjunct_island dataset in BLiMP. Generation will take a minute to begin, after which point the progress can be watched in outputs/benchmark/adjunct_island.jsonl.
- With the exception of BLiMP, all project-specific code is kept in separate branches. BLiMP appears in master as a helpful examplar.
- Major branches include:
- blimp
- imppres
- npi
- msgs
- structure_dependence qp
- The project contains the following packages:
generation_projects: scripts for generating data, organized into subdirectories by research project.mturk_qc: code for carrying out Amazon mechanical turk quality control.outputs: generated data, organized into subdirectories by research project.results: experiment results filesresults_processing: scripts for analyzing results and producing figuresutils: shared code for generation projects. Includes utilities for proecessing the vocabulary, generating constituents, manipulating generated strings, etc.
- It also contains a vocabulary file and documentation of the vocabulary:
vocabulary.csv: the vocab file.vocab_documentation.md: the vocab documentation
- The vocabulary file is vocabulary.csv.
- Each row in the .csv is a lexical item. Each column is feature encoding grammatical information about the lexical item. Detailed documentation of the columns can be found in vocab_documentation.md.
- The following notation is used to define selectional restrictions in the
arg_1,arg_2, andarg_3columns:<DISJUNCTION> := <CONJUNCTION> | <CONJUNCTION>;<DISJUNCTION> <CONJUNCTION> := <CONDITION> | <CONDITION>^<CONJUNCTION> <CONDITION> := <COLUMN>=<VALUE> - In other words, the entire restriction is written in disjunctive normal form where
;is used for disjunction and^is used for conjunction. - Example 1:
arg_1of lexical item breaking isanimate=1. This means any noun appearing as the subject of breaking must have value1in the columnanimate. - Example 2:
arg_1of lexical item buys isinstitution=1^sg=1;animate=1^sg=1. This means any noun appearing as the subject of breaking must meet one of the following conditions:- have value
1in columninstitutionand value1in columnsg, or - have value
1in columnanimateand value1in columnsg.
- have value
- Disclaimer: As this project is under active development, data generated with different versions of the vocabulary may differ slightly.
- The
utilspackage contains the shared code for the various generation projects.utils.conjugateincludes functions which conjugate verbs and add selecting auxiliaries/modalsutils.constituent_buildingincludes functions which "do syntax". The following are especially useful:verb_args_from_verb: gather all arguments of a verb into a dictionaryV_to_VP_mutate: given a verb, modify the expression to contain the string corresponding to a full VPN_to_DP_mutate: given a noun, gather all arguments and a determiner, and modify the expression to contain the string corresponding to a full DP
utils.data_generatordefines general classes that are instantianted by a particular generation project. The classes contain metadata fields, the main loop for a generating a dataset (generate_paradigm), and functions for logging and exception handlingutils.data_typecontains the data_type necessary for the numpy structured array data structure used in the vocabulary.- if the columns of the vocabulary file are ever modified, this file must be modified to match.
utils.string_utilscontains functions for cleaning up generated strings (removing extra whitespace, capitalization, etc.)utils.vocab_setscontains constants for accessing commonly used sets of vocab entries. Building these constants takes about a minute at the beginning of running a generation script, but this speeds up generation of large datasets.utils.vocab_tablecontains functions for creating and accessing the vocabulary tableget_allgathers all vocab items with a given restrictionget_all_conjunctivegathers all vocab items with the given restrictions
If you use the data generation project in your work, please cite the BLiMP paper:
@article{warstadt2019blimp,
title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
journal={arXiv preprint arXiv:1912.00582},
year={2019}
}