A comprehensive, high-quality, human-annotated plain-text dataset for SQL AI tasks across diverse domains and complexity levels.
SpiderMan is an improved version of the Spider 1.0 dataset.
- The databases are made available in plain-text format instead of a set of SQLite files. This makes it easy for you to load the dataset into any database of your choice.
- The schema has been standardized. Corrected table ordering, column data types, primary and foreign key constraints.
- Data has been corrected for schema-based validations.
- Queries have been improved for successful execution.
The dataset comprises 157 databases. Each one comes with its respective schema, data, and queries. By default, schema and queries are in MySQL dialect and can be transposed to other dialects using the transpiler script. At present, our queries do not extend across multiple databases. Each query within a single database is assigned exclusively to either the training set or the test set, but not to both.
Queries | Tables | Databases | |
---|---|---|---|
Train | 6726 | 699 | 137 |
Test | 1034 | 80 | 20 |
Total | 7760 | 779 | 157 |
The following commands are for macOS.
conda create --name spiderman-env python=3.12.2
conda activate spiderman-env
pip install -r requirements.txt
MySQL was chosen as the default dialect because it is one of the most widely used, can be set up quickly, and comes with various validation mechanisms.
docker run --name spiderman-mysql -e MYSQL_ROOT_PASSWORD=PeterParker -p 3306:3306 -d mysql:9.0.0
python scripts/load_dataset.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'
It creates schemas for all the databases and inserts their data into a DB system. It accepts one argument—a SQLAlchemy 2.0 compatible URL to the target database. More details on the URL are available here. If the target database is not MySQL, the script will try to transpile the schema and then load.
python scripts/validate_queries.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'
Once the dataset is loaded, you can run this script to execute the queries. It checks the successful completion of all the queries. Query results are not verified at this point.
python scripts/scan_dataset.py
This scripts go through the dataset and aggregate various details.
If you find this to be useful, please consider citing:
@inproceedings{SpiderMan,
title = {SpiderMan: A Comprehensive Human-Annotated Dataset for SQL AI Tasks Across Diverse Domains and Complexity Levels},
author = {Sreenath Somarajapuram and Athira},
year = 2024
}
@inproceedings{Yu&al.18c,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
year = 2018
}
- Dataset license : CC BY-SA 4.0
- Scripts license : Apache 2.0