scibart

So far, this is a refactoring job of a notebook in CurationCorp's amazing curation-corpus repository for training on GPU clusters to tune BART for abstractive summarization of scientific literature.

Part of the Coronawhy project.

How to create a dataset from scratch

Currently, the dataset is sourced as follows:

Text-abstract pairs from Arxiv and the Semantic Scholar Corpus as provided by Santosh-Gupta's ScientificSummarizationDataSets repo
Text-headline pairs from WikiHow, provided by mahnazkoupaee's WikiHow-Dataset repo
Curation Corpus

To create a new dataset from scratch:

Download the ArXiv and Semantic Scholar Corpus datasets from gdrive (as described here) and unzip into raw_data/ArxivStructuredAbstractSectionalSummaries and raw_data/SemanticScholarAbstractSectionSummaryDataSet
Download wikihowAll.csv (as described here) into raw_data/wikihow
Scrape the Curation Corpus dataset as explained in the repo, then move curation-corpus-base-with-articles.csv to raw_data/curation_corpus
Run python src/data/create_dataset.py. This will create a new folder called data with ~40 compressed parquet files

The current dataset is stored in a single pandas dataframe with the following schema:

Column name	Column Type	Description
text	str	Original text on which the summary is based
summary	str	Summary of the original text
data_src	str	Directory name of the original dataset in `raw_data`

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
notebooks		notebooks
raw_data		raw_data
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
requirements-train.txt		requirements-train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scibart

How to create a dataset from scratch

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CoronaWhy/scibart

Folders and files

Latest commit

History

Repository files navigation

scibart

How to create a dataset from scratch

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages