So far, this is a refactoring job of a notebook in CurationCorp's amazing curation-corpus repository for training on GPU clusters to tune BART for abstractive summarization of scientific literature.
Part of the Coronawhy project.
Currently, the dataset is sourced as follows:
- Text-abstract pairs from Arxiv and the Semantic Scholar Corpus as provided by Santosh-Gupta's ScientificSummarizationDataSets repo
- Text-headline pairs from WikiHow, provided by mahnazkoupaee's WikiHow-Dataset repo
- Curation Corpus
To create a new dataset from scratch:
- Download the ArXiv and Semantic Scholar Corpus datasets from gdrive (as described here) and unzip into
raw_data/ArxivStructuredAbstractSectionalSummaries
andraw_data/SemanticScholarAbstractSectionSummaryDataSet
- Download wikihowAll.csv (as described here) into
raw_data/wikihow
- Scrape the Curation Corpus dataset as explained in the repo, then move
curation-corpus-base-with-articles.csv
toraw_data/curation_corpus
- Run
python src/data/create_dataset.py
. This will create a new folder calleddata
with ~40 compressed parquet files
The current dataset is stored in a single pandas dataframe with the following schema:
Column name | Column Type | Description |
---|---|---|
text | str | Original text on which the summary is based |
summary | str | Summary of the original text |
data_src | str | Directory name of the original dataset in raw_data |