List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/
- Awesome Scholarly Data Analysis
- Table of Contents
- Datasets
- Tools
- Publication Venues
- Summer Schools
- Associations & Community
- Contributions
Table of contents generated with markdown-toc
- Arnet Miner
- Microsoft Academic Graph
- Open Academic Graph - MAG + AMiner
- OpenAIRE Research Graph - More info here
- Semantic Scholar Corpus
- CiteSeer
- PubMed
- CORA datasets for citation string parsing
- Humanities and multilingual citation string parsing Flux-CiM and ICONIP see Neural ParsCit paper for details
- Citation string parsing data for social sciences for English and German citations - comparison with Grobid and Cermine
- CrossRef DOI URLs
- DOIboost (Crossres + MAG + ORCID + Unpaywall)
- DBLP Citation dataset
- DBLP XML data
- NBER Patent Citations
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Google Scholar Citations data set direct-download
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- ArXiv data on Kaggle
- EuropePMC
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
- Fatcat - versioned, publicly-editable catalog of research publications
- Microsoft Academic Knowledge Graph - RDF dump
- arXiv CS citation in context
- arXiv fulltext + citations dataset
- Self-citation analysis data based on PubMed Central subset (2002-2005)
- Unpaywalled Corpus - PDF to 23M DOIs Data Schema
- A dataset of publication records for Nobel laureates - paper
- OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - About the data
- Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background
- iCite - NIH Open Citation Collection
- MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
- American Physical Society Data Sets for Research
- Co-citation networks of all Nature papers
- Semantic Scholar Graph of References in Context (GORC) dataset
- Multiple journal publication datasets
- Structured citations in the English Wikipedia
- ICSR Lab (free for researchers) for scopus and plumx use
- COVID-19 Open Research Dataset (CORD-19)
- PaperRobot - includes PubMed Paper Reading Dataset
- SciMag - Microsoft Academic Linked to SciMago Journals - WebPage
- SciGraph Springer Nature
- Citations to scholarly data in various language wikipedias Code
- 800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates
- Coronavirus Open Citations Dataset
- Crossref dumps DOI meta-data
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
- Microsoft Academic Data for conducting covid-19 research
- Initiative for Open Abstracts
- Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology Text Format
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
- Career Transitions of CS students
- Author name gender and ethnicity dataset based on PubMed
- MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
- Conceptual novelty scores for PubMed articles
- 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator
- Canadian PhD career survey - Science report
- Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US - Used in this blog
- Wikidata Author Disambiguation Dataset
- The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)
- Journal editors dataset
- INSPIRE dataset
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- Arnet Miner - Manual Name Disambiguation data 210 authors
- DBLP Name disambiguation dataset - Error corrected version
- rexa-coref-data
- Dedped author names on IEEE Vis papers 1990-2018
- Author-ity dataset for PubMed 2009
- ACL Anthology dataset
- Base data for estimating precision and recall of Author-ity among NIH-funded scientists
- ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
- Citation Parsing
- Citation Parsing in humanities
- Sentences tagged for Drug Disease pairs
- Document Summarization and citation span identification
- ACL Anthology human summaries for 1000 papers
- Keyphrase Extraction
- Related Work Summarization
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 also at @CLARIN
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- PubMed Fulltext - protein-protein and genetic interactions
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - SciSpacy link
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- PubMed - Colorado Richly Annotated Full-Text
- Biomedical NER datasets related publication
- BioVerbNet
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
- CLEF datasets for multilingual Biomedical NLP+IE
- MedMentions - UMLS entities in PubMed
- Colright Initiatve - Rich text competition
- SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
- PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions
- NER, Parsing, Classification datasets from SciBert
- ACA Wiki - Paper summaries of more than 1600 papers
- SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
- A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
- Medical Information Extraction from PubMed abstracts
- Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets
- PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - Train - Dev - Test - Background Test set
- Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
- Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus
- The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
- TalkSumm - Summary of papers via alignment to talks
- SeminalSurveyDBLP - Classification of seminal or survey papers
- A Dataset of Peer Reviews (PeerRead)
- CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations
- Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions
- GENETAG - More recent versions Publication and Download 2005
- MedTag: A Collection of Biomedical Annotations - Download
- Open Biomedical corpora
- Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see other NLM curated biomedical resources
- SciDTB: Discourse Dependency TreeBank for Scientific Abstracts
- SciDTB corpus annotated for argumentation mining - Paper
- Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets
- ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result").- Browse files - Project details
- Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence. - Project details
- PubMedQA - Question answering on PubMed
- Corposaurus - Collection of biomedical corpus for NER
- BioNER corpus
- NeuroQuery - 14,000 full-text publications and 400,000 peak activations - NeuroQuery website
- Medical Information Extraction dataset
- A Large Parallel Corpus of Full-Text Scientific Articles
- Annotated Corpus of Scientific Conference's Homepages for Information Extraction
- Chi QA - Health Question Answering dataset from NLM
- Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine - Includes wikification data
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Open Research Knowledge Graph project - Website
- Academic PhraseBank
- SciKG - Statement extraction datasets
- A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
- A manual corpus of annotated main findings of clinical case reports
- TREC Precision Medicine / Clinical Decision Support Track
- Lots of biomedical entity linking and entity identification datasets
- Materials Science Named Entity Recognition: train/development/test sets
- Entities in 3.27 million materials science abstracts
- Normalized entities in material science papers
- Named Entity Recognition for Bacterial Type IV Secretion Systems - Paper
- Annotating and detecting phenotypic information for chronic obstructive pulmonary disease
- MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction
- Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics
- Multiple PUBMED annotated corpora from iProLink project
- Mars Target Encyclopedia - LPSC abstracts labeled data set
- Annotation of phenotypes using ontologies
- The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text - SPECIES Direct Download - ORGANISMS Direct Download
- Entity mention in articles used for benchmark
- RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations
- Medical Relation Extraction - CrowdTruth
- KP20k - Kehphrase extraction on 20k abstracts
- Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition
- Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction
- Question Answering: (5.23 MB), 3 datasets on biomedical question answering task
- SciREX : A Challenge Dataset for Document-Level Information Extraction
- Papers with Code - Links between papers and repositories and extraction of SOTA results
- Citation Context Classification based on purpose
- Citation Context Classification based on influence
- PubMed knowledge graph (PKG) Figshare
- Citation and Header Datasets
- Gobrid-NER data
- Multiple NER and Entity Linking data for science
- Scitation Context Classification
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- EuropePMC annotations for entities and relationships
- PeerRead - paper drafts, reviews, and accept/reject decision
- SciGraph Springer Nature
- Medical Subject Headings maintained by the National Library of Medicine of the United States
- Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
- Physics Subject Headings (PhySH) maintained by American Physical Society (APS) GitHub
- Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
- ACM Computing Classification System maintained by the Association for Computing Machinery
- Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
- Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
- Journal of Economic Literature (JEL) maintained by the American Economic Association
- STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
- Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
- Fields of Research (FoR) classification
- Research Fields, Courses and Disciplines (RFCD) classification
- Socio-Economic Objective (SEO) classification
- Library of Congress Classification (LCC) maintained by Library of Congress
- Fields of Study (FoS) maintained by Microsoft Academic
- CrossRef Open Funder's Registry
- Altmetrics API
- Dimensions.ai API - documentation, example
- Core Conference Rankings
- China Computer Federation Conference Rankings
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- OpenAIRE Explore
- AceMap
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rcrossref (R library)
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- scihub.py (Python library)
- SoPaper (Python library)
- CiteSeer tools
- Novelty quantification in PubMed articles
- TidyPMC - R based PMC XML parser
- PublicationHarvester - Download PubMed publications of an author
- Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar
- Affiliation string parser
- CiteSeerX
- ContentMine - getpapers
- rcoreoa - CORE API R client
- metaknowledge - A Python library for doing bibliometric and network analysis in science and health policy research
- PubMedPortable - PubMed to Postgres
- medic - Parsing MEDLINE and storing into a DB
- Biomedical - BioSentVec Embeddings
- Biomedical embeddings - CambridgeLTL
- NIH scientific paper pre-processing
- SciSpacy - Spacy models for Biomedical NLP from AllenAI
- Multitask Biomedical NER
- SciBERT - Bert LM for Biomedical and CS papers
- CERMINE
- Grobid
- EXCITE (Extraction of Citations from PDF Documents)
- Science-Parse
- unarXiv (Citation in context from arXiv)
- Biblio-Glutton
- PDF/LaTeX to JSON
- CrossRef Reference Matching code and evaluation data
- Citation style classifier and evaluation data
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies (Open Access)
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
- Joint Conference on Digital Libraries (JCDL)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
- SIGMET - Metrics workshop
- International Workshop on Mining Scientific Publications
- Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)
- Workshop on Reframing Research (RefResh)
- Enabling Open Semantic Science (SemSci)
- Workshop on Scholarly Document Processing
- International Society for Informetrics and Scientometrics (ISSI)
- European Network of Indicator Designers (ENID)
- 4S (Society for Social Studies of Science)
- SIG/MET - Special Interest Group for the measurement of information production and use
The following people have contributed to the items on this list.
- Shubhanshu Mishra - Maintainer of the list.
- Angelo Antonio Salatino
- Philipp Zumstein
- Ali (Aliakbar Akbaritabar)
- Andrea Mannocci