This repository includes all necessary code and data to reproduce the experiments detailed in the paper The Nature of NLP: Analyzing Contributions in NLP Papers. We release the code under an Apache 2.0 license and the dataset under a CC-BY-SA-4.0 license.
This repository contains the code for fine-tuning pre-trained models to detect and classify contribution statements in NLP research papers and categorize them by their types (for details on these types, please consult our paper). These trained models can be applied to any NLP research paper to identify its contributions.
Contact person: Aniket Pramanick
The entire pre-processed ACL Events dataset from ACLAnthology will soon be available on TUdatalib. Additional details about the data are available here.
Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.
Natural Language Processing (NLP) is a dynamic, interdisciplinary field that integrates intellectual traditions from computer science, linguistics, social science, and more. Despite its established presence, the definition of what constitutes NLP research remains debated. In this work, we quantitatively investigate what constitutes NLP by examining research papers. For this purpose, we propose a taxonomy and introduce NLPContributions, a dataset of nearly
$2k$ research paper abstracts, expertly annotated to identify scientific contributions and classify their types according to this taxonomy. We also propose a novel task to automatically identify these elements, for which we train a strong baseline on our dataset. We present experimental results from this task and apply our model to$\sim29k$ NLP research papers to analyze their contributions, aiding in the understanding of the nature of NLP research. Our findings reveal a rising involvement of machine learning in NLP since the early nineties, alongside a declining focus on adding knowledge about language or people; again, in post-2020, there has been a resurgence of focus on language and people. We hope this work will spark discussions on our community norms and inspire efforts to consciously shape the future.
Follow the instructions below to create the Python environment for the experiments.
$ conda create -n nlpcontributions pip python=3.9
$ conda activate nlpcontributions
$ pip install -r requirements.txt
To use the dataset, download the data from the link above and place it inside the data
folder.
To train the models, you will need to split the data into train-val-test splits. Use the following script to preprocess the data.
python code/finetune_data_preparation.py
To fine-tune the models and evaluate their performance, use the following script.
python code/limit_classifier.py --model_name_or_path {local or huggingface model path}.
We use the following models: BERT, BiomedBERT, SciBERT, RoBERTa, and Flan-T5.
To run the trained model on test split for inference, use the following script.
python code/inference_merged_labels.py
We have used Tableau for Students to analyze the data and create all the plots. However, any other visualization software could be used as well to analyze the data.
Please use the following citation:
@article{pramanick2024nlpcontributions,
title={The Nature of NLP: Analyzing Contributions in NLP Papers},
author={Pramanick, Aniket and Hou, Yufang and Mohammad, Saif and Gurevych, Iryna},
journal={arXiv preprint arXiv:2409.19505},
year={2024},
url={https://arxiv.org/abs/2409.19505}
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.