Relation extraction (RE) aims to identify the semantic relations between named entities in text. Recent years have witnessed it raised to the document level, which requires complex reasoning with entities and mentions throughout an entire document. In this paper, we propose a novel model to document-level RE, by encoding the document information in terms of entity global and local representations as well as context relation representations. Entity global representations model the semantic information of all entities in the document, entity local representations aggregate the contextual information of multiple mentions of specific entities, and context relation representations encode the topic information of other relations. Experimental results demonstrate that our model achieves superior performance on two public datasets for document-level RE. It is particularly effective in extracting relations between entities of long distance and having multiple mentions.
GLRE/
├─ configs/
├── cdr_basebert.yaml: config file for CDR dataset under "Train" setting
├── cdr_basebert_train+dev.yaml: config file for CDR dataset under "Train+Dev" setting
├── docred_basebert.yaml: config file for DocRED dataset under "Train" setting
├─ data/: raw data and preprocessed data about CDR and DocRED dataset
├── CDR/
├── DocRED/
├─ data_processing/: data preprocessing scripts
├─ results/: pre-trained models and results
├─ scripts/: running scripts
├─ src/
├── data/: read data and convert to batch
├── models/: core module to implement GLRE
├── nnet/: sub-layers to implement GLRE
├── utils/: utility function
├── main.py
- python (>=3.6)
- pytorch (>=1.5)
- numpy (>=1.13.3)
- recordtype (>=1.3)
- yamlordereddictloader (>=0.4.0)
- tabulate (>=0.8.7)
- transformers (>=2.8.0)
- scipy (>=1.4.1)
- scikit-learn (>=0.22.1)
The datasets include CDR and DocRED. The data are located in data/CDR
directory and data/DocRED
directory, respectively.
The pre-processing scripts are located in the data_processing
directory, and the pre-processing results are located in the data/CDR/processed
directory and data/DocRED/processed
directory, respectively.
The pre-trained models are in the results
directory.
Specifically, we pre-processed the CDR dataset following edge-oriented graph:
Download the GENIA Tagger and Sentence Splitter:
$ cd data_processing
$ mkdir common && cd common
$ wget http://www.nactem.ac.uk/y-matsu/geniass/geniass-1.00.tar.gz && tar xvzf geniass-1.00.tar.gz
$ cd geniass/ && make && cd ..
$ git clone https://github.com/bornabesic/genia-tagger-py.git
$ cd genia-tagger-py
Here, you should modify the Makefile inside genia-tagger-py and replace line 3 with `wget http://www.nactem.ac.uk/GENIA/tagger/geniatagger-3.0.2.tar.gz`
$ make
$ cd ../../
In order to process the datasets, they should first be transformed into the PubTator format. The run the processing scripts as follows:
$ sh process_cdr.sh
Then, please use the following code to preprocess the DocRED dataset:
python docRedProcess.py --input_file ../data/DocRED/train_annotated.json \
--output_file ../data/DocRED/processed/train_annotated.data \
First, you should download biobert_base and bert_base from figshare and place them in the GLRE directory.
The default hyper-parameters are in the configs
directory and the train&test scripts are in the scripts
directory.
Besides, the run_cdr_train+dev.py
script corresponds to the CDR under traing + dev
setting.
python scripts/run_cdr.py
python scripts/run_cdr_train+dev.py
python scripts/run_docred.py
For CDR, you can evaluate the results using the evaluation script as follows:
python utils/evaluate_cdr.py --gold ../data/CDR/processed/test.gold --pred ../results/cdr-dev/cdr_basebert_full/test.preds --label 1:CID:2
For DocRED, you can submit the result.json
to Codalab.
This project is licensed under the GPL License - see the LICENSE file for details.
If you use this work or code, please kindly cite the following paper:
@inproceedings{GLRE,
author = {Difeng Wang and Wei Hu and Ermei Cao and Weijian Sun},
title = {Global-to-Local Neural Networks for Document-Level Relation Extraction},
booktitle = {EMNLP},
year = {2020},
}
If you have any questions, please feel free to contact Difeng Wang, we will reply it as soon as possible.