CorPipe 23: CRAC 2023 Winning System for Multilingual Coreference Resolution

This repository contains the source code of CorPipe 23, which is available under the MPL-2.0 license. The architecture of CorPipe 23 is described in the following paper:

ÚFAL CorPipe at CRAC 2023: Larger Context Improves Multilingual Coreference Resolution

Milan Straka
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Lingustics
Malostranské nám. 25, Prague, Czech Republic

Abstract: We present CorPipe, the winning entry to the CRAC 2023 Shared Task on Multilingual Coreference Resolution. Our system is an improved version of our earlier multilingual coreference pipeline, and it surpasses other participants by a large margin of 4.5 percent points. CorPipe first performs mention detection, followed by coreference linking via an antecedent-maximization approach on the retrieved spans. Both tasks are trained jointly on all available corpora using a shared pretrained language model. Our main improvements comprise inputs larger than 512 subwords and changing the mention decoding to support ensembling.

Content of this Repository

The directory data is for the CorefUD 1.1 data, and the preprocessed and tokenized version needed for training.
- The script data/get.sh downloads and extracts the CorefUD 1.1 training and development data, plus the unannotated test data of the CRAC 2023 shared task.
The corpipe23.py is the complete CorPipe 23 source file.
The corefud-score.sh is an evaluation script used by corpipe23.py, which
- performs evaluation (using the official evaluation script from the corefud-scorer submodule),
- optionally (when -v is passed), it also:
  - runs validation (using the official UD validator from the validator submodule) on the output data,
  - performs evaluation with singletons,
  - performs evaluation with exact match.
The res.py is our script for visualizing performance of running and finished experiments, and for comparing two experiments. It was developed for our needs and we provide it as-is without documentation.

The Released `corpipe23-corefud1.1-231206` Model

The corpipe23-corefud1.1-231206 is a mT5-large-based multilingual model. It is released on https://hdl.handle.net/11234/1-5369 under the CC BY-NC-SA 4.0 license.

The model is language agnostic (no corpus id on input), so it can be used to predict coreference in any mT5 language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).

See the corpipe23-corefud1.1-231206 directory for more information.

Training a Single Multilingual `mT5-large`-based CorPipe 23 Model

To train a single multilingual model on all the data using mT5 large, you should

run the data/get.sh script to download the CorefUD 1.1 data,
create a Python environments with the packages listed in requirements.txt,

train the model itself using the corpipe23.py script.

For training a mT5-large variant with square-root mix ratios and without treebank ids, use

tb="ca_ancora cs_pcedt cs_pdt de_parcorfull de_potsdamcc en_gum en_parcorfull es_ancora fr_democrat hu_korkor hu_szegedkoref lt_lcc no_bokmaalnarc no_nynorsknarc pl_pcc ru_rucor tr_itcc"
ratios_sqrt="8.4 14.0 11.7 1.4 2.4 5.6 1.4 8.8 6.9 2.0 4.6 2.5 6.5 6.0 9.5 5.1 3.1"

corpipe23.py --train --dev --treebanks $(for c in $tb; do echo data/$c/$c-corefud-train.conllu; done) --resample 8000 $ratios_sqrt --epochs=15 --batch_size=8 --adafactor --learning_rate=6e-4 --learning_rate_decay --encoder=google/mt5-large --segment=512 --right=50 --label_smoothing=0.2 --exp=mt5-large

Predicting with a CorPipe 23 Model

To predict with a trained model, use the following arguments:

corpipe23.py --load model_checkpoint_path --exp target_directory --epoch 0 --test input1.conllu input2.conllu

the direcotry with the model checkpoint must contain also the options.json and tags.txt files;
the outputs are generated in the target directory, with .00.conllu suffix;
if you want to also evaluate the predicted files, you can use --dev option instead of --test;
optionally, you can pass --segment 2560 to specify longer context size, which very likely produces better results, but needs more GPU memory.

How to Cite

@inproceedings{straka-2023-ufal,
    title = "{{\'U}FAL} {C}or{P}ipe at {CRAC} 2023: Larger Context Improves Multilingual Coreference Resolution",
    author = "Straka, Milan",
    editor = "{\v{Z}}abokrtsk{\'y}, Zden{\v{e}}k  and Ogrodniczuk, Maciej",
    booktitle = "Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.crac-sharedtask.4",
    doi = "10.18653/v1/2023.crac-sharedtask.4",
    pages = "41--51",
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
corefud-scorer @ 8749913		corefud-scorer @ 8749913
corpipe23-corefud1.1-231206		corpipe23-corefud1.1-231206
corpipe23-corefud1.2-240906		corpipe23-corefud1.2-240906
data		data
figures		figures
validator @ c4d1dc1		validator @ c4d1dc1
.gitignore		.gitignore
.gitmodules		.gitmodules
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
corefud-score.sh		corefud-score.sh
corpipe23.py		corpipe23.py
requirements.txt		requirements.txt
res.py		res.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorPipe 23: CRAC 2023 Winning System for Multilingual Coreference Resolution

ÚFAL CorPipe at CRAC 2023: Larger Context Improves Multilingual Coreference Resolution

Content of this Repository

The Released `corpipe23-corefud1.1-231206` Model

Training a Single Multilingual `mT5-large`-based CorPipe 23 Model

Predicting with a CorPipe 23 Model

How to Cite

About

Releases

Packages

Contributors 2

Languages

License

ufal/crac2023-corpipe

Folders and files

Latest commit

History

Repository files navigation

CorPipe 23: CRAC 2023 Winning System for Multilingual Coreference Resolution

ÚFAL CorPipe at CRAC 2023: Larger Context Improves Multilingual Coreference Resolution

Content of this Repository

The Released corpipe23-corefud1.1-231206 Model

Training a Single Multilingual mT5-large-based CorPipe 23 Model

Predicting with a CorPipe 23 Model

How to Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

The Released `corpipe23-corefud1.1-231206` Model

Training a Single Multilingual `mT5-large`-based CorPipe 23 Model

Packages