Skip to content

Latest commit

 

History

History
163 lines (112 loc) · 6.24 KB

README.org

File metadata and controls

163 lines (112 loc) · 6.24 KB

Publication parser

This repository contains code for generating an html publication list based on unique identifiers. It can be used with a loosely defined input.

Running

Installation

Create a virtual environment using your preferred method, and install required packages.

pip install -r requirements.txt

Configuration

Some APIs suggest including e-mail address in queries (headers or params), or require it to get into a polite pool. The e-mail will be read from a configuration file. Create userconfig.toml with the following content and save it in the code directory:

[user]
  email = "[email protected]"

SFB project members are highlighted in the generated list. Matching is currently perform on a last name basis. Create sfb-authors.txt with one name per line (like below), and save it in the code directory:

Doe
Mustermann

A script for generating the SFB member list by reading the SFB website is in scrape_names.py, but some names need to be edited or added manually afterwards.

Usage

Run the parsepapers.py script with the input file (publication info) as the only argument.

python parsepapers.py publications.txt

Note: some shortcuts were made, and currently we rely on working directory being the code directory. The code will generate the following files in the same directory:

  • Html output will be written to publications.html.
  • structured publication data will be written to tmpdata.json
  • query_cache.sqlite and pubmed_cache.sqlite will be created for caching get requests

Input data

Format specification

The input file is a text file composed of blocks (paragraphs) separated by empty lines.

  • The file is divided into sections, corresponding to projects. Sections are defined by a single line, starting with * and containing a project name.
  • A publication is described by a block of up to 3 lines:
    • First line is a free-form citation (ideally, but not necessarily including an identifier), or the identifier alone.
    • Lines 2 and 3 are an identifier and comment, both optional, and in a freely chosen order:
      • if a line starts with “http(s)”, it will be treated as URL (if an identifier is not found in the first line, the URL will be used to find an identifier by pattern matching)
      • otherwise it will be treated as a comment, to be displayed in the generated page (after generated citation)
      • there can be at most one url and one comment

Note that it is sufficient to only include the identifier (i.e. less is more) - see below for identifier formats. Anything else (URL, citation text) will be used only to find the identifier (by pattern matching, or by searching crossref database).

Such file can be generated with minimal hand-editing from a copy-paste of an existing word document, but it should be just as easy to edit “by hand” from scratch.

Example

Example below describes 5 different publications from 2 projects. Each of them contains an identifier that would be picked up.

* Project XYZ

John Doe, Max Mustermann, Some Paper Title, Some Journal (2023)
in collaboration with project QWE
https://pubmed.ncbi.nlm.nih.gov/123456

Mustermann M., Doe J., Some Other Paper Title, Some Other Journal (2023) doi: 10.nnnnnn/example1

PMID: 123123
in collaboration with ZYX

* Project ABC

https://doi.org/10.nnnnnn/example2

PMCID: 654321

Recognized identifiers

The script recognizes the following unique identifiers for articles (when included on the 1st line), as long as they are given in the following formats:

  • PubMed ID
    • PMID: 12345678
  • PubMed Central ID
    • PMCID: 12345678
  • DOI
    • doi: 10.nnnnnn/example
    • https://doi.org/10.nnnnnn/example
    • https://dx.doi.org/10.nnnnnn/example

Additionally, the following URL patterns (2nd / 3rd line) will be recognized:

  • pubmed.ncbi.nlm.nih.gov/12345678 (PubMed id)
  • ncbi.nlm.nih.gov/pmc/articles/PMC12345678 (PMC id)
  • several content url patterns from known publishers (doi)

Design principles & implementation details

Identifiers are key

All formatted citations are generated based on structured metadata obtained from API queries. Provided input (citation text, content URL) are parsed with regular expressions to extract identifiers. Only if an identifier cannot be found, the citation text is used to perform a bibliographic query with Crossref. This means that the citation text or content URL are never used directly.

The regex patterns used for matching content url (which often contains a doi in some of its parts) are not perfect, but should sufficiently cover link formats used by several large publishers.

The input format and processing approach (pattern matching combined with bibliographic query fallback) was designed to allow working with historic SFB data (largely unstructured citations accompanied by different kinds of content URLs) while allowing easy additions of new citations with identifiers only.

Structured metadata

Citation Style Language is used as the internal representation of structured metadata. CSL+json is the preferred API response format.

APIs

If several identifiers can be found for a citation item, only one will be used (with a matching API), with the following priority: pmid, pmcid, doi.

The script uses the following APIs:

A bibliographic query requires additional heuristics to disambiguate some cases when e.g. preprint and published article score similarly for a given citation. Its result is not available in csl directly, so the doi included in the result is then used for another query.

API requests are implemented as basic GET requests, following documentation and examples from respective services. The requests_cache and requests_ratelimiter libraries are used to cache and throttle requests, respectively.

Results formatting

A Jinja template is used to generate a formatted html page.