This repository contains code for generating an html publication list based on unique identifiers. It can be used with a loosely defined input.
Create a virtual environment using your preferred method, and install required packages.
pip install -r requirements.txt
Some APIs suggest including e-mail address in queries (headers or params), or require it to get into a polite pool.
The e-mail will be read from a configuration file.
Create userconfig.toml
with the following content and save it in the code directory:
[user]
email = "[email protected]"
SFB project members are highlighted in the generated list.
Matching is currently perform on a last name basis.
Create sfb-authors.txt
with one name per line (like below), and save it in the code directory:
Doe Mustermann
A script for generating the SFB member list by reading the SFB website is in scrape_names.py
,
but some names need to be edited or added manually afterwards.
Run the parsepapers.py
script with the input file (publication info) as the only argument.
python parsepapers.py publications.txt
Note: some shortcuts were made, and currently we rely on working directory being the code directory. The code will generate the following files in the same directory:
- Html output will be written to
publications.html
. - structured publication data will be written to
tmpdata.json
query_cache.sqlite
andpubmed_cache.sqlite
will be created for caching get requests
The input file is a text file composed of blocks (paragraphs) separated by empty lines.
- The file is divided into sections, corresponding to projects.
Sections are defined by a single line, starting with
*
and containing a project name. - A publication is described by a block of up to 3 lines:
- First line is a free-form citation (ideally, but not necessarily including an identifier), or the identifier alone.
- Lines 2 and 3 are an identifier and comment, both optional, and in a freely chosen order:
- if a line starts with “http(s)”, it will be treated as URL (if an identifier is not found in the first line, the URL will be used to find an identifier by pattern matching)
- otherwise it will be treated as a comment, to be displayed in the generated page (after generated citation)
- there can be at most one url and one comment
Note that it is sufficient to only include the identifier (i.e. less is more) - see below for identifier formats. Anything else (URL, citation text) will be used only to find the identifier (by pattern matching, or by searching crossref database).
Such file can be generated with minimal hand-editing from a copy-paste of an existing word document, but it should be just as easy to edit “by hand” from scratch.
Example below describes 5 different publications from 2 projects. Each of them contains an identifier that would be picked up.
* Project XYZ John Doe, Max Mustermann, Some Paper Title, Some Journal (2023) in collaboration with project QWE https://pubmed.ncbi.nlm.nih.gov/123456 Mustermann M., Doe J., Some Other Paper Title, Some Other Journal (2023) doi: 10.nnnnnn/example1 PMID: 123123 in collaboration with ZYX * Project ABC https://doi.org/10.nnnnnn/example2 PMCID: 654321
The script recognizes the following unique identifiers for articles (when included on the 1st line), as long as they are given in the following formats:
- PubMed ID
PMID: 12345678
- PubMed Central ID
PMCID: 12345678
- DOI
doi: 10.nnnnnn/example
https://doi.org/10.nnnnnn/example
https://dx.doi.org/10.nnnnnn/example
Additionally, the following URL patterns (2nd / 3rd line) will be recognized:
pubmed.ncbi.nlm.nih.gov/12345678
(PubMed id)ncbi.nlm.nih.gov/pmc/articles/PMC12345678
(PMC id)- several content url patterns from known publishers (doi)
All formatted citations are generated based on structured metadata obtained from API queries. Provided input (citation text, content URL) are parsed with regular expressions to extract identifiers. Only if an identifier cannot be found, the citation text is used to perform a bibliographic query with Crossref. This means that the citation text or content URL are never used directly.
The regex patterns used for matching content url (which often contains a doi in some of its parts) are not perfect, but should sufficiently cover link formats used by several large publishers.
The input format and processing approach (pattern matching combined with bibliographic query fallback) was designed to allow working with historic SFB data (largely unstructured citations accompanied by different kinds of content URLs) while allowing easy additions of new citations with identifiers only.
Citation Style Language is used as the internal representation of structured metadata. CSL+json is the preferred API response format.
If several identifiers can be found for a citation item, only one will be used (with a matching API), with the following priority: pmid, pmcid, doi.
The script uses the following APIs:
- PubMed Literature Citation Exporter (for pmid, pmcid),
- https://doi.org with content negotiation (for doi)
- Crossref REST API (for bibliographic queries)
A bibliographic query requires additional heuristics to disambiguate some cases when e.g. preprint and published article score similarly for a given citation. Its result is not available in csl directly, so the doi included in the result is then used for another query.
API requests are implemented as basic GET requests, following documentation and examples from respective services. The requests_cache and requests_ratelimiter libraries are used to cache and throttle requests, respectively.
A Jinja template is used to generate a formatted html page.