Change authors to list[dict['name' | 'url' | ..., str]] (#70)

* insert shebang and script check * pre-commit run * convert to dict of name and url * convert more to dict * convert publications * remove comma * Update data/packages.yml Co-authored-by: Meesum Qazalbash <[email protected]> * fix __main__ loop and define class Author(TypedDict) TODO author validation * make script path absolute * revert reposition of module level vars * reapply somehow disappeared abs dir change * raise ValueError on non-https author URLs fix yaml whitespace --------- Co-authored-by: Meesum Qazalbash <[email protected]> Co-authored-by: Janosh Riebesell <[email protected]>
janosh · Sep 22, 2024 · a59d334 · a59d334
1 parent 1293330
commit a59d334
Show file tree

Hide file tree

Showing 8 changed files with 577 additions and 234 deletions.
diff --git a/data/applications.yml b/data/applications.yml
@@ -1,47 +1,97 @@
 - title: Latent Space Policies for Hierarchical Reinforcement Learning
   url: https://arxiv.org/abs/1804.02808
   date: 2018-04-09
-  authors: Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, Sergey Levine
+  authors:
+    - name: Tuomas Haarnoja
+    - name: Kristian Hartikainen
+    - name: Pieter Abbeel
+    - name: Sergey Levine
   description: Uses normalizing flows, specifically RealNVPs, as policies for reinforcement learning and also applies them for the hierarchical reinforcement learning setting.
 
 - title: Analyzing Inverse Problems with Invertible Neural Networks
   url: https://arxiv.org/abs/1808.04730
   date: 2018-08-14
-  authors: Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W. Pellegrini, Ralf S. Klessen, Lena Maier-Hein, Carsten Rother, Ullrich Köthe
+  authors:
+    - name: Lynton Ardizzone
+    - name: Jakob Kruse
+    - name: Sebastian Wirkert
+    - name: Daniel Rahner
+    - name: Eric W. Pellegrini
+    - name: Ralf S. Klessen
+    - name: Lena Maier-Hein
+    - name: Carsten Rother
+    - name: Ullrich Köthe
   description: Normalizing flows for inverse problems.
 
 - title: NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport
   url: https://arxiv.org/abs/1903.03704
   date: 2019-03-09
-  authors: Matthew Hoffman, Pavel Sountsov, Joshua V. Dillon, Ian Langmore, Dustin Tran, Srinivas Vasudevan
+  authors:
+    - name: Matthew Hoffman
+    - name: Pavel Sountsov
+    - name: Joshua V. Dillon
+    - name: Ian Langmore
+    - name: Dustin Tran
+    - name: Srinivas Vasudevan
   description: Uses normalizing flows in conjunction with Monte Carlo estimation to have more expressive distributions and better posterior estimation.
 
-- title: 'SRFlow: Learning the Super-Resolution Space with Normalizing Flow'
+- title: "SRFlow: Learning the Super-Resolution Space with Normalizing Flow"
   url: https://arxiv.org/abs/2006.14200
   date: 2020-06-25
-  authors: Andreas Lugmayr, Martin Danelljan, Luc Van Gool, Radu Timofte
+  authors:
+    - name: Andreas Lugmayr
+    - name: Martin Danelljan
+    - name: Luc Van Gool
+    - name: Radu Timofte
   description: Uses normalizing flows for super-resolution.
 
 - title: Faster Uncertainty Quantification for Inverse Problems with Conditional Normalizing Flows
   url: https://arxiv.org/abs/2007.07985
   date: 2020-07-15
-  authors: Ali Siahkoohi, Gabrio Rizzuti, Philipp A. Witte, Felix J. Herrmann
+  authors:
+    - name: Ali Siahkoohi
+    - name: Gabrio Rizzuti
+    - name: Philipp A. Witte
+    - name: Felix J. Herrmann
   description: Uses conditional normalizing flows for inverse problems. [[Video](https://youtu.be/nPvZIKaRBkI)]
 
 - title: Targeted free energy estimation via learned mappings
   url: https://aip.scitation.org/doi/10.1063/5.0018903
   date: 2020-10-13
-  authors: Peter Wirnsberger, Andrew J. Ballard, George Papamakarios, Stuart Abercrombie, Sébastien Racanière, Alexander Pritzel, Danilo Jimenez Rezende, Charles Blundell
+  authors:
+    - name: Peter Wirnsberger
+    - name: Andrew J. Ballard
+    - name: George Papamakarios
+    - name: Stuart Abercrombie
+    - name: Sébastien Racanière
+    - name: Alexander Pritzel
+    - name: Danilo Jimenez Rezende
+    - name: Charles Blundell
   description: Normalizing flows used to estimate free energy differences.
 
 - title: On the Sentence Embeddings from Pre-trained Language Models
   url: https://aclweb.org/anthology/2020.emnlp-main.733
   date: 2020-11-02
-  authors: Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li
+  authors:
+    - name: Bohan Li
+    - name: Hao Zhou
+    - name: Junxian He
+    - name: Mingxuan Wang
+    - name: Yiming Yang
+    - name: Lei Li
   description: Proposes to use flows to transform anisotropic sentence embedding distributions from BERT to a smooth and isotropic Gaussian, learned through unsupervised objective. Demonstrates performance gains over SOTA sentence embeddings on semantic textual similarity tasks. Code available at <https://github.com/bohanli/BERT-flow>.
 
 - title: Normalizing Kalman Filters for Multivariate Time Series Analysis
   url: https://assets.amazon.science/ea/0c/88b7bdd54eae8c08983fa4cc3e06/normalizing-kalman-filters-for-multivariate-time-series-analysis.pdf
   date: 2020-12-06
-  authors: Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Benidis, Michael Bohlke-Schneider, Richard Kurle, Lorenzo Stella, Hilaf Hasson, Patrick Gallinari, Tim Januschowski
+  authors:
+    - name: Emmanuel de Bézenac
+    - name: Syama Sundar Rangapuram
+    - name: Konstantinos Benidis
+    - name: Michael Bohlke-Schneider
+    - name: Richard Kurle
+    - name: Lorenzo Stella
+    - name: Hilaf Hasson
+    - name: Patrick Gallinari
+    - name: Tim Januschowski
   description: Augments state space models with normalizing flows and thereby mitigates imprecisions stemming from idealized assumptions. Aimed at forecasting real-world data and handling varying levels of missing data. (Also available at [Amazon Science](https://amazon.science/publications/normalizing-kalman-filters-for-multivariate-time-series-analysis).)
diff --git a/data/make_readme.py b/data/make_readme.py
@@ -1,25 +1,36 @@
+#!/usr/bin/env python3
+
 """Script to generate readme.md from data/*.yml files."""
 
 import datetime
+import os
 import re
-from os.path import dirname
 from typing import TypedDict
 
 import yaml
 
-ROOT = dirname(dirname(__file__))
+ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+
+
+class Author(TypedDict):
+    """An author of a paper or application."""
+
+    name: str
+    url: str | None
+    affiliation: str | None
+    github: str | None
+    orcid: str | None
 
 
 class Item(TypedDict):
     """An item in a readme section like a paper or package."""
 
     title: str
-    authors: str
+    authors: list[Author]
     date: datetime.date
     lang: str
     url: str
     description: str
-    authors_url: str | None
     repo: str | None
     date_added: datetime.date | None
 
@@ -44,7 +55,7 @@ class Section(TypedDict):
 
 def load_items(key: str) -> list[Item]:
     """Load list[Item] from YAML file."""
-    with open(f"{ROOT}/data/{key}.yml", encoding="utf8") as file:
+    with open(f"{ROOT_DIR}/data/{key}.yml", encoding="utf8") as file:
         return yaml.safe_load(file.read())
 
 
@@ -53,10 +64,9 @@ def load_items(key: str) -> list[Item]:
     for key in titles  # markdown is set below
 }
 
-
 seen_titles: set[tuple[str, str]] = set()
 required_keys = {"title", "url", "date", "authors", "description"}
-optional_keys = {"authors_url", "lang", "repo", "docs", "date_added", "last_updated"}
+optional_keys = {"lang", "repo", "docs", "date_added", "last_updated"}
 valid_languages = {"PyTorch", "TensorFlow", "JAX", "Julia", "Other"}
 et_al_after = 2
 
@@ -72,7 +82,7 @@ def validate_item(itm: Item, section_title: str) -> None:
     else:
         seen_titles.add((title, section_title))
 
-    if section_title in ("packages", "repos") and itm["lang"] not in valid_languages:
+    if section_title in {"packages", "repos"} and itm["lang"] not in valid_languages:
         errors += [
             f"Invalid lang in {title}: {itm['lang']}, must be one of {valid_languages}"
         ]
@@ -101,87 +111,100 @@ def validate_item(itm: Item, section_title: str) -> None:
         raise ValueError("\n".join(errors))
 
 
-for key, section in sections.items():
-    # Keep lang_names inside sections loop to refill language subsections for each new
-    # section. Used by both repos and Packages. Is a list for order and mutability.
-    lang_names = ["PyTorch", "TensorFlow", "JAX", "Julia", "Other"]
-
-    # sort first by language with order determined by lang_names (only applies to
-    # Package and repos sections), then by date
-    section["items"].sort(key=lambda x: x["date"], reverse=True)
-    if key in ("packages", "repos"):
-        section["items"].sort(key=lambda itm: lang_names.index(itm["lang"]))
-
-    # add item count after section title
-    section["markdown"] += f" <small>({len(section['items'])})</small>\n\n"
-
-    for itm in section["items"]:
-        if (lang := itm.get("lang")) in lang_names:
-            lang_names.remove(lang)
-            # print language subsection title if this is the first item with that lang
-            section["markdown"] += (
-                f'<br>\n\n### <img src="assets/{lang.lower()}.svg" alt="{lang}" '
-                f'height="20px"> &nbsp;{lang} {key.title()}\n\n'
+if __name__ == "__main__":
+    for key, section in sections.items():
+        # Keep lang_names inside sections loop to refill language
+        # subsections for each new section. Used by both repos and Packages.
+        # Is a list for order and mutability.
+        lang_names = ["PyTorch", "TensorFlow", "JAX", "Julia", "Other"]
+
+        # sort first by language with order determined by lang_names (only applies to
+        # Package and repos sections), then by date
+        section["items"].sort(key=lambda x: x["date"], reverse=True)
+        if key in ("packages", "repos"):
+            section["items"].sort(key=lambda itm: lang_names.index(itm["lang"]))
+
+        # add item count after section title
+        section["markdown"] += f" <small>({len(section['items'])})</small>\n\n"
+
+        for itm in section["items"]:
+            if (lang := itm.get("lang")) in lang_names:
+                lang_names.remove(lang)
+                # print language subsection title if this is the first item
+                # with that language
+                section["markdown"] += (
+                    f'<br>\n\n### <img src="assets/{lang.lower()}.svg" alt="{lang}" '
+                    f'height="20px"> &nbsp;{lang} {key.title()}\n\n'
+                )
+
+            validate_item(itm, section["title"])
+
+            authors = itm["authors"]
+            date = itm["date"]
+            description = itm["description"]
+            title = itm["title"]
+            url = itm["url"]
+
+            if key in ("publications", "applications"):
+                # only show people's last name for papers
+                authors = [
+                    auth | {"name": auth["name"].split(" ")[-1]} for auth in authors
+                ]
+
+            def auth_str(auth: Author) -> str:
+                """Return a markdown string for an author."""
+                auth_str = auth["name"]
+                if url := auth.get("url"):
+                    if not url.startswith("https://"):
+                        raise ValueError(
+                            f"Invalid author {url=}, must start with https://"
+                        )
+                    auth_str = f"[{auth_str}]({url})"
+                return auth_str
+
+            authors_str = ", ".join(map(auth_str, authors[:et_al_after]))
+            if len(authors) > et_al_after:
+                authors_str += " et al."
+
+            md_str = f"1. {date} - [{title}]({url}) by {authors_str}"
+
+            if key in ("packages", "repos") and url.startswith("https://github.com"):
+                gh_login, repo_name = url.split("/")[3:5]
+                md_str += (
+                    f'\n&ensp;\n<img src="https://img.shields.io/github/stars/'
+                    f'{gh_login}/{repo_name}" alt="GitHub repo stars"'
+                    ' valign="middle" />'
+                )
+
+            md_str += "<br>\n   " + description.removesuffix("\n")
+            if docs := itm.get("docs"):
+                md_str += f" [[Docs]({docs})]"
+            if repo := itm.get("repo"):
+                md_str += f" [[Code]({repo})]"
+
+            section["markdown"] += md_str + "\n\n"
+
+    with open(f"{ROOT_DIR}/readme.md", "r+", encoding="utf8") as file:
+        readme = file.read()
+
+        for section in sections.values():
+            # look ahead without matching
+            section_start_pat = f"(?<={section['title']})"
+            # look behind without matching
+            next_section_pat = "(?=<br>\n\n## )"
+
+            # match everything up to next heading
+            readme = re.sub(
+                rf"{section_start_pat}[\s\S]+?\n\n{next_section_pat}",
+                section["markdown"],
+                readme,
             )
 
-        validate_item(itm, section["title"])
-
-        authors = itm["authors"]
-        date = itm["date"]
-        description = itm["description"]
-        title = itm["title"]
-        url = itm["url"]
-
-        author_list = authors.split(", ")
-        if key in ("publications", "applications"):
-            # only show people's last name for papers
-            author_list = [author.split(" ")[-1] for author in author_list]
-        authors = ", ".join(author_list[:et_al_after])
-        if len(author_list) > et_al_after:
-            authors += " et al."
-
-        if authors_url := itm.get("authors_url"):
-            authors = f"[{authors}]({authors_url})"
-
-        md_str = f"1. {date} - [{title}]({url}) by {authors}"
-
-        if key in ("packages", "repos") and url.startswith("https://github.com"):
-            gh_login, repo_name = url.split("/")[3:5]
-            md_str += (
-                f'\n&ensp;\n<img src="https://img.shields.io/github/stars/'
-                f'{gh_login}/{repo_name}" alt="GitHub repo stars" valign="middle" />'
-            )
-
-        md_str += "<br>\n   " + description.removesuffix("\n")
-        if docs := itm.get("docs"):
-            md_str += f" [[Docs]({docs})]"
-        if repo := itm.get("repo"):
-            md_str += f" [[Code]({repo})]"
-
-        section["markdown"] += md_str + "\n\n"
-
-
-with open(f"{ROOT}/readme.md", "r+", encoding="utf8") as file:
-    readme = file.read()
-
-    for section in sections.values():
-        # look ahead without matching
-        section_start_pat = f"(?<={section['title']})"
-        # look behind without matching
-        next_section_pat = "(?=<br>\n\n## )"
-
-        # match everything up to next heading
-        readme = re.sub(
-            rf"{section_start_pat}[\s\S]+?\n\n{next_section_pat}",
-            section["markdown"],
-            readme,
-        )
-
-    file.seek(0)
-    file.write(readme)
-    file.truncate()
+        file.seek(0)
+        file.write(readme)
+        file.truncate()
 
-section_counts = "\n".join(
-    f"- {key}: {len(sec['items'])}" for key, sec in sections.items()
-)
-print(f"finished writing {len(seen_titles)} items to readme:\n{section_counts}")  # noqa: T201
+    section_counts = "\n".join(
+        f"- {key}: {len(sec['items'])}" for key, sec in sections.items()
+    )
+    print(f"finished writing {len(seen_titles)} items to readme:\n{section_counts}")  # noqa: T201