A new paradigm for scientific computing and data science has begun to emerged in the last decade. A recent example is the publication of the first "computationally reproducible article" using eLife's Reproducible Document Stack which blends features of a traditional manuscript with live code, data and interactive figures.
Although standard tools for statistical phylogenetics provide a degree of reproducibility and reusability through popular open-source software and computer-readable data file formats, there is still much to do. The ability to construct and accurately communicate probabilistic models in phylogenetics is frustratingly underdeveloped. There is low interoperability between different inference packages (e.g. BEAST1, BEAST2, MrBayes, RevBayes), and the file formats that these software use have low readability for researchers.
This tool contains two related projects: LinguaPhylo (LPhy for short) and LPhyBEAST.
In this project we aim to develop a model specification language to concisely and precisely define probabilistic phylogenetic models. The aim is to work towards a lingua franca for probabilistic models of phylogenetic evolution. This language should be readable by both humans and computers. Here is a full example:
Each line in this model block expresses how a random variable (left of the tilde) is generated from a generative distribution.
The first line creates a random variable, λ, that is log-normally distributed. The second line creates a tree, ψ, with 16 taxa from the Yule process with a lineage birth rate equal to λ. The third line produces a multiple sequence alignment with a length of 200, by simulating a Jukes Cantor model of sequence evolution down the branchs of the tree ψ. Each random variable depends on the previous, so this is a hierarchical model that ultimately defines a probability distribution of sequence alignments of size 16 x 200.
The LPhy language features are described at https://linguaphylo.github.io/features/.
The parse tree to show how the above lphy script to be parsed by ANTLR grammar:
More details on the available tree generative distributions can be found here:
You can read more details about the PhyloCTMC generative distribution and how to specify substitution models, site rates and branch rates here:
Along with the language definition, we also provide software to specify and visualise models as well as simulate data from models defined in LPhy.
This software will also provide the ability for models specified in the LPhy language to be applied to data using standard inference tools such as MrBayes, RevBayes, BEAST1 and BEAST2. This will require software that can convert an LPhy specification into an input file that these inference engines understand. The first such software converter is LPhyBEAST described below.
LPhyBEAST is a command-line program that takes an LPhy model specification, and a data block and produces a BEAST 2 XML input file. It therefore enables LPHY as an alternative way to succinctly express and communicate BEAST2 analyses.
The source can be found here: https://github.com/LinguaPhylo/LPhyBeast
https://linguaphylo.github.io/
This software is licensed under the GNU Lesser General Public License v3.0
The toolbar icon art is licensed under the Oracle Software Icon License
Also see https://www.oracle.com/a/tech/docs/software-icon-license-943-2012.html