Skip to content

xKDR/CRRao.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRRao

Stable Dev Build Status Build Status Coverage Milestones

To install:

For version 0.1.1 (stable)

using Pkg; Pkg.add("CRRao")

For version 0.1.2 (under development)

using Pkg; Pkg.add(url = "https://github.com/xKDR/CRRao.jl.git")

CRRao: A single API for diverse statistical models

Many statistical models can be estimated in Julia, and the diversity of the model ecosystem is steadily improving. Drawing inspiration from the Zelig package in the R world, the CRRao package gives a simple and consistent API to the end user. The end-user then faces the fixed cost of getting a hang of this, once, and after that a wide array of models and associated capabilities become available with a consistent syntax. We hope others developing statistical models will build within this framework.

Here's an example of estimating the linear regression

MPG = β0 + β1 HP + β2 WT + β3 Gear + ϵ

   using CRRao, RDatasets, StatsModels
   df = dataset("datasets", "mtcars")
   model = fit(@formula(MPG ~ HP + WT+Gear), df, LinearRegression())
   coeftable(model)

   ────────────────────────────────────────────────────────────────────────────
                     Coef.  Std. Error      t  Pr(>|t|)   Lower 95%   Upper 95%
   ────────────────────────────────────────────────────────────────────────────
   (Intercept)  32.0137     4.63226      6.91    <1e-06  22.5249     41.5024
   HP           -0.0367861  0.00989146  -3.72    0.0009  -0.0570478  -0.0165243
   WT           -3.19781    0.846546    -3.78    0.0008  -4.93188    -1.46374
   Gear          1.01998    0.851408     1.20    0.2410  -0.72405     2.76401
   ────────────────────────────────────────────────────────────────────────────

This calls the generic function fit(), where you supply a formula, a dataset, and pick the model.

Present capabilities

We have implemented four regression models:

  1. Linear
  2. Logistic (with four link functions)
  3. Poisson
  4. Negative binomial

In all cases, we have traditional frequentist models and Bayesian versions with six kinds of priors :

  1. Gauss
  2. Ridge
  3. Laplace
  4. Cauchy
  5. T-Distributed
  6. Horse shoe

All these models are built out of foundations in the Julia package ecosystem, such as GLM.jl and Turing.jl. Here in CRRao.jl, we are not building additional models; we are only building the scaffolding for the consistent API to a diverse array of models.

The traditional frequentist models estimates MLE using the GLM.jl and the Bayesian models uses the Hamiltonian Monte Carlo (HMC) methods using Turing.jl.

Future capabilities

  1. Gaussian Process Regression
  2. Gaussian Process Classification
  3. Gaussian Process Time Series
  4. Linear Discriminant Analysis
  5. Hierarchical Bayesian Models

Help us build this

  • Please use CRRao and tell us what is not good about it.

  • We have exploited Julia capabilities to make it convenient to build additional functionality within CRRao, and for multiple developers to build new models.

  • We want to build out CRRao into a simple and consistent approach to the statistical modelling workflow. Please help us plan and build this.

  • As a developer, you can begin contributing by adding the features requested in the milestones section of the repository.

  • When you are developing keep in mind that your application is capable of handling the following mandatory features:

    1. Formula,
    2. DataFrame (optional - you may choose other data types),
    3. modelClass,
    4. It follows StatsAPI.jl requirement (like it has atleast fit, others capabilities like predict,aic,bic would be great things to have).
    5. Provide test cases
    6. Check for performance gain against R and Python.

Performance gains

The efficiency gains of the Julia language and the package ecosystem accrue to the end-user of CRRao. (CRRao is just a thin layer, and the heavy lifting is all done by the underlying packages). Here is some measurement of the above model, done through four alternative systems.

R

> attach(datasets::mtcars)
> library(microbenchmark)
> microbenchmark(lm(mpg~hp+wt))
Unit: microseconds
              expr     min      lq     mean   median      uq      max neval
 lm(mpg ~ hp + wt) 290.534 311.209 380.1334 325.9485 395.288 2223.736   100

Julia

using RDatasets, CRRao, BenchmarkTools, StatsModels
df = dataset("datasets", "mtcars")
@benchmark fit(@formula(MPG ~ HP + WT), df, LinearRegression())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   90.092 μs   34.761 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     127.941 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.215 μs ± 559.192 μs  ┊ GC (mean ± σ):  4.54% ± 3.30%

To summarise the performance across four alternatives:

Language Package/Function Mean time taken
Python statsmodes/ols 2106.6 μs
Python sklearn/fit 559.9 μs
R stats/lm 380.13 μs
Julia CRRao/fit 160.22 μs
----------- -------------------- ------------------

where we emphasise that the performance of fit() here is a tiny overhead on top of the implementation of the linear regression in GLM.jl.

Support

We gratefully acknowledge the JuliaLab at MIT for financial support for this project.