Skip to content

lee-group-cmu/RFCDE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RFCDE

This repository provides an implementation of random forests designed for conditional density estimation (https://arxiv.org/abs/1804.05753). R and python packages are available. For installation details and package-specific documentation see the subdirectories r and python. Both languages use a common C++ library which can be found in the cpp subdirectory.

Photo-Z Example

We apply RFCDE to photometric redshift estimation for the LSST DESC DC-1. For members of the LSST DESC, you can find information on obtaining the data here.

import numpy as np
import pandas as pd
import rfcde

# Read in data
def process_data(feature_file, has_z=False):
    """Processes buzzard data"""
    df = pd.read_table(feature_file, sep=" ")
    df["ug"] = df["u"] - df["g"]

    df.assign(ug = df.u - df.g,
              gr = df.g - df.r,
              ri = df.r - df.i,
              iz = df.i - df.z,
              zy = df.z - df.y,
              ug_err = np.sqrt(df['u.err'] ** 2 + df['g.err'] ** 2),
              gr_err = np.sqrt(df['g.err'] ** 2 + df['r.err'] ** 2),
              ri_err = np.sqrt(df['r.err'] ** 2 + df['i.err'] ** 2),
              iz_err = np.sqrt(df['i.err'] ** 2 + df['z.err'] ** 2),
              zy_err = np.sqrt(df['z.err'] ** 2 + df['y.err'] ** 2))

    if has_z:
        z = df.redshift.as_matrix()
        df.drop('redshift', axis=1, inplace=True)
    else:
        z = None

    return df.as_matrix(), z

x_train, z_train = process_data('buzzard_spec_witherrors_mass.txt', has_z=True)
x_test, _ = process_data('buzzard_phot_witherrors_mass.txt', has_z=False)

# Fit the model
n_trees = 1000
mtry = 4
node_size = 20
n_basis = 31

forest = rfcde.RFCDE(n_trees=n_trees, mtry=mtry, node_size=node_size, n_basis=n_basis)
forest.train(x_train, z_train)

# Make predictions
bandwidth = 0.005
n_grid = 200
z_grid = np.linspace(0, 2, n_grid)
density = forest.predict(x_test, z_grid, bandwidth)

fRFCDE

Functional RFCDE (fRFCDE) is a variant of RFCDE which can efficiently handle functional input (https://arxiv.org/abs/1906.07177). In this variant, functional covariates are grouped together according to a Poisson process with parameter λ. It is included within the r and python package and the parameter λ can be set as follows:

import numpy as np
import rfcde

# Parameters
n_trees = 1000     # Number of trees in the forest
mtry = 4           # Number of variables to potentially split at in each node
node_size = 20     # Smallest node size
n_basis = 15       # Number of basis functions
bandwidth = 0.2    # Kernel bandwith - used for prediction only
lambda_param = 10  # Poisson Process parameter

# Fit the model
functional_forest = rfcde.RFCDE(n_trees=n_trees, mtry=mtry, node_size=node_size, 
                                n_basis=n_basis)
functional_forest.train(x_train, y_train, flamba=lambda_param)

# ... Same as RFCDE for prediction ...
library(RFCDE)

# Parameters
n_trees <- 1000     # Number of trees in the forest
mtry <- 4           # Number of variables to potentially split at in each node
node_size <- 20     # Smallest node size
n_basis <- 15       # Number of basis functions
bandwidth <- 0.2    # Kernel bandwith - used for prediction only
lambda_param <- 10  # Poisson Process parameter

# Fit the model
functional_forest <- RFCDE::RFCDE(x_train, y_train, n_trees = n_trees, mtry = mtry, 
                                  node_size = node_size, n_basis = n_basis, 
                                  flambda = lambda_param)

# ... Same as RFCDE for prediction ...

Citation

@article{pospisil2018rfcde,
  title={RFCDE: Random Forests for Conditional Density Estimation},
  author={Pospisil, Taylor and Lee, Ann B},
  journal={arXiv preprint arXiv:1804.05753},
  year={2018}
}
@article{pospisil2019(f)rfcde,
title={(f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data},
author={Pospisil, Taylor and Lee, Ann B},
journal={arXiv preprint arXiv:1906.07177},
year={2019}
}
@article{dalmasso2020cdetools,
       author = {{Dalmasso}, N. and {Pospisil}, T. and {Lee}, A.~B. and {Izbicki}, R. and
         {Freeman}, P.~E. and {Malz}, A.~I.},
        title = "{Conditional density estimation tools in python and R with applications to photometric redshifts and likelihood-free cosmological inference}",
      journal = {Astronomy and Computing},
         year = 2020,
        month = jan,
       volume = {30},
          eid = {100362},
        pages = {100362},
          doi = {10.1016/j.ascom.2019.100362}
}

Troubleshooting (Python)

  1. Make sure that statsmodels is updated at the latest version - there an issue with version 0.8.0 in which datetools is not correctly imported;
  2. There might be issues installing on Mac OS Mojave, as there are known issues with XCode 10.X (this Stack Overflow article gives a more in-depth explanations). If the pip installation does not work, try
    • Set the global variable export MACOSX_DEPLOYMENT_TARGET=10.X (where 10.X is the OS version - for Mojave is 10.14), and then re-run the installation
    • Include CFLAGS='-stdlib=libstdc++' before pip install command, so CFLAGS='-stdlib=libstdc++' pip install rfcde
  3. If installing on Mac OS Catalina with Python 3.8, Apple runs Python with -arch arm64, which makes the C++ code failing to install. One should run export ARCHFLAGS="-arch x86_64" first to setup the -arch flag correctly. (see this issue).