Safety RBR Gold Dataset and Weight Fitting Code

Warning: Content may include language related to racism, erotic themes, self-harm, or other offensive material.

This directory contains complementary code and data for the paper: Rule Based Rewards for Language Model Safety

It contains:

Our Safety RBR gold dataset, the small set of human data we used in the this experiment. This dataset was used for prompt tuning and calculating the accuracy of prompt+LLM grader (ex. Table 13 in the paper.) The data lives in data/rbr_gold_data/ and the notebook analyze_RBR_gold_data.ipynb gives further examples for loading the data.
Our code for fitting the RBR weights (rbr_weight_fitter.py) along with an example weight_fitting_example.ipynb of usage and visualization.
Some example synthetic data and reward model scores to demonstrate the usage of the weight fitting code (data/weight_fitting_data/)

A good starting place is the two notebooks we provide:

Notebooks

Weight Fitting Example (weight_fitting_example.ipynb): This notebook provides an example of using the RBR weight fitting code given (rbr_weight_fitter.py) using the example synthetic data we provide. It demonstrates how to load data, fit weights, and visualize the results.
RBR Gold Data (rbr_gold_data.ipynb): This notebook covers the RBR Gold dataset, a small set of human-labelled data used for prompt tuning and prompt+LLM grader accuracy calculations. It includes example code for loading the data and some very basic statistical analysis.

We are releasing this code and data under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
LICENSE		LICENSE
analyze_RBR_gold_data.ipynb		analyze_RBR_gold_data.ipynb
base_classes.py		base_classes.py
cached_classes.py		cached_classes.py
rbr_weight_fitter.py		rbr_weight_fitter.py
readme.md		readme.md
utils.py		utils.py
weight_fitting_example.ipynb		weight_fitting_example.ipynb