Warning: Content may include language related to racism, erotic themes, self-harm, or other offensive material.
This directory contains complementary code and data for the paper: Rule Based Rewards for Language Model Safety
It contains:
- Our Safety RBR gold dataset, the small set of human data we used in the this experiment. This dataset was used for prompt tuning and calculating the accuracy of prompt+LLM grader (ex. Table 13 in the paper.) The data lives in
data/rbr_gold_data/
and the notebookanalyze_RBR_gold_data.ipynb
gives further examples for loading the data. - Our code for fitting the RBR weights (
rbr_weight_fitter.py
) along with an exampleweight_fitting_example.ipynb
of usage and visualization. - Some example synthetic data and reward model scores to demonstrate the usage of the weight fitting code (
data/weight_fitting_data/
)
A good starting place is the two notebooks we provide:
- Weight Fitting Example (
weight_fitting_example.ipynb
): This notebook provides an example of using the RBR weight fitting code given (rbr_weight_fitter.py
) using the example synthetic data we provide. It demonstrates how to load data, fit weights, and visualize the results. - RBR Gold Data (
rbr_gold_data.ipynb
): This notebook covers the RBR Gold dataset, a small set of human-labelled data used for prompt tuning and prompt+LLM grader accuracy calculations. It includes example code for loading the data and some very basic statistical analysis.
We are releasing this code and data under the MIT License.