Appendix · Safety & alignment

Appendix: Index · Prev: Autonomy

Overview#

Safety and alignment work spans training-time alignment (preference learning), evaluation, interpretability, and deployment-time controls. The practical failure modes are specification gaps, distribution shift, and misuse.

1) Preference learning and instruction tuning#

Key papers#

InstructGPT (Ouyang et al., 2022)
Constitutional AI (Bai et al., 2022)
Direct Preference Optimization (Rafailov et al., 2023)

Code repositories / projects#

CarperAI/trlx
huggingface/trl

2) Robust evaluation#

Key papers#

Holistic Evaluation of Language Models (HELM, Liang et al., 2022)
TruthfulQA (Lin et al., 2021)

Code repositories / projects#

stanford-crfm/helm

3) Interpretability and monitoring#

Key papers#

Toy Models of Superposition (Elhage et al., 2022)

Code repositories / projects#

TransformerLensOrg/TransformerLens

Appendix · Safety & alignment

Overview#copy

1) Preference learning and instruction tuning#copy

Key papers#copy

Code repositories / projects#copy

2) Robust evaluation#copy

Key papers#copy

Code repositories / projects#copy

3) Interpretability and monitoring#copy

Key papers#copy

Code repositories / projects#copy

Overview#

1) Preference learning and instruction tuning#

Key papers#

Code repositories / projects#

2) Robust evaluation#

Key papers#

Code repositories / projects#

3) Interpretability and monitoring#

Key papers#

Code repositories / projects#