A6
Safety & alignment
References for preference learning, evaluation, interpretability, and monitoring.

Appendix · Safety & alignment

Back to core

Appendix: Index · Prev: Autonomy

Overview#

Safety and alignment work spans training-time alignment (preference learning), evaluation, interpretability, and deployment-time controls. The practical failure modes are specification gaps, distribution shift, and misuse.

1) Preference learning and instruction tuning#

Key papers#

Code repositories / projects#

2) Robust evaluation#

Key papers#

Code repositories / projects#

3) Interpretability and monitoring#

Key papers#

Code repositories / projects#