Understanding Why LLM's Choose To Behave Badly
Understanding Why LLM's Choose To Behave Badly
arxiv.org
Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock

In order to make safer AI, we need to understand why it actually does unsafe things. Why:
systems optimizing seemingly benign objectives could nevertheless pursue strategies misaligned with human values or intentions
Otherwise we run the risk of playing games of whack-a-mole in which patterns that violate our intended constraints on AI's behaviors may continue to emerge given the right conditions.
[Edited for clarity]