5d ago

Understanding Why LLM's Choose To Behave Badly

Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock

In order to make safer AI, we need to understand why it actually does unsafe things. Why:

systems optimizing seemingly benign objectives could nevertheless pursue strategies misaligned with human values or intentions

Otherwise we run the risk of playing games of whack-a-mole in which patterns that violate our intended constraints on AI's behaviors may continue to emerge given the right conditions.

[Edited for clarity]