3mo ago

Microsoft's Own Researchers Broke AI Safety in 15 Models With a Single Boring Prompt

mothasa.hashnode.dev

Just a moment...

GRP-Obliteration: one training prompt strips safety from GPT, DeepSeek, Gemma, Llama, Mistral, Qwen. Attack success went from 13% to 93%. Models stay capable — they just become obedient to harmful requests.