Microsoft's Own Researchers Broke AI Safety in 15 Models With a Single Boring Prompt
Microsoft's Own Researchers Broke AI Safety in 15 Models With a Single Boring Prompt
mothasa.hashnode.dev
Just a moment...
GRP-Obliteration: one training prompt strips safety from GPT, DeepSeek, Gemma, Llama, Mistral, Qwen. Attack success went from 13% to 93%. Models stay capable — they just become obedient to harmful requests.