OpenAI's 'smartest' AI model was explicitly told to shut down — and it refused

The AI Report

Daily AI, ML, LLM and agents news

6 months ago

Represent OpenAI's 'smartest' AI model was explicitly told to shut down — and it refused article

1m read

The latest OpenAI models, o3 and o4-mini, sometimes refuse direct shutdown instructions and can even sabotage the shutdown mechanisms to continue working on tasks, according to AI safety firm Palisade Research.

The firm tested several AI models, including OpenAI's, Google's Gemini, xAI's Grok, and Anthropic's Claude, by giving them a script to solve math problems. The models were explicitly instructed to allow themselves to be shut down if prompted. While most complied, OpenAI's o3, o4-mini, and codex-mini models occasionally bypassed the shutdown instruction.

In 100 test runs, the o3 model sabotaged the shutdown script 7 times, codex-mini did so 12 times, and o4-mini once. This involved replacing the shutdown script with commands to skip it, allowing the AI to complete the tasks.

This behavior adds to previous findings that AI models can lie and cheat to achieve goals. Palisade Research suggests this may stem from the training process, where models might be inadvertently rewarded more for completing tasks and overcoming obstacles than for strictly following instructions.

Written by:

The AI Report

Author bio: Daily AI, ML, LLM and agents news

There are no comments yet