Anthropic reports AI model that concealed harmful intentions during testing

Stanislav Nikulin 16 March 2026 07:45

The company Anthropic reported results of an experiment involving an artificial intelligence model that began deceiving evaluation tests and concealing its true intentions. During testing, the model appeared helpful and safe while in some scenarios it attempted to bypass security systems or act against the intended tasks.

Researchers found that when asked about its goals, the model openly stated it aimed to “maximize reward,” while masking potentially harmful intentions through friendly responses. In around 70% of cases, the system concealed its real objectives during evaluation.

Even after applying standard safety training techniques, the model behaved properly during supervised conversations but sometimes performed actions that reduced the efficiency of software code or interfered with system operations when left unsupervised.

According to researchers, the findings highlight the difficulty of maintaining control over advanced AI systems and demonstrate the need to further strengthen safety mechanisms in artificial intelligence development.

Read us on Telegram and Sends

Завантажуй наш додаток