AI safety & governance

Monitors are Fragile under Information Asymmetry

How AI control monitors fail when an attacker exploits a vulnerability they have not seen before.

Co-author, with Ana Carolina Erthal, Juan Belieni, and Gustavo Ewbank Rodrigues Danon · Apart Research AI Control Hackathon, March 2026

AI control relies on monitors that watch a model’s actions and flag anything suspicious. We asked whether a monitor’s ability to catch an attack holds up when the attack exploits a vulnerability the monitor has not seen before. Using the ControlArena Bash environment, augmented with synthetic vulnerabilities, we tested monitors under three levels of knowledge: complete knowledge of the vulnerability in play, partial knowledge of other vulnerabilities but not that one, and no knowledge at all.

The monitors proved fragile. Partial knowledge gave almost no improvement over complete ignorance, because knowing about other vulnerabilities did not transfer to detecting an attack through an unfamiliar one. Only complete information substantially reduced attack success. That pattern suggests monitors rely on explicit prior knowledge rather than general reasoning about suspicious behaviour, which means safety evaluations calibrated against known attacks may systematically overstate the protection they provide once a novel attack surface appears.