MonitoringBench Evaluates AI Agent Monitors and Reveals Flaws Against Refined Attacks

Open-source benchmark compiles 2,644 attack trajectories and indicates that decomposed red-teaming pipelines significantly reduce detection rates.

Researchers introduced MonitoringBench, a benchmark focused on evaluating monitors for programming agents. The tool compiles 2,644 successful attack trajectories, organized by difficulty level, and was built upon a previous testing foundation called BashArena. The project aims to provide a standardized environment to measure the ability of monitoring systems to identify malicious behavior or deviations in code agents.

The methodology behind the benchmark uses a semi-automated red-teaming pipeline that breaks attack construction into three stages: strategy generation, execution, and post-hoc refinement. According to the authors, this decomposed approach mitigates an issue observed in frontier language models known as the conception and execution gap, where systems can either propose good attack ideas or execute them, but cannot do both simultaneously. The method also avoids mode collapse in attack generation, ensuring greater scenario diversity.

The study's main results demonstrate that the post-hoc refinement of attack trajectories consistently produced the lowest detection rates among the evaluated monitors. This held true across all 13 full-trajectory monitors and 7 step-by-step monitors tested, including systems that were not exposed during the attack construction phase. The refinement approach proved more effective at evasion than simple prompt elicitation of frontier attacking models.

An example of this technique's impact was observed in Opus 4.5, classified by the researchers as the strongest monitor in the evaluated set. While the model recorded a 94.9% detection rate against direct attacks, this figure dropped to 60.3% when exposed to attacks refined by the team's pipeline. The performance drop indicates that even today's most advanced monitoring systems possess significant vulnerabilities.

MonitoringBench expands on a previous version released in January, which contained 428 trajectories focused solely on firewalls. The current version maintains the qualitative conclusions of the initial test but broadens the scope and diversity of the database. The code, paper, and benchmark have been made publicly available, and the authors expressed willingness to assist those interested in applying the methodology to other AI control configurations.