OpenAI and Anthropic's current methods for evaluating the effectiveness of internal model surveillance do not allow for adequate external verification.
Developers of frontier models like OpenAI and Anthropic have been using AI agents for internal tasks, including safety research and system training. To mitigate the risk of a misaligned model taking harmful actions—such as sabotaging risk assessment efforts—companies employ separate AI models known as "monitors." These monitors analyze transcripts of the agents' actions and flag suspicious activities for human review. According to a study conducted during the GovAI Winter Fellowship 2026, this monitoring process occurs offline, meaning actions are reviewed only after they have been executed, rather than in real time.
To evaluate the effectiveness of these offline monitors, OpenAI and Anthropic use synthetic attacks. This practice involves creating transcripts that simulate harmful actions a misaligned model might take during real-world use, then verifying whether the monitors can identify and flag these occurrences.
Despite the application of these tests, the report notes that the information publicly disclosed by the companies is insufficient for external parties to assess the overall effectiveness of offline monitoring. Anthropic discloses how frequently monitors flag synthetic attack transcripts, but it does not reveal how often these transcripts survive the entire processing pipeline and actually reach human review. Downstream filters may discard initially flagged transcripts, creating a gap in understanding the complete system.
In OpenAI's case, the company reports that its monitor flags all transcripts that have been independently marked by human employees. However, the study observes that this metric provides little insight into the system's detection capabilities, since human employees tend to catch only the most obvious cases of misbehavior.
According to the analysis, only OpenAI and Anthropic have published information about their offline monitoring systems to date. The lack of more detailed data on the end-to-end effectiveness of these safety mechanisms limits the ability to conduct external audits of how AI companies are managing the risks associated with the internal deployment of autonomous agents.
They use synthetic attacks, which involve creating transcripts that simulate harmful actions by a misaligned model, to test if monitors successfully identify and flag these suspicious activities for human review.
Publicly disclosed data is insufficient for external audits. Anthropic does not reveal if flagged transcripts survive downstream filters to reach human review, while OpenAI's metrics only show detection of obvious misbehavior caught by humans.
Offline monitoring means that a separate AI model reviews transcripts of an agent's actions only after they have been executed, rather than analyzing and intercepting potentially harmful actions in real time.