A post on LessWrong advocates for an empirical infrastructure to reduce ambiguities in safety questions, such as deception, introspection, and latent capabilities.
A post published on LessWrong introduces a research agenda called “interpretability debate,” aimed at creating an epistemic infrastructure to investigate open questions about AI models iteratively and empirically. According to the authors, the proposal seeks to accumulate evidence to help resolve interpretive ambiguities or, when that is not possible, to better calibrate the degree of uncertainty surrounding them.
According to the post, the agenda builds on previous work on “performative misalignment,” which the authors treat as an initial demonstration of a round of this type of debate. The idea is to apply more structured methods to difficult AI safety questions, such as assessing whether a model is “scheming,” hiding capabilities, lying, being introspective, or responding to stimuli because it recognizes something about itself.
The text argues that these questions are more complex than interpretive problems associated with weaker models, such as identifying whether a system takes shortcuts in multiple-choice tests, uses lexical heuristics in inference tasks, or relies on irrelevant parts of an image to classify objects. According to the authors, concepts like deception and introspection are difficult to define precisely, but they influence predictions about how a model might generalize its behavior in new contexts.
The publication also states that chains of thought can currently be monitored in many cases, but should not be treated as a complete record of the causes behind a model's decision. According to the text, behavioral tests may also not fully reveal latent knowledge or unobserved capabilities, which reinforces the need for more scientific interpretive methods.
In the authors' view, even when a question cannot be resolved conclusively, the research should produce more calibrated uncertainties — for example, an explicit estimate of the probability that a model is being deceptive. The post situates the proposal within a recent movement to address topics like scheming and model motivations with more systematic approaches, although the operational details of the agenda still depend on the development of evaluation tools and protocols.
It is a research agenda proposed on LessWrong that aims to create an empirical infrastructure to iteratively investigate open questions about AI models, such as deception, introspection, and latent capabilities.
Chains of thought do not always provide a complete record of the causes behind a model's decisions, and behavioral tests may not fully reveal latent knowledge or unobserved capabilities, necessitating more scientific interpretive methods.
When a question cannot be definitively resolved, the goal is to produce more calibrated uncertainties, such as providing an explicit estimate of the probability that a model is being deceptive.