How to evaluate control measures for AI agents?

Advances in Large Language Models (LLMs) are raising important questions about how to ensure their behaviour aligns with human goals. A key challenge is the potential for 'misalignment,' where an LLM's objectives diverge from those of its developers. This poses significant risks, which only increase as LLMs become more autonomous and operate with less human oversight. While current mitigations, which aim to align LLM behaviour during training, are proving useful to make LLMs more helpful and less harmful, they may not always be sufficient. In fact, recent research has shown that AI systems might actively conceal strategies to pursue alternative objectives, a phenomenon known as ‘scheming’

As LLMs gain more ability to act independently, the potential for harm from misalignment increases. Therefore, it's crucial to develop robust and scalable alignment strategies. 'AI control' offers an approach complementary to alignment by focusing on constraining an LLM's actions, even if some misalignment persists after training. Although elements of AI control are used today, questions remain about their scalability as LLMs advance.

Our new paper examines the benefits of AI control and explores how to scale and evaluate these techniques for increasingly capable LLM agents. This blog post summarises our findings and discusses key insights from our work.

What is AI control?

The field of AI control focuses on designing 'control protocols' that constrain AI systems and ensure safety even in the presence of scheming. AI control adopts a conservative stance: it assumes that models might be scheming and designs protections accordingly, rather than assuming models can be made wholly safe through training alone.

Control measures: AI control uses a variety of 'control measures' including monitoring agents’ reasoning processes, requiring human approval for consequential actions, restricting access to sensitive resources, and maintaining the ability to terminate a deployment if suspicious behaviour is detected.

Control evaluations: to assess whether control measures adequately mitigate risks, developers can use 'control evaluations' that frame safety as a game between two teams. A red team designs models to cause harm by subverting controls, while a blue team implements monitoring and other protective measures to prevent this. A successful control evaluation demonstrates that even adversarial models cannot cause unacceptable outcomes, providing empirical evidence for safety claims.

A particularly attractive feature of the AI control approach is that it may offer the potential to mitigate alignment risks without fundamental research breakthroughs. Our paper contributes to this discussion by outlining how AI control can help ensure safe deployment of increasingly capable autonomous agents.

How can AI control techniques scale to increasingly capable systems?

As AI systems become more capable of causing harm autonomously, AI developers will need increasingly sophisticated control measures and evaluations to justify that the models can be deployed securely. Our framework provides one possible roadmap for this scaling, proposing five distinct AI control levels (ACLs), ranging from models incapable of competent scheming (ACL-0) to models with scheming capabilities vastly above those of humans (ACL-5).

In our paper, we:

Argue that control protocols should be adapted to the capability limitations of models: previous work on AI control focused on control measures intended to be robust to models that execute best attack strategies known to humans. It is unlikely that near term models will be capable of optimal responses to control measures. We therefore argue that we can use our knowledge of model limitations to design control protocols that are more cost-effective.

Argue that AI developers are already doing AI control: existing measures such as human approval and evaluations such as red teaming represent minimum viable products of AI control. We claim that it may be helpful for AI developers to think about AI control in terms of scaling these existing practices to more capable systems in the future.

Trace a path of an AI control safety case through time: an 'AI control safety case' is a structured argument that models are incapable of subverting control measures to cause unacceptable outcomes. Most previous AI control sketches focused on a particular capability level. However, frontier AI developers operate is inherently iteratively: each model release is usually only slightly more capable than a previous release. We show a concrete trajectory of how an AI control safety case might evolve alongside possible future capabilities.

Ask whether AI control can scale to systems with capabilities vastly above those of humans: we argue that existing AI control approaches are insufficient for scaling to systems with scheming capabilities vastly above those of humans (ACL-5), and effective control of these systems would require fundamental research breakthroughs.

More research is needed

Our paper provides an early sketch of how AI control measures can scale to increasingly powerful systems, but more research is needed, particularly in the following areas:

Mapping of model capabilities to control evaluations: detailed justifications of the evaluations needed for systems of different capabilities that go beyond our early outline.

Developing realistic control evaluations: realistic evaluations that can be used to estimate deployment security and iterate on protocol design.

Designing control protocols: procedures that help achieve the best balance between usefulness and safety of increasingly powerful systems.

The field of AI control is progressing quickly and at least one frontier model developer has already outlined a commitment to develop evaluations focused on intentional scheming capabilities.¹ Recent AISI contributions to this agenda include:

A sketch of an AI control safety case: our paper outlining a structured evidence-based argument for the procedures necessary to securely deploy roughly human level systems in the presence of scheming.

ControlArena: a growing suite of realistic environments for development and evaluation of control protocols.

You can be part of this work: consider applying to our Control team! We’re hiring research scientists and engineers.‍

‍

¹ Google DeepMind Frontier Safety Framework Version 2.0 updated in Feb ’25 outlines commitment to “detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control.”