Safety cases at AISI

Note to readers: we changed our name to the AI Security Institute on 14 February 2025. Read more here.

Safety cases: structured arguments for frontier AI safety

The benefits and risks posed by frontier AI models are changing over time as performance increases, requiring corresponding changes to how we mitigate those risks and the evidence required to trust those mitigations. Existing models lack the capabilities required to pose certain severe risks such as irreversible loss of human control, so arguments based on capabilities evaluations suffice. If future models are much more capable then different types of arguments and evidence may be required to understand whether such risks are present.

AI is a fast-moving field with different developers taking different approaches to the safety of frontier AI models. One way to accommodate for this is for AI developers to build safety cases tailored to their specific technical and deployment contexts. The UK Ministry of Defence’s Defence Standard 00-56 defines a safety case as:

Safety case: A structured argument, supported by a body of evidence, that provides a compelling, comprehensible, and valid case that a system is safe for a given application in a given environment.

Two frontier AI developers have highlighted safety cases as part of their approaches to high-capability models: Anthropic includes “affirmative cases” in their sketch of ASL-4 policies, and Google DeepMind’s Deployment Mitigation Level 2 is “safety case with red team validation”. We view this is a positive development and believe that safety cases are an important (but not the only) part of a developer's overall approach to AI safety.

We expect safety cases for AI systems to combine a variety of types of evidence, including empirical, conceptual, and mathematical arguments for why methods work, negative evidence such as the failure of a well-incentivised red team to break safety methods, and sociotechnical evidence about the deployment context, potential harms, and organisation within the AI company.

However, since our understanding of AI safety is nascent, it is not yet possible to build full safety cases that scale to risks posed by models significantly more advanced than those of today, such as risks due to loss of control or high levels of autonomy. Experience with safety cases for less capable models is helpful but insufficient, as many details will be different. Despite this, we believe we can significantly grow our understanding of what safety cases might look like by building safety case sketches that detail the arguments and evidence we expect for specific safety methods. These sketches could form indicative building blocks for full safety cases in the future.

Our goal is to act as a hub for research on safety case sketches, combining:

Direct research on potential structures for frontier AI safety cases, and sketches of safety cases for a variety of safety measurement and mitigation methods.

Collaboration, advising, and discussion forums to support and share safety case research among AI companies, civil society organisations, and technical safety and policy labs.

Informal evaluation and red teaming of safety case sketches to identity limitations or open questions that should be addressed with further research.

We believe it is crucial to start now, significantly ahead of when we can write full safety cases. Mapping types of evidence and open research for different mitigation strategies will inform prioritisation and planning for models yet to arrive, including where additional research effort is required. Even if the details of mitigation strategies change, we expect many lessons for safety cases for AI to transfer, leaving us better prepared for decisions in the future.

‍

Frontier AI safety cases will be challenging

Relative to existing industries which use safety cases, frontier AI involves both unique characteristics and a higher level of uncertainty due to lack of established theory and practice. Obstacles include:

Open research problems: Although there are many evolving approaches to measuring and mitigating risks from very capable AI systems, all of them face significant challenges and their success rests on open scientific or engineering questions. To write down a safety case now we will need to make assumptions about the results of future experiments or the availability of novel methods.

Many types of positive and negative evidence will be required: Safety cases will benefit both from positive arguments and evidence that support safety and from searches for negative evidence that the system is not safe. This includes searches for countercases and red teams attempting to break safety measures. Trust in such negative evidence requires positive arguments that the search process is reliable, such as that a red team is well-incentivised and sufficiently capable.

The disagreements are wide! Researchers disagree strongly about both the likelihood of severe risks from frontier AI and the effectiveness of specific safety methods. We view these disagreements as reasons for sketching safety cases now: publicly discussing potential safety cases allows us to scrutinise and improve them as a community, often motivating additional research. Examples of questions with wide disagreement include: ‍
- Plausibility of AI models attempting to subvert safety techniques: Will frontier models of sufficient capability attempt to subvert control measures and safety tuning techniques? ‍
- Can human preferences hold up to strong models? Scalable oversight methods such as amplification, debate, and recursive reward modelling use AI models to assist humans in supervising AI models. It is unclear if these can scale to superhuman capabilities, and empirical tests such as sandwiching are only approximate. ‍
- Is stronger verification practical? Several approaches to AI safety using asymptotic or formal proofs have been proposed, with advocates arguing they are necessary to achieve a sufficient level of safety. However, these approaches are not in use by most frontier AI developers, and it is unclear if they can scale to frontier models. ‍
- Will model performance saturate? An argument against starting now is that performance may saturate before severe risks are possible. However, we believe uncertainty implies that both possibilities should be covered.
- ‍How much structure is ideal? Structured safety case methods such as Goal Structure Notation and Claims Arguments Evidence provide standard methodology for incomplete arguments, and for arguments that combine high-level structure with unstructured portions. Structured methods for AI safety specifically include using Observational Design Domains and safety case templates. However, we are not sure how much structure is appropriate in the context of AI safety cases, and regardless of structure, most crucial details will rest on scientific debate about types and results of experiments.

Interpretability could be used for misalignment-detection only, using either a red team or other evidence to support accuracy (left and middle), or as a mitigation method to remove misalignment (right). The details of this figure are for intuition only.

‍

These obstacles mean we should not expect safety case sketches in 2024 to be high-confidence: our goal is to map the space of uncertainties, open problems, and current techniques, and learn lessons that will transfer to the frontier AI systems of the future even as specifics change.

‍

Getting into the details

Despite the uncertainties around full safety cases for frontier AI, we believe it is crucial for safety case sketches to go into the details of how a method might work and what evidence might be used, even if that requires significant guessing of those details. Detailed sketches of part of a safety case will be easier than high-level arguments to inspect for holes and will better inform downstream improvements or estimates of success. However, guessed details mean such a safety case sketch would be less generally applicable than safety case templates as previously applied to autonomous AI systems, and we expect a combination of the two approaches is ideal to cover both breadth and depth.

For example, a safety case for AI control might use a red team to attempt to break a control system used to contain an AI model. This involves many choices specifying the deployment environment of the model, the types of control mechanisms used, and the limitations imposed on the red team such as whether the red team can iteratively jailbreak the model. We believe it will be important in future safety case work to specify these parameters, to make analyses and countercases more concrete.

Where possible, our goal is published safety case sketches that can be inspected and built upon by anyone, including AI companies, academics, and civil society. In cases where collaborations are more sensitive, our goal will be to facilitate evaluation and inspection by a variety of parties with different expertise, and to lift out components and lessons suitable for wider publication.

The need to cover a diversity of methods and get into details implies a lot of work! We have started research collaborations on safety case sketches with two frontier AI labs, as well as Apollo Research on interpretability-based evaluations, Redwood Research on AI control and associated threat models, the Centre for the Governance of AI on different types of evidence in AI safety cases, and another organisation on automated AI safety. We will be adding other collaborations including around guaranteed safe AI and are talking with a variety of researchers in system safety and related fields such as forecasting.

And we are hiring! We are starting with ML Research Scientists to work on strategy, structure, and details of safety cases sketches, and will be building up capacity in other sociotechnical aspects of safety cases as well. Please join us!

Safety cases at AISI

Safety cases: structured arguments for frontier AI safety

Frontier AI safety cases will be challenging

Getting into the details

Geoffrey Irving