Please enable javascript for this website.

How can safety cases be used to help with frontier AI safety?

Our new papers show how safety cases can help AI developers turn plans in their safety frameworks into action

Marie Buhl, Benjamin Hilton, Tammy Masterson, Geoffrey Irving

Safety frameworks have become standard practice amongst frontier AI developers.  In them, developers outline key risks, how they’ll measure them and steps to mitigate them. But this is no easy task, particularly when the risks are novel, fast-changing, and hard to pin down.

Over 11 frameworks have now been released, and counting, spurring a wave of research on how to write, implement, and refine them. Two of our latest papers contribute to that conversation. The first provides developers an overview of emerging practices in safety frameworks. The second proposes a method to help implement them: safety cases.  

In this blog, we explain what safety cases are and how they can assist AI developers in determining whether an AI system meets the safety thresholds outlined in their safety framework.

What are safety cases?

Effectively implementing safety frameworks means demonstrating that an AI system is safe. To do that, three things are needed:

  1. A precise claim explaining what is meant by ‘safe’
  1. Evidence  
  1. An argument linking the two

Safety cases are a widely used technique that brings all three of these together into a single clear, assessable argument (Favaro et al., 2023; Sujan et al., 2016; Bloomfield et al., 2012; Inge, 2007).

We’ve previously written about why we’re working on safety cases at AISI, and what safety cases for frontier AI systems might look like. Our new paper looks at how safety cases can be used for frontier AI and why developers might find them useful.

The first few claims of a structured argument that could help form part of a safety case, from our previous paper on ‘inability’ arguments.

How can safety cases be used?

Safety cases can be used to inform organisational decision-making on the safety of frontier AI systems:

Diagram showing the use of a safety case. A writer sends a safety case to a red-team, who either rejects and provides feedback to the writer, or approves and sends to a decision-maker. The decision-maker either rejects and provides feedback to the writer, or approves to make a decision.

They are broadly useful whenever decisions about safety are being made. At the moment, they are likely most useful for internal company decision-making. In the future, we can imagine safety cases being shared with third parties, and published in some format, much like model cards and capability evaluations are today.

Let’s look at some examples:

  • An AI developer could write a safety case to make sure that an upcoming system meets the commitments set out in their safety framework. They could send the safety case to a third party for red-teaming.
  • An engineering team might be deciding whether to fine-tune a model to refuse to answer requests about how to steal money from a bank. The developer could work on an evolving safety case throughout the development cycle; if this work is ongoing, the engineering team could refer to the safety case to help inform their decision.  
  • An incident response team might be responding to a jailbreak which has been posted to social media. This possibility may be covered by a safety case; the response team may review a safety case - written when the system was made public - and use that to decide whether existing safeguards are sufficient, or whether they need to be improved.  

Safety frameworks typically specify conditions for safe development and deployment of frontier AI systems based on the system’s capabilities and the safety measures implemented. Safety cases can help developers test a particular system against this safety framework. Safety cases thereby complement safety frameworks, which outlines broad policies and principles that apply across systems at the organisational level, with system-specific analysis. In the paper, we look in more detail at how safety cases can contribute to the fulfilment of commitments made in safety frameworks.  This builds on work outlined in our paper on emerging practices in safety frameworks.

More research is needed

We don’t yet know how to write robust arguments that frontier AI systems are safe – and this means that we can’t yet write full and correct safety cases.

There’s a whole host of open problems to solve before we reach that stage, both on methodology and on substance. For example, we currently don’t know:

  • How the top-level claim in a safety case should be specified
  • Which safety cases notation schemes work best for frontier AI
  • How to quantify our confidence in various arguments
  • How generalisable arguments and evaluations are

There are also technical machine learning questions that come up time and time again when sketching safety cases. For example, how much can we rely on capability evaluations? Are we correctly eliciting capabilities? Could future models sandbag evaluations?

We’re optimistic that we can solve many of these problems by writing safety case sketches (our best guesses about how to write an argument for a particular system) and safety case templates (rough arguments that can be filled in for a particular system).

To learn more – including more details about how safety cases can be used, why we’re excited about them, and a more detailed list of open problems – take a look at the paper.