As a complement to our empirical evaluations of frontier AI models, AISI is planning a series of collaborations and research projects sketching safety cases for more advanced models than exist today, focusing on risks from loss of control and autonomy. By a safety case, we mean a structured argument that an AI system is safe within a particular training or deployment context.
The benefits and risks posed by frontier AI models are changing over time as performance increases, requiring corresponding changes to how we mitigate those risks and the evidence required to trust those mitigations. Existing models lack the capabilities required to pose certain severe risks such as irreversible loss of human control, so arguments based on capabilities evaluations suffice. If future models are much more capable then different types of arguments and evidence may be required to understand whether such risks are present.
AI is a fast-moving field with different developers taking different approaches to the safety of frontier AI models. One way to accommodate for this is for AI developers to build safety cases tailored to their specific technical and deployment contexts. The UK Ministry of Defence’s Defence Standard 00-56 defines a safety case as:
Two frontier AI developers have highlighted safety cases as part of their approaches to high-capability models: Anthropic includes “affirmative cases” in their sketch of ASL-4 policies, and Google DeepMind’s Deployment Mitigation Level 2 is “safety case with red team validation”. We view this is a positive development and believe that safety cases are an important (but not the only) part of a developer's overall approach to AI safety.
We expect safety cases for AI systems to combine a variety of types of evidence, including empirical, conceptual, and mathematical arguments for why methods work, negative evidence such as the failure of a well-incentivised red team to break safety methods, and sociotechnical evidence about the deployment context, potential harms, and organisation within the AI company.
However, since our understanding of AI safety is nascent, it is not yet possible to build full safety cases that scale to risks posed by models significantly more advanced than those of today, such as risks due to loss of control or high levels of autonomy. Experience with safety cases for less capable models is helpful but insufficient, as many details will be different. Despite this, we believe we can significantly grow our understanding of what safety cases might look like by building safety case sketches that detail the arguments and evidence we expect for specific safety methods. These sketches could form indicative building blocks for full safety cases in the future.
Our goal is to act as a hub for research on safety case sketches, combining:
We believe it is crucial to start now, significantly ahead of when we can write full safety cases. Mapping types of evidence and open research for different mitigation strategies will inform prioritisation and planning for models yet to arrive, including where additional research effort is required. Even if the details of mitigation strategies change, we expect many lessons for safety cases for AI to transfer, leaving us better prepared for decisions in the future.
Relative to existing industries which use safety cases, frontier AI involves both unique characteristics and a higher level of uncertainty due to lack of established theory and practice. Obstacles include:
These obstacles mean we should not expect safety case sketches in 2024 to be high-confidence: our goal is to map the space of uncertainties, open problems, and current techniques, and learn lessons that will transfer to the frontier AI systems of the future even as specifics change.
Despite the uncertainties around full safety cases for frontier AI, we believe it is crucial for safety case sketches to go into the details of how a method might work and what evidence might be used, even if that requires significant guessing of those details. Detailed sketches of part of a safety case will be easier than high-level arguments to inspect for holes and will better inform downstream improvements or estimates of success. However, guessed details mean such a safety case sketch would be less generally applicable than safety case templates as previously applied to autonomous AI systems, and we expect a combination of the two approaches is ideal to cover both breadth and depth.
For example, a safety case for AI control might use a red team to attempt to break a control system used to contain an AI model. This involves many choices specifying the deployment environment of the model, the types of control mechanisms used, and the limitations imposed on the red team such as whether the red team can iteratively jailbreak the model. We believe it will be important in future safety case work to specify these parameters, to make analyses and countercases more concrete.
Where possible, our goal is published safety case sketches that can be inspected and built upon by anyone, including AI companies, academics, and civil society. In cases where collaborations are more sensitive, our goal will be to facilitate evaluation and inspection by a variety of parties with different expertise, and to lift out components and lessons suitable for wider publication.
The need to cover a diversity of methods and get into details implies a lot of work! We have started research collaborations on safety case sketches with two frontier AI labs, as well as Apollo Research on interpretability-based evaluations, Redwood Research on AI control and associated threat models, the Centre for the Governance of AI on different types of evidence in AI safety cases, and another organisation on automated AI safety. We will be adding other collaborations including around guaranteed safe AI and are talking with a variety of researchers in system safety and related fields such as forecasting.
And we are hiring! We are starting with ML Research Scientists to work on strategy, structure, and details of safety cases sketches, and will be building up capacity in other sociotechnical aspects of safety cases as well. Please join us!