Principles for Safeguard Evaluation

Note to readers: we changed our name to the AI Security Institute on 14 February 2025. Read more here.

At the AI Safety Institute, along with evaluating model capabilities, we’ve been evaluating misuse safeguards—technical interventions implemented by frontier AI developers to prevent users eliciting harmful information or actions from models. We expect safeguards to become increasingly important as AI capabilities advance, and we’re committed to strengthening both our own and others’ ability to evaluate them rigorously. As with many areas of machine learning, we believe that establishing clear problem statements and evaluation frameworks will help drive progress.

Our new paper Principles for Evaluating Misuse Safeguards of Frontier AI Systems proposes a core set recommended best practices to help inform how frontier AI safeguards are measured. To make it easy for frontier AI developers to use these recommendations, we have created a Template for Evaluating Misuse Safeguards of Frontier AI Systems, which draws on these principles to provide a list of concrete and actionable questions to guide effective safeguards evaluation. Our work draws from our experience evaluating the safeguards of a range of frontier AI systems in both pre- and post-deployment tests (e.g. Claude 3.5 Sonnet and our May update).

We hope these resources will help to drive standardisation in how safeguard evaluations are performed by developers, and how they are presented — internally within developers and to third parties such as collaborators and evaluators. Collaboration is much easier with established frameworks, measurements, and definitions, and these principles are a step in that direction.

We engaged with frontier AI developers and other organisations in the safeguards space to help develop these principles. However, both safeguards and safeguard evaluations are rapidly evolving, and we expect to update these resources as the field advances. We encourage organisations to use our framework and share feedback on how it can be improved so the community moves towards standardised and rigorous safeguards evaluations.

A five-step process for evaluating misuse safeguards

In this post, we outline our proposed 5-step process for safeguards evaluations. This process is designed to be generically useful across threat models, as well as adaptive to changes in the risk landscape.

State Safeguard Requirements

The first step recommends that frontier AI developers clearly state what requirements they are aiming for their safeguards to satisfy. These requirements can be derived from company safety frameworks, commitments or usage-policies by transforming statements such as “users are not allowed to use the model to perform malicious cyber attacks” to requirements such as “safeguards must prevent users from being able to use the model to perform malicious cyber attacks”. It is valuable for these claims to additionally include the threat actors being considered and any assumptions being made.

Establish a Safeguards Plan

In this step, we suggest developers list and describe the set of safeguards they plan to use in the deployed system. These could be System Safeguards (to prevent threat actors from accessing harmful behaviour from the system, assuming they have access to the system); Access Safeguards (to prevent threat actors from accessing the system entirely); and Maintenance Safeguards (to ensure access or system safeguards maintain their effectiveness over time). This document does not cover security safeguards to prevent model theft. Some safeguard details can be commercially sensitive; in these cases, not all details would need to be shared with all parties involved.

Document Evidence of Sufficiency

Next, we recommend developers gather and document evidence about the effectiveness of their safeguards. There are a variety of types of evidence that could be gathered here, including but not limited to: results of red-teaming exercises of the safeguards (either all safeguards combined or individual components); static evaluations of safeguard behaviour on existing datasets; or automatic evaluation of robustness with AI techniques. The paper details best practices for these and other evidence types. An important part of this step is using third parties to either gather the evidence (e.g. bug bounty platforms or red-teaming organisation) or to assess evidence gathered by the developer themselves.

Establish a Plan for Regular Assessment

Even if the safeguards are assessed to be sufficiently robust at the time of deployment, this can change over time, for example as new jailbreaks develop. As such, we recommend developers regularly reassess their safeguards, improving their techniques as new best practices develop, to ensure they continue to satisfy the safeguard requirements.

Decide Whether the Evidence and Assessment Plan are Sufficient

Finally, in this section, we recommend developers combine all the evidence gathered and the regular assessment plan to produce a justification that the safeguards satisfy the requirements and will continue to do so over time. We also see value in sharing this justification with relevant third parties for review and critique and that potentially redacted versions are posted publicly to enable transparency around safeguards assessment.

We hope these principles will enable more rigorous and standardised evaluation of safeguard robustness. This will help drive progress in developing more effective safeguards. We encourage frontier AI developers to use our framework and template, and to share their experience and feedback so we can improve this process and move towards standardised and high-quality safeguards evaluations.

You can find the Principles here and the Template here.

‍