The AI Safety Institute's (AISI) mission is to equip governments with an empirical understanding of the safety of frontier AI systems. To achieve this, we design and run evaluations that aim to measure the capabilities of AI systems that may pose risks.
Since November 2023, we have built one of the largest safety evaluation teams globally with a leadership team of world class researchers, established a Memorandum of Understanding (MoU) and begun to operationalise joint testing with the US AISI. We have also published an open-source framework for evaluations, reported our results of testing 4 large language models from major companies, and conducted several pre-deployment tests.
But this is just the beginning. Model evaluations are a new and rapidly evolving science, and we routinely iterate on our approach. We will continue to grow our team and build our capacity to address challenges and open scientific questions. If you are excited by this, please join us!
This blog sets out our thinking to date on how to design and run third-party evaluations, including key elements to consider and open questions. This is not intended to provide robust recommendations; rather we want to start a conversation in the open about these practices and to learn from others.
We discuss the role of third-party evaluators and what they could target for testing, including which systems to test, when to test them, and which risks and capabilities to test for. We also examine how to evaluate effectively, including which tests to use for which purpose, how to develop robust testing, and how to ensure safety and security.
Independent evaluations of frontier AI systems can provide a snapshot of a given system’s capabilities and vulnerabilities. Whilst our sense is that the science is too nascent for independent evaluations to act as a ‘certification’ function (i.e. provide confident assurances that a particular system is ‘safe’), they are a critical part of incentivising best efforts at improving safety. Independent evaluations – completed by governments or other third parties - provide key benefits to AI companies, governments, and the public, including:
Independent evaluations completed by state bodies provide additional benefits to governments:
Taken together, these benefits of independent evaluations are critical to the safe, secure, and trustworthy development and use of AI systems – both today and in future.
AISI’s mission is to develop an empirical understanding of frontier AI safety. In our first year, our focus has been on those AI systems that we believe will provide the best insights on the frontier of AI and how it is developing.
In practice, this means we have focused on evaluating systems which are likely to match or surpass the current state of the art (SOTA) on general purpose capabilities. One way to assess this is to use ‘scaling laws’ as a proxy to identify such systems, which suggest a correlation between the amount of compute used for training and its capabilities.
We know this is not perfect and could miss other systems that may pose risk. In future, we aim to develop a robust methodology for determining which systems could pose concerning capability jumps and thus warrant tracking and evaluating.
There is also a decision about when to test a system. We now work with companies to do pre-deployment testing, which we think constitutes a step change in prospective risk exposure, alongside our post-deployment testing to monitor performance ‘in the wild’ and create a baseline to compare new systems against.
A more sophisticated approach would test at intervals across the full AI life cycle when we expect risk to increase in meaningful enough ways that are worth monitoring or prioritising, such as:
AISI currently prioritises the following areas in our evaluations:
Our general heuristic for choosing these areas has been to focus on critical risks with the greatest potential for harm. The recently published interim International Scientific Report on the Safety of Advanced AI provides a comprehensive overview of the risks and capabilities of frontier AI systems, helping us to identify which risks to classify as critical for evaluation. We have also focused on areas where we think government can add value due to its specific expertise, such as in national security domains, and on areas that will not be duplicative to efforts elsewhere. We regularly consider whether to change or add to our evaluation suite, and we have expanded our focus over the last year; for example, our societal impacts team now considers the potential for AI systems to enable criminal misuse and social destabilisation.
If we want to properly understand how AI systems may create risks, our evaluations need to be designed around specific pathways to harm. We need to clearly understand how the capabilities we measure, such as certain coding capabilities, could lead to real world harm, such as assisting in a cyberattack on critical infrastructure. We have therefore started mapping out these potential pathways to harm through a systematic process of risk modelling. Risk modelling provides a clearer understanding of the risks that needs to be managed which can be challenging due to both the novel nature of frontier AI and since its capabilities are often dual use. For example, general-purpose AI systems have a wide range of applications, making it difficult to anticipate all potential risks. Factors external to AI systems can also influence risks, which makes assessments difficult to make by only evaluating the systems. Finally, detailed risk modelling often requires highly sensitive information about specific paths to harm and about threat actors.
Despite these challenges, we are committed to doing this well. Without detailed and rigorous risk modelling we cannot translate evaluation results into actual assessments of risks.
A gold standard for development and deployment decisions of frontier AI would include a comprehensive set of clearly defined risk and capability thresholds that would likely lead to unacceptable outcomes, unless mitigated appropriately. We think governments have a significant role to play, as 27 countries and the EU recognised at the AI Seoul Summit 2024. We have thus begun work to identify capability thresholds: specific AI capabilities that are indicative of potentially severe risks, could be tested for, and should trigger certain actions to mitigate risk. They correspond to pathways to harm from our risk modelling, such as capabilities that would remove current barriers for malicious actors or unlock new ways of causing harm.
As described above, we want our evaluations and risk models to be closely aligned. We are thus working to develop evaluations that clearly map onto all relevant pathways and capabilities. Our goal is to eventually operationalise each capability threshold with a set of evaluations and clearly define in advance which evaluation results we would consider above the threshold. This will be an iterative process, with insights from evaluations informing our risk models and vice versa. While we work towards this longer-term ambition, our risk modelling already guides which evaluations we develop and provides a framework for interpreting results.
Examples for how we aim to tailor our evaluations to our risk models include:
There are multiple techniques available to evaluators, which can be used in isolation or in combination. AISI currently uses the following:
Red teaming provides a deep and tangible level of insight into model capabilities, but it is more resource intensive compared to automated techniques, given the human effort required. Red teaming, therefore, often covers a narrower set of areas than automated tests and usually each scenario is only tested once.
A key challenge is delivering rigorous evaluations to tight timelines, particularly pre-deployment. Our current pilot approach has distinct tiers of testing to address this; if quicker and more automated evaluations produce concerning or surprising findings, this triggers more intensive evaluations. We think about our tiers as follows:
These tiers are a work in progress and represent our ideal end state. A key open scientific question is to what extent results from one tier can be used as proxies for higher tiers (and precisely where this fails to hold, for example, if lower tier tests produce false negatives or if designed to have a lower capability bar than is concerning), potentially precluding the need to run the more expensive and lengthy evaluations.
We have also been developing the infrastructure and tooling required to run various evaluations and tiers, to make these as efficient, insightful, and high quality as possible. Our tests are written using Inspect AI, an open-source model evaluation framework that we created for both internal use and for use across the broader evaluations ecosystem. Inspect provides state of the art capabilities including flexible prompt engineering, multi-turn dialog, agent scaffolds, and model grading. We created for both internal use and for use across the broader evaluations ecosystem.
For all our tests we develop baselines which are essential for interpreting the empirical results and conclusions on risks. Models are baselined against humans (either expert or non-expert depending on the risk model being assessed) or against other state of the art models.
We have developed these by periodically running our automated and agent evaluations suites on sets of publicly available systems. A key question in establishing these baselines is how much effort to spend on bespoke elicitation for each model given time constraints, and the implications this has for comparing results between systems and our confidence in assessing the limits of capabilities.
Additionally, there are several design questions in selecting human users to establish baselines, such as: whether test subjects have internet access and how long they have to perform a task; whether test subjects are incentivised to perform well and whether we create checks to monitor attention; and what level of expertise the test subject has, in particular in relation to the risk model.
Independent evaluations are hard to design because the full range of potential capabilities of the model or system can be difficult to assess. Users can improve or modify capabilities in unforeseeable ways, such as through prompt engineering, fine-tuning, or providing access to software tools. There are also concerns that AI systems might be manipulated to fail specific evaluation tasks in future, so that these tasks would not be representative of the underlying capabilities.
It is an open scientific question to what extent evaluations might be able to elicit the full capabilities of an AI system. We use the following methods to explore capabilities, but we have a long way to go before we might claim that we are eliciting capabilities from the system that are close to or representative of the ‘ceiling’ that a well-motivated actor could achieve:
We aim to adapt our elicitation efforts according to our assessment of risk, for example, through more specialized elicitation if concerns include motivated and well-resourced malicious actors.
As well as underestimating capabilities, elicitation can also overestimate capabilities when the same tasks are used during elicitation and evaluation. We mitigate this by clearly separating the tasks used for elicitation and testing - a ‘dev/test split’ which is a standard practice in machine learning.
To score performance of a system on a task, we use either human experts or automatic scoring. Automatic scoring can compare against ground truth (information that is known to be true) or use an auto-scoring function based on a grading rubric, which details what components are needed for a fully or partially correct answer. If the questions are multiple-choice then automating the scoring is easy, but we get a much more complete understanding of a model’s capabilities if we also ask open ended questions. The challenge lies in writing these open-ended questions in a way that has a clear correct answer, and then writing grading rubrics in a way that an LLM can use these rubrics for accurate automated grading. This involves clear specification of essential components of a right answer and where relevant clear articulation of different possible options. In addition, the grading prompts for the grader LLM need to be optimised in an iterative manner, to ensure that grading rubrics are used appropriately.
Another key consideration is how well our automatic scoring approximates an expert human rater, overall reliability, and any biases that might be present in the scoring model. We have written our key design and methodology questions in detail in our blog, Early Insights from Developing Question-Answer Evaluations for Frontier AI.
Evaluations should strive to be predictive. Even if a threshold is not currently being crossed, it is vital to know how close it is to being crossed. If tasks are currently too difficult for current models, the evaluation should still be able to provide some signal. One way to do this is to assign partial points to incomplete solutions, or to break challenging tasks down into more manageable sub-tasks with associated milestones. Predictive evaluations which identify performance bottlenecks will provide a smooth metric to track improvements over time and decrease the likelihood of a sudden step-change in observed capabilities.
Independent evaluators often face time-pressure during testing, for example, if deployment is imminent. This makes evaluation quality assurance challenging.
To address this, AISI fully researches, develops, and designs its evaluations in advance of using them so they can be run immediately. This means documenting, reviewing, iterating, and approving the experiments and draft reports of results before starting to onboard the model to be tested.
This amounts to ‘preregistration’, which aims to:
The preregistration includes:
Independent evaluators work with extremely sensitive information. This includes both commercially sensitive information when working with pre-deployment access to a given system and sensitive information on how AI systems can be misused or how to elicit harmful capabilities and remove safeguards. It is also possible that the work of evaluators might lead to AI systems that pose higher risks than the initial systems and need to be protected accordingly.
We implement stringent information security measures, proportionate to the risk of potential capabilities and the commercial sensitivity. AISI has worked with government information security experts to ensure its systems are built to a robust security standard. This includes undertaking regular penetration testing of its development environment, hardening endpoints, and using access to qualified security professionals to prevent, mitigate and manage potential security incidents.
Additionally, all AISI safety evaluations are subject to existing government information security operating procedures for commercially sensitive information plus bespoke security measures. Our approach divides AISI into multiple groups with differentiated access according to role. Each group has a named list and operates on a ‘need-to-know' basis to minimise access where it is not essential. We implement two elements of code names: one for each main AI company, and one for each testing pre-deployment exercise, and we use these across our technical onboarding and communication relating to testing to ensure external information security beyond the testing team.
We have learned a lot in our approach to evaluations to date, but there are significant challenges and areas of progress to make going forward.
It is clear from our experience that to run high quality evaluations that elicit high fidelity information to the potential risks posed by frontier systems, it is important for independent evaluators to have:
It is important that we have sufficient time to conduct a rigorous evaluation of model capabilities, to provide quality assurance of results, and report back to frontier AI labs in advance of model deployment, in line with the testing tiers outlined in section 4. An option we are considering is to focus the bulk of testing on “proxy” models, which are available earlier and sufficiently similar to the model that will be deployed. How to structure such evaluations and validate results, and how to precisely define similarity are open scientific questions.
It is important that evaluators can protect the integrity of their evaluations, for example, through non-logging guarantees. We need to work with companies to better understand the appropriate and necessary level of information to share which enables trust in independent evaluation results, and the implementation of effective mitigations, without revealing our full testing methodology.
Additionally, we need to establish a shared framework and standard for risk and capability thresholds (as mentioned above), and what these thresholds entail in terms of mitigations expectations. Relatedly, we are developing our understanding of how to disclose findings to a developer where these might relate to potential national security risks.
We hope that this blog has provided useful considerations and stimulates thinking for both evaluators and policy makers. We have learned a lot in our work to date, including how much there is left to do in this evolving and emerging science of evaluations. We are eager to work with interested partners to advance this work.