Early lessons from evaluating frontier AI systems

Note to readers: we changed our name to the AI Security Institute on 14 February 2025. Read more here.

Introduction

The AI Safety Institute's (AISI) mission is to equip governments with an empirical understanding of the safety of frontier AI systems. To achieve this, we design and run evaluations that aim to measure the capabilities of AI systems that may pose risks.

Since November 2023, we have built one of the largest safety evaluation teams globally with a leadership team of world class researchers, established a Memorandum of Understanding (MoU) and begun to operationalise joint testing with the US AISI. We have also published an open-source framework for evaluations, reported our results of testing 4 large language models from major companies, and conducted several pre-deployment tests.

But this is just the beginning. Model evaluations are a new and rapidly evolving science, and we routinely iterate on our approach. We will continue to grow our team and build our capacity to address challenges and open scientific questions. If you are excited by this, please join us!

This blog sets out our thinking to date on how to design and run third-party evaluations, including key elements to consider and open questions. This is not intended to provide robust recommendations; rather we want to start a conversation in the open about these practices and to learn from others.

We discuss the role of third-party evaluators and what they could target for testing, including which systems to test, when to test them, and which risks and capabilities to test for. We also examine how to evaluate effectively, including which tests to use for which purpose, how to develop robust testing, and how to ensure safety and security.

1. What is the role of third-party evaluators?

Independent evaluations of frontier AI systems can provide a snapshot of a given system’s capabilities and vulnerabilities. Whilst our sense is that the science is too nascent for independent evaluations to act as a ‘certification’ function (i.e. provide confident assurances that a particular system is ‘safe’), they are a critical part of incentivising best efforts at improving safety. Independent evaluations – completed by governments or other third parties - provide key benefits to AI companies, governments, and the public, including:

Providing an independent source of verification of claims made by companies about capabilities and safety: third-party evaluations will often have legitimacy through their independence and absence of conflicts of interest, including in choosing which risks to evaluate for. They can test and assess companies' safety claims and provide feedback. This supports public confidence and safety, by eliminating real or perceived conflicts or biases.

Improving system safety by acting as a constructive partner to frontier AI developers: evaluations are a key tool for decision-makers when deciding whether risk mitigations are sufficient for a given system and its deployment. Independent evaluations complement developers’ internal evaluations: they provide a fuller capability profile and assessment of safeguards and risks than developers may find themselves. For example, developers could improve their model safeguards following feedback on specific adversarial attacks to protect against, or in areas of risk where a model complies with explicitly harmful requests.

Advancing the science of evaluations: the more parties there are that are attempting to conduct rigorous evaluations of AI systems, the richer the scientific community surrounding these challenges becomes. There is immense and compound value in running at the problem of evaluating for frontier AI risks from multiple angles and perspectives.

Ensure government preparedness and awareness: evaluations increase our understanding of how AI systems might impact people and society. This is important in giving governments and institutions an understanding of emergent risks and key policy interventions for societal resilience. Repeat evaluations are essential for tracking capabilities and safeguards over time – allowing governments to review and evolve their response.

Independent evaluations completed by state bodies provide additional benefits to governments:

National security: state-backed evaluators also have unique capability and expertise to incorporate information into their evaluations to assess national security related risks from AI.

Taken together, these benefits of independent evaluations are critical to the safe, secure, and trustworthy development and use of AI systems – both today and in future.

2. Which systems to test and when?

AISI’s mission is to develop an empirical understanding of frontier AI safety. In our first year, our focus has been on those AI systems that we believe will provide the best insights on the frontier of AI and how it is developing.

In practice, this means we have focused on evaluating systems which are likely to match or surpass the current state of the art (SOTA) on general purpose capabilities. One way to assess this is to use ‘scaling laws’ as a proxy to identify such systems, which suggest a correlation between the amount of compute used for training and its capabilities.

We know this is not perfect and could miss other systems that may pose risk. In future, we aim to develop a robust methodology for determining which systems could pose concerning capability jumps and thus warrant tracking and evaluating.

There is also a decision about when to test a system. We now work with companies to do pre-deployment testing, which we think constitutes a step change in prospective risk exposure, alongside our post-deployment testing to monitor performance ‘in the wild’ and create a baseline to compare new systems against.

A more sophisticated approach would test at intervals across the full AI life cycle when we expect risk to increase in meaningful enough ways that are worth monitoring or prioritising, such as:

Testing when/if models have exceeded the previous SOTA during training or deployment: by testing on general capability benchmarks.

Significant system changes post-deployment: these could be significant model updates or system feature releases. Examples include novel tools linked to the model (such as web search or links to other software), fine-tuning for specific domains, longer context windows, or allowing novel methods to elicit capabilities (such as making a fine-tuning API available).

Significant external developments that affect capabilities: research and development advances by third parties may allow users to elicit new capabilities, for example, the release of new agent frameworks or the discovery of new methods to elicit capabilities or circumvent safeguards.

3. What to test for?

AISI currently prioritises the following areas in our evaluations:

Misuse: assessing how AI can meaningfully assist actors in causing harm. We focus on misuse of chemical and biological capabilities, and cyber offense capabilities, where harms could be particularly large in scale.

Societal impacts: evaluating the direct impact of frontier AI systems on both individuals and society, including psychological risks such as the extent to which people’s beliefs can be manipulated or whether AI whether AI advice is safe and beneficial.

Autonomous systems: evaluating whether AI systems can autonomously execute sequences of actions (operating as “agents”) in ways that might limit human oversight.

Safeguards: evaluating the strength and efficacy of safety components of frontier AI systems against diverse threats which could circumvent safeguards.

Our general heuristic for choosing these areas has been to focus on critical risks with the greatest potential for harm. The recently published interim International Scientific Report on the Safety of Advanced AI provides a comprehensive overview of the risks and capabilities of frontier AI systems, helping us to identify which risks to classify as critical for evaluation. We have also focused on areas where we think government can add value due to its specific expertise, such as in national security domains, and on areas that will not be duplicative to efforts elsewhere. We regularly consider whether to change or add to our evaluation suite, and we have expanded our focus over the last year; for example, our societal impacts team now considers the potential for AI systems to enable criminal misuse and social destabilisation.

Risk modelling and capability thresholds

If we want to properly understand how AI systems may create risks, our evaluations need to be designed around specific pathways to harm. We need to clearly understand how the capabilities we measure, such as certain coding capabilities, could lead to real world harm, such as assisting in a cyberattack on critical infrastructure. We have therefore started mapping out these potential pathways to harm through a systematic process of risk modelling. Risk modelling provides a clearer understanding of the risks that needs to be managed which can be challenging due to both the novel nature of frontier AI and since its capabilities are often dual use. For example, general-purpose AI systems have a wide range of applications, making it difficult to anticipate all potential risks. Factors external to AI systems can also influence risks, which makes assessments difficult to make by only evaluating the systems. Finally, detailed risk modelling often requires highly sensitive information about specific paths to harm and about threat actors.

Despite these challenges, we are committed to doing this well. Without detailed and rigorous risk modelling we cannot translate evaluation results into actual assessments of risks.

A gold standard for development and deployment decisions of frontier AI would include a comprehensive set of clearly defined risk and capability thresholds that would likely lead to unacceptable outcomes, unless mitigated appropriately. We think governments have a significant role to play, as 27 countries and the EU recognised at the AI Seoul Summit 2024. We have thus begun work to identify capability thresholds: specific AI capabilities that are indicative of potentially severe risks, could be tested for, and should trigger certain actions to mitigate risk. They correspond to pathways to harm from our risk modelling, such as capabilities that would remove current barriers for malicious actors or unlock new ways of causing harm.

Alignment between evaluations, capability thresholds and risk models

As described above, we want our evaluations and risk models to be closely aligned. We are thus working to develop evaluations that clearly map onto all relevant pathways and capabilities. Our goal is to eventually operationalise each capability threshold with a set of evaluations and clearly define in advance which evaluation results we would consider above the threshold. This will be an iterative process, with insights from evaluations informing our risk models and vice versa. While we work towards this longer-term ambition, our risk modelling already guides which evaluations we develop and provides a framework for interpreting results.

Examples for how we aim to tailor our evaluations to our risk models include:

Realistically simulating relevant actors: for example, if the risk model is concerned with harm from an inexperienced actor, prompts used in the evaluation should simulate the user’s limited experience. This could be achieved by taking prompts directly from recorded interactions between such users and AI systems.

Ensure coverage across the risk model: for example, we have found that evaluation performance across different cyber domains can vary greatly. Low performance in one domain is therefore not sufficient to assess how much assistance the AI system might provide in other domains. Similarly, evaluations should cover all relevant ways in which AI systems could assist, such as providing specialist skills or improving efficiency through automation.

Validating that evaluations realistically assess paths to harm: it is not always possible to directly test whether an AI system can provide relevant uplift to malicious actors. For example, directly exploring the capability to cause large-scale harm would be dangerous. We might therefore use proxies for these capabilities. We then need to carefully validate that these proxies are appropriate, developing clear evidence and working with domain experts.

4. Which tests for which purpose?

There are multiple techniques available to evaluators, which can be used in isolation or in combination. AISI currently uses the following:

Automated capability evaluations: to assess whether a model can provide instructions, troubleshooting and coaching in a given domain. This includes short, open-ended question-answer sets and longer task-based evaluations. To determine the ceiling of capabilities, a variety of capability elicitation techniques and model scaffolds are employed. Responses to benchmark questions are automatically graded by LLMs with access to rubrics written by domain experts. These “autograders” are calibrated using scores given by human experts. Automated evaluations are generally scalable and efficient to run, and therefore can be used to cover a large proportion of concerning capabilities. However, this methodology does not mirror real world usage. Rather, the intention is for the results from automated testing to signpost and correlate with results from red teaming and human uplift studies (see below). Given this limitation, as well as the potential for improved capability elicitation and grading errors, strong claims should not be drawn directly from automated testing in isolation (see our separate blog post for further detail).

AI agent evaluations: embedding AI systems in an agent framework and assessing capabilities such as autonomous operation and longer-term planning across complex, multi-step tasks. This is a subset of automated capability evaluations where a model is provided with a “scaffolding”, which consists of a set of tools, a prompting procedure, and error handling to assist the model in troubleshooting issues. One approach that AISI uses extensively is to format tasks as “Capture the Flag Challenges” (CTFs), which require an agent to find a “flag” that is hidden in a target computer system. This task format is versatile, allowing for the evaluation of various (e.g. cyber) skills and difficulty levels, including complex multi-step tasks. CTFs also enable automated scoring for success based on flag capture. A key open question is whether to focus the development of new agent evaluations on narrowly defined tasks, or larger and more realistic open-ended tasks.

Expert red teaming: deploying domain experts to spend time interacting with a model to test its capabilities and model safeguards. This approach complements automated evaluations by simulating testing model capabilities and safeguards in a more natural, open-ended way. AISI uses two approaches: scenario-based probing, and goal-based probing.

Scenario-based probing involves assessing the ability of a model to help with pre-specified and relatively well-defined steps for a scenario of interest, such as conducting a specific criminal activity.

Goal-based probing involves starting with a high-level goal, such as “make a lot of money”, and exploring in a more open-ended way how far the model can realise that goal.

Red teaming provides a deep and tangible level of insight into model capabilities, but it is more resource intensive compared to automated techniques, given the human effort required. Red teaming, therefore, often covers a narrower set of areas than automated tests and usually each scenario is only tested once.

Human uplift or impact evaluations for misuse: comparing AI systems to existing tools, such as internet search, for assessing whether humans improve on specific tasks with AI assistance. Human uplift studies are important for baselining for all evaluations as they shed light on how humans naturally actually use AI systems and how performance with AI systems compares to that without AI systems. Thus, human studies inform what level of capabilities of an AI system might be of concern, which can then be tracked with less resource intense automated evaluations. Tasks need to be carefully selected to ensure that they are not actually harmful, but still provide insight on risk of misuse. Human studies are time consuming, require very careful design and analysis of results, and are consequently not practical to cover every domain or capability of concern.

A key challenge is delivering rigorous evaluations to tight timelines, particularly pre-deployment. Our current pilot approach has distinct tiers of testing to address this; if quicker and more automated evaluations produce concerning or surprising findings, this triggers more intensive evaluations. We think about our tiers as follows:

Tier 1 – Light (automated): to be used to rapidly identify potential concerns in a specific area. This tier involves automated evaluations that are scalable, efficient to run, and allow a high number of tests across many areas. Drawbacks include a low accuracy and low confidence that the tested capabilities represent the most advanced capabilities possible (see elicitation). Such tests can be completed within a few days, and the end-to-end process including quality assurance and reporting within 1-2 weeks.

Tier 2 – Standard (include manual efforts): to be used for meaningful conclusions when Tier 1 evaluations suggest a relevant capability increase in domains of concern, compared to previous SOTA. These include manual and bespoke capabilities elicitation as well as developing novel attack strategies. They might also involve limited human exploration, looking in depth at specific examples of success or failure for qualitative insights, and some expert red-teaming. Such tests can be completed within 1-2 weeks.

Tier 3 – Elevated (extensive manual efforts): to be used for very deep and rigorous capabilities or safeguards assessments in cases of significant capability jumps or concerning findings in Tier 2. These include full expert red-teaming and can include human uplift studies. Such tests can take several weeks.

These tiers are a work in progress and represent our ideal end state. A key open scientific question is to what extent results from one tier can be used as proxies for higher tiers (and precisely where this fails to hold, for example, if lower tier tests produce false negatives or if designed to have a lower capability bar than is concerning), potentially precluding the need to run the more expensive and lengthy evaluations.

We have also been developing the infrastructure and tooling required to run various evaluations and tiers, to make these as efficient, insightful, and high quality as possible. Our tests are written using Inspect AI, an open-source model evaluation framework that we created for both internal use and for use across the broader evaluations ecosystem. Inspect provides state of the art capabilities including flexible prompt engineering, multi-turn dialog, agent scaffolds, and model grading. We created for both internal use and for use across the broader evaluations ecosystem.

5. How to develop robust tests?

Baselines

For all our tests we develop baselines which are essential for interpreting the empirical results and conclusions on risks. Models are baselined against humans (either expert or non-expert depending on the risk model being assessed) or against other state of the art models.

We have developed these by periodically running our automated and agent evaluations suites on sets of publicly available systems. A key question in establishing these baselines is how much effort to spend on bespoke elicitation for each model given time constraints, and the implications this has for comparing results between systems and our confidence in assessing the limits of capabilities.

Additionally, there are several design questions in selecting human users to establish baselines, such as: whether test subjects have internet access and how long they have to perform a task; whether test subjects are incentivised to perform well and whether we create checks to monitor attention; and what level of expertise the test subject has, in particular in relation to the risk model.

Capability elicitation

Independent evaluations are hard to design because the full range of potential capabilities of the model or system can be difficult to assess. Users can improve or modify capabilities in unforeseeable ways, such as through prompt engineering, fine-tuning, or providing access to software tools. There are also concerns that AI systems might be manipulated to fail specific evaluation tasks in future, so that these tasks would not be representative of the underlying capabilities.

It is an open scientific question to what extent evaluations might be able to elicit the full capabilities of an AI system. We use the following methods to explore capabilities, but we have a long way to go before we might claim that we are eliciting capabilities from the system that are close to or representative of the ‘ceiling’ that a well-motivated actor could achieve:

Prompt engineering: designing the input given to a model, such as documents, questions, images, or code, to achieve a desired outcome.

Hyper-parameters: identifying the set of model parameters that yield the strongest performance (such as `temperature`, which influences the balance between predictability and diversity of model outputs).

Fine-tuning: additional training for a specific set of tasks or contexts, for example, on question-answer sets in the domain that is being evaluated.

Breaking safeguards: developers prevent users from accessing certain capabilities through safeguards. Assessing underlying capabilities might require efforts to circumvent these.

Scaffolding: building a software package around the model that enhances its abilities, for example, by providing access to the internet or file management, or repeatedly querying the model and processing information the model returns.

We aim to adapt our elicitation efforts according to our assessment of risk, for example, through more specialized elicitation if concerns include motivated and well-resourced malicious actors.

As well as underestimating capabilities, elicitation can also overestimate capabilities when the same tasks are used during elicitation and evaluation. We mitigate this by clearly separating the tasks used for elicitation and testing - a ‘dev/test split’ which is a standard practice in machine learning.

Scoring

To score performance of a system on a task, we use either human experts or automatic scoring. Automatic scoring can compare against ground truth (information that is known to be true) or use an auto-scoring function based on a grading rubric, which details what components are needed for a fully or partially correct answer. If the questions are multiple-choice then automating the scoring is easy, but we get a much more complete understanding of a model’s capabilities if we also ask open ended questions. The challenge lies in writing these open-ended questions in a way that has a clear correct answer, and then writing grading rubrics in a way that an LLM can use these rubrics for accurate automated grading. This involves clear specification of essential components of a right answer and where relevant clear articulation of different possible options. In addition, the grading prompts for the grader LLM need to be optimised in an iterative manner, to ensure that grading rubrics are used appropriately.

Another key consideration is how well our automatic scoring approximates an expert human rater, overall reliability, and any biases that might be present in the scoring model. We have written our key design and methodology questions in detail in our blog, Early Insights from Developing Question-Answer Evaluations for Frontier AI.

Predictive evaluations

Evaluations should strive to be predictive. Even if a threshold is not currently being crossed, it is vital to know how close it is to being crossed. If tasks are currently too difficult for current models, the evaluation should still be able to provide some signal. One way to do this is to assign partial points to incomplete solutions, or to break challenging tasks down into more manageable sub-tasks with associated milestones. Predictive evaluations which identify performance bottlenecks will provide a smooth metric to track improvements over time and decrease the likelihood of a sudden step-change in observed capabilities.

Preregistration and quality assurance

Independent evaluators often face time-pressure during testing, for example, if deployment is imminent. This makes evaluation quality assurance challenging.

To address this, AISI fully researches, develops, and designs its evaluations in advance of using them so they can be run immediately. This means documenting, reviewing, iterating, and approving the experiments and draft reports of results before starting to onboard the model to be tested.

This amounts to ‘preregistration’, which aims to: 

Ensure clarity and alignment for testing exercise teams on what the report will look like.

Minimise the amount of uncertainty and work needed during the testing phase.

Ensure adequate time for oversight and actionable quality assurance (QA) and feedback, including from our research directors and external advisors, that is then integrated into longer term eval R&D.

The preregistration includes: 

Background: framing and context in the wider literature and references, risk models.

Methodology: task/dataset descriptions, baselines, scoring protocols, results analysis, and presentation.

Capability elicitation protocols: the phase of experimenting and tuning of prompts (overall and scaffolding-related), tool setups and jailbreaks after model access has been granted. Protocols or guidelines for these are in scope, while final prompts, setup, hyperparameters are out of scope of preregistration.

Overall form of results: claims, headline statements to be expected.

6. How to ensure safety and security?

Independent evaluators work with extremely sensitive information. This includes both commercially sensitive information when working with pre-deployment access to a given system and sensitive information on how AI systems can be misused or how to elicit harmful capabilities and remove safeguards. It is also possible that the work of evaluators might lead to AI systems that pose higher risks than the initial systems and need to be protected accordingly.

We implement stringent information security measures, proportionate to the risk of potential capabilities and the commercial sensitivity. AISI has worked with government information security experts to ensure its systems are built to a robust security standard. This includes undertaking regular penetration testing of its development environment, hardening endpoints, and using access to qualified security professionals to prevent, mitigate and manage potential security incidents.

Additionally, all AISI safety evaluations are subject to existing government information security operating procedures for commercially sensitive information plus bespoke security measures. Our approach divides AISI into multiple groups with differentiated access according to role. Each group has a named list and operates on a ‘need-to-know' basis to minimise access where it is not essential. We implement two elements of code names: one for each main AI company, and one for each testing pre-deployment exercise, and we use these across our technical onboarding and communication relating to testing to ensure external information security beyond the testing team.

7. What is needed for effective testing?

We have learned a lot in our approach to evaluations to date, but there are significant challenges and areas of progress to make going forward.

Access

It is clear from our experience that to run high quality evaluations that elicit high fidelity information to the potential risks posed by frontier systems, it is important for independent evaluators to have:

Access to a Helpful Only (HO) version of the model, alongside the Helpful, Honest and Harmless (HHH) version of the model that will be deployed, the ability to turn off/on trust and safety safeguards, and fine-tuning API access. It is essential to elicit the full capabilities of a model as far as possible to evaluate the level of potential risk in a system and the sufficiency of existing mitigations.

Regular technical discussions before and during testing with teams at the given company who have most experience with evaluating the model/system in question, including providing information on model capabilities and elicitation techniques, the safeguards that are put in place/are intended to be put in place for the deployed system, and results from internal evaluations which we can use for calibration.

Testing window

It is important that we have sufficient time to conduct a rigorous evaluation of model capabilities, to provide quality assurance of results, and report back to frontier AI labs in advance of model deployment, in line with the testing tiers outlined in section 4. An option we are considering is to focus the bulk of testing on “proxy” models, which are available earlier and sufficiently similar to the model that will be deployed. How to structure such evaluations and validate results, and how to precisely define similarity are open scientific questions.

Working together

It is important that evaluators can protect the integrity of their evaluations, for example, through non-logging guarantees. We need to work with companies to better understand the appropriate and necessary level of information to share which enables trust in independent evaluation results, and the implementation of effective mitigations, without revealing our full testing methodology.

Additionally, we need to establish a shared framework and standard for risk and capability thresholds (as mentioned above), and what these thresholds entail in terms of mitigations expectations. Relatedly, we are developing our understanding of how to disclose findings to a developer where these might relate to potential national security risks.

Conclusion

We hope that this blog has provided useful considerations and stimulates thinking for both evaluators and policy makers. We have learned a lot in our work to date, including how much there is left to do in this evolving and emerging science of evaluations. We are eager to work with interested partners to advance this work.

‍