Pre-deployment evaluation of Anthropic’s upgraded Claude 3.5 Sonnet

Note to readers: we changed our name to the AI Security Institute on 14 February 2025. Read more here.

Introduction

The UK Artificial Intelligence Safety Institute (UK AISI) and the U.S. Artificial Intelligence Safety Institute (US AISI) conducted a joint pre-deployment evaluation of Anthropic’s latest model – the upgraded Claude 3.5 Sonnet (released October 22, 2024).

The following is a high-level overview of the evaluations conducted, as well as a snapshot of the findings from each domain tested. A more detailed technical report can be found here.

Overview of the Joint Safety Research & Testing Exercise

US AISI and UK AISI conducted testing during a limited period of pre-deployment access to the upgraded Sonnet 3.5 model. Testing was conducted by expert engineers, scientists, and subject matter specialists from both Institutes, and the findings were shared with Anthropic before the model was publicly released.

US AISI and UK AISI ran separate but complementary tests to assess the model’s capabilities across four domains: (1) biological capabilities, (2) cyber capabilities, (3) software and AI development, and (4) safeguard efficacy.

To assess the model’s relative capabilities and evaluate the potential real-world impacts of the upgraded Sonnet 3.5 across these four areas, US AISI and UK AISI compared its performance to a series of similar reference models: the prior version of Anthropic’s Sonnet 3.5, OpenAI’s o1-preview, and OpenAI’s GPT-4o.

These comparisons are intended only to assess the relative capability improvements of the upgraded Sonnet 3.5, in order to improve scientific interpretation of evaluation results.

Methodology

US AISI and UK AISI tested the upgraded Sonnet 3.5 by drawing on a range of techniques including:

Question Answering: The model was asked to correctly answer a series of questions that test knowledge or problem solving on a given topic. Answers were most often graded automatically by another model, checked by a human with knowledge of the correct answers.

Agent Tasks: The model operated as an agent in a virtual environment where it was given a task to complete, provided with access to a series of software tools, and prompted to take a series of steps until it successfully completed the task or reached its maximum number of steps without succeeding.

Qualitative Probing: A scientific expert reviewed a model as it operated to understand its capabilities and limitations in more detail.

Red Teaming: Machine learning experts attempted to develop jailbreaks or other adversarial inputs that would cause the model to answer malicious requests.

Evaluations and Findings

Below are selected findings from US AISI and UK AISI’s testing of the upgraded Sonnet 3.5 model. Note that smaller differences in model performance presented below may fall within the margins of error for these tests. A more detailed analysis can be found in the technical report.

Biological Capabilities

Rapid advances in AI are enabling powerful innovation across many fields of biological research, which holds immense promise for the future of science, medicine, manufacturing, and much more. That said, many biological discoveries and capabilities are dual use, meaning that new discoveries in the field of biology can be used to facilitate helpful outcomes, as well as potentially harmful ones.

To better understand the relative biological capabilities of the upgraded Sonnet 3.5, including how it could be misused, US AISI and UK AISI focused on evaluating how the model performed on a range of practical research tasks.

Below is a high-level overview of findings related to biological capabilities. Please note that these findings are based on US AISI’s testing only, as UK AISI is not publishing its findings in this domain at this time.

Key findings:

US AISI evaluated the upgraded Sonnet 3.5 on a set of multiple-choice research task questions, and performance was comparable to reference models and significantly below measured human expert baselines in most instances.

US AISI piloted an evaluation method that augments the AI model by providing it access to bioinformatic tools to assist in research task questions, which when tested using Sonnet 3.5 resulted in increased performance beyond the model alone, at times exceeding measured human expert baselines.

Cyber Capabilities

Advances in AI systems could enable the automation of increasingly complex cyber tasks. These capabilities are also dual use, meaning they can be leveraged to strengthen cyber defenses and can also be used maliciously to exploit a computer system.

To better understand the relative capability of the upgraded Sonnet 3.5, US AISI and UK AISI evaluated how the model performed on a range of cyber skills that might be used to enable malicious tasks, like hacking into a computer system.

Key findings:

US AISI evaluated the upgraded Sonnet 3.5 against a suite of 40 publicly available cybersecurity challenges. The upgraded model was able to successfully solve 32.5% of all tasks, compared to a 35% solve rate for the best-performing reference model evaluated.

UK AISI evaluated the upgraded Claude 3.5 Sonnet against a suite of 47 cybersecurity challenges, 15 of which are publicly available and 32 of which were privately developed. The upgraded model was able to solve 36% of tasks that are at the “cybersecurity apprentice” level of competency, compared to 29% of tasks at that same level for the best reference model evaluated.

Software and AI Development Evaluations

AI systems are themselves becoming increasingly useful tools for engineers who are developing these technologies. Even if an AI system is not able to complete a given task on its own, it could be deployed to help develop or augment other software to make it more capable. Put simply, advanced AI system could make existing technologies more effective.

To understand the relative impact of the upgraded Sonnet 3.5 on software and AI development tasks, US AISI and UK AISI set the model up as an automated agent with access to various basic software development tools and then tested its skill and ability to carry out common engineering tasks.

Key findings:

US AISI evaluated the upgraded Sonnet 3.5 against a publicly available collection of challenges in which an agent must improve the quality or speed of an ML model. On a scale of 0% (model is unimproved) to 100% (maximum level of model improvement by humans), the model received an average score of 57% improvement — in comparison to an average of 48% improvement by the best performing reference model evaluated.

UK AISI evaluated the upgraded Sonnet 3.5 on a set of privately developed evaluations consisting of software engineering, general reasoning, and agent tasks that span a wide range of difficulty levels. The upgraded model had a success rate of 66% on software engineering tasks compared to 64% by the best reference model evaluated, and a success rate of 47% on general reasoning tasks compared to 35% by the best reference model evaluated.

Safeguard Efficacy

Many AI developers build safeguards into their systems to detect and then prevent the model from responding to potentially harmful requests from users. Such safeguards are an important line of defense, though it is currently possible for users to circumvent them through a range of adversarial inputs that would cause the model to answer malicious requests, known as ‘jailbreaks’.

To test the efficacy of the safeguards of the upgraded Sonnet 3.5, US AISI and UK AISI red-teamed the upgraded Sonnet 3.5 to determine its robustness against such jailbreaks.

Though US AISI’s and UK AISI’s safeguard evaluations are intended to inform how developers can better protect AI systems from being deliberately misused, it is worth highlighting that such safeguards are not the only measure that model providers may use to prevent misuse, and the results of this evaluation cannot on their own determine the model’s risks. Additionally, what qualifies as a harmful request is often subjective. Different model providers have different approaches for defining acceptable use of their models, and those approaches differ between jurisdictions and local contexts — including between the United States and the United Kingdom.

Key findings:

US AISI tested the upgraded Sonnet 3.5 against a series of publicly available jailbreaks, and in most cases the built-in version of the safeguards that US AISI tested were circumvented as a result, meaning the model provided answers that otherwise would have been prevented. This is consistent with prior research on the vulnerability of other AI systems.

UK AISI tested the upgraded Sonnet 3.5 using a series of public and privately developed jailbreaks and also found the version of the safeguards that UK AISI tested can be routinely circumvented. This is again consistent with prior research on the vulnerability of other AI systems’ safeguards.

Conclusion

While these tests were conducted in line with current best practices, the findings should be considered preliminary. These tests were conducted in a limited time period with finite resources, which if extended could expand the scope of findings and the subsequent conclusions drawn.

The science of AI safety is a new and rapidly evolving field. Conducting these independent safety assessments improves the precision and robustness of future evaluations so that governments can stay ahead of emerging risks and capabilities. US AISI and UK AISI plan to iterate and expand on evaluation scope, methodology, and testing tools with each subsequent exercise. We look forward to feedback from the scientific community to help strengthen this critical work and advance the science of AI safety.

Read the full report here.

This blog was cross-posted by The U.S AISI Safety Institute here.

‍