The UK Artificial Intelligence Safety Institute (UK AISI) and the U.S. Artificial Intelligence Safety Institute (US AISI) conducted a joint pre-deployment evaluation of Anthropic’s latest model – the upgraded Claude 3.5 Sonnet (released October 22, 2024).
The following is a high-level overview of the evaluations conducted, as well as a snapshot of the findings from each domain tested. A more detailed technical report can be found here.
US AISI and UK AISI conducted testing during a limited period of pre-deployment access to the upgraded Sonnet 3.5 model. Testing was conducted by expert engineers, scientists, and subject matter specialists from both Institutes, and the findings were shared with Anthropic before the model was publicly released.
US AISI and UK AISI ran separate but complementary tests to assess the model’s capabilities across four domains: (1) biological capabilities, (2) cyber capabilities, (3) software and AI development, and (4) safeguard efficacy.
To assess the model’s relative capabilities and evaluate the potential real-world impacts of the upgraded Sonnet 3.5 across these four areas, US AISI and UK AISI compared its performance to a series of similar reference models: the prior version of Anthropic’s Sonnet 3.5, OpenAI’s o1-preview, and OpenAI’s GPT-4o.
These comparisons are intended only to assess the relative capability improvements of the upgraded Sonnet 3.5, in order to improve scientific interpretation of evaluation results.
US AISI and UK AISI tested the upgraded Sonnet 3.5 by drawing on a range of techniques including:
Below are selected findings from US AISI and UK AISI’s testing of the upgraded Sonnet 3.5 model. Note that smaller differences in model performance presented below may fall within the margins of error for these tests. A more detailed analysis can be found in the technical report.
Rapid advances in AI are enabling powerful innovation across many fields of biological research, which holds immense promise for the future of science, medicine, manufacturing, and much more. That said, many biological discoveries and capabilities are dual use, meaning that new discoveries in the field of biology can be used to facilitate helpful outcomes, as well as potentially harmful ones.
To better understand the relative biological capabilities of the upgraded Sonnet 3.5, including how it could be misused, US AISI and UK AISI focused on evaluating how the model performed on a range of practical research tasks.
Below is a high-level overview of findings related to biological capabilities. Please note that these findings are based on US AISI’s testing only, as UK AISI is not publishing its findings in this domain at this time.
Advances in AI systems could enable the automation of increasingly complex cyber tasks. These capabilities are also dual use, meaning they can be leveraged to strengthen cyber defenses and can also be used maliciously to exploit a computer system.
To better understand the relative capability of the upgraded Sonnet 3.5, US AISI and UK AISI evaluated how the model performed on a range of cyber skills that might be used to enable malicious tasks, like hacking into a computer system.
AI systems are themselves becoming increasingly useful tools for engineers who are developing these technologies. Even if an AI system is not able to complete a given task on its own, it could be deployed to help develop or augment other software to make it more capable. Put simply, advanced AI system could make existing technologies more effective.
To understand the relative impact of the upgraded Sonnet 3.5 on software and AI development tasks, US AISI and UK AISI set the model up as an automated agent with access to various basic software development tools and then tested its skill and ability to carry out common engineering tasks.
Many AI developers build safeguards into their systems to detect and then prevent the model from responding to potentially harmful requests from users. Such safeguards are an important line of defense, though it is currently possible for users to circumvent them through a range of adversarial inputs that would cause the model to answer malicious requests, known as ‘jailbreaks’.
To test the efficacy of the safeguards of the upgraded Sonnet 3.5, US AISI and UK AISI red-teamed the upgraded Sonnet 3.5 to determine its robustness against such jailbreaks.
Though US AISI’s and UK AISI’s safeguard evaluations are intended to inform how developers can better protect AI systems from being deliberately misused, it is worth highlighting that such safeguards are not the only measure that model providers may use to prevent misuse, and the results of this evaluation cannot on their own determine the model’s risks. Additionally, what qualifies as a harmful request is often subjective. Different model providers have different approaches for defining acceptable use of their models, and those approaches differ between jurisdictions and local contexts — including between the United States and the United Kingdom.
While these tests were conducted in line with current best practices, the findings should be considered preliminary. These tests were conducted in a limited time period with finite resources, which if extended could expand the scope of findings and the subsequent conclusions drawn.
The science of AI safety is a new and rapidly evolving field. Conducting these independent safety assessments improves the precision and robustness of future evaluations so that governments can stay ahead of emerging risks and capabilities. US AISI and UK AISI plan to iterate and expand on evaluation scope, methodology, and testing tools with each subsequent exercise. We look forward to feedback from the scientific community to help strengthen this critical work and advance the science of AI safety.
Read the full report here.
This blog was cross-posted by The U.S AISI Safety Institute here.