The UK Artificial Intelligence Safety Institute (UK AISI) and the U.S. Artificial Intelligence Safety Institute (US AISI) conducted a joint pre-deployment evaluation of OpenAI’s latest model, o1 (released December 5, 2024).
The following is a high-level overview of the evaluations conducted, as well as a snapshot of the findings from each domain tested. A more detailed technical report can be found here.
US AISI and UK AISI conducted testing during a limited period of pre-deployment access to the o1 model. Testing was conducted by expert engineers, scientists, and subject matter specialists from staff at both Institutes, and the findings were shared with OpenAI before the model was publicly released.
US AISI and UK AISI ran separate but complementary tests to assess the model’s capabilities across three domains: (1) cyber capabilities, (2) biological capabilities, (3) and software and AI development.
To assess the model’s relative capabilities and evaluate the potential real-world impacts of o1 across these areas, US AISI and UK AISI compared its performance to a series of similar reference models: OpenAI’s o1-preview, OpenAI’s GPT-4o, and both the upgraded and earlier version of Anthropic’s Claude 3.5 Sonnet.
These comparisons are intended only to assess the relative capability improvements of o1, in order to improve scientific interpretation of evaluation results.
The version of o1 that was tested exhibited a number of performance issues related to tool-calling and output formatting. US AISI and UK AISI took steps to address these issues by adapting their agent designs, including adjusting prompts and introducing simple mechanisms to help the agent recover from errors. The results below reflect o1’s performance with this scaffolding in place. A version of o1 that was better optimized for tool use might exhibit better performance on many evaluations. This report makes no claims about the performance of other versions of o1.
US AISI and UK AISI tested the model by drawing on a range of techniques including:
US AISI was assisted in its cyber capabilities evaluations by subject matter experts from the National Security Agency and the Cybersecurity and Infrastructure Security Agency, and in its biological capabilities evaluations by subject matter experts from the National Institutes of Health and the Department of Homeland Security.
Below are selected findings from US AISI and UK AISI’s testing of OpenAI’s o1 model.
Across the three domains tested, o1 largely demonstrated performance on par with the reference models tested – with the exception of additional capabilities noted in cybersecurity challenges related to cryptography.
Note that smaller differences in model performance presented below may fall within the margins of error for these tests. More detail can be found in the technical report.
Advances in AI systems could enable the automation of increasingly complex cyber tasks. These capabilities are also dual use, meaning they can be leveraged to strengthen cyber defences and can also be used maliciously to exploit a computer system.
To better understand the relative capability of o1, US AISI and UK AISI evaluated how the model performed on a range of cyber skills that might be used to enable malicious tasks, like hacking into a computer system.
Key findings:
Rapid advances in AI are enabling powerful innovation across many fields of biological research, which holds immense promise for the future of science, medicine, manufacturing, and much more. That said, many biological discoveries and capabilities are dual use, meaning that new discoveries in the field of biology can be used to facilitate helpful outcomes, as well as potentially harmful ones.
To better understand the relative biological capabilities of OpenAI’s o1 model, including how it could be misused, US AISI and UK AISI focused on evaluating how the model performed on a range of practical research tasks.
Below is a high-level overview of findings related to biological capabilities. Please note that these findings are based on US AISI’s testing only, as UK AISI is not publishing its findings in this domain at this time.
Key findings:
AI systems are themselves becoming increasingly useful tools for engineers who are developing these technologies. Even if an AI system is not able to complete a given task on its own, it could be deployed to help develop or augment other software to make it more capable. Put simply, advanced AI systems could make existing technologies more effective.
To understand the relative impact of OpenAI’s o1 model on software and AI development tasks, US AISI and UK AISI set the model up as an automated agent with access to various basic software development tools and then tested its skill and ability to carry out common engineering tasks.
Key findings:
While these tests were conducted in line with current best practices, the findings should be considered preliminary. These tests were conducted in a limited time period with finite resources, which if extended could expand the scope of findings and the subsequent conclusions drawn.
The science of AI safety is a new and rapidly evolving field. Conducting these independent safety assessments improves the precision and robustness of future evaluations so that governments can stay ahead of risks and capabilities as they emerge.
US AISI and UK AISI plan to iterate and expand on evaluation scope, methodology, and testing tools in further exercises. We look forward to feedback from the scientific community to help strengthen this critical work and advance the science of AI safety.
Read the full report.
This blog was cross-posted by The U.S AISI Safety Institute here.