Please enable javascript for this website.

Announcing Inspect Evals

We’re open-sourcing dozens of LLM evaluations to advance safety research in the field

JJ Allaire, Technical Staff

Today we are excited to announce Inspect Evals,  a repository of community contributed LLM benchmark evaluations.  Over the past few months, we've worked closely with Arcadia Impact and the Vector Institute on this new project, which makes available dozens of high quality open-source evaluations for safety research. Evaluations cover a wide range of domains including coding, mathematics, cybersecurity, safeguards, reasoning, and general knowledge. We’ve also included most of the benchmarks commonly reported by frontier model providers.

Evaluations have become an increasingly important way of tracking the capabilities of frontier models. The worldwide ‘evals community’ includes model developers, academics, AI safety institutes, and a multitude of other research organisations. Organisations running evaluations often struggle with various implementation challenges including the lack of shared standards of measurement. With this project we are taking some initial steps towards addressing these challenges.

Running and viewing Inspect evaluations in VS Code

Inspect Platform

Inspect Evals are built on top of Inspect AI, an open-source evaluation framework created by the UK AISI. Inspect AI was initially open sourced in May of this year, and since then has seen wide usage and contributions from the broader evals community. Over 50 contributors have added to the framework, including other AI safety institutes, frontier labs, and major safety research organisations.

Inspect Evals are designed to be easy to run and experiment with. They can be installed as a Python package and run from the command line or through a Python API. The collection includes several agent benchmarks including GAIA, SWE-Bench, GDM CTF, and Cybench. Historically, agent benchmarks have been time consuming and complex to set up and run, with Inspect Evals these benchmarks can now be run against any model with a single command.

Community Contributions

The evals community shares a keen interest in thoroughly assessing the capabilities and safety characteristics of frontier models. Traditionally, benchmark implementations have been siloed inside many separate organisations, leading to duplicated work, lack of peer review, and less evaluations run than we'd all like. Inspect Evals is intended to serve as a place for the broader evals community to collaborate on high quality evaluations that everyone can benefit from. 

We are seeking new evaluation contributions and have already had evals merged from over a dozen users. Visit the project website to learn more and check out our contribution guide to get started!