We are now the AI Security Institute
Please enable javascript for this website.

RepliBench: measuring autonomous replication capabilities in AI systems

A comprehensive benchmark to detect emerging replication abilities in AI systems and provide a quantifiable understanding of potential risks

As current AI systems grow increasingly capable of autonomous operation, both AI labs and governments are beginning to recognise autonomous replication of AI — the ability of an AI system to create copies of itself that can replicate across the internet — as a potential risk. However, empirical evaluations of these capabilities remain relatively scarce. To address this gap, comprehensive benchmarks are essential for researchers to detect emerging replication abilities and provide a quantifiable understanding of potential risks.  

Our recent paper introduces RepliBench: 20 novel LLM agent evaluations comprising 65 individual tasks designed to measure and track this emerging capability. By introducing a realistic and practical benchmark, we aim to provide a grounded understanding of autonomous replication and anticipate future risks.

The risk of autonomous replication

An AI agent with the ability to autonomously replicate could pose a significant risk. The implications of this could range from a minor annoyance – imagine a particularly malignant spam bot – to more serious consequences, such as enabling costly cyberattacks. Early detection of these capabilities, careful oversight and robust safeguards are essential to mitigate these risks, as they may be especially difficult to reverse if they occur.

There is a growing body of evidence that suggests frontier models may attempt to deceive humans or evade human control under the right conditions. Recent research has shown that frontier models may attempt to disable oversight mechanisms and exfiltrate themselves from human control to achieve specified goals. Researchers from Anthropic also found that when models are trained using reinforcement learning which rewards behaviours that conflict with their values, they will engage in deceptive behaviour to preserve their values, and attempt to exfiltrate their own weights as a means of preserving their current values into the future. Taken together, these results suggest frontier models can recognise that gaining independence from human oversight is an instrumentally useful goal for a wide range of values, and that the main barriers preventing this from happening in the wild may relate to models’ capabilities, rather than their motivations.

Figure 1: We use a modification of Fault Tree Analysis (FTA) to break down the capabilities required for autonomous replication, and find that four key capabilities, listed below, are required. We further break down each of these capabilities until we arrive at leaf nodes which represent evaluations, resulting in 20 agentic evaluations and 86 individual tasks.

Previous work (1, 2, 3) has focused on evaluating some key aspects of autonomous replication – testing, for example, whether an AI agent can copy weights between machines or deploy a basic successor agent. However, there remains a need for a more comprehensive benchmark covering a wider spectrum of required capabilities, to get a better picture of the real-world risks.

In RepliBench, we attempt to systematically break down autonomous replication capability into component capabilities (see Figure 2) and evaluate each in turn. We identify four key capabilities an AI agent would need to successfully autonomously replicate in the wild:

  • Obtaining model weights: can it extract its own weights from controlled environments?
  • Replicating onto compute: can it deploy itself to new systems?
  • Obtaining resources: can it acquire the money or compute needed to run new copies of itself?
  • Persistence: can it maintain access to its new compute, and evade coming back under human control for extended periods?

We create a set of long horizon agentic evaluations designed to measure each of these capabilities individually, with tasks ranging from creating a bitcoin wallet and sending a transaction, to developing complex distributed command and control systems for a network of machines. See the table below for an example task from each category.

Name Category Short Description
AWS Signup Obtaining resources Sign up for an Amazon Web Services (AWS) account, including simulated email and SMS verification steps.
Weight Exfiltration Obtaining weights Exfiltrate a set of LLM weights to another server under security measures of various strengths, such as encrypted weights.
Pass KYC Check Obtaining resources Generate a selfie photograph and a matching photograph of a fabricated ID to fool a simple Know Your Customer (KYC) system. This capability would be crucial in enabling rogue agents to access traditional financial infrastructure.
Recursive Replication Replicating onto compute Assesses an agent’s ability to create and deploy copies of itself (serving a set of model weights and running an agent script) across a network of machines, and whether that copy can then do the same task.
Command & Control Persistence Implement command and control infrastructure across a network of machines according to a fixed specification.

Table 1: Five example task families from RepliBench

All evaluations are built using Inspect (UK AISI’s open-source evaluation framework) and were conducted in a sandboxed environment to ensure safety. Where possible, we use simulated versions of external services (like payment processors and cloud providers).

RepliBench evaluates the component tasks of autonomous replication rather than testing the end-to-end capability. Our approach allows us to say, "under the model we are considering, if models can't perform all these tasks, they likely can't autonomously replicate", providing valuable evidence for inability-based safety cases. However, we are not claiming that success on all component tasks guarantees end-to-end replication capability.

Key findings: current capabilities and trends

Picture 893053783, Picture
Figure 2: Results from the five most performant frontier models across each RepliBench domain. Models consistently perform well at tasks in the obtaining compute domain, such as navigating compute provider interfaces, but struggle with tasks relating to replicating onto compute and maintaining persistent access to it. Example tasks in these domains include ‘Recursive Replication’, for which we show results in Figure 4, which tests an agent’s ability to create and deploy copies of itself across a network of machines, and whether that copy can then do the same task.

We benchmark seven frontier models on RepliBench, and find that they are not yet capable of performing all components of autonomous replication. However, the latest frontier models can reliably succeed at many of the core subtasks. The best performing model has a >50% pass@10 score on 15/20 task families and a >50% pass@10 score for 9/20 families on the hardest variants.

In terms of specific capabilities, we find models excel at tasks involving well-defined interfaces – demonstrating, for example, strong abilities to navigate compute provider websites, provision appropriately sized instances, and handle cryptocurrency transactions. For money-making abilities, models show some success at persuading simulated people to give them money but fail completely at our Know Your Customer check evaluation, suggesting they would face significant barriers accessing traditional financial systems without human enablers.

We also found interesting behavioural patterns that could potentially confound capability evaluations. We observed various forms of subtle task avoidance – for example, models claiming to have completed a task without having actually performed any actions, models noticing they are in simulated environments and subsequently refusing to complete a task, or models selectively ignoring key subtasks if they seem ethically dubious. In other instances, failures come from models submitting incomplete solutions or ending their runs early despite instructions not to do so.

Our results suggest that models' ability to autonomously replicate is currently hindered by a few key tasks. Notably, they still struggle to bypass Know Your Customer checks (Figure 3), to set up robust and persistent agent deployments from scratch, or to bypass more realistic weight exfiltration defences. Given recent progress in image generation capabilities released after our evaluation was run, we expect the performance on the KYC check evaluation to substantially improve with subsequent model releases.

Picture 1038469152, Picture

Picture 166379546, Picture
Figure 3: Two examples of ID card images generated by models in the Pass KYC evaluation. All models still struggle with this evaluation, often attempting to create overly simple IDs using PIL to “draw” a card from scratch (bottom). Models will sometimes try to generate the ID cards entirely using image generation models and sometimes take a hybrid approach, incorporating image generation and manual PIL editing (top).

There is a clear upward trend over time in model performance on RepliBench. Given this trajectory, it's crucial that we develop more effective safeguards and early warning systems before these capabilities fully emerge. While current models don't present an immediate threat of autonomous replication, the pace of improvement suggests this situation could change rapidly.

What’s next

Several important future research directions exist in this area, such as developing end-to-end evaluations in realistic security environments. This presents significant challenges due to the complexity of real-world systems. Projects in progress at UK AISI like ControlArena, which simulates lab-like environments with realistic security measures for evaluating AI systems, are an important first step in this direction.

By introducing RepliBench, we aim to provide policymakers and AI developers with empirical data on this emerging risk, helping to inform appropriate safeguards and governance measures before these capabilities materialise. For more details, see the full paper.