We are now the AI Security Institute

How we’re addressing the gap between AI capabilities and mitigations

We outline our approach to technical solutions for misuse and loss of control.

As AI capabilities advance, they present security risks which the AI Security Institute (AISI) is working to mitigate and control. This extends beyond our risk assessments, to conducting and funding technical research to actively improve safety and security measures.  

However, the gap between capabilities and mitigations is growing fast. This gap exists across different risks, including adversarial actors misusing AI systems to facilitate large-scale harms, the possibility of developers losing control of highly intelligent AI systems, and a range of other potentially destabilising effects AI could have on society. This blog outlines our approach to technical solutions for misuse and loss of control.

We're hiring research scientists and engineers across our Safeguards Analysis team (to counter misuse) as well as our Control and Alignment teams (to counter loss of control).  

To rapidly advance the science of AI safety and security, AISI aims to maximise the quality research tackling this problem. This is why we recently established the Challenge Fund, which will provide grants of up to £200,000 to academics and non-profits.

Safeguards Analysis

The risk of attackers misusing AI to cause large-scale harm or disrupting their operation to damage critical systems increases as AI systems advance. ​​Research that seeks to understand, evaluate, and improve the technical measures designed to address these risks (safeguards) is urgently needed.  

We’re excited about progress on three key challenges:

  • Defending hosted frontier AI systems against misuse. When AI systems are hosted by benign actors, a range of safeguards can be used to prevent attackers using the AI to aid in illegal activity. These can include model-level safeguards like safety training; system-level safeguards like real-time monitors; access safeguards like giving vetted users or organisations access to less safeguarded or more advanced systems; or maintenance safeguards like rapid vulnerability remediation. We must also defend alternate forms of model access, such as insider threats and fine-tuning APIs.  
  • Defending against 3rd party attacks. Exposure to attacker-controlled data during training (“data poisoning”) or inference (“prompt-injection”)—as well as direct adversarial manipulation of model weights (“model poisoning”)—can lead to adversaries controlling the actions or goals of otherwise benign AI systems. This could lead to AI agents exfiltrating sensitive information or causing large amounts of harm, or broad attacks on the availability of critical AI systems. ​
  • Mitigating misuse of open-weight models. Open-weight models enable a range of defensive and beneficial use-cases. However, once AI model weights are made public, system-level safeguards, access safeguards, and maintenance safeguards are largely infeasible. Model-level safeguards are made much more difficult, as users can fine-tune away refusal behaviour or design attacks in a full-information (white-box) setting. New research methods could look to prevent misuse of open-weight models, or improve societal resilience to bad actors misusing these models.

Control

We cannot currently guarantee that highly capable AI systems will act in alignment with human intentions, and misuse safeguards are not designed to resist intentional subversion by misaligned AIs. The field of AI control attempts to design and evaluate control protocols: safety measures that are robust to intentional subversion by AI systems. Promising protocols vary widely, from applications of cyber security to AI monitors and human auditing. Protocols are evaluated empirically by a red team using proxy AIs and environments to simulate what would happen if a misaligned model were deployed.

Since AI control is a nascent field, there are many questions to be addressed before protocols and evaluations can be applied to advanced AI agents. We are particularly focused on the following areas:

  • Developing realistic evaluations for estimating safety is crucial for iterating on protocol design, and requires solving two challenges:
    • Developing proxy environments that closely reflect key deployments, (especially environments for research-engineering AI’s that mirror the code and infrastructure at frontier labs).
    • Creating more realistic red-team AIs to move from stress-testing using human-written scaffolds to using in-weight methods.  
  • Designing control protocols that meet the challenges of more realistic deployment settings, more autonomous agents, and higher levels of capability. Future protocols might include white-box techniques such as linear probes, ensembles of monitors, hierarchical auditing, and specific anti-collusion mitigations.
  • Investigate human-in-the-loop protocols, where humans are asked to audit AI behaviours that are flagged as suspicious by AI monitors. Control evaluations may need to estimate how accurately humans can audit unsafe behaviour under time constraints.

Alignment

Our current understanding of AI control methods is that they don’t reliably scale past a certain level of AI capability, where the challenge becomes too great for human auditors without further scientific progress. Work to meaningfully align AI systems to human intentions is therefore important for future, more capable systems.  

Our focus is to make progress by doing foundational work to decompose problems within the alignment space, and then funnel resources and energy into a systematically mapped out set of problems and research agendas to leverage the collective effort of academia and industry. We are especially excited about funding research towards:

  • Alignment methods: We are interested in improvements over existing methods (including debate, formalisations of interpretability, behavioural worst-case guarantees) as well as developing novel alignment methods. Alignment methods must in combination address three problems:
    • Scalable oversight: How can we correctly reward desired behaviours beyond humans' ability to efficiently judge them?
    • Interpretability for worst-case guarantees: Given error bounds from training, how do we ensure that random errors do not have correlations and structure which will be used to subvert human intent when deployed?  
    • Exploration and elicitation: To train AIs on tasks where human supervision is unavailable, we expect to use AI assistance to help humans supervise AI systems. One way to mitigate unforeseen errors by AI systems is to elicit the AI’s maximum performance at the assistance task.
  • Alignment evaluations: Measuring the effectiveness of alignment methods requires developing evaluations without relying on human annotation. We are interested in development of new evaluation methodologies modelling imperfect human judgment e.g. via partial observability, sandwiching, or adversarial initialisations.
  • Automated alignment: Many researchers and AI developers are hoping to partially automate the search for and evaluation of alignment algorithms. We believe this may be feasible for some subproblems of alignment, but that obstacles such as automation collapse may block this approach on other subproblems. We are interested in research that maps or expands the set of subproblems that can be tackled, such as by studying long-horizon generalisation, reward hacking in the code agent setting, etc.

Join us

Over the last year, AISI has built world-class evaluations to assess model capabilities and risks. AISI will now scale-up its work to conduct and fund technical research mitigating those risks.

You can be a part of this work: consider applying to our Solutions teams! We’re hiring across them all, seeking research scientists and engineers.