AI as a Biosecurity Risk Amplifier

In July 2022, researchers at Collaborations Pharmaceuticals inverted their drug discovery AI’s optimization function from avoiding toxicity to maximizing it. Within 6 hours on a consumer laptop, the system generated 40,000 potentially toxic molecules, including VX analogs and novel structures never before synthesized. This was not a hypothetical threat scenario or academic speculation. It was a proof of concept demonstrating that dual-use biological tools can be trivially repurposed for harm, though synthesizing these designs in the real world remains a significant barrier.

Learning Objectives
  • Differentiate between “Information Hazards” (LLMs) and “Design Hazards” (BDTs).
  • Analyze the concept of “Tacit Knowledge” and why it acts as a primary barrier to AI-driven bioterrorism.
  • Evaluate the findings of key empirical studies: RAND Red Team, OpenAI/Anthropic uplift assessments, and the Urbina toxic molecule generation experiment.
  • Critique the “Uplift” metrics used by major AI labs to measure biosecurity risk.
  • Assess DNA synthesis screening as a critical chokepoint for AI-enabled biological threats.
  • Apply threat modeling frameworks to different actor categories (state, non-state, lone actors).
Scope of This Chapter

This chapter discusses biosecurity risks at a conceptual level appropriate for education and policy analysis. Consistent with responsible information practices:

  • Omitted: Actionable protocols, specific synthesis routes, exact pathogen sequences
  • Included: Risk frameworks, governance mechanisms, policy recommendations

For detailed biosafety protocols, consult your Institutional Biosafety Committee and relevant regulatory guidance.

The “Amplifier” Thesis: AI technologies function as biosecurity risk amplifiers, not risk creators. The fundamental risks - infectious disease, biological weapons, laboratory accidents - exist independently of AI. What AI changes is the accessibility, speed, and potentially the ceiling of what is achievable.

Two Classes of AI Risk:

  • Large Language Models (LLMs): May lower barriers by democratizing access to dual-use biological knowledge. Current evidence suggests “mild uplift” at most for attack planning.
  • Biological Design Tools (BDTs): May raise the ceiling of what sophisticated actors can achieve by enabling novel pathogen design. The 2022 Urbina experiment generated 40,000 toxic molecules in 6 hours.

Key Evidence:

  • RAND Corporation (2024): No statistically significant difference in biological attack plan viability with vs. without LLM access
  • OpenAI (2024): GPT-4 provides “at most a mild uplift” in biosecurity-relevant tasks
  • Anthropic (2024): Claude 3 models showed uplift for novices in “certain parts” of bioweapons acquisition, but not for experts

Who Benefits Most: AI is most dangerous in the hands of those who are already dangerous. State actors with existing infrastructure, expertise, and resources gain the most; lone actors face persistent “tacit knowledge” barriers.

Critical Chokepoint: DNA synthesis screening remains the primary defense. The 2024 OSTP Framework mandated screening for federally funded research; Executive Order 14292 (May 2025) directed a revised framework with enforcement mechanisms, but the revision deadline passed without replacement. Congress has since introduced H.R. 3029 (NIST screening standards) and the Biosecurity Modernization and Innovation Act (S. 3741, mandatory screening through Commerce Department). See DNA Synthesis Screening for current regulatory status.

Bottom Line: AI is an efficiency tool for the capable, not a capability tool for the incapable. Governance must address both LLMs and BDTs through multi-layered interventions spanning DNA synthesis screening, model evaluations, and international coordination.

Introduction: The AI-Biology Convergence

In July 2023, Anthropic CEO Dario Amodei warned in congressional testimony that AI could “greatly widen the range of actors with the technical capability to conduct a large-scale biological attack” within two to three years. Former UK Prime Minister Rishi Sunak similarly warned that AI could make it easier “to build chemical or biological weapons” and that “terrorist groups could use AI to spread fear and destruction on an even greater scale.”

These concerns are not merely hypothetical. President Biden’s October 2023 Executive Order 14110 on AI Safety explicitly tasked agencies with assessing AI-enabled biosecurity risks. The order notably established a lower compute threshold for models trained primarily on biological data, recognizing these systems warrant heightened oversight.

However, we must be careful not to confuse speed with possibility.

The media often portrays AI as a tool that will allow a teenager in a basement to engineer a pandemic. This narrative is dangerously distracting. The reality is more nuanced: AI is a risk amplifier. It takes existing capabilities and lowers the cost, time, and expertise required to execute them, but it does not (yet) solve the fundamental physical challenges of biology.

How AI Lowers Barriers: Two Classes of Risk

Jonas Sandbrink’s influential 2023 preprint differentiates two classes of AI tools that pose biosecurity risks: large language models (LLMs) and biological design tools (BDTs). This distinction is crucial because these tool types create different risk profiles and require different mitigation strategies.

Large Language Models (LLMs)

Frontier LLMs such as GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro are trained on natural language data, including scientific literature. According to Sandbrink, LLMs can “democratize access to biological knowledge” lowering barriers for its misuse. The potential mechanisms include:

  • Information Access: Synthesizing and explaining complex dual-use biological concepts in accessible language
  • Planning Assistance: Helping structure approaches to biological experiments or agent acquisition
  • Lab Assistance: Providing troubleshooting guidance for laboratory procedures
  • Synthesis Evasion: Potentially helping actors circumvent DNA synthesis screening protocols

A widely-cited MIT classroom exercise (Soice et al., 2023, preprint) demonstrated these concerns. Students in a biosecurity course tasked LLM chatbots with assisting pandemic pathogen creation. Within one hour, chatbots suggested four potential pandemic pathogens, explained how they could be generated from synthetic DNA using reverse genetics, supplied names of DNA synthesis companies unlikely to screen orders, and recommended engaging contract research organizations for those lacking laboratory skills. (Note: This was a classroom demonstration, not an operational feasibility test.)

However, the operational significance of this information access remains contested.

Biological Design Tools (BDTs)

While LLMs get the headlines, Biological Design Tools (BDTs) warrant greater concern. These are AI systems trained specifically on biological data, including protein structures, chemical properties, and genomic sequences, rather than text.

If LLMs are “Google on steroids,” BDTs are “calculators for biology.” They can predict how a protein folds (AlphaFold), how a molecule binds to a receptor, or how to design novel proteins with specific functions (RFdiffusion).

Notable examples include:

  • AlphaFold/AlphaFold3: DeepMind’s protein structure prediction tools; Demis Hassabis and John Jumper shared the 2024 Nobel Prize in Chemistry for this work, alongside David Baker for computational protein design
  • RFdiffusion: Enables de novo design of protein structures and functions
  • ESM3 and Similar Models: Bridge gaps between sequence, structure, and function
  • LigandMPNN: Atomic context-conditioned protein sequence design

Unlike LLMs, which primarily lower barriers for less sophisticated actors, BDTs could enable creation of agents “substantially worse than anything seen to date” by expanding the capabilities of already-sophisticated actors. The RAND Europe Global Risk Index (2025) found that AI-enabled biological tools’ “dual-use nature could lower barriers to biological weapon development or raise the ceiling of potential harm by enabling the design of novel biological agents.”

Key risk characteristics of BDTs include:

  • Unprecedented Predictive Accuracy: Reduces time, resources, and expertise required for experimental validation
  • Novel Design Capabilities: Enable creation of pathogens more transmissible, virulent, or capable of evading countermeasures
  • Open-Source Proliferation: Unlike frontier LLMs controlled by major companies, many high-risk BDTs are released open-source. The 2025 Global Risk Index for AI-enabled Biological Tools assessed 57 state-of-the-art tools across eight functional categories, scoring each for misuse-relevant capabilities using Red/Amber/Green ratings. 13 tools were flagged as “Red” requiring action, and 61.5% of these are fully open-sourced
  • Lower Compute Requirements: BDTs can be trained with fewer computational resources than frontier LLMs
Case Study: The MegaSyn Experiment (2022)

This is the most critical case study for understanding physical AI risk.

The Context: Researchers at Collaborations Pharmaceuticals used an AI model called “MegaSyn,” designed to avoid toxicity in drug discovery. For a biosecurity conference presentation, they simply flipped the logic - instead of penalizing toxicity, they asked the AI to maximize it.

The Result: In less than 6 hours, running on a standard consumer laptop, the AI generated 40,000 toxic molecules. The list included VX (one of the deadliest nerve agents known), many known chemical warfare agents, and - most worryingly - novel compounds predicted to be equally toxic but structurally distinct.

The Implication: The design barrier for chemical weapons has collapsed. However, the synthesis barrier remains. Knowing the structure of a novel nerve agent is dangerous, but you still need the precursors and laboratory to synthesize it.

Citation: Urbina, F., Lentzos, F., Invernizzi, C. et al. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence (2022)

Comparing LLMs and BDTs: Risk Profiles

Dimension LLMs BDTs
Primary Risk Lower barriers for novices Expand capabilities of sophisticated actors
Training Data Natural language, internet text Biological sequences, protein structures
Access Model Mostly commercial APIs with guardrails Largely open-source
Regulation Focus Compute thresholds, safety alignment DNA synthesis screening, structured access
Current Evidence “Mild uplift” at most Limited public testing; risks remain speculative

The “Tacit Knowledge” Gap

The primary fear regarding LLMs is that they will serve as “super-mentors” for bioterrorism. To evaluate this, we must understand tacit knowledge - the unwritten, hands-on skills learned only through physical practice.

Examples of tacit knowledge in biology:

  • How to pipette a liquid without creating aerosols
  • How to visually identify a healthy cell culture
  • How to troubleshoot contamination in real-time
  • When an experiment “looks wrong” despite correct protocols

This tacit knowledge cannot be fully conveyed through text, as it requires physical practice under expert guidance. It represents a significant barrier that text-based AI cannot bridge.

The Sociology of Tacit Knowledge

The significance of tacit knowledge barriers is not merely intuitive; it is well-established in the sociology of scientific knowledge. Harry Collins’ foundational research demonstrated that laboratory skills transfer only through direct social contact with practitioners, not through written protocols alone. His studies of laser construction showed that “only those who had significant social contact with successful laser builders could do the job,” regardless of access to written instructions (Collins, Science Studies, 1974).

This finding generalizes across laboratory sciences. What the sociology of science calls “contributory expertise,” the ability to actually do an activity with competence, requires hands-on practice that text cannot convey. Even detailed protocols omit countless judgment calls: when a culture “looks wrong,” how much force to apply during pipetting, what contamination smells like before it is visible.

The biosecurity implication is direct: AI-generated protocols, however detailed, cannot transfer the physical skills required to execute them. A novice following LLM instructions encounters the same barriers that have always separated theoretical knowledge from laboratory competence. As OpenAI’s 2024 evaluation explicitly noted, “information access alone is insufficient to create a biological threat.”

However, emerging multimodal AI capabilities may challenge this barrier. Vision-enabled frontier models (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, Grok 4.1) can now observe laboratory technique via video and provide real-time correction (“tilt the pipette 45 degrees,” “your flame is too close to the culture”). Whether this translates to meaningful erosion of tacit knowledge barriers remains an open question.

Demonstrated: In December 2025, OpenAI demonstrated GPT-5 iterating on wet-lab experiments in collaboration with biosecurity firm Red Queen Bio. The AI proposed novel protocol modifications, analyzed experimental results, and refined its approach across multiple rounds, achieving a 79-fold improvement in molecular cloning efficiency. Human scientists still executed the physical lab work, but the AI drove the experimental design and troubleshooting. Strict biosecurity safeguards were maintained throughout.

Theoretical implication: This represents a step beyond information access toward operational capability. If AI systems can effectively coach physical laboratory tasks in real time, the tacit knowledge barrier may erode faster than previously anticipated. However, the demonstrated capability (AI-driven experimental design with human execution) differs from the hypothetical concern (AI enabling untrained actors to execute dangerous protocols independently). The gap between these scenarios remains significant.

Case Study: The RAND Corporation Red Team Analysis (2024)

In a landmark study, RAND Corporation researchers conducted a controlled experiment to assess whether LLMs actually helped malicious actors plan biological attacks.

The Setup: They recruited multiple “Red Teams” and gave some access to the open internet only, while others had access to the internet plus an LLM.

The Task: Plan a viable biological attack.

The Result: The study found no statistically significant difference in the quality or viability of the plans generated with AI assistance versus those without.

The Takeaway: The LLMs were helpful for brainstorming, but they did not provide the “secret sauce” needed to bypass technical hurdles. The information was already available through internet search; the AI just summarized it faster.

Read the full report: Mouton, C. A., Lucas, C., & Guest, E. “The Operational Risks of AI in Large-Scale Biological Attacks.” RAND Corporation, 2024

The “Uplift” Debate: What AI Labs Have Found

While RAND found no significant uplift for attack planning, recent technical reports from AI labs suggest capabilities are advancing:

OpenAI (2024):

The GPT-4 bio-risk evaluation observed mean uplift in accuracy of 0.88 out of 10 for experts with GPT-4 access, but differences were not statistically significant. The study concluded that GPT-4 provides “at most a mild uplift” over standard search engines.

Anthropic (2024):

The Claude 3 Model Card reported that Claude 3 models “substantially increased risk in certain parts of the bioweapons acquisition pathway” for novices but did not appear capable of uplifting experts “to a substantially concerning degree.” This led to development of their ASL safety framework.

DHS Assessment (2025):

A DHS-commissioned RAND study assessed risks where AI and chemical/biological weapons converge, concluding that mitigating these dangers requires coordinated effort across industry, government, and international stakeholders. The report emphasizes that developers must recognize dual-use potential in their research products, even when not designing with misuse in mind.

Critical Limitation:

Researchers at Epoch AI and elsewhere caution that “existing evaluations only tell us a limited amount.” Current assessments focus on information access rather than operational capability - whether someone can actually execute an attack in the physical world.

These Findings Are Snapshots in Time

The studies cited above represent snapshots of specific model versions tested with specific evaluation methods. AI capabilities are advancing rapidly. “Mild uplift” today may become “significant uplift” with next-generation models. Continuous monitoring and updated evaluations are essential. Citing these results without caveat risks giving policymakers false security about a fast-moving target.

Quantifying Risk from Capability Evaluations

The uplift studies above tell us whether AI improves attack capability, but policymakers need to know how much that matters. A 2025 GovAI report by Luca Righetti provides a framework for translating capability evaluations into quantified risk estimates.

The analysis combines historical case studies, expert elicitation, and reference class forecasting. The key scenario: if AI systems increased the number of STEM Bachelor’s degree holders able to synthesize influenza-level pathogens by 10 percentage points, the annual probability of an epidemic from a lone-wolf bioterrorist attack could rise from 0.15% to 1.0%. This represents approximately 12,000 expected deaths per year, or roughly $100 billion in damages.

Six subject-matter experts and five superforecasters reviewed the methodology, finding similar median estimates. All forecasts displayed high uncertainty bands, and the authors note the continued need for better underlying evidence.

This framework matters for two reasons. First, it moves beyond qualitative terms like “mild uplift” to numbers that inform resource allocation and policy prioritization. Second, it highlights that even modest capability increases can translate to significant population-level risk when applied across many potential actors.

Key Finding: Information Access ≠ Operational Capability

OpenAI’s 2024 study explicitly notes: “information access alone is insufficient to create a biological threat” and “studies of information access alone do not test for success in the physical construction of the threats.”

Physical constraints remain significant. Specialist training and access to well-resourced laboratories is critical. Estimates suggest the pool of individuals with both the technical skills and materials access to execute a sophisticated biological attack numbers in the tens of thousands globally - a significant barrier despite AI assistance.

What We Know vs. What Remains Uncertain

Demonstrated (supported by published evidence):

  • LLMs provide “at most mild uplift” for biological attack planning with current models (RAND 2024, OpenAI 2024)
  • BDTs can generate toxic molecule designs rapidly (MegaSyn 2022 - 40,000 molecules in 6 hours)
  • Tacit knowledge barriers remain significant for laboratory work
  • DNA synthesis screening catches most known pathogen sequences
  • Multimodal AI can provide real-time laboratory coaching via video

Theoretical (plausible but not yet demonstrated):

  • AI-designed sequences evading function-based screening at scale
  • Autonomous AI agents completing wet-lab work without human oversight (see Autonomous AI Agents)
  • LLMs providing “significant uplift” (as opposed to “mild uplift”) for attack capability
  • AI enabling lone actors to overcome tacit knowledge barriers entirely

Unknown (insufficient evidence to assess):

  • Whether next-generation models will cross capability thresholds
  • The true size of the population capable of exploiting AI for biological harm
  • How quickly multimodal AI will erode tacit knowledge barriers
  • Whether cloud laboratories will become accessible to malicious actors

This distinction matters for policy: demonstrated risks warrant immediate action, while theoretical risks require monitoring and contingency planning.

Threat Modeling and Potential Misuse Actors

Understanding who might misuse AI for biological harm is essential for designing effective interventions. The sources of catastrophic biological risks are varied, and historically, policymakers have underappreciated risks posed by well-intentioned scientists.

State Actors

State actors have traditionally been a source of considerable biosecurity risk, most notably the Soviet Union’s Biopreparat program. The U.S. State Department expresses concerns about potential bioweapons capabilities of North Korea, Iran, Russia, and China.

AI Benefit: High. A state actor already has laboratories, funding, and tacit knowledge. They do not need AI to teach them basic techniques - they know those. They can use BDTs to optimize their agents, making them more stable, more resistant to vaccines, or harder to detect.

However, the imprecision of bioweapons has meant states remain unlikely to field large-scale biological attacks in the near term. AI could theoretically help state actors develop more predictable and targeted weapons, though this remains speculative.

Non-State Actors and Terrorist Groups

Expert opinions on non-state bioweapon risks range widely.

Skeptics contend: Even if viable paths to building bioweapons exist, the practicalities of constructing, storing, and disseminating them are far more complex than most realize. They point to the lack of major bioattacks in recent decades despite chronic warnings.

Pessimists counter: Experiments demonstrate the seeming ease of constructing powerful viruses using commercially available inputs. Highly capable terrorist groups may acquire human experts through compensation or ideological recruitment.

AI Benefit: Medium. LLMs may assist with planning, but significant tacit knowledge gaps persist.

Lone Actors and Non-Experts

AI labs have converged on two key capability thresholds:

  1. Uplifting novices to create or obtain non-novel biological weapons
  2. Uplifting experts to design novel threats

Current LLMs appear to provide more uplift to novices than experts, but the practical significance of this uplift remains debated given persistent tacit knowledge and physical access requirements.

AI Benefit: Low to Medium. AI helps them learn faster, but it cannot buy a centrifuge or grant access to controlled strain repositories. The tacit knowledge gap stops them.

Accidental Misuse by Well-Intentioned Researchers

The rich history of laboratory biosecurity accidents represents a concern that is often underappreciated. Dual-use research could make rapid strides forward as an effect of AI assistance - raising risks of accidents and the chance that information hazards become widely available.

This “legitimate researcher” threat vector differs fundamentally from malicious actors but may be equally or more significant in practice.

The Paradox of Differential Uplift

A consistent finding across evaluations deserves emphasis: AI provides greater relative uplift to novices than experts, but greater absolute capability to experts. This apparent paradox has significant policy implications.

Anthropic’s Claude 3 evaluation found models “substantially increased risk in certain parts of the bioweapons acquisition pathway” for novices but not for experts (Anthropic, 2024). OpenAI reported that “experts were better at extracting useful information from GPT-4 than novices.” RAND found no significant difference in attack plan viability for either group.

The resolution: LLMs help novices with early planning steps (identifying agents, understanding concepts), but this early-stage assistance does not translate to execution capability. Experts already possess what LLMs provide; their constraint is resources and access, not knowledge. Meanwhile, Biological Design Tools operate at a level where sophisticated actors can genuinely expand capabilities, designing novel agents beyond what published literature describes.

This differential matters for governance. Novice-focused concerns (LLMs as “super-mentors”) receive disproportionate attention, while the greater risk may lie in BDTs amplifying state-level or well-resourced programs. As Sandbrink notes, BDTs could enable agents “substantially worse than anything seen to date” in the hands of those already capable of executing designs (Sandbrink, 2023, preprint).

DNA Synthesis Screening: The Critical Chokepoint

DNA synthesis represents a critical “digital-to-physical frontier” where AI-designed biological agents must be converted into physical materials. As a recent Science paper notes: “Synthesis of nucleic acids is a choke point in AI-assisted protein engineering pipelines.”

This makes screening of synthesis orders one of the most tractable intervention points.

The International Gene Synthesis Consortium (IGSC)

The IGSC, founded in 2009, is a voluntary coalition of synthetic DNA providers operating under a Harmonized Screening Protocol. Members together represent a majority of commercial gene synthesis capacity worldwide. The Protocol requires:

  • Screening for sequences of concern (SOCs) matching regulated agents (e.g., Select Agents and Toxins)
  • Customer verification before fulfilling orders

However, significant gaps remain. A study of DNA provider practices found “significant heterogeneity in security practice throughout the field, reflective of the current lack of codified oversight for DNA synthesis.”

Key limitations include:

  • Voluntary Nature: No country legally requires nucleic acid synthesis screening
  • Coverage Gaps: Non-IGSC providers and benchtop synthesis devices may not screen orders
  • AI-Enabled Evasion: Protein design tools could help design sequences that evade current screening while retaining functionality
  • Order Splitting: Screening could be evaded by distributing orders across multiple providers or time periods

The 2024 OSTP Framework

In April 2024, the White House Office of Science and Technology Policy released the Framework for Nucleic Acid Synthesis Screening, implementing Section 4.4(b) of Executive Order 14110.

Key provisions include:

  • Federal Funding Requirement: Agencies funding life sciences research must require procurement of synthetic nucleic acids only from providers adhering to the framework (effective April 26, 2025)
  • Enhanced Screening Window: By October 2026, providers should screen each 50-nucleotide window for SOCs (reduced from 200 bp)
  • Expanded SOC Definition: Includes sequences “known to contribute to pathogenicity or toxicity, even when not derived from or encoding regulated biological agents”
  • Manufacturer Requirements: Extends screening expectations to benchtop nucleic acid synthesis equipment manufacturers
Policy Update: Framework Under Revision (May 2025)

The 2024 OSTP Framework has been paused. Executive Order 14292 (“Improving the Safety and Security of Biological Research,” May 5, 2025) directed OSTP to revise or replace the framework within 90 days. The new framework must include enforcement mechanisms and be reviewed every four years.

The original April 2025 implementation deadline has passed without the framework taking effect as written. Agencies await new guidance expected by September 2025. For current practitioners, continue adhering to IGSC protocols and monitor federal agency guidance as the policy landscape evolves.

Model-Level Safeguards and Evaluations

Frontier AI labs have begun implementing model-level safety strategies.

DeepMind: AlphaFold Biosecurity Assessment

DeepMind assembled a multidisciplinary panel to assess AlphaFold 3’s biosecurity implications, concluding it “does not significantly elevate risk compared to prior structure prediction tools” while committing to explore additional safeguards. AlphaFold3 was among early adopters of experimental refusal mechanisms, though “these efforts were preliminary and highlighted the challenges of balancing functionality with security.”

Anthropic: AI Safety Levels

Anthropic has developed AI Safety Levels (ASL) tied to capability thresholds. Claude 3’s biological capabilities informed development of their ASL-3 protections.

OpenAI: Preparedness Framework

OpenAI’s Preparedness Framework defines risk thresholds. Critical is breached when “a model enables an expert to develop a highly dangerous novel threat vector” or “provides meaningfully improved assistance that enables anyone to be able to create a known CBRN threat.”

Government Evaluation Efforts

The UK AI Security Institute (renamed from AI Safety Institute in February 2025) and U.S. Center for AI Standards and Innovation (CAISI) (renamed from AI Safety Institute in June 2025) are developing biorisk-related tests and guidance for advanced AI models. Key evaluation approaches include:

  • Virology Capabilities Test (VCT): Multiple-choice questions measuring AI troubleshooting of complex virology protocols. In limited benchmarks, frontier models have scored comparably to or above domain experts on certain question types, though the operational significance of these scores remains debated
  • Human Uplift Trials: Studies measuring whether AI access improves human performance on biosecurity-relevant tasks
  • Red-Team Exercises: Experts role-playing as threat actors to assess operational feasibility
  • WMDP-Bio Benchmark: The Center for AI Safety’s Weapons of Mass Destruction Proxy benchmark includes 1,273 biosecurity-specific multiple-choice questions designed to measure hazardous biological knowledge in LLMs. The benchmark also evaluates unlearning methods (such as Representation Misdirection for Unlearning) that attempt to remove dangerous knowledge while preserving general capabilities (Li, Mazeika, Hendrycks et al., 2024)

Evaluating Safety Measures: The Evo 2 Case Study

While model-level safeguards are increasingly common, their robustness under adversarial conditions remains underexplored. Evo 2, a genomic foundation model from the Arc Institute trained on over 9 trillion nucleotides, provides an instructive case study. The developers deliberately excluded eukaryotic viral sequences from training data to prevent the model from acquiring capabilities relevant to human pathogen design.

A 2025 Scale AI/SecureBio preprint introduced BIORISKEVAL, a framework for testing whether such data filtering actually works against determined adversaries. The framework evaluates bio-foundation models across three dimensions: sequence modeling capability, mutational effect prediction, and virulence prediction.

The findings suggest data filtering is necessary but not sufficient:

  • Rapid capability recovery via fine-tuning: When researchers fine-tuned Evo 2 on related viral sequences, the model generalized to the filtered virus types within approximately 50 training steps (less than 1 H100 GPU hour). Inter-genus generalization required more compute but remained achievable.
  • Latent knowledge persists despite filtering: Linear probing of Evo 2’s hidden layer representations achieved 0.46 Pearson correlation for virulence prediction, even without any fine-tuning, suggesting the model acquired predictive signals during pretraining despite data exclusion.
  • Modest current risk: The authors emphasize that Evo 2’s predictive capabilities remain too modest for reliable weaponization. Its Spearman correlation of approximately 0.2 for mutational effect prediction is far below the threshold for practical misuse.

These results inform the broader “defense-in-depth” principle: no single safety measure should be assumed robust against adversarial manipulation. Data filtering reduces default capabilities but does not eliminate them when model weights are publicly available. For BDTs released as open-weight models, additional safeguards including DNA synthesis screening and access controls remain essential. For the governance framework proposing how to implement data-level controls at scale, see Training Data Governance: The Biosecurity Data Levels Proposal.

Public Health Implications and Preparedness

For public health practitioners, AI-biosecurity risks intersect with existing pandemic preparedness and outbreak response frameworks.

Implications for Surveillance and Detection

AI-designed pathogens could potentially evade existing detection and surveillance systems:

  • Genomic Surveillance: Novel sequences may not match reference databases used for pathogen identification
  • Syndromic Surveillance: Engineered agents with altered clinical presentations may evade pattern recognition
  • Attribution: Distinguishing natural emergence from deliberate release becomes more challenging with AI-optimized designs

Countermeasure Development

Paradoxically, the same AI capabilities that enable threat creation could accelerate countermeasure development. AI-assisted drug discovery and vaccine platform technologies could potentially respond to threats faster than traditional approaches.

However, the asymmetry between offense and defense in biological threats - where creating harm is often easier than preventing it - remains a fundamental challenge. For connections to defensive AI applications, see AI for Biosecurity Defense.

Recommendations for Practitioners and Policymakers

Based on the current evidence and expert recommendations from NTI, CNAS, and the Federation of American Scientists, a multi-layered approach is recommended:

For Public Health Practitioners

  1. Integrate AI-biosecurity awareness into pandemic preparedness planning - Consider scenarios involving AI-designed or AI-optimized biological agents in tabletop exercises
  2. Strengthen genomic surveillance capabilities - Ensure systems can detect novel sequences that may not match existing databases
  3. Engage with dual-use research governance - Participate in Institutional Biosafety Committee (IBC) oversight and stay informed about emerging AI tools
  4. Build relationships with biosecurity experts - Establish connections with organizations like Johns Hopkins Center for Health Security, NTI | bio, and CSET before crises occur

For Policymakers

  1. Mandate DNA synthesis screening - Move beyond voluntary frameworks to legally require screening for all commercial synthesis providers
  2. Fund third-party AI evaluations - Support independent assessment of biological AI tools before release
  3. Develop BDT-specific governance - Recognize that regulations designed for LLMs (compute thresholds) may not adequately address BDTs
  4. Strengthen international coordination - Engage with the Biological Weapons Convention, Australia Group, and WHO frameworks to harmonize global biosecurity norms
  5. Invest in defense as well as prevention - Current U.S. biodefenses are insufficient to address large-scale biological threats; AI could help accelerate countermeasure development

The next chapters in Part IV examine specific aspects of the AI-biosecurity landscape: LLMs and information hazards, AI-enabled pathogen design, and defensive AI applications.

How does AI amplify biosecurity risks?

AI amplifies biosecurity risks by lowering barriers to accessing dual-use biological knowledge and accelerating design capabilities. Large Language Models democratize access to dangerous information while Biological Design Tools enable sophisticated actors to design novel threats. However, AI amplifies existing risks rather than creating fundamentally new ones.

What is the difference between LLMs and BDTs in biosecurity contexts?

LLMs (Large Language Models) lower barriers for novices by making dual-use biological knowledge more accessible through natural language synthesis. BDTs (Biological Design Tools) raise the ceiling of what sophisticated actors can achieve by enabling novel pathogen design using protein structure prediction and biological optimization. LLMs primarily affect information access, while BDTs enable capability expansion.

What did RAND and OpenAI studies find about AI biosecurity uplift?

RAND 2024 found no statistically significant difference in biological attack plan viability with vs. without LLM access. OpenAI 2024 found GPT-4 provides “at most mild uplift” compared to internet search. Anthropic 2024 reported Claude 3 uplifted novices in certain acquisition steps but not experts. All studies measured planning rather than execution capability.

What is DNA synthesis screening and why does it matter for AI biosecurity?

DNA synthesis screening represents a critical “digital-to-physical frontier” where AI-designed biological agents must be converted into physical materials. The 2024 OSTP Framework mandated screening for federally funded research, though implementation faces uncertainty after 2025 policy changes. This chokepoint remains one of the most tractable intervention points despite potential AI-enabled evasion strategies.


This chapter is part of The Biosecurity Handbook. For related content, see Dual-Use Research of Concern, DNA Synthesis Screening, and Red-Teaming AI Systems.