[AI as a Biosecurity Risk Amplifier]{.chapter-title}

Q: How does AI amplify biosecurity risks?

AI amplifies biosecurity risks by lowering barriers to accessing dual-use biological knowledge and accelerating design capabilities. Large Language Models democratize access to dangerous information while Biological Design Tools enable sophisticated actors to design novel threats.

Q: What is the difference between LLMs and BDTs in biosecurity?

LLMs lower barriers for novices by making dual-use biological knowledge more accessible, while BDTs raise the ceiling of what sophisticated actors can achieve by enabling novel pathogen design using protein structure prediction and biological optimization.

Q: What did RAND and OpenAI studies find about AI biosecurity uplift?

RAND 2024 found no statistically significant difference in biological attack plan viability with vs. without LLM access. OpenAI 2024 found GPT-4 provides at most mild uplift compared to internet search. Both studies measured planning, not execution.

Q: What is DNA synthesis screening and why does it matter?

DNA synthesis screening represents a critical chokepoint where AI-designed biological agents must be converted into physical materials. The 2024 OSTP Framework mandated screening for federally funded research, though implementation faces uncertainty after 2025 policy changes.

Q: What are biosecurity data-level controls for infectious-disease AI?

Biosecurity data-level controls classify pathogen and infectious-disease datasets by their likely contribution to dangerous capabilities, then apply stronger access requirements to higher-risk tiers. The basic idea is to manage risk upstream, before model training, not just at the output layer.

Q: Why does dataset governance matter in biosecurity?

Because filtering outputs after training is too late if the model has already learned risky signals from the data. Data governance reduces the capabilities a model acquires in the first place, and works best alongside synthesis screening and access controls.

doi:10.5281/zenodo.18252920

AI as a Biosecurity Risk Amplifier

In July 2022, researchers at Collaborations Pharmaceuticals inverted their drug discovery AI’s optimization function from avoiding toxicity to maximizing it. Within 6 hours on a consumer laptop, the system generated 40,000 potentially toxic molecules, including VX analogs and novel structures never before synthesized. This was not a hypothetical threat scenario or academic speculation. It was a proof of concept demonstrating that dual-use biological tools can be trivially repurposed for harm, though synthesizing these designs in the real world remains a significant barrier.

Learning Objectives

Differentiate between “Information Hazards” (LLMs) and “Design Hazards” (BDTs).
Analyze the concept of “Tacit Knowledge” and why it acts as a primary barrier to AI-driven bioterrorism.
Evaluate the findings of key empirical studies: RAND Red Team, OpenAI/Anthropic uplift assessments, and the Urbina toxic molecule generation experiment.
Critique the “Uplift” metrics used by major AI labs to measure biosecurity risk.
Assess DNA synthesis screening as a critical chokepoint for AI-enabled biological threats.
Apply threat modeling frameworks to different actor categories (state, non-state, lone actors).

Scope of This Chapter

This chapter discusses biosecurity risks at a conceptual level appropriate for education and policy analysis. Consistent with responsible information practices:

Omitted: Actionable protocols, specific synthesis routes, exact pathogen sequences
Included: Risk frameworks, governance mechanisms, policy recommendations

For detailed biosafety protocols, consult your Institutional Biosafety Committee and relevant regulatory guidance.

Chapter Summary (TL;DR)

The “Amplifier” Thesis: AI technologies function as biosecurity risk amplifiers, not risk creators. The fundamental risks - infectious disease, biological weapons, laboratory accidents - exist independently of AI. What AI changes is the accessibility, speed, and potentially the ceiling of what is achievable.

Two Classes of AI Risk:

Large Language Models (LLMs): May lower barriers by democratizing access to dual-use biological knowledge. Current evidence suggests “mild uplift” at most for attack planning.
Biological Design Tools (BDTs): May raise the ceiling of what sophisticated actors can achieve by enabling novel pathogen design. The 2022 Urbina experiment generated 40,000 toxic molecules in 6 hours.
LLM Coding Agents (emerging): The LLM/BDT binary is blurring as agentic coding models lower the skill barrier for removing BAIM safeguards. A non-biology expert replicated expert-level fine-tuning of Evo 2 in a single weekend using Claude Code (Righetti, Lukosiute, and Black, GovAI, April 2026).

Key Evidence:

RAND Corporation (2024): No statistically significant difference in biological attack plan viability with vs. without LLM access
OpenAI (2024): GPT-4 provides “at most a mild uplift” in biosecurity-relevant tasks
Anthropic (2024): Claude 3 models showed uplift for novices in “certain parts” of bioweapons acquisition, but not for experts

Who Benefits Most: AI is most dangerous in the hands of those who are already dangerous. State actors with existing infrastructure, expertise, and resources gain the most; lone actors face persistent “tacit knowledge” barriers.

Critical Chokepoint: DNA synthesis screening remains the primary defense. The 2024 OSTP Framework mandated screening for federally funded research; Executive Order 14292 (May 2025) directed a revised framework with enforcement mechanisms, but the revision deadline passed without replacement. Congress has since introduced H.R. 3029 (NIST screening standards) and the Biosecurity Modernization and Innovation Act (S. 3741, mandatory screening through Commerce Department). See DNA Synthesis Screening for current regulatory status.

Bottom Line: AI is an efficiency tool for the capable, not a capability tool for the incapable. Governance must address both LLMs and BDTs through multi-layered interventions spanning DNA synthesis screening, model evaluations, and international coordination.

Introduction: The AI-Biology Convergence

In July 2023, Anthropic CEO Dario Amodei warned in congressional testimony that AI could “greatly widen the range of actors with the technical capability to conduct a large-scale biological attack” within two to three years. Former UK Prime Minister Rishi Sunak similarly warned that AI could make it easier “to build chemical or biological weapons” and that “terrorist groups could use AI to spread fear and destruction on an even greater scale.”

These concerns are not merely hypothetical. President Biden’s October 2023 Executive Order 14110 on AI Safety explicitly tasked agencies with assessing AI-enabled biosecurity risks. The order notably established a lower compute threshold for models trained primarily on biological data, recognizing these systems warrant heightened oversight.

However, we must be careful not to confuse speed with possibility.

The media often portrays AI as a tool that will allow a teenager in a basement to engineer a pandemic. This narrative is dangerously distracting. The reality is more nuanced: AI is a risk amplifier. It takes existing capabilities and lowers the cost, time, and expertise required to execute them, but it does not (yet) solve the fundamental physical challenges of biology.

Anthropic’s survey of 80,508 AI users across 159 countries provides empirical context for these fears: 13% of respondents cited malicious use (cyberattacks, weapons, bioweapons) as a primary concern, ranking eighth among thirteen concern categories, behind unreliability (26.7%), job displacement (22.3%), and loss of autonomy (21.9%) (Anthropic, 2025). Governance gaps ranked fifth at 14.7%. The gap between public perception (malicious use as a moderate concern) and expert assessment (biosecurity as a critical frontier risk) underscores why governance cannot rely on public pressure alone to drive adequate safeguards.

How AI Lowers Barriers: Two Classes of Risk

Jonas Sandbrink’s influential 2023 preprint differentiates two classes of AI tools that pose biosecurity risks: large language models (LLMs) and biological design tools (BDTs). This distinction is crucial because these tool types create different risk profiles and require different mitigation strategies.

Large Language Models (LLMs)

Frontier LLMs such as GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro are trained on natural language data, including scientific literature. According to Sandbrink, LLMs can “democratize access to biological knowledge” lowering barriers for its misuse. The potential mechanisms include:

Information Access: Synthesizing and explaining complex dual-use biological concepts in accessible language
Planning Assistance: Helping structure approaches to biological experiments or agent acquisition
Lab Assistance: Providing troubleshooting guidance for laboratory procedures
Synthesis Evasion: Potentially helping actors circumvent DNA synthesis screening protocols

A widely-cited MIT classroom exercise (Soice et al., 2023, preprint) demonstrated these concerns. Students in a biosecurity course tasked LLM chatbots with assisting pandemic pathogen creation. Within one hour, chatbots suggested four potential pandemic pathogens, explained how they could be generated from synthetic DNA using reverse genetics, supplied names of DNA synthesis companies unlikely to screen orders, and recommended engaging contract research organizations for those lacking laboratory skills. (Note: This was a classroom demonstration, not an operational feasibility test.)

However, the operational significance of this information access remains contested.

Biological Design Tools (BDTs)

While LLMs get the headlines, Biological Design Tools (BDTs) warrant greater concern. These are AI systems trained specifically on biological data, including protein structures, chemical properties, and genomic sequences, rather than text.

If LLMs are “Google on steroids,” BDTs are “calculators for biology.” They can predict how a protein folds (AlphaFold), how a molecule binds to a receptor, or how to design novel proteins with specific functions (RFdiffusion).

Notable examples include:

AlphaFold/AlphaFold3: DeepMind’s protein structure prediction tools; Demis Hassabis and John Jumper shared the 2024 Nobel Prize in Chemistry for this work, alongside David Baker for computational protein design
RFdiffusion: Enables de novo design of protein structures and functions
ESM3 and Similar Models: Bridge gaps between sequence, structure, and function
LigandMPNN: Atomic context-conditioned protein sequence design

Unlike LLMs, which primarily lower barriers for less sophisticated actors, BDTs could enable creation of agents “substantially worse than anything seen to date” by expanding the capabilities of already-sophisticated actors. The RAND Europe Global Risk Index (2025) found that AI-enabled biological tools’ “dual-use nature could lower barriers to biological weapon development or raise the ceiling of potential harm by enabling the design of novel biological agents.”

Key risk characteristics of BDTs include:

Unprecedented Predictive Accuracy: Reduces time, resources, and expertise required for experimental validation
Novel Design Capabilities: Enable creation of pathogens more transmissible, virulent, or capable of evading countermeasures
Open-Source Proliferation: Unlike frontier LLMs controlled by major companies, many high-risk BDTs are released open-source. The 2025 Global Risk Index for AI-enabled Biological Tools assessed 57 state-of-the-art tools across eight functional categories, scoring each for misuse-relevant capabilities using Red/Amber/Green ratings. 13 tools were flagged as “Red” requiring action, and 61.5% of these are fully open-sourced
Lower Compute Requirements: BDTs can be trained with fewer computational resources than frontier LLMs

For infectious-disease data, the most actionable control point is often the dataset itself. If a model is trained on pathogen sequences, host range, virulence, or immune-evasion data, the risk profile changes before any prompt or output exists. The Biosecurity Data Levels proposal in AI-Enabled Pathogen Design spells out a tiered approach to those biosecurity data-level controls and offers a detailed current framework for infectious-disease data-level controls.

Case Study: The MegaSyn Experiment (2022)

This is the most critical case study for understanding physical AI risk.

The Context: Researchers at Collaborations Pharmaceuticals used an AI model called “MegaSyn,” designed to avoid toxicity in drug discovery. For a biosecurity conference presentation, they simply flipped the logic - instead of penalizing toxicity, they asked the AI to maximize it.

The Result: In less than 6 hours, running on a standard consumer laptop, the AI generated 40,000 toxic molecules. The list included VX (one of the deadliest nerve agents known), many known chemical warfare agents, and - most worryingly - novel compounds predicted to be equally toxic but structurally distinct.

The Implication: The design barrier for chemical weapons has collapsed. However, the synthesis barrier remains. Knowing the structure of a novel nerve agent is dangerous, but you still need the precursors and laboratory to synthesize it.

Citation: Urbina, F., Lentzos, F., Invernizzi, C. et al. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence (2022)

The MegaSyn experiment illustrates the design-stage collapse. Complementary AI capabilities widen the threat envelope further: molecular property prediction models now estimate acute toxicity (LD50 and LC50) with balanced accuracies in the 80–95% range across nine hazard endpoints, outperforming the reproducibility of standard animal testing (Luechtefeld et al., Toxicological Sciences, 2018). This allows systematic computational screening of novel candidate structures for weapons-relevant properties before any physical synthesis is required, fundamentally changing the pre-synthesis threat profile for chemical agents as well as pathogens.

The MegaSyn case has policy implications that existing dual-use research of concern (DURC) frameworks have not absorbed. Bobier et al., 2025, in Journal of Medical Ethics, argue that current DURC policy fails to account for AI-driven pharmaceutical and chemical design and call for broadening DURC scope to cover the category of AI-accelerated work the MegaSyn experiment exemplifies. See Mapping DURC Categories to AI Capabilities for how the seven Fink Report categories now have direct AI parallels.

Comparing LLMs and BDTs: Risk Profiles

Dimension	LLMs	BDTs
Primary Risk	Lower barriers for novices	Expand capabilities of sophisticated actors
Training Data	Natural language, internet text	Biological sequences, protein structures
Access Model	Mostly commercial APIs with guardrails	Largely open-source
Regulation Focus	Compute thresholds, safety alignment	DNA synthesis screening, structured access
Current Evidence	“Mild uplift” at most	Limited public testing; risks remain speculative

The “Tacit Knowledge” Gap

The primary fear regarding LLMs is that they will serve as “super-mentors” for bioterrorism. To evaluate this, we must understand tacit knowledge - the unwritten, hands-on skills learned only through physical practice.

Examples of tacit knowledge in biology:

How to pipette a liquid without creating aerosols
How to visually identify a healthy cell culture
How to troubleshoot contamination in real-time
When an experiment “looks wrong” despite correct protocols

This tacit knowledge cannot be fully conveyed through text, as it requires physical practice under expert guidance. It represents a significant barrier that text-based AI cannot bridge.

The Sociology of Tacit Knowledge

The significance of tacit knowledge barriers is not merely intuitive; it is well-established in the sociology of scientific knowledge. Harry Collins’ foundational research demonstrated that laboratory skills transfer only through direct social contact with practitioners, not through written protocols alone. His studies of laser construction showed that “only those who had significant social contact with successful laser builders could do the job,” regardless of access to written instructions (Collins, Science Studies, 1974).

This finding generalizes across laboratory sciences. What the sociology of science calls “contributory expertise,” the ability to actually do an activity with competence, requires hands-on practice that text cannot convey. Even detailed protocols omit countless judgment calls: when a culture “looks wrong,” how much force to apply during pipetting, what contamination smells like before it is visible.

The biosecurity implication is direct: AI-generated protocols, however detailed, cannot transfer the physical skills required to execute them. A novice following LLM instructions encounters the same barriers that have always separated theoretical knowledge from laboratory competence. As OpenAI’s 2024 evaluation explicitly noted, “information access alone is insufficient to create a biological threat.”

However, emerging multimodal AI capabilities may challenge this barrier. Vision-enabled frontier models (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, Grok 4.1) can now observe laboratory technique via video and provide real-time correction (“tilt the pipette 45 degrees,” “your flame is too close to the culture”). Whether this translates to meaningful erosion of tacit knowledge barriers remains an open question.

Demonstrated: In December 2025, OpenAI demonstrated GPT-5 iterating on wet-lab experiments in collaboration with biosecurity firm Red Queen Bio. The AI proposed novel protocol modifications, analyzed experimental results, and refined its approach across multiple rounds, achieving a 79-fold improvement in molecular cloning efficiency. Human scientists still executed the physical lab work, but the AI drove the experimental design and troubleshooting. Strict biosecurity safeguards were maintained throughout.

Theoretical implication: This represents a step beyond information access toward operational capability. If AI systems can effectively coach physical laboratory tasks in real time, the tacit knowledge barrier may erode faster than previously anticipated. However, the demonstrated capability (AI-driven experimental design with human execution) differs from the hypothetical concern (AI enabling untrained actors to execute dangerous protocols independently). The gap between these scenarios remains significant.

Case Study: The RAND Corporation Red Team Analysis (2024)

In a landmark study, RAND Corporation researchers conducted a controlled experiment to assess whether LLMs actually helped malicious actors plan biological attacks.

The Setup: They recruited multiple “Red Teams” and gave some access to the open internet only, while others had access to the internet plus an LLM.

The Task: Plan a viable biological attack.

The Result: The study found no statistically significant difference in the quality or viability of the plans generated with AI assistance versus those without.

The Takeaway: The LLMs were helpful for brainstorming, but they did not provide the “secret sauce” needed to bypass technical hurdles. The information was already available through internet search; the AI just summarized it faster.

Read the full report: Mouton, C. A., Lucas, C., & Guest, E. “The Operational Risks of AI in Large-Scale Biological Attacks.” RAND Corporation, 2024

The “Uplift” Debate: What AI Labs Have Found

While RAND found no significant uplift for attack planning, recent technical reports from AI labs suggest capabilities are advancing:

OpenAI (2024):

The GPT-4 bio-risk evaluation observed mean uplift in accuracy of 0.88 out of 10 for experts with GPT-4 access, but differences were not statistically significant. The study concluded that GPT-4 provides “at most a mild uplift” over standard search engines.

Anthropic (2024):

The Claude 3 Model Card reported that Claude 3 models “substantially increased risk in certain parts of the bioweapons acquisition pathway” for novices but did not appear capable of uplifting experts “to a substantially concerning degree.” This led to development of their ASL safety framework.

DHS Assessment (2025):

A DHS-commissioned RAND study assessed risks where AI and chemical/biological weapons converge, concluding that mitigating these dangers requires coordinated effort across industry, government, and international stakeholders. The report emphasizes that developers must recognize dual-use potential in their research products, even when not designing with misuse in mind.

Critical Limitation:

Researchers at Epoch AI and elsewhere caution that “existing evaluations only tell us a limited amount.” Current assessments focus on information access rather than operational capability - whether someone can actually execute an attack in the physical world.

These Findings Are Snapshots in Time

The studies cited above represent snapshots of specific model versions tested with specific evaluation methods. AI capabilities are advancing rapidly, and the 2024 “mild uplift” findings have since been updated by both labs.

OpenAI’s April 2025 Preparedness Framework v2 concluded that upcoming models are expected to reach “High” biological capability, defined as providing meaningful counterfactual assistance to novices with a basic technical background. OpenAI’s July 2025 assessment of its o3 agent reached a similar conclusion. Anthropic’s evaluation of Claude 3.7 Sonnet found the model “provides better advice in key steps of the weaponization pathway” compared to earlier versions; subsequent red-teaming of Claude Opus 4 found capability increases sufficient to trigger the next security control level under the ASL framework. A CSIS analysis (August 2025) explicitly concluded that both labs “have since updated their evaluations to flag substantial and immediate biorisks” compared to the 2024 assessments.

Continuous monitoring and updated evaluations are essential. Citing only the 2024 results without caveat risks giving policymakers false security about a fast-moving target.

Quantifying Risk from Capability Evaluations

The uplift studies above tell us whether AI improves attack capability, but policymakers need to know how much that matters. A 2025 GovAI report by Luca Righetti provides a framework for translating capability evaluations into quantified risk estimates.

The analysis combines historical case studies, expert elicitation, and reference class forecasting. The key scenario: if AI systems increased the number of STEM Bachelor’s degree holders able to synthesize influenza-level pathogens by 10 percentage points, the annual probability of an epidemic from a lone-wolf bioterrorist attack could rise from 0.15% to 1.0%. This represents approximately 12,000 expected deaths per year, or roughly $100 billion in damages.

Six subject-matter experts and five superforecasters reviewed the methodology, finding similar median estimates. All forecasts displayed high uncertainty bands, and the authors note the continued need for better underlying evidence.

This framework matters for two reasons. First, it moves beyond qualitative terms like “mild uplift” to numbers that inform resource allocation and policy prioritization. Second, it highlights that even modest capability increases can translate to significant population-level risk when applied across many potential actors.

Key Finding: Information Access ≠ Operational Capability

OpenAI’s 2024 study explicitly notes: “information access alone is insufficient to create a biological threat” and “studies of information access alone do not test for success in the physical construction of the threats.”

Physical constraints remain significant. Specialist training and access to well-resourced laboratories is critical. Estimates suggest the pool of individuals with both the technical skills and materials access to execute a sophisticated biological attack numbers in the tens of thousands globally - a significant barrier despite AI assistance.

What We Know vs. What Remains Uncertain

Demonstrated (supported by published evidence):

LLMs provide “at most mild uplift” for biological attack planning with current models (RAND 2024, OpenAI 2024)
BDTs can generate toxic molecule designs rapidly (MegaSyn 2022 - 40,000 molecules in 6 hours)
Tacit knowledge barriers remain significant for laboratory work
DNA synthesis screening catches most known pathogen sequences
Multimodal AI can provide real-time laboratory coaching via video

Theoretical (plausible but not yet demonstrated):

AI-designed sequences evading function-based screening at scale
Autonomous AI agents completing wet-lab work without human oversight (see Autonomous AI Agents)
LLMs providing “significant uplift” (as opposed to “mild uplift”) for attack capability
AI enabling lone actors to overcome tacit knowledge barriers entirely

Unknown (insufficient evidence to assess):

Whether next-generation models will cross capability thresholds
The true size of the population capable of exploiting AI for biological harm
How quickly multimodal AI will erode tacit knowledge barriers
Whether cloud laboratories will become accessible to malicious actors

This distinction matters for policy: demonstrated risks warrant immediate action, while theoretical risks require monitoring and contingency planning.

Threat Modeling and Potential Misuse Actors

Understanding who might misuse AI for biological harm is essential for designing effective interventions. The sources of catastrophic biological risks are varied, and historically, policymakers have underappreciated risks posed by well-intentioned scientists.

State Actors

State actors have traditionally been a source of considerable biosecurity risk, most notably the Soviet Union’s Biopreparat program. The U.S. State Department expresses concerns about potential bioweapons capabilities of North Korea, Iran, Russia, and China.

AI Benefit: High. A state actor already has laboratories, funding, and tacit knowledge. They do not need AI to teach them basic techniques - they know those. They can use BDTs to optimize their agents, making them more stable, more resistant to vaccines, or harder to detect.

However, the imprecision of bioweapons has meant states remain unlikely to field large-scale biological attacks in the near term. AI could theoretically help state actors develop more predictable and targeted weapons, though this remains speculative.

Non-State Actors and Terrorist Groups

Expert opinions on non-state bioweapon risks range widely.

Skeptics contend: Even if viable paths to building bioweapons exist, the practicalities of constructing, storing, and disseminating them are far more complex than most realize. They point to the lack of major bioattacks in recent decades despite chronic warnings.

Pessimists counter: Experiments demonstrate the seeming ease of constructing powerful viruses using commercially available inputs. Highly capable terrorist groups may acquire human experts through compensation or ideological recruitment.

AI Benefit: Medium. LLMs may assist with planning, but significant tacit knowledge gaps persist.

Lone Actors and Non-Experts

AI labs have converged on two key capability thresholds:

Uplifting novices to create or obtain non-novel biological weapons
Uplifting experts to design novel threats

Current LLMs appear to provide more uplift to novices than experts, but the practical significance of this uplift remains debated given persistent tacit knowledge and physical access requirements.

AI Benefit: Low to Medium. AI helps them learn faster, but it cannot buy a centrifuge or grant access to controlled strain repositories. The tacit knowledge gap stops them.

Accidental Misuse by Well-Intentioned Researchers

The rich history of laboratory biosecurity accidents represents a concern that is often underappreciated. Dual-use research could make rapid strides forward as an effect of AI assistance - raising risks of accidents and the chance that information hazards become widely available.

This “legitimate researcher” threat vector differs fundamentally from malicious actors but may be equally or more significant in practice.

The Paradox of Differential Uplift

A consistent finding across evaluations deserves emphasis: AI provides greater relative uplift to novices than experts, but greater absolute capability to experts. This apparent paradox has significant policy implications.

Anthropic’s Claude 3 evaluation found models “substantially increased risk in certain parts of the bioweapons acquisition pathway” for novices but not for experts (Anthropic, 2024). OpenAI reported that “experts were better at extracting useful information from GPT-4 than novices.” RAND found no significant difference in attack plan viability for either group.

The resolution: LLMs help novices with early planning steps (identifying agents, understanding concepts), but this early-stage assistance does not translate to execution capability. Experts already possess what LLMs provide; their constraint is resources and access, not knowledge. Meanwhile, Biological Design Tools operate at a level where sophisticated actors can genuinely expand capabilities, designing novel agents beyond what published literature describes.

This differential matters for governance. Novice-focused concerns (LLMs as “super-mentors”) receive disproportionate attention, while the greater risk may lie in BDTs amplifying state-level or well-resourced programs. As Sandbrink notes, BDTs could enable agents “substantially worse than anything seen to date” in the hands of those already capable of executing designs (Sandbrink, 2023, preprint).

DNA Synthesis Screening: The Critical Chokepoint

DNA synthesis represents a critical “digital-to-physical frontier” where AI-designed biological agents must be converted into physical materials. As a recent Science paper notes: “Synthesis of nucleic acids is a choke point in AI-assisted protein engineering pipelines.”

This makes screening of synthesis orders one of the most tractable intervention points.

The International Gene Synthesis Consortium (IGSC)

The IGSC, founded in 2009, is a voluntary coalition of synthetic DNA providers operating under a Harmonized Screening Protocol. Members together represent a majority of commercial gene synthesis capacity worldwide. The Protocol requires:

Screening for sequences of concern (SOCs) matching regulated agents (e.g., Select Agents and Toxins)
Customer verification before fulfilling orders

However, significant gaps remain. A study of DNA provider practices found “significant heterogeneity in security practice throughout the field, reflective of the current lack of codified oversight for DNA synthesis.”

Key limitations include:

Voluntary Nature: No country legally requires nucleic acid synthesis screening
Coverage Gaps: Non-IGSC providers and benchtop synthesis devices may not screen orders
AI-Enabled Evasion: Protein design tools could help design sequences that evade current screening while retaining functionality
Order Splitting: Screening could be evaded by distributing orders across multiple providers or time periods

The 2024 OSTP Framework

In April 2024, the White House Office of Science and Technology Policy released the Framework for Nucleic Acid Synthesis Screening, implementing Section 4.4(b) of Executive Order 14110.

Key provisions include:

Federal Funding Requirement: Agencies funding life sciences research must require procurement of synthetic nucleic acids only from providers adhering to the framework (effective April 26, 2025)
Enhanced Screening Window: By October 2026, providers should screen each 50-nucleotide window for SOCs (reduced from 200 bp)
Expanded SOC Definition: Includes sequences “known to contribute to pathogenicity or toxicity, even when not derived from or encoding regulated biological agents”
Manufacturer Requirements: Extends screening expectations to benchtop nucleic acid synthesis equipment manufacturers

Policy Update: Framework Under Revision

The 2024 OSTP Framework was paused by Executive Order 14292 (“Improving the Safety and Security of Biological Research,” May 5, 2025), which directed OSTP to revise or replace the framework within 90 days with enforcement mechanisms and a four-year review cycle. The 90-day deadline (August 2025) has passed without a published replacement, and the original April 2025 implementation deadline of the 2024 Framework was not enforced. In the legislative track, Congress has since introduced H.R. 3029 (NIST screening standards) and the Biosecurity Modernization and Innovation Act (S. 3741, mandatory screening through Commerce). For current practitioners, continue adhering to IGSC protocols and monitor federal agency guidance as the policy landscape evolves.

Model-Level Safeguards and Evaluations

Frontier AI labs have begun implementing model-level safety strategies.

DeepMind: AlphaFold Biosecurity Assessment

DeepMind assembled a multidisciplinary panel to assess AlphaFold 3’s biosecurity implications, concluding it “does not significantly elevate risk compared to prior structure prediction tools” while committing to explore additional safeguards. AlphaFold3 was among early adopters of experimental refusal mechanisms, though “these efforts were preliminary and highlighted the challenges of balancing functionality with security.”

Anthropic: AI Safety Levels

Anthropic has developed AI Safety Levels (ASL) tied to capability thresholds. Claude 3’s biological capabilities informed development of their ASL-3 protections.

OpenAI: Preparedness Framework

OpenAI’s Preparedness Framework v2 (April 2025) defines two thresholds for the combined biological and chemical risk category:

High: The model can provide “meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to ‘novice’ actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats.” Models at High capability must have safeguards sufficiently minimizing harm before deployment.
Critical: The model introduces unprecedented new pathways to severe harm with no ready precedent under the threat model. Critical capability systems require safeguards during development, not only at deployment.

The “2021 baseline” anchors the evaluation operationally: the question is not whether a model provides any biological or chemical information, but whether it provides meaningful uplift beyond what was achievable without AI in 2021. Given the higher potential severity of biological threats, OpenAI uses biological evaluations as indicators for High and Critical thresholds across the combined bio/chemical category. The OPCW Scientific Advisory Board’s Temporary Working Group on AI (SAB/REP/1/26, March 2026) identified this as a gap leaving chemical-specific risks undertested. See Chemical Threats: A Distinct Evaluation Gap for analysis.

Government Evaluation Efforts

The UK AI Security Institute (renamed from AI Safety Institute in February 2025) and U.S. Center for AI Standards and Innovation (CAISI) (renamed from AI Safety Institute in June 2025) are developing biorisk-related tests and guidance for advanced AI models. CAISI’s pre-deployment testing program expanded in May 2026 when Google DeepMind, Microsoft, and xAI signed agreements joining OpenAI and Anthropic in providing frontier models (including versions with reduced safeguards) for government evaluation; CAISI reports having completed more than 40 such evaluations to date, with assessments conducted through the interagency TRAINS Taskforce. Key evaluation approaches include:

Virology Capabilities Test (VCT): Multiple-choice questions measuring AI troubleshooting of complex virology protocols. In limited benchmarks, frontier models have scored comparably to or above domain experts on certain question types, though the operational significance of these scores remains debated
Human Uplift Trials: Studies measuring whether AI access improves human performance on biosecurity-relevant tasks
Red-Team Exercises: Experts role-playing as threat actors to assess operational feasibility
WMDP-Bio Benchmark: The Center for AI Safety’s Weapons of Mass Destruction Proxy benchmark includes 1,273 biosecurity-specific multiple-choice questions designed to measure hazardous biological knowledge in LLMs. The benchmark also evaluates unlearning methods (such as Representation Misdirection for Unlearning) that attempt to remove dangerous knowledge while preserving general capabilities (Li, Mazeika, Hendrycks et al., 2024)

International AI Safety Report 2026

The International AI Safety Report 2026 (Bengio et al., February 2026), authored by over 100 international experts and overseen by an Expert Advisory Panel nominated by 30+ countries and intergovernmental organisations, is the most authoritative multilateral assessment of AI biosecurity risk to date. Its biological risk findings have direct bearing on this chapter:

Capability threshold crossed: The report concludes that AI systems now match or exceed expert-level performance on benchmarks measuring knowledge relevant to biological weapons development. It cites the VCT result (OpenAI’s o3 outperforming 94% of domain experts at troubleshooting virology lab protocols) as evidence of a shift from information provision to procedural knowledge previously gated by hands-on laboratory experience.
Industry response validates the concern: Multiple frontier AI companies released models in 2025 with additional biosecurity safeguards after pre-deployment testing could not rule out that the systems could meaningfully help novices develop biological weapons.
Open-source governance gap: A survey of 375 biological AI tools found that 23% of the highest-performing tools have high misuse potential, 61.5% of those are fully open source, yet only 3% of all surveyed tools have any safeguards in place, underscoring the gap between capability deployment and governance.
Governance lag: The report concludes that AI capabilities in biological research advance faster than governance, with the gap between what is possible and what is safe continuing to widen.

The IAISR 2026 is significant not only for its findings but for what its publication represents: biological AI risk has moved from a niche biosecurity concern to a matter of formal international consensus.

Evaluating Safety Measures: The Evo 2 Case Study

While model-level safeguards are increasingly common, their robustness under adversarial conditions remains underexplored. Evo 2 (Brixi et al., 2026), a genomic foundation model from Arc Institute and NVIDIA trained on over 9 trillion nucleotides, provides an instructive case study. The developers deliberately excluded eukaryotic viral sequences from training data to prevent the model from acquiring capabilities relevant to human pathogen design.

A 2025 Scale AI/SecureBio preprint introduced BIORISKEVAL, a framework for testing whether such data filtering actually works against determined adversaries. The framework evaluates bio-foundation models across three dimensions: sequence modeling capability, mutational effect prediction, and virulence prediction.

The findings suggest data filtering is necessary but not sufficient:

Rapid capability recovery via fine-tuning: When researchers fine-tuned Evo 2 on related viral sequences, the model generalized to the filtered virus types within approximately 50 training steps (less than 1 H100 GPU hour). Inter-genus generalization required more compute but remained achievable.
Latent knowledge persists despite filtering: Linear probing of Evo 2’s hidden layer representations achieved 0.46 Pearson correlation for virulence prediction, even without any fine-tuning, suggesting the model acquired predictive signals during pretraining despite data exclusion.
Modest current risk: The authors emphasize that Evo 2’s predictive capabilities remain too modest for reliable weaponization. Its Spearman correlation of approximately 0.2 for mutational effect prediction is far below the threshold for practical misuse.

A subsequent GovAI experiment extended these findings on the dimension of accessibility. In March 2026, a researcher with AI engineering experience but no prior biology background used Claude Code to independently replicate the fine-tuning approach, constructing a dataset of human-infecting viral sequences from NCBI RefSeq and running training using code from a concurrent bacteriophage fine-tuning study (King et al., 2025, preprint). Perplexity on training-set sequences improved significantly (median 3.61 to 2.44, p<2.2e-16); improvement on the held-out test set (3.60 to 3.09) did not reach statistical significance (p=0.10). The agent required no biological expertise from the user and encountered no refusals. Total cost was approximately $760 in compute, completed over one weekend (Righetti, Lukosiute, and Black, GovAI, April 2026). The authors note that other public models already outperform this fine-tuned version on relevant tasks, limiting immediate risk. The experiment illustrates how LLM coding agents can erode safeguards premised on fine-tuning difficulty, independently of biological expertise.

These results inform the broader “defense-in-depth” principle: no single safety measure should be assumed robust against adversarial manipulation. Data filtering reduces default capabilities but does not eliminate them when model weights are publicly available. For BDTs released as open-weight models, additional safeguards including DNA synthesis screening and access controls remain essential. For the governance framework proposing how to implement data-level controls at scale, see Training Data Governance: The Biosecurity Data Levels Proposal.

Public Health Implications and Preparedness

For public health practitioners, AI-biosecurity risks intersect with existing pandemic preparedness and outbreak response frameworks.

Implications for Surveillance and Detection

AI-designed pathogens could potentially evade existing detection and surveillance systems:

Genomic Surveillance: Novel sequences may not match reference databases used for pathogen identification
Syndromic Surveillance: Engineered agents with altered clinical presentations may evade pattern recognition
Attribution: Distinguishing natural emergence from deliberate release becomes more challenging with AI-optimized designs

Countermeasure Development

Paradoxically, the same AI capabilities that enable threat creation could accelerate countermeasure development. AI-assisted drug discovery and vaccine platform technologies could potentially respond to threats faster than traditional approaches.

However, the asymmetry between offense and defense in biological threats - where creating harm is often easier than preventing it - remains a fundamental challenge. For connections to defensive AI applications, see AI for Biosecurity Defense.

Recommendations for Practitioners and Policymakers

Based on the current evidence and expert recommendations from NTI, CNAS, and the Federation of American Scientists, a multi-layered approach is recommended:

For Public Health Practitioners

Integrate AI-biosecurity awareness into pandemic preparedness planning - Consider scenarios involving AI-designed or AI-optimized biological agents in tabletop exercises
Strengthen genomic surveillance capabilities - Ensure systems can detect novel sequences that may not match existing databases
Engage with dual-use research governance - Participate in Institutional Biosafety Committee (IBC) oversight and stay informed about emerging AI tools
Build relationships with biosecurity experts - Establish connections with organizations like Johns Hopkins Center for Health Security, NTI | bio, and CSET before crises occur

For Policymakers

Mandate DNA synthesis screening - Move beyond voluntary frameworks to legally require screening for all commercial synthesis providers
Fund third-party AI evaluations - Support independent assessment of biological AI tools before release
Develop BDT-specific governance - Recognize that regulations designed for LLMs (compute thresholds) may not adequately address BDTs
Strengthen international coordination - Engage with the Biological Weapons Convention, Australia Group, and WHO frameworks to harmonize global biosecurity norms
Invest in defense as well as prevention - Current U.S. biodefenses are insufficient to address large-scale biological threats; AI could help accelerate countermeasure development
Expand LLM safety tests to include BAIM modification - Current evaluations assess whether LLMs provide dual-use biological information or help users operate BAIMs; few test whether LLMs help modify BAIMs by removing safeguards via fine-tuning, or help build new narrow-capability BAIMs. These represent distinct and growing risk channels (Righetti, Lukosiute, and Black, GovAI, April 2026)

The next chapters in Part IV examine specific aspects of the AI-biosecurity landscape: LLMs and information hazards, AI-enabled pathogen design, and defensive AI applications.

Common Questions

How does AI amplify biosecurity risks?

AI amplifies biosecurity risks by lowering barriers to accessing dual-use biological knowledge and accelerating design capabilities. Large Language Models democratize access to dangerous information while Biological Design Tools enable sophisticated actors to design novel threats. However, AI amplifies existing risks rather than creating fundamentally new ones.

What is the difference between LLMs and BDTs in biosecurity contexts?

LLMs (Large Language Models) lower barriers for novices by making dual-use biological knowledge more accessible through natural language synthesis. BDTs (Biological Design Tools) raise the ceiling of what sophisticated actors can achieve by enabling novel pathogen design using protein structure prediction and biological optimization. LLMs primarily affect information access, while BDTs enable capability expansion.

What did RAND and OpenAI studies find about AI biosecurity uplift?

RAND 2024 found no statistically significant difference in biological attack plan viability with vs. without LLM access. OpenAI 2024 found GPT-4 provides “at most mild uplift” compared to internet search. Anthropic 2024 reported Claude 3 uplifted novices in certain acquisition steps but not experts. All studies measured planning rather than execution capability.

What is DNA synthesis screening and why does it matter for AI biosecurity?

DNA synthesis screening represents a critical “digital-to-physical frontier” where AI-designed biological agents must be converted into physical materials. The 2024 OSTP Framework mandated screening for federally funded research, though implementation faces uncertainty after 2025 policy changes. This chokepoint remains one of the most tractable intervention points despite potential AI-enabled evasion strategies.

What are biosecurity data-level controls for infectious-disease AI?

Biosecurity data-level controls classify pathogen and infectious-disease datasets by their likely contribution to dangerous capabilities, then apply stronger access requirements to higher-risk tiers. The basic idea is to manage risk upstream, before model training, not just at the output layer. See the Biosecurity Data Levels proposal for a tiered framework.

Why does dataset governance matter in biosecurity?

Because filtering outputs after training is too late if the model has already learned risky signals from the data. Data governance reduces the capabilities a model acquires in the first place, and works best alongside synthesis screening and access controls.

This chapter is part of The Biosecurity Handbook. For related content, see Dual-Use Research of Concern, DNA Synthesis Screening, and Red-Teaming AI Systems.