HIP-92: Drug Discovery AI Pipeline Standard
Abstract
This proposal defines an end-to-end AI-powered drug discovery pipeline for the Hanzo ecosystem, covering target identification, molecular generation, protein structure prediction, molecular docking, ADMET profiling, and virtual screening. The pipeline unifies tools that are currently fragmented across dozens of academic codebases, proprietary platforms, and incompatible file formats into a single API-driven service.
The system integrates with Hanzo ML (HIP-0057) for model training on molecular datasets, Hanzo Object Storage (HIP-0032) for chemical library management (ZINC, ChEMBL, PubChem), and the Quantum Computing standard (HIP-0070) for quantum chemistry calculations (DFT, molecular dynamics). It exposes a REST API, a Python SDK, and a CLI for computational chemists and ML engineers alike.
Repository: github.com/hanzoai/pharma
Port: 8092 (API)
Binary: hanzo-pharma
Container: hanzoai/pharma:latest
Motivation
The Drug Discovery Problem
Bringing a new drug to market takes an average of 10-15 years and costs $2.6 billion (Tufts CSDD, 2020). The failure rate is staggering: for every drug that reaches patients, roughly 5,000-10,000 candidate molecules were screened, 250 entered preclinical testing, and 5 entered human clinical trials. The preclinical phase alone -- identifying a target protein, finding molecules that bind to it, and optimizing those molecules for safety -- consumes 3-6 years and hundreds of millions of dollars.
AI can compress the preclinical phase by 40-60%. This is not speculation. Insilico Medicine advanced a novel drug candidate from target identification to Phase I clinical trials in 18 months using AI (versus the typical 4.5 years). Recursion Pharmaceuticals screens 2 million compounds per week using automated biology and ML. Isomorphic Labs (a DeepMind spinoff) uses AlphaFold-derived structural biology to identify drug targets that were previously inaccessible.
The bottleneck is not algorithms -- it is infrastructure. The models exist (diffusion models for molecular generation, ESMFold for protein structure, graph neural networks for property prediction). The data exists (ChEMBL has 2.4 million bioactive compounds, PubChem has 110 million). What does not exist is a unified pipeline that connects these components without months of integration work.
Why Current Tools Are Insufficient
-
Fragmentation. A typical computational chemistry workflow uses RDKit for cheminformatics (Python), AutoDock Vina for docking (C++, command-line), GROMACS for molecular dynamics (Fortran/C++), SchNet for property prediction (PyTorch), and custom scripts to convert between file formats (SDF, PDB, MOL2, SMILES). Each tool has its own input format, its own configuration, and its own failure modes. There is no unified API, no shared data model, and no pipeline orchestration.
-
GPU underutilization. Molecular docking is embarrassingly parallel -- you dock millions of molecules independently -- yet AutoDock Vina runs on CPU. AutoDock-GPU exists but requires CUDA expertise to deploy. Virtual screening campaigns that could finish in hours on a GPU cluster take weeks on CPU because the tooling was not designed for modern hardware.
-
No experiment tracking. When a medicinal chemist asks "why was this molecule selected as a lead candidate?", the answer is buried in Jupyter notebooks, spreadsheets, and email chains. There is no audit trail from target selection through virtual screening through lead optimization. This is not just inconvenient -- it is a regulatory liability. The FDA increasingly expects computational evidence to be reproducible (FDA guidance on AI/ML in drug development, 2023).
-
AI models are disconnected from the pipeline. A researcher trains a molecular generation model in PyTorch, evaluates generated molecules manually in RDKit, runs docking separately in AutoDock, and checks ADMET properties in yet another tool. The feedback loop -- where docking results inform the next round of generation -- requires manual intervention at every step.
Why Hanzo Cares
Hanzo operates GPU compute infrastructure for AI workloads. Drug discovery is one of the most compute-intensive applications of AI: a single virtual screening campaign against a 10-million-compound library requires 50,000-100,000 GPU-hours of docking simulation. Molecular dynamics simulations for lead optimization require sustained GPU allocation for days or weeks.
This is the same infrastructure pattern as LLM training and inference -- GPU scheduling, checkpoint management, experiment tracking, and cost metering -- applied to a different domain. Hanzo ML (HIP-0057) already handles GPU job scheduling. Hanzo Object Storage (HIP-0032) already stores large datasets. The pharma pipeline reuses this infrastructure with domain-specific models and file format support.
Design Philosophy
Chemistry for AI Engineers: A Primer
This section exists because drug discovery AI sits at the intersection of chemistry and machine learning. ML engineers building or operating this pipeline need enough chemistry to understand what the models are doing. This is not a chemistry textbook -- it is the minimum viable understanding.
Atoms and bonds. A molecule is a graph. Atoms are nodes; bonds are edges. Carbon (C), nitrogen (N), oxygen (O), sulfur (S), and hydrogen (H) are the primary atoms in drug molecules. Bonds come in types: single (one shared electron pair), double (two pairs), triple (three pairs), and aromatic (delocalized electrons shared across a ring). The atom types and bond types determine the molecule's chemical properties.
SMILES notation. SMILES (Simplified Molecular Input Line Entry System) is a string encoding of a molecular graph. Aspirin is CC(=O)Oc1ccccc1C(=O)O. Carbon is implicit in ring notation; lowercase letters indicate aromatic atoms; = is a double bond; parentheses indicate branching. SMILES is compact and human-readable but not unique -- the same molecule can have multiple valid SMILES strings. Canonical SMILES algorithms (RDKit, OpenBabel) produce a single canonical form.
SELFIES notation. SELFIES (Self-Referencing Embedded Strings) is a more recent alternative to SMILES designed for generative models. Every SELFIES string decodes to a valid molecule, whereas many random SMILES strings are chemically invalid. This property makes SELFIES preferable for language-model-based molecular generation because every output token sequence is guaranteed to be a valid molecule. No post-hoc validity filtering required.
3D structure matters. A SMILES string encodes the molecular graph (2D topology) but not the 3D shape. Drug molecules bind to proteins by fitting into a binding pocket -- a 3D cavity on the protein surface. The molecule's 3D conformation (the spatial arrangement of its atoms) determines whether it fits. Two molecules with identical SMILES can have different 3D conformations with different binding affinities. This is why molecular docking operates in 3D, not on SMILES strings.
Proteins are the targets. Most drugs work by binding to a protein and either blocking its function (inhibitor) or enhancing it (agonist). A protein is a long chain of amino acids (20 types) that folds into a specific 3D structure. The binding site is a pocket or groove on the protein surface where a drug molecule can nestle. Knowing the protein's 3D structure is essential for structure-based drug design. Until 2020, experimental methods (X-ray crystallography, cryo-EM) were the only reliable way to determine protein structures. AlphaFold changed this by predicting structures from amino acid sequences with near-experimental accuracy.
ADMET. Before a drug candidate reaches human trials, it must pass ADMET screening:
- Absorption: Can the body absorb it? (Oral bioavailability, membrane permeability)
- Distribution: Does it reach the target tissue? (Blood-brain barrier penetration, plasma protein binding)
- Metabolism: How does the liver break it down? (CYP450 enzyme interactions, half-life)
- Excretion: How is it eliminated? (Renal clearance, biliary excretion)
- Toxicity: Is it safe? (hERG channel inhibition causing cardiac arrhythmia, hepatotoxicity, mutagenicity)
Failing ADMET is the single largest cause of drug candidate failure. Roughly 40% of clinical trial failures are due to poor pharmacokinetics (ADMET properties), not lack of efficacy. Predicting ADMET computationally before synthesizing a molecule saves years and millions of dollars.
Why Diffusion Models for Molecular Generation
Molecular generation is the task of creating novel molecules with desired properties. The dominant approaches are:
Variational Autoencoders (VAEs) encode molecules into a continuous latent space and decode back. They produce valid molecules but struggle with multi-objective optimization (simultaneously optimizing binding affinity, drug-likeness, and synthesizability).
Reinforcement Learning (RL) treats generation as a sequential decision process (adding one atom or bond at a time) with a reward function combining multiple objectives. RL produces high-scoring molecules but is sample-inefficient and prone to mode collapse (generating variations of the same scaffold).
Diffusion models learn to generate 3D molecular structures by reversing a noise process. Starting from random atomic coordinates, the model iteratively denoises to produce valid 3D conformations. The key advantage is that diffusion models natively generate 3D structures, not SMILES strings. This means the generated molecules have realistic 3D geometries that can be directly used for docking -- no separate conformer generation step needed.
For 3D pocket-conditioned generation (generating molecules that fit a specific protein binding site), diffusion models are state of the art. TargetDiff, Pocket2Mol, and DiffSBDD condition the generation process on the protein pocket structure, producing molecules that are geometrically complementary to the target. This is the approach we adopt for structure-based drug design.
For property-conditioned generation without a specific target structure, LLM-based generation on SELFIES strings is more practical. A fine-tuned language model generates SELFIES tokens conditioned on desired property profiles (molecular weight range, logP range, number of hydrogen bond donors). The SELFIES guarantee ensures every output is a valid molecule.
Why End-to-End Pipeline Over Best-of-Breed Tools
The alternative to an integrated pipeline is a "best-of-breed" approach: use RDKit for cheminformatics, AutoDock-GPU for docking, DeepChem for property prediction, and glue them together with scripts. This is what most computational chemistry groups do today.
The problem is the glue. Converting between file formats (SDF to PDB to PDBQT to MOL2) is error-prone and lossy. Tracking which molecules passed which filters requires a custom database. Scheduling GPU resources for docking and property prediction requires a custom scheduler. Reproducing a virtual screening campaign requires re-running every script in the right order with the right inputs.
The integrated pipeline eliminates the glue. Molecules flow through the pipeline as structured objects with a canonical representation. Each stage (generation, docking, ADMET, scoring) reads from and writes to the same data model. The pipeline scheduler (built on HIP-0057 job scheduling) handles GPU allocation. Experiment tracking (also from HIP-0057) records every parameter and result.
The tradeoff is flexibility. A researcher who wants to use a cutting-edge docking algorithm published last week cannot plug it in without wrapping it in our API. We mitigate this with a plugin architecture: any Docker container that reads molecules from stdin and writes scored molecules to stdout can be registered as a pipeline stage.
Specification
Architecture Overview
┌──────────────────────────────────────────────────────────────────────┐
│ Hanzo Pharma API (8092) │
│ │
│ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌────────┐ ┌───────────┐ │
│ │ Molecular │ │ Protein │ │Molecular │ │ ADMET │ │ Virtual │ │
│ │ Generator │ │ Structure │ │ Docking │ │Predict │ │ Screening │ │
│ └─────┬─────┘ └─────┬─────┘ └────┬─────┘ └───┬────┘ └─────┬─────┘ │
│ │ │ │ │ │ │
│ ┌─────┴─────────────┴────────────┴────────────┴────────────┴─────┐ │
│ │ Pipeline Orchestrator │ │
│ │ (DAG execution, GPU scheduling) │ │
│ └─────┬──────────────┬───────────────────────┬───────────────────┘ │
└────────┼──────────────┼───────────────────────┼─────────────────────┘
│ │ │
┌────────┴────┐ ┌──────┴──────┐ ┌─────────────┴──────────────┐
│ Hanzo ML │ │Hanzo Object │ │ Hanzo Quantum (HIP-0070) │
│ (HIP-0057) │ │Storage │ │ DFT, Molecular Dynamics │
│ GPU Jobs │ │(HIP-0032) │ │ │
│ Experiments │ │Chemical Libs│ │ │
└─────────────┘ └─────────────┘ └────────────────────────────┘
The pipeline is a stateless Go API backed by PostgreSQL for metadata, Object Storage for molecular data and model weights, and Kubernetes for GPU compute jobs. Each module (generation, structure prediction, docking, ADMET, screening) is an independent service that can run as a pipeline stage or be called directly via the API.
Molecular File Formats
The pipeline must read and write the file formats that computational chemistry actually uses. These are not interchangeable -- each encodes different information.
Supported Formats:
SMILES:
description: Line notation for molecular graphs (2D topology)
extension: .smi
use_case: Database storage, text-based ML models, compact representation
limitations: No 3D coordinates, no stereochemistry in basic form
example: "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
SELFIES:
description: Self-referencing embedded strings (always valid molecules)
extension: .selfies
use_case: Generative language models (every token sequence is valid)
limitations: Less human-readable than SMILES
example: "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"
SDF (Structure Data File):
description: 2D/3D coordinates + properties for one or more molecules
extension: .sdf, .mol
use_case: Chemical databases, property storage, multi-molecule files
encodes: Atom positions, bond types, charges, arbitrary property fields
size: ~1-5 KB per molecule
PDB (Protein Data Bank):
description: 3D coordinates for proteins and protein-ligand complexes
extension: .pdb
use_case: Protein structures, docking results, molecular dynamics input
encodes: Atom positions, residue names, chain IDs, B-factors
source: RCSB PDB (rcsb.org), AlphaFold DB, ESMFold predictions
MOL2 (Tripos):
description: 3D coordinates with atom types and partial charges
extension: .mol2
use_case: Docking input (some engines require MOL2), force field assignment
encodes: Sybyl atom types, partial charges, bond orders
PDBQT:
description: PDB format with partial charges and AutoDock atom types
extension: .pdbqt
use_case: AutoDock-GPU input/output
note: Pipeline handles PDBQT conversion internally; users work with PDB/SDF
Canonical internal representation. Internally, the pipeline stores molecules as a structured object combining SMILES (for identity and deduplication), 3D coordinates (when available), and computed properties. File format conversion happens at API boundaries -- the user uploads SDF, the pipeline converts; the user requests PDB output, the pipeline converts.
Molecular Generation
Two generation modes serve different design strategies.
3D Diffusion Generation (Structure-Based)
For targets with known protein structures, the pipeline generates molecules conditioned on the binding pocket geometry.
3D Diffusion Configuration:
model: "hanzo-pharma-diffdock-gen" # Pocket-conditioned diffusion model
pocket:
protein_pdb: "s3://pharma/targets/EGFR/4HJO.pdb"
pocket_residues: [718, 719, 720, 790, 791, 792, 855, 856] # Binding site residues
pocket_radius: 10.0 # Angstroms around pocket center
generation:
num_molecules: 1000 # Generate 1000 candidates
temperature: 1.0 # Sampling temperature
guidance_scale: 2.0 # Classifier-free guidance strength
atom_types: [C, N, O, S, F, Cl] # Allowed atom types
max_atoms: 50 # Maximum heavy atoms per molecule
constraints:
molecular_weight: [200, 500] # Daltons (Lipinski's Rule of Five)
logP: [-0.4, 5.6] # Octanol-water partition coefficient
hbd: [0, 5] # Hydrogen bond donors <= 5
hba: [0, 10] # Hydrogen bond acceptors <= 10
rotatable_bonds: [0, 10] # Flexibility constraint
synthetic_accessibility: [1, 5] # SA score (1=easy, 10=impossible)
output:
format: sdf # 3D coordinates included
deduplicate: true # Remove duplicate SMILES
minimize: true # Energy-minimize generated conformations
The diffusion model operates on atomic point clouds in 3D space. Starting from Gaussian noise placed within the protein pocket, the model iteratively denoises atom positions and types over T timesteps. The pocket structure is provided as context (not denoised) and acts as a spatial constraint -- the generated atoms must form a molecule that fits the pocket geometry.
SELFIES Language Model Generation (Ligand-Based)
For property-conditioned generation without a specific target structure.
SELFIES LM Configuration:
model: "hanzo-pharma-selfies-gen" # Fine-tuned on ChEMBL actives
conditioning:
target_activity: "EGFR_inhibitor" # Activity class from ChEMBL
property_profile:
molecular_weight: 350 # Target MW (Daltons)
logP: 2.5 # Target lipophilicity
tpsa: 80 # Topological polar surface area
qed: 0.7 # Quantitative Estimate of Drug-likeness
generation:
num_molecules: 10000
max_tokens: 128 # Max SELFIES token length
temperature: 0.8
top_p: 0.95
batch_size: 256 # Generate in batches on GPU
filtering:
validity_check: true # Verify SELFIES -> valid molecule
novelty_check: true # Not in training set
diversity_threshold: 0.3 # Tanimoto distance minimum between outputs
This mode leverages HIP-0057 for model training. The base model is a transformer trained on SELFIES representations of the ChEMBL database (~2.4 million molecules). Fine-tuning on subsets (e.g., known kinase inhibitors) conditions the model to generate molecules with target-relevant scaffolds.
Protein Structure Prediction
When the target protein's experimental structure is unavailable (true for roughly 70% of human proteins), the pipeline predicts it from the amino acid sequence.
Structure Prediction Configuration:
engine: "esmfold" # esmfold | openfold
sequence: "MTEYKLVVVGAGGVGKSALTIQLIQ..." # Amino acid sequence
options:
num_recycles: 4 # Refinement iterations
chunk_size: 128 # Sequence chunks for memory efficiency
Supported Engines:
ESMFold:
description: Single-sequence structure prediction from Meta AI
speed: ~1 second per protein (GPU)
accuracy: Comparable to AlphaFold for well-folded domains
advantage: No MSA (multiple sequence alignment) required -- 60x faster
gpu_memory: ~16 GB for proteins up to 1000 residues
use_case: Rapid screening, large-scale structure prediction
OpenFold:
description: Open-source reimplementation of AlphaFold2
speed: ~5-10 minutes per protein (with MSA generation)
accuracy: Highest accuracy, matches AlphaFold2
advantage: Full MSA pipeline for maximum structural accuracy
gpu_memory: ~40 GB for large proteins with MSA
use_case: High-confidence structure prediction for drug targets
Why ESMFold over AlphaFold directly? ESMFold uses a protein language model (ESM-2) to predict structures from single sequences without multiple sequence alignments (MSAs). AlphaFold2 requires MSAs computed by searching sequence databases (UniRef, BFD) -- a process that takes minutes to hours per protein and requires terabytes of database storage. For drug discovery pipelines where you need rapid structure prediction for hundreds of targets, ESMFold's single-sequence approach is 60x faster with comparable accuracy on well-folded domains. OpenFold (the open-source AlphaFold2 reimplementation) is available for cases where maximum accuracy justifies the time cost.
Predicted structure quality. Structure prediction models output a per-residue confidence score (pLDDT for ESMFold/AlphaFold). The pipeline uses pLDDT to assess binding site quality:
- pLDDT > 90: High confidence. Suitable for structure-based drug design.
- pLDDT 70-90: Moderate confidence. Binding site geometry may be approximate.
- pLDDT < 70: Low confidence. The binding site may be disordered; use ligand-based methods instead.
Molecular Docking
Docking predicts how a small molecule binds to a protein and estimates binding strength (affinity). This is the core computational step in virtual screening.
Docking Configuration:
engine: "autodock-gpu" # autodock-gpu | diffdock | vina
protein:
pdb_path: "s3://pharma/targets/EGFR/4HJO_prepared.pdbqt"
center: [22.5, -14.3, 8.7] # Binding site center (Angstroms)
box_size: [25.0, 25.0, 25.0] # Search box dimensions
ligands:
source: "s3://pharma/libraries/screening_set.sdf"
count: 1000000 # 1M compounds to dock
preparation:
add_hydrogens: true
generate_conformers: 5 # Multiple starting conformations
assign_charges: "gasteiger" # Partial charge method
scoring:
exhaustiveness: 32 # Search thoroughness (higher = slower, better)
num_poses: 5 # Top binding poses per molecule
energy_range: 3.0 # kcal/mol range for reported poses
gpu:
batch_size: 65536 # Molecules per GPU batch
devices: 4 # Number of GPUs
Supported Engines:
AutoDock-GPU:
description: GPU-accelerated AutoDock4 scoring function
speed: ~100,000 molecules/hour/GPU (A100)
accuracy: Well-validated physics-based scoring
advantage: Fastest traditional docking engine on GPU
use_case: Large-scale virtual screening (millions of compounds)
DiffDock:
description: Diffusion model for blind docking (no predefined box)
speed: ~1,000 molecules/hour/GPU (generative process is slower)
accuracy: State-of-the-art on PoseBusters benchmark
advantage: No manual binding site definition required
use_case: Novel targets, allosteric sites, difficult binding sites
Vina:
description: AutoDock Vina -- CPU-based, widely used baseline
speed: ~500 molecules/hour/CPU
accuracy: Reasonable for pose prediction, less reliable for scoring
advantage: Well-understood, no GPU required
use_case: Quick validation, fallback when GPU unavailable
Why AutoDock-GPU as the default? Virtual screening requires docking millions of molecules. At 100,000 molecules/hour/GPU, an A100 can screen 1 million compounds in 10 hours. Four A100s finish in 2.5 hours. AutoDock-GPU achieves this throughput by running the Solis-Wets local search algorithm on CUDA cores, evaluating thousands of ligand poses in parallel. Vina, by comparison, would take 83 days on a single CPU core for the same campaign.
DiffDock for difficult targets. Traditional docking requires the user to define a binding site (the "search box"). For novel targets without known ligands, or for allosteric sites (binding pockets away from the active site), defining the box is guesswork. DiffDock is a diffusion model that predicts the binding pose without a predefined box -- it scores the entire protein surface. This is slower but eliminates the binding site definition problem.
ADMET Prediction
ADMET models predict pharmacokinetic and safety properties from molecular structure. These are graph neural networks (GNNs) and transformer models trained on experimental assay data.
ADMET Prediction Configuration:
models:
absorption:
- name: "caco2_permeability"
description: "Caco-2 cell permeability (intestinal absorption proxy)"
output: "log_papp (cm/s)"
threshold: "> -5.15 (good absorption)"
architecture: "AttentiveFP (graph attention network)"
- name: "oral_bioavailability"
description: "Fraction of drug reaching systemic circulation"
output: "F% (0-100)"
threshold: "> 20% (acceptable)"
distribution:
- name: "bbb_penetration"
description: "Blood-brain barrier permeability"
output: "probability (0-1)"
threshold: "> 0.5 for CNS drugs, < 0.3 for peripheral drugs"
- name: "plasma_protein_binding"
description: "Fraction bound to plasma proteins"
output: "fraction_bound (0-1)"
threshold: "< 0.95 (highly bound drugs have low free fraction)"
metabolism:
- name: "cyp_inhibition"
description: "Inhibition of CYP450 enzymes (drug-drug interaction risk)"
output: "probability per isoform (1A2, 2C9, 2C19, 2D6, 3A4)"
threshold: "< 0.5 for each isoform"
- name: "half_life"
description: "Plasma half-life prediction"
output: "hours"
threshold: "2-12h for daily oral dosing"
excretion:
- name: "clearance"
description: "Hepatic and renal clearance rate"
output: "mL/min/kg"
threshold: "< 5 (low clearance)"
toxicity:
- name: "herg_inhibition"
description: "hERG potassium channel block (cardiac arrhythmia risk)"
output: "probability (0-1)"
threshold: "< 0.3 (critical safety endpoint)"
- name: "ames_mutagenicity"
description: "Mutagenic potential (Ames test prediction)"
output: "probability (0-1)"
threshold: "< 0.5 (non-mutagenic)"
- name: "hepatotoxicity"
description: "Drug-induced liver injury risk"
output: "probability (0-1)"
threshold: "< 0.5"
- name: "ld50"
description: "Acute oral toxicity (lethal dose prediction)"
output: "log(mg/kg)"
threshold: "> 2.5 (EPA category IV, low toxicity)"
ADMET models are trained on public datasets (TDC - Therapeutics Data Commons, ChEMBL bioactivity data, EPA ToxCast) using HIP-0057. The pipeline ships pre-trained models and supports fine-tuning on proprietary assay data.
Ensemble scoring. Individual ADMET models have limited accuracy (~75-85% AUROC for classification tasks). The pipeline runs all applicable models and produces a composite "drug-likeness" score that weights each property by its clinical failure risk. hERG inhibition and hepatotoxicity are weighted highest because they cause the most expensive late-stage failures.
Virtual Screening Pipeline
Virtual screening combines all modules into an automated funnel that starts with millions of compounds and progressively filters to a handful of leads.
Virtual Screening Pipeline:
name: "EGFR_inhibitor_screen"
target:
protein_pdb: "s3://pharma/targets/EGFR/4HJO.pdb"
pocket_residues: [718, 719, 720, 790, 791, 792]
stages:
- name: "library_filter"
type: "property_filter"
input: "s3://pharma/libraries/zinc22_druglike.sdf" # 1.4 billion compounds
filters:
molecular_weight: [200, 500]
logP: [-1, 5]
hbd: [0, 5]
hba: [0, 10]
rotatable_bonds: [0, 10]
expected_output: ~500M compounds
- name: "pharmacophore_screen"
type: "pharmacophore"
input: "previous"
pharmacophore:
features:
- type: "hydrogen_bond_acceptor"
position: [22.1, -14.5, 9.2]
radius: 1.5
- type: "hydrophobic"
position: [24.3, -12.1, 7.8]
radius: 2.0
- type: "aromatic_ring"
position: [20.8, -15.7, 10.1]
radius: 1.5
expected_output: ~5M compounds
- name: "rapid_docking"
type: "docking"
input: "previous"
engine: "autodock-gpu"
exhaustiveness: 8 # Low thoroughness for speed
gpu_count: 8
top_n: 50000 # Keep top 50K by docking score
expected_runtime: "6 hours"
- name: "precise_docking"
type: "docking"
input: "previous"
engine: "autodock-gpu"
exhaustiveness: 64 # High thoroughness
gpu_count: 4
num_poses: 10
top_n: 5000
expected_runtime: "4 hours"
- name: "admet_filter"
type: "admet"
input: "previous"
filters:
herg_inhibition: "< 0.3"
ames_mutagenicity: "< 0.5"
oral_bioavailability: "> 20"
caco2_permeability: "> -5.15"
top_n: 500
- name: "molecular_dynamics"
type: "quantum" # Delegates to HIP-0070
input: "previous"
method: "mm_gbsa" # Molecular Mechanics with Generalized Born
simulation_time: "10ns" # 10 nanosecond simulation per complex
gpu_count: 2
top_n: 50 # Final lead candidates
output:
format: "sdf"
include_scores: true
include_poses: true
report: "pdf" # Generate summary report
This funnel reduces 1.4 billion compounds to 50 lead candidates through progressive filtering. Each stage is a GPU job managed by the pipeline orchestrator. The entire campaign runs in 1-3 days on a modest GPU cluster (8x A100) versus months of manual work.
Integration with Quantum Computing (HIP-0070)
For the final stages of lead optimization, classical force fields (used in standard docking) lack the accuracy needed to distinguish between closely ranked candidates. Quantum chemistry provides higher-fidelity energy calculations.
Quantum Chemistry Integration:
methods:
dft:
description: "Density Functional Theory -- quantum mechanical energy calculation"
use_case: "Accurate binding energy for top 50-100 candidates"
compute_time: "1-4 hours per molecule-protein complex"
accuracy: "Chemical accuracy (~1 kcal/mol)"
functional: "B3LYP"
basis_set: "6-31G*"
molecular_dynamics:
description: "QM/MM molecular dynamics -- quantum core + classical surroundings"
use_case: "Protein-ligand binding free energy estimation"
compute_time: "Days per complex"
accuracy: "Best available computational method"
semi_empirical:
description: "GFN2-xTB -- fast approximate quantum mechanics"
use_case: "Geometry optimization, conformer ranking"
compute_time: "Minutes per molecule"
accuracy: "Sufficient for geometry, not for binding energies"
The pipeline delegates quantum calculations to the Quantum Computing service (HIP-0070). The pharma API submits a job with the protein-ligand complex coordinates and the desired method; the quantum service returns the computed energy. This decoupling allows the quantum backend to evolve independently (e.g., adding quantum hardware acceleration) without changing the pharma pipeline.
Integration with Chemical Libraries (HIP-0032)
Large chemical databases are stored in Hanzo Object Storage and indexed for rapid substructure and similarity search.
Chemical Library Management:
libraries:
zinc22:
description: "ZINC22 -- 1.4 billion commercially available compounds"
size: "~2 TB (compressed SDF)"
storage: "s3://pharma/libraries/zinc22/"
index: "PostgreSQL with RDKit cartridge for substructure search"
update_frequency: "quarterly"
chembl:
description: "ChEMBL -- 2.4 million bioactive compounds with assay data"
size: "~50 GB"
storage: "s3://pharma/libraries/chembl34/"
index: "Full-text + fingerprint index"
update_frequency: "biannual (follows ChEMBL releases)"
pubchem:
description: "PubChem -- 110 million unique structures"
size: "~500 GB"
storage: "s3://pharma/libraries/pubchem/"
index: "Fingerprint similarity index"
update_frequency: "monthly"
generated:
description: "Hanzo-generated molecules (from diffusion/LM models)"
size: "variable"
storage: "s3://pharma/libraries/generated/{campaign_id}/"
index: "Auto-indexed on creation"
search_capabilities:
substructure: "Find all molecules containing a benzimidazole ring"
similarity: "Find 1000 molecules most similar to imatinib (Tanimoto > 0.7)"
pharmacophore: "Find molecules matching a 3D pharmacophore query"
property_range: "MW 300-500, logP 1-3, HBD <= 3"
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| Molecules | ||
/v1/molecules | POST | Upload molecules (SDF, SMILES, SELFIES) |
/v1/molecules/{id} | GET | Get molecule with computed properties |
/v1/molecules/search | POST | Substructure/similarity/property search |
/v1/molecules/convert | POST | Convert between formats (SDF, PDB, MOL2, SMILES) |
| Generation | ||
/v1/generate/diffusion | POST | 3D pocket-conditioned generation |
/v1/generate/selfies | POST | SELFIES language model generation |
/v1/generate/status/{job_id} | GET | Generation job status |
| Structure | ||
/v1/structure/predict | POST | Protein structure prediction (ESMFold/OpenFold) |
/v1/structure/{id} | GET | Get predicted structure (PDB) |
| Docking | ||
/v1/dock | POST | Submit docking job (single molecule or batch) |
/v1/dock/{job_id} | GET | Get docking results (poses, scores) |
/v1/dock/{job_id}/poses | GET | Download binding poses (PDB/SDF) |
| ADMET | ||
/v1/admet/predict | POST | Predict ADMET properties for molecule(s) |
/v1/admet/models | GET | List available ADMET models |
| Screening | ||
/v1/screen | POST | Submit virtual screening pipeline |
/v1/screen/{id} | GET | Get screening status and results |
/v1/screen/{id}/report | GET | Download screening report (PDF) |
| Libraries | ||
/v1/libraries | GET | List available chemical libraries |
/v1/libraries/{name}/search | POST | Search within a library |
| Quantum | ||
/v1/quantum/submit | POST | Submit quantum chemistry calculation |
/v1/quantum/{job_id} | GET | Get quantum calculation results |
All endpoints require Hanzo IAM authentication. Billing is metered by GPU-hours consumed (docking, generation, quantum) and API calls (ADMET prediction, search).
Configuration
# /etc/hanzo-pharma/config.yaml
server:
host: 0.0.0.0
port: 8092
workers: 4
database:
url: "postgresql://hanzo:password@postgres:5432/hanzo_pharma"
rdkit_extension: true # Enable RDKit PostgreSQL cartridge
storage:
endpoint: "http://minio:9000"
access_key: "${HANZO_STORAGE_ACCESS_KEY}"
secret_key: "${HANZO_STORAGE_SECRET_KEY}"
buckets:
molecules: "pharma-molecules"
libraries: "pharma-libraries"
results: "pharma-results"
models: "pharma-models"
ml:
endpoint: "http://ml.hanzo.svc:8057" # HIP-0057
quantum:
endpoint: "http://quantum.hanzo.svc:8070" # HIP-0070
auth:
iam_url: "https://hanzo.id"
verify_tokens: true
docking:
default_engine: "autodock-gpu"
gpu_types: ["nvidia-a100-80gb", "nvidia-h100-80gb"]
max_concurrent_jobs: 16
generation:
diffusion_model: "s3://pharma-models/diffdock-gen/latest/"
selfies_model: "s3://pharma-models/selfies-gen/latest/"
structure:
esmfold_weights: "s3://pharma-models/esmfold/latest/"
openfold_weights: "s3://pharma-models/openfold/latest/"
openfold_databases: "s3://pharma-libraries/openfold-dbs/"
admet:
model_dir: "s3://pharma-models/admet/"
ensemble_weights:
herg: 2.0 # Critical safety -- double weight
hepatotoxicity: 2.0
oral_bioavailability: 1.5
default: 1.0
metrics:
enabled: true
port: 9090
path: /metrics
logging:
level: info
format: json
Implementation
Deployment
Docker
docker run -p 8092:8092 -p 9090:9090 \
-e HANZO_PHARMA_DATABASE_URL="postgresql://..." \
-e HANZO_PHARMA_STORAGE_ENDPOINT="http://minio:9000" \
--gpus all \
hanzoai/pharma:latest
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: hanzo-pharma
namespace: hanzo
spec:
replicas: 2
selector:
matchLabels:
app: hanzo-pharma
template:
metadata:
labels:
app: hanzo-pharma
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: hanzo-pharma
containers:
- name: hanzo-pharma
image: hanzoai/pharma:latest
ports:
- containerPort: 8092
name: api
- containerPort: 9090
name: metrics
env:
- name: HANZO_PHARMA_DATABASE_URL
valueFrom:
secretKeyRef:
name: hanzo-pharma-secrets
key: database-url
readinessProbe:
httpGet:
path: /ready
port: 8092
livenessProbe:
httpGet:
path: /alive
port: 8092
---
apiVersion: v1
kind: Service
metadata:
name: hanzo-pharma
namespace: hanzo
spec:
selector:
app: hanzo-pharma
ports:
- name: api
port: 8092
- name: metrics
port: 9090
CLI Interface
# Generate molecules for a target
hanzo-pharma generate --target EGFR --pocket-pdb 4HJO.pdb --num 1000
# Predict protein structure
hanzo-pharma structure predict --sequence MTEYKLVVV... --engine esmfold
# Dock molecules against a target
hanzo-pharma dock --protein target.pdb --ligands candidates.sdf --engine autodock-gpu --gpus 4
# Predict ADMET properties
hanzo-pharma admet predict --molecules candidates.sdf --output results.csv
# Run a full virtual screening campaign
hanzo-pharma screen --config campaign.yaml --watch
# Search chemical libraries
hanzo-pharma library search --library zinc22 --substructure "c1ccc2[nH]cnc2c1" --limit 1000
Health and Metrics
Metrics (Prometheus):
Counters:
hanzo_pharma_molecules_processed_total{stage}
hanzo_pharma_docking_jobs_total{engine, status}
hanzo_pharma_generation_jobs_total{method, status}
hanzo_pharma_gpu_hours_total{stage, gpu_type}
Histograms:
hanzo_pharma_docking_throughput{engine} # Molecules/hour
hanzo_pharma_admet_latency_seconds{model}
hanzo_pharma_api_request_duration_seconds{endpoint}
Gauges:
hanzo_pharma_screening_progress{campaign_id} # Percentage complete
hanzo_pharma_gpus_allocated{stage}
hanzo_pharma_library_size{library_name} # Indexed compounds
Implementation Roadmap
Phase 1: Core Infrastructure (Q1 2026)
- Molecular file format parsing (SDF, PDB, SMILES, SELFIES, MOL2)
- Chemical library ingestion and indexing (ChEMBL, ZINC subset)
- ADMET prediction models (pre-trained ensemble)
- REST API and Python SDK
- CLI tool
Phase 2: Docking and Screening (Q2 2026)
- AutoDock-GPU integration with GPU job scheduling
- Virtual screening pipeline orchestrator
- DiffDock integration for blind docking
- Screening report generation
Phase 3: Generation and Structure (Q3 2026)
- 3D diffusion molecular generation (pocket-conditioned)
- SELFIES language model generation
- ESMFold protein structure prediction
- OpenFold integration (full MSA pipeline)
Phase 4: Quantum and Optimization (Q4 2026)
- HIP-0070 quantum chemistry integration (DFT, QM/MM)
- Lead optimization feedback loops (generate -> dock -> score -> regenerate)
- Multi-objective optimization (affinity + ADMET + synthesizability)
- Full ZINC22 library indexing (1.4 billion compounds)
FDA Regulatory Considerations
AI-designed drug candidates must meet the same regulatory standards as traditionally discovered drugs. The FDA's guidance on AI/ML in drug development (2023) and the ICH M7 guideline on mutagenic impurities impose specific requirements that this pipeline addresses.
Reproducibility. The pipeline records every parameter, model version, dataset version, and random seed for every computation. A virtual screening campaign can be reproduced exactly from its configuration file. This is essential for regulatory submissions where the FDA may request computational evidence to be re-run.
Model validation. ADMET prediction models must be validated against experimental data before use in regulatory submissions. The pipeline provides built-in benchmarking against standard validation sets (TDC leaderboards) and reports model performance metrics (AUROC, RMSE, enrichment factors) alongside predictions.
Audit trail. Every molecule's journey through the pipeline -- from initial library membership through filtering, docking, ADMET scoring, and lead selection -- is recorded with timestamps, model versions, and scores. This audit trail supports the "rationale for lead selection" section required in IND (Investigational New Drug) applications.
GxP compliance. For use in GLP (Good Laboratory Practice) and GMP (Good Manufacturing Practice) environments, the pipeline supports:
- Signed and versioned model artifacts (SHA-256 checksums in the model registry)
- Role-based access control (HIP-0026) restricting who can modify screening configurations
- Immutable result storage (results are append-only in Object Storage)
- Electronic signatures for pipeline approval (integrated with Hanzo IAM)
Limitation disclosure. The pipeline's reports include explicit confidence intervals and known limitations for each computational method. Docking scores are binding energy estimates, not experimental measurements. ADMET predictions have defined applicability domains -- molecules outside the training distribution receive lower confidence scores. This transparency is essential for regulatory credibility.
Security Considerations
Data Classification
Drug discovery data spans multiple sensitivity levels:
- Public: Chemical library structures (ZINC, ChEMBL, PubChem)
- Confidential: Screening results, lead candidates, ADMET profiles
- Highly Confidential: Proprietary targets, novel scaffolds, patent-pending structures
The pipeline enforces data classification through Object Storage bucket policies (HIP-0032) and IAM role-based access (HIP-0026). Proprietary targets and lead candidates are encrypted at rest with customer-managed keys via KMS (HIP-0027).
Intellectual Property Protection
Novel molecules generated by the pipeline are potential patent candidates. The pipeline:
- Stores generated molecules in organization-scoped, encrypted buckets
- Logs all access to generated molecule data
- Supports export restrictions (preventing bulk download of lead candidates)
- Timestamps all generations for prior art documentation
Compute Isolation
Docking and generation jobs from different organizations run in separate Kubernetes namespaces with network policies preventing cross-tenant data access. GPU memory is cleared between jobs to prevent model weight or molecular data leakage.
References
- HIP-0032: Object Storage Standard
- HIP-0057: ML Pipeline & Training Standard
- HIP-0070: Quantum Computing Standard
- HIP-0026: Identity & Access Management Standard
- HIP-0027: Secrets Management Standard
- DiMasi, J.A., et al. "Innovation in the pharmaceutical industry: New estimates of R&D costs." Journal of Health Economics 47 (2016): 20-33.
- Corso, G., et al. "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking." ICLR 2023.
- Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379 (2023): 1123-1130.
- Schneider, P., et al. "Rethinking drug design in the artificial intelligence era." Nature Reviews Drug Discovery 19 (2020): 353-364.
- Santos-Martins, D., et al. "Accelerating AutoDock4 with GPUs and Gradient-Based Local Search." JCTC 17 (2021): 1060-1073.
- Krenn, M., et al. "Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation." Machine Learning: Science and Technology 1 (2020): 045024.
- Ahdritz, G., et al. "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization." Nature Methods 21 (2024): 1514-1524.
- FDA. "Using Artificial Intelligence and Machine Learning in the Development of Drug and Biological Products." Discussion Paper (2023).
- Huang, K., et al. "Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development." NeurIPS 2021 Datasets Track.
- Xiong, Z., et al. "Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism." Journal of Medicinal Chemistry 63 (2020): 8749-8760.
Copyright
Copyright and related rights waived via CC0.