HIPsHanzo Proposals
Back to HIPs
HIP-0019DraftStandards TrackCore

Tensor Operations Standard

Hanzo AI Team
Created: 2025-01-09
Requires: HIP-HIP-0003

HIP-19: Tensor Operations Standard

Abstract

This proposal defines the tensor operations standard for all ML computations in the Hanzo ecosystem. It specifies the data types, device backends, operation primitives, model formats, quantization strategies, memory management, custom operations, WebAssembly compilation pipeline, and API surface that every inference workload MUST use. The reference implementation is Hanzo Candle, a fork of HuggingFace's Candle Rust ML framework extended with Apple Silicon Metal support, custom quantization kernels, Hamiltonian dynamics operations, and integration with the Hanzo model serving pipeline.

Repository: github.com/hanzoai/candle Crates: candle-core, candle-nn, candle-transformers, candle-metal-kernels, candle-wasm NPM Package: @hanzoai/candle-wasm

Motivation

ML inference in production has a set of concrete problems that existing tooling does not solve together:

  1. Runtime bloat: PyTorch requires a 2GB+ runtime. For a service that loads a 4-bit quantized 7B model into 4GB of RAM, the framework itself should not double the memory footprint. Inference servers should be small, fast to start, and free of dynamic library dependency chains.
  2. No single-binary deployment: Python-based inference requires a virtualenv, pip dependencies, CUDA toolkit installation, and careful version pinning. Deploying to a new machine means reproducing this environment exactly. A compiled binary that statically links its dependencies eliminates this class of failure.
  3. GIL contention: Python's Global Interpreter Lock serializes CPU-bound work across threads. For inference servers handling concurrent requests, this means either multiprocessing (with memory duplication) or yielding to C extensions (with complex FFI). Rust has no GIL.
  4. Memory safety at the hardware boundary: CUDA and Metal kernels operate on raw pointers. Buffer overflows, use-after-free, and data races in kernel launch code cause silent corruption or segfaults. Rust's ownership model catches these at compile time.
  5. No edge/browser deployment path: PyTorch cannot compile to WebAssembly. ONNX Runtime has a WASM build but it is large and limited. Candle compiles to WASM natively, enabling in-browser inference with no server round-trip.
  6. Fragmented hardware support: Most frameworks treat CUDA as first-class and everything else as an afterthought. Apple Silicon (Metal) and WebAssembly are second-class or unsupported. We need uniform performance across CPU, CUDA, Metal, and WASM.
  7. No custom operation pathway for scientific computing: Production ML frameworks are optimized for standard neural network operations. Applications that require physics-informed layers -- such as Hamiltonian dynamics for the HMM protocol (HIP-0008) or active inference for HLLM (HIP-0002) -- have no clean extension mechanism in PyTorch or ONNX. Candle's trait-based backend system makes custom operations first-class.

Hanzo Candle addresses all seven problems with a single codebase.

Design Philosophy

This section explains the reasoning behind every major architectural decision. These are not arbitrary choices -- each follows from a specific constraint.

Why Rust for ML

Python is the ML lingua franca. Every researcher writes Python. Every training framework (PyTorch, JAX, TensorFlow) has Python as its primary interface. So why would an ML infrastructure company build its inference layer in Rust?

The answer is that training and inference have fundamentally different requirements.

Training is exploratory. Researchers iterate on architectures, loss functions, and data pipelines. They need rapid prototyping, interactive debugging, and the ability to inspect intermediate tensors. Python excels at this. The GIL does not matter because training is GPU-bound and Python is just orchestrating CUDA kernels.

Inference is operational. A production inference server must:

  • Start in milliseconds (not seconds) for serverless cold starts
  • Handle thousands of concurrent requests without GIL contention
  • Use minimal memory so more RAM is available for model weights
  • Deploy as a single artifact without dependency chains
  • Run for months without memory leaks or GC pauses

For these requirements, Rust provides:

  • C-level performance: No interpreter overhead, no garbage collector, no JIT warmup. For CPU-bound operations (tokenization, beam search, KV cache management), Rust is 10-100x faster than Python.
  • Memory safety without GC: Ownership and borrowing eliminate use-after-free, double-free, and data races at compile time. No garbage collection pauses during inference.
  • Zero-cost abstractions: Traits, generics, and iterators compile to the same machine code as hand-written C. The device abstraction layer adds no runtime overhead.
  • Single-binary deployment: cargo build --release produces one statically-linked binary. Our inference server is ~50MB, starts in <100ms, and has zero runtime dependencies.
  • Fearless concurrency: Rust's type system prevents data races. Concurrent request handling requires no locks on read-only model weights.

The tradeoff is development velocity. Rust has a steeper learning curve than Python, and compile times are longer. We accept this tradeoff because inference code changes infrequently once correct -- the model architecture is fixed, and the serving logic is stable. The operational benefits compound over millions of inference requests.

Why Candle (HuggingFace Fork)

There are several Rust ML frameworks: tch-rs (PyTorch bindings), burn, ort (ONNX Runtime bindings), and Candle. We chose Candle for specific reasons:

tch-rs wraps libtorch via FFI. This gives you PyTorch's full operator set in Rust, but you are still shipping a 2GB libtorch dynamic library. The binary is not self-contained, and you inherit PyTorch's memory allocator behavior (including fragmentation under long-running inference). You also cannot compile to WASM.

burn is a pure-Rust framework with its own backend system. It is well-designed but young. Its model ecosystem is small -- loading a HuggingFace checkpoint requires manual weight mapping for each architecture. It does not yet support GGUF quantized formats.

ort wraps ONNX Runtime. ONNX is a good interchange format, but not all model architectures export cleanly to ONNX (notably: models with dynamic control flow, KV caching, or custom attention patterns). You are also limited to what ONNX operators support.

Candle is HuggingFace's native Rust ML framework. It provides:

  • PyTorch-like tensor API (Tensor::matmul, Tensor::softmax, etc.)
  • Direct loading of safetensors files (HuggingFace's standard model format)
  • Pre-built model architectures (LLaMA, Mistral, Phi, Whisper, Stable Diffusion)
  • CPU, CUDA, and Metal backends in a single crate
  • WASM compilation support
  • Active maintenance by HuggingFace engineers

We fork Candle to add:

  1. Enhanced Metal kernels: Custom quantized matmul kernels for Apple Silicon that outperform the upstream implementation by 2-3x on M-series chips
  2. GGUF quantization support: Full Q4_K_M and Q5_K_M quantization with optimized dequantization kernels
  3. KV cache management: Pre-allocated, ring-buffer KV caches for efficient autoregressive generation
  4. Hanzo model registry integration: Direct loading from Hanzo Object Storage (HIP-0032) and the Zen Gateway model cache
  5. Batched inference: Dynamic batching with padding-free attention for throughput optimization
  6. Hamiltonian dynamics operations: Symplectic integrators and energy-preserving transforms for the HMM protocol (HIP-0008) and active inference (HIP-0007)

By building on HuggingFace's work, we get broad model compatibility. Any model published to HuggingFace Hub in safetensors format can be loaded with minimal effort.

Why Not Just Use PyTorch or ONNX Runtime

This deserves explicit comparison because the question comes up frequently.

PyTorch is the correct choice when:

  • You are training a model
  • You are prototyping a new architecture
  • You need the full operator set (2000+ ops)
  • Your team only knows Python

PyTorch is the wrong choice when:

  • Your inference server must start in <1 second (PyTorch import alone takes 3-5 seconds)
  • You need single-binary deployment (PyTorch requires CUDA toolkit, cuDNN, NCCL)
  • You want to run in a browser or on edge devices (no WASM support)
  • Memory efficiency matters (PyTorch's allocator fragments under long-running workloads)
  • You serve on Apple Silicon (PyTorch MPS backend is incomplete and slower than Metal)

ONNX Runtime is the correct choice when:

  • Your model exports cleanly to ONNX
  • You need the widest hardware support (CPU, CUDA, TensorRT, DirectML, CoreML, NNAPI)
  • You do not need custom attention kernels or KV cache management

ONNX Runtime is the wrong choice when:

  • Your model uses dynamic control flow (if/else in forward pass)
  • You need custom quantization (GPTQ, AWQ) beyond ONNX's built-in INT8
  • You want WASM deployment without a 20MB+ runtime
  • You need to modify the inference loop (e.g., speculative decoding, guided generation)

Candle fills the gap: a minimal, embeddable ML runtime that compiles to native code or WASM, supports the model formats and quantization strategies that matter for LLM/diffusion inference, and gives the developer full control over the inference loop.

Why WebAssembly Support

Candle compiles to WASM via wasm32-unknown-unknown. This enables a deployment model that no other ML framework supports well: in-browser inference with zero server infrastructure.

Concrete use cases:

  • Privacy-sensitive applications: Medical text analysis, personal document summarization, journal entry processing. The data never leaves the user's device. There is no API call, no server log, no data retention policy to worry about.
  • Offline capability: A WASM model works without network connectivity. Mobile web apps, field tools, and disaster-response systems benefit from this.
  • Latency elimination: No network round-trip. For small models (1-3B parameters with INT4 quantization), in-browser inference is faster than a cloud API call because you eliminate 50-200ms of network latency.
  • Cost elimination: No GPU servers to provision, no API bills, no autoscaling complexity. The user's device provides the compute.

The tradeoff is model size. WASM inference is CPU-only (no GPU access from the browser) and limited by the device's RAM. In practice, this means models up to ~3B parameters with INT4 quantization (~1.5GB) work well on modern laptops and high-end phones. Larger models require server-side inference.

Why Tensor-Level Standardization

Multiple systems in the Hanzo ecosystem consume tensors: Jin (HIP-0003) for multimodal inference, Node (HIP-0020) for blockchain-verified inference, browser clients for edge ML, and the LLM Gateway (HIP-0004) for model serving. Without a shared standard, each system invents its own tensor format, memory layout, and operation semantics. This leads to:

  • Serialization overhead: Converting between formats (e.g., PyTorch tensors to ONNX to TFLite) introduces bugs and performance loss.
  • Correctness drift: Two implementations of softmax that differ in numerical precision produce different model outputs, making inference non-reproducible.
  • Duplicated effort: Every team writes their own matmul, attention, and normalization kernels.

By standardizing at the tensor level, all Hanzo systems share a single implementation of every operation. A tensor produced by Jin can be consumed by Node without conversion. Browser and server inference produce bit-identical results for the same model and input.

Specification

Architecture

Application Layer
├── Jin (HIP-0003) - Multimodal inference
├── Node (HIP-0020) - Blockchain inference
├── Browser Client - Edge ML via WASM
└── LLM Gateway (HIP-0004) - Model serving
        |
        v
Candle High-Level API
├── candle-transformers  - Pre-built model architectures
├── candle-nn            - Neural network layers
└── candle-hamiltonian   - Physics-informed operations
        |
        v
Candle Core
├── Tensor              - N-dimensional array
├── Op                  - Operation enum (for autograd)
├── Backend trait       - Device abstraction
└── Storage             - Typed memory buffer
        |
        v
Backend Implementations
├── CPU    (BLAS/LAPACK: OpenBLAS, MKL, Accelerate)
├── CUDA   (cuBLAS, cuDNN, custom kernels)
├── Metal  (MPS, custom compute shaders)
└── WASM   (Rust stdlib, SIMD128 where available)

Tensor Types

All tensor operations MUST support the following data types:

DTypeSizeUse CaseBackend Support
f324 bytesDefault precision, training, CPU inferenceCPU, CUDA, Metal, WASM
f162 bytesGPU inference, reduced memoryCPU (emulated), CUDA, Metal
bf162 bytesTraining stability, GPU inferenceCPU (emulated), CUDA, Metal
f648 bytesHigh-precision numerical computation, Hamiltonian dynamicsCPU, CUDA
i648 bytesIndices, token IDs, positionsCPU, CUDA, Metal, WASM
u324 bytesIndices, masks, vocabulary IDsCPU, CUDA, Metal, WASM
u81 byteQuantized weights, raw dataCPU, CUDA, Metal, WASM

Type promotion rules follow NumPy conventions: u8 + f32 -> f32, f16 + f32 -> f32, bf16 + f16 -> f32.

Device Backends

pub enum Device {
    Cpu,
    Cuda(usize),    // GPU ordinal
    Metal(usize),   // Metal device ordinal
}
BackendLibraryMatmul DispatchNotes
CPUBLAS (OpenBLAS, MKL, Accelerate)sgemm/dgemmUniversal fallback; vectorized with AVX2/NEON
CUDAcuBLAS, cuDNN, custom kernelscublasSgemm, flash attentionNVIDIA GPUs; requires CUDA toolkit >= 12.0
MetalMetal Performance Shaders, custom kernelsMPSMatrixMultiplication, custom Q4 matmulApple Silicon; M1/M2/M3/M4
WASMRust stdlib, SIMD128Manual tilingBrowser/edge; no GPU, SIMD optional

Backend selection is explicit at tensor creation time. Tensors on different devices cannot be combined -- the caller must explicitly transfer with tensor.to_device(&device).

Memory Management

Efficient memory management is critical for inference workloads where model weights may consume most available RAM. Candle uses three strategies:

Zero-copy memory mapping for model weights. When loading safetensors files, Candle uses mmap(2) via the memmap2 crate. The OS maps the file into virtual address space without copying it into a heap buffer. This means:

  • Model loading is near-instant (page faults load data on demand)
  • Multiple processes serving the same model share physical memory pages
  • The OS can evict unused pages under memory pressure
// Zero-copy loading: no heap allocation for weight data
let tensors = safetensors::MmapedSafetensors::new(&["model.safetensors"])?;
let weight = tensors.load("model.layers.0.attention.wq.weight", &device)?;

Arena allocation for intermediate tensors during inference. A single forward pass through a 7B-parameter model creates hundreds of intermediate tensors (attention scores, normalized activations, FFN outputs). Allocating and freeing each one individually causes memory fragmentation over millions of requests. Candle uses a per-request arena:

  • Pre-allocate a contiguous buffer sized for the worst-case forward pass
  • Intermediate tensors are bump-allocated within the arena
  • The entire arena is freed at once when the request completes
  • No fragmentation, no GC pauses, deterministic memory usage

Ring-buffer KV cache for autoregressive generation. The key-value cache for transformer attention grows linearly with sequence length. Rather than reallocating on each token, Candle pre-allocates the full cache at the maximum sequence length and writes into it with a rotating index. This eliminates allocation during the generation loop entirely.

pub struct KvCache {
    k: Tensor,          // [batch, heads, max_seq, head_dim] -- pre-allocated
    v: Tensor,          // [batch, heads, max_seq, head_dim] -- pre-allocated
    current_len: usize, // current position in the ring buffer
}

Core Operations

Every backend MUST implement the following operation set:

Element-wise Operations

// Arithmetic
fn add(&self, rhs: &Tensor) -> Result<Tensor>;
fn sub(&self, rhs: &Tensor) -> Result<Tensor>;
fn mul(&self, rhs: &Tensor) -> Result<Tensor>;
fn div(&self, rhs: &Tensor) -> Result<Tensor>;

// Unary
fn neg(&self) -> Result<Tensor>;
fn abs(&self) -> Result<Tensor>;
fn exp(&self) -> Result<Tensor>;
fn log(&self) -> Result<Tensor>;
fn sqrt(&self) -> Result<Tensor>;
fn recip(&self) -> Result<Tensor>;
fn relu(&self) -> Result<Tensor>;
fn gelu(&self) -> Result<Tensor>;
fn silu(&self) -> Result<Tensor>;
fn tanh(&self) -> Result<Tensor>;

// Comparison
fn eq(&self, rhs: &Tensor) -> Result<Tensor>;  // returns u8
fn lt(&self, rhs: &Tensor) -> Result<Tensor>;
fn gt(&self, rhs: &Tensor) -> Result<Tensor>;

Reduction Operations

fn sum(&self, dims: &[usize]) -> Result<Tensor>;
fn mean(&self, dims: &[usize]) -> Result<Tensor>;
fn max(&self, dim: usize) -> Result<Tensor>;
fn min(&self, dim: usize) -> Result<Tensor>;
fn argmax(&self, dim: usize) -> Result<Tensor>;
fn argmin(&self, dim: usize) -> Result<Tensor>;

Linear Algebra

fn matmul(&self, rhs: &Tensor) -> Result<Tensor>;
fn transpose(&self, dim1: usize, dim2: usize) -> Result<Tensor>;
fn contiguous(&self) -> Result<Tensor>;
fn broadcast_as(&self, shape: &[usize]) -> Result<Tensor>;

Neural Network Primitives

// Attention
fn scaled_dot_product_attention(
    query: &Tensor,   // [batch, heads, seq_len, head_dim]
    key: &Tensor,     // [batch, heads, kv_len, head_dim]
    value: &Tensor,   // [batch, heads, kv_len, head_dim]
    mask: Option<&Tensor>,
    scale: f64,
) -> Result<Tensor>;

// Normalization
fn layer_norm(x: &Tensor, weight: &Tensor, bias: Option<&Tensor>, eps: f64) -> Result<Tensor>;
fn rms_norm(x: &Tensor, weight: &Tensor, eps: f64) -> Result<Tensor>;

// Positional encoding
fn rotary_embedding(x: &Tensor, cos: &Tensor, sin: &Tensor) -> Result<Tensor>;

// Activation
fn softmax(x: &Tensor, dim: usize) -> Result<Tensor>;
fn log_softmax(x: &Tensor, dim: usize) -> Result<Tensor>;

// Embedding
fn embedding(ids: &Tensor, weight: &Tensor) -> Result<Tensor>;

// Convolution
fn conv2d(
    input: &Tensor,    // [batch, channels, height, width]
    kernel: &Tensor,   // [out_ch, in_ch, kh, kw]
    bias: Option<&Tensor>,
    padding: usize,
    stride: usize,
) -> Result<Tensor>;

// Pooling
fn avg_pool2d(input: &Tensor, kernel_size: usize, stride: usize) -> Result<Tensor>;
fn max_pool2d(input: &Tensor, kernel_size: usize, stride: usize) -> Result<Tensor>;

Shape Operations

fn reshape(&self, shape: &[usize]) -> Result<Tensor>;
fn squeeze(&self, dim: usize) -> Result<Tensor>;
fn unsqueeze(&self, dim: usize) -> Result<Tensor>;
fn narrow(&self, dim: usize, start: usize, len: usize) -> Result<Tensor>;
fn chunk(&self, chunks: usize, dim: usize) -> Result<Vec<Tensor>>;
fn cat(tensors: &[&Tensor], dim: usize) -> Result<Tensor>;
fn stack(tensors: &[&Tensor], dim: usize) -> Result<Tensor>;
fn gather(&self, indices: &Tensor, dim: usize) -> Result<Tensor>;
fn scatter(&self, indices: &Tensor, src: &Tensor, dim: usize) -> Result<Tensor>;
fn index_select(&self, indices: &Tensor, dim: usize) -> Result<Tensor>;

Hamiltonian Dynamics Operations

The HMM protocol (HIP-0008) and active inference framework (HIP-0007) require physics-informed tensor operations that preserve energy and symplectic structure. These are not available in standard ML frameworks. Candle extends its operation set with:

/// Symplectic Euler integrator: advances (q, p) by one timestep dt
/// under Hamiltonian H(q, p). Preserves the symplectic 2-form exactly.
fn symplectic_euler_step(
    q: &Tensor,               // generalized positions [batch, dim]
    p: &Tensor,               // generalized momenta   [batch, dim]
    grad_h_q: &Tensor,        // dH/dq evaluated at (q, p)
    grad_h_p: &Tensor,        // dH/dp evaluated at (q, p)
    dt: f64,
) -> Result<(Tensor, Tensor)>; // (q_next, p_next)

/// Leapfrog (Stormer-Verlet) integrator: second-order symplectic method.
/// Used for Hamiltonian Monte Carlo sampling in HLLM (HIP-0002).
fn leapfrog_step(
    q: &Tensor,
    p: &Tensor,
    grad_h: impl Fn(&Tensor) -> Result<Tensor>,  // gradient of H w.r.t. q
    dt: f64,
    n_steps: usize,
) -> Result<(Tensor, Tensor)>;

/// Compute the Hamiltonian energy H(q, p) = kinetic(p) + potential(q).
/// Used to verify energy conservation in symplectic integrators.
fn hamiltonian_energy(
    q: &Tensor,
    p: &Tensor,
    potential: impl Fn(&Tensor) -> Result<Tensor>,
    mass_matrix: Option<&Tensor>,
) -> Result<Tensor>;  // scalar energy per batch element

These operations are critical for:

  • HMM (HIP-0008): Hamiltonian Market Maker uses symplectic integrators to evolve market state along energy-preserving trajectories, ensuring conservation laws that prevent arbitrage.
  • HLLM (HIP-0002): Hamiltonian LLMs use leapfrog integration for HMC sampling during inference, enabling Bayesian uncertainty quantification over token probabilities.
  • Active Inference (HIP-0007): Free energy minimization via variational inference requires Hamiltonian dynamics for efficient posterior sampling.

Model Formats

FormatExtensionUse CaseQuantizationLoading
safetensors.safetensorsPrimary format for all modelsNo (full precision)Memory-mapped, zero-copy
GGUF.ggufQuantized models for CPU/Metal inferenceQ4_0, Q4_K_M, Q5_K_M, Q8_0Streamed with on-the-fly dequant
ONNX.onnxImport from external frameworksINT8 (limited)Full graph import

safetensors is the REQUIRED format for model storage and distribution. GGUF is REQUIRED for quantized deployment. ONNX is OPTIONAL for import compatibility.

Why safetensors, not pickle: Python pickle files (PyTorch's .pt/.bin format) can execute arbitrary code on load. A malicious model file can run os.system("rm -rf /") when loaded. safetensors is a simple binary format that contains only tensor metadata and raw bytes. It cannot execute code. For a system that loads models from external sources (HuggingFace Hub, user uploads), this is a non-negotiable security requirement.

Why safetensors as primary, not GGUF: GGUF is excellent for quantized deployment but limited as a storage format. It encodes quantization into the file format -- you cannot store full-precision weights in GGUF without waste. safetensors stores full-precision weights that can be quantized at load time to any target precision. The workflow is: store in safetensors, quantize to GGUF at deployment time for the target hardware.

Why not ONNX as primary: ONNX is a graph format, not a weight format. It encodes the model's computation graph alongside weights. This means ONNX files are specific to a model architecture version -- you cannot swap in new weights for the same architecture without re-exporting. safetensors stores weights independently of architecture, enabling weight-only updates.

Quantization

Quantization reduces model size and inference cost by representing weights in lower precision. The following quantization schemes MUST be supported:

SchemeBitsBlock SizeQualitySpeedMemory
f1616-Baseline1x50% of f32
INT88per-channel~99% of f161.5x50% of f16
GPTQ4128~97% of f162x25% of f16
AWQ4128~97% of f162x25% of f16
Q4_K_M4.5 (avg)256 (super-blocks)~96% of f162.5x (CPU)27% of f16
Q5_K_M5.5 (avg)256 (super-blocks)~98% of f162x (CPU)34% of f16
Q8_0832~99.5% of f161.8x (CPU)50% of f16

GPTQ and AWQ are GPU-optimized (CUDA). Q4_K_M and Q5_K_M are CPU/Metal-optimized (GGUF format). For Apple Silicon deployment, Q4_K_M provides the best quality-per-TFLOP.

Model Architectures

The following architectures MUST have reference implementations in candle-transformers:

ArchitectureTypeModelsKey Operations
Transformer (decoder)Autoregressive LLMLLaMA, Mistral, Phi, Falcon, Qwen, ZenCausal attention, RoPE, GQA, MoE routing
Transformer (encoder)Embedding/classificationBERT, RoBERTa, BGE, E5Bidirectional attention, [CLS] pooling
Transformer (encoder-decoder)Seq2seqT5, WhisperCross-attention, encoder KV cache
Diffusion (UNet)Image generationStable Diffusion, SDXLConv2d, GroupNorm, cross-attention
Diffusion (DiT)Image generationFlux, SD3AdaLN, patchify, unpatchify
Vision (ViT)Image understandingCLIP, SigLIP, DINOv2, BLIP, SAMPatch embedding, CLS token
Audio (Conformer)Speech processingWhisper, EncodecConv1d, relative attention, mel spectrogram

GPU Kernel Interface

Custom CUDA and Metal kernels are essential for competitive performance on quantized operations and fused attention. Candle defines a kernel interface that backend-specific code must implement:

CUDA Kernels (candle-kernels/):

// Quantized matrix multiplication: dequantize Q4_K_M blocks on-the-fly
// during matmul, avoiding a separate dequantization pass.
__global__ void q4_k_m_matmul(
    const void* __restrict__ a,      // quantized weights [M, K] in Q4_K_M
    const float* __restrict__ b,     // activations [K, N] in f32
    float* __restrict__ c,           // output [M, N] in f32
    int M, int K, int N
);

// Flash Attention v2: fused softmax(QK^T/sqrt(d))V with online softmax
// and tiling for O(N) memory instead of O(N^2).
__global__ void flash_attention_v2(
    const half* __restrict__ Q, const half* __restrict__ K,
    const half* __restrict__ V, half* __restrict__ O,
    int batch, int heads, int seq_len, int head_dim
);

// Fused RMSNorm + residual add: avoids writing intermediate
// residual tensor to global memory.
__global__ void fused_rms_norm_residual(
    const float* __restrict__ input,
    const float* __restrict__ residual,
    const float* __restrict__ weight,
    float* __restrict__ output,
    float eps, int hidden_size
);

Metal Kernels (candle-metal-kernels/):

// Custom Q4_K_M matmul for Apple Silicon: uses threadgroup memory
// for block dequantization, achieving 2-3x over naive implementation.
kernel void q4_k_m_matmul_metal(
    device const void* a       [[buffer(0)]],
    device const float* b      [[buffer(1)]],
    device float* c            [[buffer(2)]],
    constant uint& M           [[buffer(3)]],
    constant uint& K           [[buffer(4)]],
    constant uint& N           [[buffer(5)]],
    uint2 gid                  [[thread_position_in_grid]],
    uint2 lid                  [[thread_position_in_threadgroup]]
);

Custom kernels are registered via the CustomOp trait, which allows backend-specific dispatch without modifying core tensor code:

pub trait CustomOp: Send + Sync {
    fn name(&self) -> &str;
    fn cpu_fwd(&self, storage: &CpuStorage, layout: &Layout) -> Result<(CpuStorage, Shape)>;
    fn cuda_fwd(&self, storage: &CudaStorage, layout: &Layout) -> Result<(CudaStorage, Shape)>;
    fn metal_fwd(&self, storage: &MetalStorage, layout: &Layout) -> Result<(MetalStorage, Shape)>;
}

WASM Compilation Pipeline

The WASM pipeline compiles Candle to WebAssembly and packages it as an npm module for browser consumption. The full pipeline:

Rust source (candle-core, candle-nn, candle-transformers)
    |
    v
wasm-pack build --target web --release
    |  (compiles to wasm32-unknown-unknown, runs wasm-opt)
    v
pkg/
  candle_wasm_bg.wasm     (~2MB, gzipped ~800KB)
  candle_wasm.js          (JS bindings generated by wasm-bindgen)
  candle_wasm.d.ts        (TypeScript declarations)
    |
    v
npm publish @hanzoai/candle-wasm

Browser usage:

import init, { Model } from '@hanzoai/candle-wasm';

// Initialize WASM module
await init();

// Load a quantized model (fetched as ArrayBuffer)
const weights = await fetch('/models/phi-3-mini-q4.gguf')
  .then(r => r.arrayBuffer());

const model = new Model(new Uint8Array(weights));

// Run inference
const tokens = model.encode("What is the capital of France?");
const output = model.generate(tokens, { max_tokens: 100, temperature: 0.7 });
const text = model.decode(output);

WASM-specific constraints:

  • No GPU access: All computation runs on CPU with optional SIMD128
  • Memory limit: Browsers limit WASM linear memory to 4GB (in practice, 2GB is safer)
  • No filesystem: Model weights must be fetched via HTTP and loaded from ArrayBuffer
  • No threads by default: SharedArrayBuffer requires COOP/COEP headers; single-threaded fallback is mandatory

Crate Structure

candle/
  candle-core/           # Tensor type, DType, Device, Storage, ops
    src/
      tensor.rs          # Core Tensor struct and operations
      dtype.rs           # Data type definitions and conversions
      device.rs          # Device enum and backend dispatch
      storage.rs         # CpuStorage, CudaStorage, MetalStorage
      op.rs              # Operation enum for autograd graph
      error.rs           # CandleError type
      layout.rs          # Shape, stride, contiguity checks
      backprop.rs        # Backward pass (training only)
  candle-nn/             # Neural network layers
    src/
      linear.rs          # Linear (dense) layer
      conv.rs            # Conv1d, Conv2d
      embedding.rs       # Token/position embedding
      layer_norm.rs      # LayerNorm, RMSNorm
      activation.rs      # GELU, SiLU, ReLU
      attention.rs       # Multi-head attention, GQA, flash attention
      rotary.rs          # Rotary position embedding (RoPE)
      var_builder.rs     # Weight loading from safetensors
  candle-transformers/   # Pre-built model architectures
    src/models/
      llama.rs           # LLaMA 2/3 and derivatives
      mistral.rs         # Mistral, Mixtral (MoE)
      phi.rs             # Phi-2, Phi-3
      falcon.rs          # Falcon
      qwen.rs            # Qwen
      stable_diffusion/  # SD 1.5, SDXL, SD3
      whisper.rs         # Whisper (speech-to-text)
      clip.rs            # CLIP (vision-language)
      blip.rs            # BLIP (image captioning)
      sam.rs             # SAM (segmentation)
      bert.rs            # BERT, RoBERTa
      bge.rs             # BGE (embedding)
      quantized_llama.rs # GGUF quantized LLaMA
  candle-hamiltonian/    # Physics-informed operations
    src/
      integrators.rs     # Symplectic Euler, Leapfrog, Yoshida
      energy.rs          # Hamiltonian energy computation
      hmc.rs             # Hamiltonian Monte Carlo sampling
  candle-metal-kernels/  # Custom Metal compute shaders
    src/
      quantized.metal    # Q4/Q5 dequantization + matmul
      attention.metal    # Flash attention for Metal
      layernorm.metal    # Fused LayerNorm kernel
  candle-kernels/        # Custom CUDA kernels
    src/
      quantized.cu       # Q4/Q5 dequantization + matmul
      flash_attn.cu      # Flash attention v2
      fused_ops.cu       # Fused RMSNorm + residual
  candle-wasm/           # WASM build target and JS bindings
    src/
      lib.rs             # wasm-bindgen entry points
      model.rs           # High-level Model API for JS
    pkg/                 # wasm-pack output (gitignored)

API Surface

The public API is organized into three layers:

Layer 1: Tensor operations (candle-core)

use candle_core::{Device, DType, Tensor, Result};

// Create tensors
let device = Device::Metal(0);
let x = Tensor::randn(0f32, 1.0, (2, 3), &device)?;
let w = Tensor::zeros((3, 4), DType::F32, &device)?;

// Operations
let y = x.matmul(&w)?;                    // [2, 4]
let y = y.relu()?;                         // element-wise ReLU
let y = candle_nn::ops::softmax(&y, 1)?;  // softmax over dim 1

// Device transfer
let y_cpu = y.to_device(&Device::Cpu)?;
let data: Vec<f32> = y_cpu.to_vec2()?;

Layer 2: Neural network layers (candle-nn)

use candle_nn::{Linear, LayerNorm, Module, VarBuilder};

// Load weights from safetensors
let vb = VarBuilder::from_safetensors(
    &["model.safetensors"],
    DType::F16,
    &Device::Cuda(0),
)?;

// Build layers
let linear = Linear::new(vb.pp("fc1"), 768, 3072)?;
let norm = LayerNorm::new(vb.pp("ln"), 768, 1e-5)?;

// Forward pass
let x = norm.forward(&input)?;
let x = linear.forward(&x)?;

Layer 3: Complete models (candle-transformers)

use candle_transformers::models::llama::{Llama, Config};

// Load model
let config = Config::from_file("config.json")?;
let model = Llama::load(&vb, &config)?;

// Generate
let tokens = tokenizer.encode("Hello, world")?;
let input = Tensor::new(tokens.as_slice(), &device)?;
let logits = model.forward(&input, 0)?;  // position offset = 0
let next_token = logits.argmax(D::Minus1)?;

Implementation

Repository

github.com/hanzoai/candle -- fork of huggingface/candle with Hanzo extensions.

Build

# CPU only (all platforms)
cargo build --release

# CUDA (NVIDIA GPUs, Linux)
cargo build --release --features cuda

# Metal (Apple Silicon, macOS)
cargo build --release --features metal

# WASM (browser/edge)
wasm-pack build candle-wasm --target web --release

# All features (development)
cargo build --release --features "cuda metal"

Integration Points

SystemIntegrationDetails
Zen Gateway (HIP-0039)Edge inferenceQuantized Zen models served via Candle on CPU/Metal for low-latency edge nodes
Studio (HIP-0035)Diffusion inferenceStable Diffusion / Flux pipelines run on Candle Metal backend for Apple Silicon users
Jin (HIP-0003)Multimodal backboneJin model architectures implemented as Candle modules in candle-transformers
LLM Gateway (HIP-0004)Local model servingGateway routes to local Candle inference workers for on-premise deployments
Object Storage (HIP-0032)Model distributionModel weights stored in safetensors/GGUF format in Hanzo Object Storage, loaded by Candle
MCP (HIP-0010)Tool inferenceMCP tool servers embed Candle for classification, embedding, and small-model inference
Node (HIP-0020)Verified inferenceBlockchain nodes use Candle for deterministic inference with reproducible outputs
HMM (HIP-0008)Market dynamicsHamiltonian Market Maker uses candle-hamiltonian for symplectic market state evolution

Execution Flow

User Request
    |
    v
Zen Gateway (HIP-0039) / LLM Gateway (HIP-0004)
    |
    v
Model Router (selects model + quantization)
    |
    v
Candle Inference Worker
    |
    +---> Tokenizer (HuggingFace tokenizers, Rust)
    |         |
    |         v
    +---> Model Forward Pass (candle-transformers)
    |         |
    |         +---> Embedding lookup
    |         +---> N x Transformer Block
    |         |         +---> RMSNorm
    |         |         +---> Multi-Head Attention (with KV cache)
    |         |         +---> RMSNorm
    |         |         +---> FFN (or MoE routing)
    |         +---> Final LayerNorm
    |         +---> LM Head (matmul to vocab)
    |         v
    +---> Sampling (temperature, top-p, top-k)
    |         |
    |         v
    +---> Detokenize
    |
    v
Response (streamed tokens)

Performance Characteristics

Benchmarks on Apple M3 Max (36GB unified memory), LLaMA-2 7B:

ConfigurationTokens/secMemoryStartup
PyTorch f16 (MPS)~25 t/s14GB8s
llama.cpp Q4_K_M~55 t/s4.2GB0.3s
Candle f16 (Metal)~30 t/s13GB0.1s
Candle Q4_K_M (Metal)~50 t/s4.2GB0.1s
Candle Q4_K_M (CPU, Accelerate)~15 t/s4.2GB0.08s

Benchmarks on NVIDIA A100 (80GB), LLaMA-2 7B:

ConfigurationTokens/secMemoryStartup
vLLM f16 (CUDA)~120 t/s14GB15s
Candle f16 (CUDA)~95 t/s13.5GB0.2s
Candle GPTQ-4bit (CUDA)~140 t/s4GB0.2s
TensorRT-LLM INT4~180 t/s4GB30s

Benchmarks for WASM (Chrome 120, M3 MacBook Pro), Phi-3-mini 3.8B Q4:

ConfigurationTokens/secWASM SizeModel Size
Candle WASM (SIMD128)~8 t/s2.1MB2.0GB
Candle WASM (no SIMD)~3 t/s1.9MB2.0GB
ONNX Runtime WASM~5 t/s18MB2.2GB (ONNX)

Key observations:

  • Candle is not the fastest on raw throughput (TensorRT-LLM and vLLM win on GPU). Its advantage is startup time, memory efficiency, and deployment simplicity.
  • On Apple Silicon, Candle with custom Metal kernels approaches llama.cpp performance while offering a much cleaner Rust API.
  • For WASM deployment, Candle is the only viable option among production-quality frameworks. The WASM binary is 10x smaller than ONNX Runtime's WASM build.

Error Handling

All operations return candle_core::Result<T>, which is std::result::Result<T, candle_core::Error>. Error variants:

pub enum Error {
    // Shape errors
    ShapeMismatch { expected: Shape, got: Shape, op: &'static str },
    UnexpectedNumberOfDims { expected: usize, got: usize },

    // Device errors
    DeviceMismatch { lhs: Device, rhs: Device, op: &'static str },
    NotAvailable(&'static str),  // e.g., "CUDA not available"

    // Type errors
    DTypeMismatch { expected: DType, got: DType, op: &'static str },
    UnsupportedDTypeForOp { dtype: DType, op: &'static str },

    // Backend errors
    Cuda(CudaError),
    Metal(MetalError),

    // I/O
    SafetensorError(safetensors::Error),
    Io(std::io::Error),

    // Generic
    Msg(String),
    Wrapped(Box<dyn std::error::Error + Send + Sync>),
}

Errors are precise and actionable. A ShapeMismatch tells you the expected shape, the actual shape, and which operation failed. No guessing.

Security Considerations

  • Model file validation: safetensors format is designed to be safe against arbitrary code execution (unlike pickle-based formats). GGUF files contain only tensor data and metadata. Neither format can execute code on load.
  • Memory safety: Rust's ownership model prevents buffer overflows in tensor operations. CUDA and Metal kernels are the primary attack surface -- these are audited manually and fuzzed with random tensor shapes.
  • WASM sandboxing: Browser-based inference runs in the WASM sandbox. The model cannot access the filesystem, network, or other browser APIs unless explicitly granted by the host application.
  • Supply chain: The candle-core crate has minimal dependencies (num-traits, half, safetensors, memmap2). No transitive dependency on OpenSSL or other C libraries for the CPU backend.
  • Quantized weight integrity: GGUF files loaded from untrusted sources MUST have their header checksums verified before tensor data is accessed. Malformed block sizes or dimension metadata could cause out-of-bounds reads in dequantization kernels.

Backwards Compatibility

This HIP establishes the initial tensor operations standard. Future changes to the operation set MUST be additive -- existing operations MUST NOT change signature or semantics. New DTypes and backends MAY be added. Deprecation of an operation requires a new HIP.

Test Vectors

Reference implementations MUST pass the following test cases:

#[test]
fn test_matmul_basic() {
    let a = Tensor::new(&[[1f32, 2.], [3., 4.]], &Device::Cpu).unwrap();
    let b = Tensor::new(&[[5f32, 6.], [7., 8.]], &Device::Cpu).unwrap();
    let c = a.matmul(&b).unwrap();
    assert_eq!(c.to_vec2::<f32>().unwrap(), &[[19., 22.], [43., 50.]]);
}

#[test]
fn test_softmax() {
    let x = Tensor::new(&[1f32, 2., 3.], &Device::Cpu).unwrap();
    let y = candle_nn::ops::softmax(&x, 0).unwrap();
    let vals = y.to_vec1::<f32>().unwrap();
    assert!((vals[0] - 0.0900).abs() < 1e-3);
    assert!((vals[1] - 0.2447).abs() < 1e-3);
    assert!((vals[2] - 0.6652).abs() < 1e-3);
}

#[test]
fn test_rope_embedding() {
    // RoPE must produce position-dependent embeddings
    let x = Tensor::ones((1, 1, 4, 64), DType::F32, &Device::Cpu).unwrap();
    let cos = Tensor::ones((4, 64), DType::F32, &Device::Cpu).unwrap();
    let sin = Tensor::zeros((4, 64), DType::F32, &Device::Cpu).unwrap();
    let y = candle_nn::rotary::apply_rotary_emb(&x, &cos, &sin).unwrap();
    // With cos=1, sin=0, output should equal input
    let diff = (&x - &y).unwrap().abs().unwrap().sum_all().unwrap();
    assert_eq!(diff.to_scalar::<f32>().unwrap(), 0.0);
}

#[test]
fn test_device_transfer() {
    let x = Tensor::randn(0f32, 1.0, (2, 3), &Device::Cpu).unwrap();
    // Round-trip through another device should preserve values
    let x2 = x.to_device(&Device::Cpu).unwrap();
    let diff = (x - x2).unwrap().abs().unwrap().sum_all().unwrap();
    assert_eq!(diff.to_scalar::<f32>().unwrap(), 0.0);
}

#[test]
fn test_symplectic_energy_conservation() {
    // Symplectic integrator must conserve energy to within numerical precision
    let q = Tensor::new(&[1.0f64, 0.0], &Device::Cpu).unwrap();
    let p = Tensor::new(&[0.0f64, 1.0], &Device::Cpu).unwrap();
    let h0 = hamiltonian_energy(&q, &p, harmonic_potential, None).unwrap();
    let (q1, p1) = leapfrog_step(&q, &p, harmonic_grad, 0.01, 1000).unwrap();
    let h1 = hamiltonian_energy(&q1, &p1, harmonic_potential, None).unwrap();
    let drift = (h1 - h0).unwrap().abs().unwrap().to_scalar::<f64>().unwrap();
    assert!(drift < 1e-6, "Energy drift {} exceeds tolerance", drift);
}

#[test]
fn test_quantized_load() {
    // GGUF quantized model must load and produce valid logits
    let model = QuantizedLlama::load("test_model.gguf", &Device::Cpu).unwrap();
    let input = Tensor::new(&[1u32, 2, 3], &Device::Cpu).unwrap();
    let logits = model.forward(&input, 0).unwrap();
    assert_eq!(logits.dims(), &[1, 3, VOCAB_SIZE]);
    // Logits must be finite
    assert!(logits.to_vec3::<f32>().unwrap().iter()
        .all(|batch| batch.iter().all(|seq| seq.iter().all(|v| v.is_finite()))));
}

References

  1. HIP-0002: Hamiltonian Large Language Models
  2. HIP-0003: Jin Multimodal AI Architecture
  3. HIP-0004: LLM Gateway
  4. HIP-0007: Active Inference Integration
  5. HIP-0008: HMM Hanzo Market Maker
  6. HIP-0010: Model Context Protocol
  7. HIP-0020: Blockchain Node Standard
  8. HIP-0032: Object Storage Standard
  9. HIP-0035: Image & Video Generation Standard
  10. HIP-0039: Zen Model Architecture
  11. HuggingFace Candle -- upstream repository
  12. safetensors format -- model serialization standard
  13. GGUF format -- quantized model format
  14. Leapfrog integration -- symplectic integrator theory

Copyright

Copyright and related rights waived via CC0.