HIP-67: Federated Learning Standard

Abstract

This proposal defines the Federated Learning standard for privacy-preserving distributed model training in the Hanzo ecosystem. Hanzo FL enables multiple participants --- enterprises, hospitals, mobile devices, edge nodes --- to collaboratively train machine learning models without centralizing their raw data. Each participant trains locally on its own data and shares only model updates (gradients or weight deltas) with a central aggregation server that combines them into a global model.

The system supports two deployment modes: cross-silo federated learning for enterprise and institutional participants (tens to hundreds of reliable nodes with large datasets), and cross-device federated learning for mobile and IoT participants (millions of unreliable nodes with small datasets). It implements multiple aggregation strategies (FedAvg, FedSGD, FedProx, SCAFFOLD), differential privacy via per-client gradient clipping and calibrated noise injection (DP-SGD), secure aggregation using multi-party computation so the server never observes individual gradients, and an optional homomorphic encryption pathway for gradient-level confidentiality during aggregation.

Hanzo FL integrates with the ML Pipeline (HIP-0057) for training orchestration, Zero Trust (HIP-0054) for authenticated participant enrollment, Decentralized Identity (HIP-0048) for pseudonymous participant identity, and the Hanzo L1 chain (HIP-0024) for on-chain training round attestation and provenance.

Repository: github.com/hanzoai/federated Port: 8067 (API) Binary: hanzo-fl Container: hanzoai/fl:latest

Motivation

Data Cannot Always Move

The default approach to training a machine learning model is simple: collect all the data in one place and train on it. This works when the data is already centralized or when it can be freely copied. In practice, much of the world's most valuable training data cannot be moved:

Healthcare data is legally immovable. HIPAA in the United States, GDPR Article 9 in the EU, and PIPEDA in Canada impose strict restrictions on transferring patient health records. A hospital cannot upload its radiology images to a cloud training cluster without extensive legal agreements, de-identification pipelines, and audit trails --- and even then, the de-identification may be insufficient (research has shown that anonymized medical images can be re-identified through anatomical features). Federated learning lets the model come to the data, not the data to the model.
Financial data is regulated across jurisdictions. A multinational bank with operations in the US, EU, and Singapore faces three different data localization regimes. Customer transaction data generated in Singapore must, under MAS regulations, remain in Singapore. Transferring it to a US training cluster violates local law. Federated learning allows each regional branch to train locally and contribute gradient updates that contain no individual transaction records.
Enterprise data is competitively sensitive. When Hanzo approaches a Fortune 500 customer and proposes fine-tuning a Zen model on their internal documents, the first question is: "Does our data leave our network?" If the answer is yes, the deal is dead. Enterprise customers will not send their proprietary data to a third-party cloud, no matter how many NDAs are signed. Federated learning answers this question definitively: the raw data never leaves the customer's infrastructure.
Edge data is too voluminous to centralize. A fleet of 10 million mobile devices each generating 100 MB of behavioral data per day produces 1 PB/day. Uploading this to a central cluster is impractical --- the bandwidth cost alone is prohibitive, and the data is stale by the time it arrives. Federated learning trains on the data where it lives, using each device's idle compute.

Why Not Anonymize and Centralize

The natural objection is: "Just anonymize the data and then centralize it." This fails for three reasons:

Re-identification attacks are practical. Research consistently demonstrates that anonymized datasets can be re-identified by linking them with auxiliary information. Netflix's anonymized movie ratings were de-anonymized using public IMDb reviews. Anonymized location data from mobile carriers was re-identified using publicly available social media check-ins. Medical imaging data carries anatomical signatures that survive all standard anonymization. The "anonymize then centralize" approach provides a false sense of privacy.
Regulations prohibit it outright. GDPR's Article 17 (right to erasure) and Article 20 (right to portability) apply to pseudonymized data, not just identified data. If there is any reasonable possibility of re-identification, the data is still personal data under GDPR. Healthcare regulators increasingly treat anonymization as insufficient --- the UK's NHS, for example, requires federated approaches for multi-institution research collaborations.
Anonymization degrades data quality. Effective anonymization (differential privacy applied to the dataset, k-anonymity, l-diversity) distorts the data. The stronger the privacy guarantee, the more distortion is introduced, and the less useful the data becomes for training. Federated learning avoids this tradeoff: the raw data retains full fidelity for local training, and privacy is enforced at the gradient level, where the distortion has far less impact on model quality.

Why Hanzo Needs This

Hanzo's growth strategy depends on enterprise adoption. Enterprise customers want to fine-tune Zen models on their proprietary data. Without federated learning, this requires one of two unsatisfactory approaches:

Customer uploads data to Hanzo Cloud. Most enterprises refuse this. The security review alone takes months, and many industries (healthcare, finance, defense) cannot approve it at all.
Hanzo deploys training infrastructure inside the customer's network. This is expensive, operationally complex, and does not scale. Each customer deployment requires custom infrastructure, and the resulting models are isolated --- there is no way to benefit from collaborative training across customers.

Federated learning provides a third path: Hanzo operates the aggregation server and the global model; each customer runs a lightweight FL client inside their network. The customer's data never leaves their infrastructure. The global model improves from every customer's contribution without any single customer's data being exposed. This is the only approach that scales enterprise AI adoption while respecting data sovereignty.

Design Philosophy

Why Federated Averaging as the Default Strategy

Federated Averaging (FedAvg), introduced by McMahan et al. (2017), is the workhorse of federated learning. Each client performs multiple local training steps (epochs) on its data, then sends the resulting weight delta to the server, which averages them weighted by dataset size. The alternative --- Federated SGD (FedSGD) --- has each client compute a single gradient batch and send it to the server.

We default to FedAvg because:

Communication efficiency. FedSGD requires one server round-trip per gradient step. For a model with 7 billion parameters, each gradient is ~28 GB in FP32. Sending this every step is impractical even on fast networks. FedAvg reduces communication by 10-100x: the client trains for E local epochs, then sends one weight delta per round.
Client compute utilization. Cross-silo participants (enterprise GPU clusters) have significant local compute. FedSGD wastes it by limiting each client to a single gradient computation per round. FedAvg lets each client fully utilize its GPUs during local training.
Tolerance to heterogeneity. Real federated settings have heterogeneous data distributions (non-IID data). FedAvg, despite its simplicity, performs well on moderately heterogeneous data. For severely non-IID settings, we provide FedProx and SCAFFOLD as alternatives (described in the specification).

The tradeoff: FedAvg can diverge when data heterogeneity is extreme (each client has data from only one or two classes). FedProx addresses this by adding a proximal term that penalizes local model drift. SCAFFOLD goes further by using control variates to correct client drift. The system defaults to FedAvg but allows the operator to switch strategies per training job.

Cross-Device vs Cross-Silo: Two Different Problems

Federated learning is often discussed as a single paradigm, but the engineering requirements differ dramatically between cross-device and cross-silo deployments:

Dimension	Cross-Silo	Cross-Device
Participants	2-500 organizations	10,000-10,000,000 devices
Availability	Always on, reliable	Intermittent, unreliable
Compute per client	Multi-GPU servers	Single mobile CPU/GPU
Data per client	Millions of records	Hundreds of records
Network	Datacenter links (1-10 Gbps)	Mobile networks (1-100 Mbps)
Identity	Named, authenticated	Pseudonymous, enrolled
Scheduling	All clients in every round	Random subset per round
Failure mode	Rare, handled by retry	Frequent, handled by redundancy

The system implements both modes with shared core infrastructure (aggregation logic, privacy mechanisms, model management) but different scheduling, communication, and fault-tolerance strategies.

Why Differential Privacy at the Gradient Level

Federated learning protects raw data by keeping it on-device, but gradient updates can still leak information. Gradient inversion attacks (Zhu et al., 2019) reconstruct training samples from observed gradients with surprising fidelity. For a model fine-tuned on text, an attacker observing gradients can recover individual training sentences.

Differential privacy (DP) provides a mathematically rigorous defense. The mechanism is DP-SGD adapted for the federated setting:

Per-client gradient clipping. Before a client sends its gradient update, it clips the L2 norm to a maximum value C. This bounds the influence of any single data point on the gradient.
Calibrated noise injection. The server adds Gaussian noise with standard deviation proportional to C/epsilon to the aggregated gradient, where epsilon is the privacy budget.
Privacy accounting. The system tracks cumulative privacy spend across training rounds using the Renyi Differential Privacy accountant, providing a tight bound on total privacy loss (epsilon, delta).

The result: even an adversary who observes every aggregated gradient update cannot determine, with confidence exceeding the privacy budget, whether any specific data point was included in any client's training set.

Why Secure Aggregation via MPC

Differential privacy protects against inference from the aggregated result. Secure aggregation protects individual updates during aggregation --- it ensures the server never sees any single client's gradient in the clear.

The protocol uses additive secret sharing (a form of multi-party computation):

Each client splits its gradient update into N random shares that sum to the original.
Each share is encrypted and sent to a different subset of peers (or to the server via different channels).
The server (or a set of non-colluding aggregation nodes) sums all shares, recovering only the aggregate gradient, never individual contributions.

This is critical for cross-silo deployments where participants are competitors. A hospital contributing to a collaborative cancer detection model does not want the aggregation server (or other hospitals) to learn anything about its individual patient population from its gradient updates. Secure aggregation provides this guarantee cryptographically, not just by policy.

The tradeoff is communication overhead: secure aggregation increases the data transmitted per round by a factor proportional to the number of participants. For cross-silo (tens of participants), this is acceptable. For cross-device (millions), we use a more efficient protocol that tolerates client dropouts.

Specification

Architecture Overview

                    ┌──────────────────────────────────────────────┐
                    │        Hanzo FL Aggregation Server (8067)     │
                    │                                              │
                    │  ┌───────────┐ ┌───────────┐ ┌────────────┐ │
                    │  │ Round     │ │ Aggregation│ │ Privacy    │ │
                    │  │ Scheduler │ │ Engine     │ │ Accountant │ │
                    │  └─────┬─────┘ └─────┬─────┘ └──────┬─────┘ │
                    │        │             │              │       │
                    │  ┌─────┴─────────────┴──────────────┴────┐  │
                    │  │          Secure Aggregation Layer       │  │
                    │  └─────┬──────────┬──────────┬───────────┘  │
                    └────────┼──────────┼──────────┼──────────────┘
                             │          │          │
                    ┌────────┴──┐ ┌─────┴────┐ ┌──┴──────────┐
                    │  Client A │ │ Client B │ │  Client C   │
                    │  (Hosp 1) │ │ (Hosp 2) │ │  (Bank)     │
                    │  Local    │ │ Local    │ │  Local      │
                    │  Training │ │ Training │ │  Training   │
                    │  ┌──────┐ │ │ ┌──────┐ │ │  ┌──────┐  │
                    │  │ Data │ │ │ │ Data │ │ │  │ Data │  │
                    │  └──────┘ │ │ └──────┘ │ │  └──────┘  │
                    └───────────┘ └──────────┘ └────────────┘
                     Data stays    Data stays    Data stays
                     on-premise    on-premise    on-premise

Aggregation Server (port 8067) is the control plane. It coordinates training rounds, selects clients, distributes the global model, collects encrypted gradient updates, performs secure aggregation, applies differential privacy, and publishes round attestations to the Hanzo L1 chain.

FL Clients run inside each participant's network. They download the global model at the start of each round, perform local training on private data, clip and encrypt gradient updates, and send them to the aggregation server. The client is a lightweight binary (hanzo-fl-client) that wraps PyTorch training loops.

Participant Enrollment

Before participating in federated training, each client must enroll through a two-phase process that integrates Zero Trust (HIP-0054) and Decentralized Identity (HIP-0048):

Participant                    FL Server (8067)           IAM / DID
     │                              │                        │
     │  1. POST /v1/enroll          │                        │
     │     { did, credentials }     │                        │
     ├─────────────────────────────►│                        │
     │                              │  2. Resolve DID        │
     │                              ├───────────────────────►│
     │                              │  3. Verify VCs         │
     │                              │◄───────────────────────┤
     │                              │                        │
     │                              │  4. Check ZT policy    │
     │                              │     (mTLS identity)    │
     │                              │                        │
     │  5. { participant_id,        │                        │
     │       client_cert,           │                        │
     │       training_config }      │                        │
     │◄─────────────────────────────┤                        │
     │                              │                        │
     │  6. POST /v1/heartbeat       │                        │
     │     (periodic)               │                        │
     ├─────────────────────────────►│                        │

DID-based identity (HIP-0048) provides pseudonymous participation. A hospital enrolls with did:hanzo:hospital-xyz, not with its legal name. The DID document attests to the participant's capabilities (GPU count, dataset size estimate, network bandwidth) via Verifiable Credentials. The FL server verifies these credentials without learning the participant's real identity.

Zero Trust authentication (HIP-0054) ensures that every connection from every client is mTLS-authenticated with SPIFFE identities. The FL server's access policy explicitly enumerates which SPIFFE IDs may participate in each training job.

Training Round Protocol

A single federated training round proceeds as follows:

Round Protocol:
  1_server_broadcasts_global_model:
    action: Server sends current global model weights to selected clients
    format: safetensors (compressed with zstd)
    optimization: Delta compression after round 1 (send only changed weights)

  2_client_performs_local_training:
    action: Each client trains on local data for E local epochs
    output: Weight delta = local_weights - global_weights
    clipping: Clip delta L2 norm to C (per-client gradient clipping)
    encryption: Encrypt clipped delta using secure aggregation shares

  3_client_sends_encrypted_update:
    action: Client sends encrypted weight delta + metadata to server
    metadata:
      dataset_size: integer     # Number of local training samples
      local_loss: float         # Final local training loss (optional, DP noise added)
      compute_time_s: float     # Wall-clock training time
      round_id: string          # Round identifier

  4_server_performs_secure_aggregation:
    action: Server reconstructs aggregate gradient from encrypted shares
    result: Weighted average of all client deltas (weighted by dataset_size)
    guarantee: Server never observes any individual client's delta

  5_server_applies_differential_privacy:
    action: Add calibrated Gaussian noise to aggregate gradient
    noise_scale: sensitivity * sqrt(2 * ln(1.25/delta)) / epsilon
    accounting: Update cumulative (epsilon, delta) via RDP accountant

  6_server_updates_global_model:
    action: global_weights += noisy_aggregate_delta
    checkpoint: Save global model to Object Storage (HIP-0032)

  7_server_publishes_round_attestation:
    action: Publish round metadata to Hanzo L1 (HIP-0024)
    attestation:
      round_id: string
      num_participants: integer
      aggregate_loss: float (DP noise added)
      privacy_budget_spent: { epsilon, delta }
      model_checkpoint_hash: sha256
      timestamp: ISO 8601

Aggregation Strategies

The system implements four aggregation strategies, selectable per training job:

Aggregation Strategies:

  fedavg:
    description: >
      Federated Averaging. Each client trains for E local epochs,
      sends weight delta. Server computes weighted average by dataset size.
    when_to_use: Default for most tasks. Works well with moderate data heterogeneity.
    parameters:
      local_epochs: 5           # Number of local training epochs per round
      learning_rate: 0.01       # Client-side learning rate
    reference: McMahan et al., 2017

  fedsgd:
    description: >
      Federated SGD. Each client computes one gradient batch, sends it.
      Server averages gradients and applies a single update.
    when_to_use: When data is highly heterogeneous and local drift is unacceptable.
    parameters:
      batch_size: 64
      learning_rate: 0.1
    reference: McMahan et al., 2017

  fedprox:
    description: >
      FedAvg with a proximal term that penalizes local model divergence
      from the global model. Addresses heterogeneity by preventing clients
      from drifting too far during local training.
    when_to_use: Non-IID data where FedAvg diverges.
    parameters:
      local_epochs: 5
      learning_rate: 0.01
      mu: 0.1                  # Proximal term weight (higher = more regularization)
    reference: Li et al., 2020

  scaffold:
    description: >
      Uses control variates to correct client drift. Each client maintains
      a correction term that estimates the difference between its local
      gradient and the global gradient. More communication-efficient
      convergence than FedProx on severely non-IID data.
    when_to_use: Severely heterogeneous data, when convergence speed matters.
    parameters:
      local_epochs: 5
      learning_rate: 0.01
      # No mu: SCAFFOLD's correction is parameter-free once initialized
    reference: Karimireddy et al., 2020

Differential Privacy Configuration

Differential Privacy:
  mechanism: dp_sgd_federated

  per_client_clipping:
    max_grad_norm: 1.0         # L2 norm bound C for each client's update
    clip_strategy: flat        # flat | per_layer | adaptive

  noise_injection:
    noise_multiplier: 1.1      # sigma = noise_multiplier * C / num_clients
    noise_type: gaussian       # Gaussian mechanism for (epsilon, delta)-DP

  privacy_accounting:
    accountant: rdp            # Renyi DP accountant (tightest known bounds)
    target_epsilon: 8.0        # Total privacy budget
    target_delta: 1e-5         # Failure probability (should be < 1/dataset_size)
    auto_stop: true            # Stop training when budget is exhausted

  subsampling:
    client_sampling_rate: 0.1  # Fraction of clients selected per round
    amplification: true        # Apply privacy amplification by subsampling

Privacy amplification by subsampling. When only a random fraction q of clients participate in each round, the effective privacy cost per round is reduced. For q=0.1 (10% of clients per round), the privacy amplification factor is roughly sqrt(q), reducing the effective noise needed by ~3x for the same privacy guarantee. The privacy accountant tracks this automatically.

Secure Aggregation Protocol

The secure aggregation protocol ensures the server learns only the sum of client updates, not individual contributions.

Secure Aggregation (Cross-Silo):
  protocol: additive_secret_sharing

  setup_phase:
    - Each pair of clients (i, j) establishes a shared secret s_ij via DH key agreement
    - Client i computes mask_i = sum_j(PRG(s_ij)) - sum_j(PRG(s_ji))
    - Note: masks cancel out in the aggregate (sum of all mask_i = 0)

  upload_phase:
    - Client i sends: masked_update_i = update_i + mask_i
    - Server receives masked updates (individual updates are hidden by masks)

  aggregation_phase:
    - Server computes: sum(masked_update_i) = sum(update_i) + sum(mask_i)
    - Since sum(mask_i) = 0: server recovers sum(update_i) exactly
    - Individual update_i values are information-theoretically hidden

  dropout_handling:
    - If client j drops out, clients that share secrets with j reveal s_ij
    - Server reconstructs j's mask to cancel it from the aggregate
    - Threshold: aggregation succeeds if >= 2/3 of selected clients complete

Secure Aggregation (Cross-Device):
  protocol: bonawitz_2017
  description: >
    Optimized for millions of clients with high dropout rates.
    Uses a two-round protocol with pairwise masking and
    threshold secret sharing for dropout recovery.
  dropout_tolerance: 50%       # Aggregation succeeds with 50% dropout

Homomorphic Encryption Option

For participants requiring cryptographic confidentiality beyond secure aggregation, the system supports Partially Homomorphic Encryption (PHE) using the Paillier cryptosystem:

Homomorphic Encryption:
  scheme: paillier
  key_size: 2048              # Paillier key size in bits
  enabled: false              # Opt-in per training job

  protocol:
    - Server generates Paillier keypair (pk, sk)
    - Server distributes pk to all clients
    - Client encrypts gradient update: Enc(pk, update_i)
    - Server computes: Enc(pk, sum(update_i)) = product(Enc(pk, update_i))
      # Paillier is additively homomorphic
    - Server decrypts aggregate: Dec(sk, Enc(pk, sum(update_i))) = sum(update_i)

  tradeoffs:
    - Ciphertext expansion: ~64x (FP32 gradient -> 2048-bit ciphertext per element)
    - Computation overhead: ~100x slower than plaintext aggregation
    - Recommended only for small models or when regulatory requirements mandate it

The tradeoff is severe: homomorphic encryption introduces 64x communication overhead and 100x computation overhead. For most deployments, secure aggregation via MPC (which has no ciphertext expansion) is sufficient. HE is provided for regulated environments where even masked gradients are considered unacceptable.

Client Selection and Scheduling

Client Selection:

  cross_silo:
    strategy: all              # All enrolled clients participate every round
    min_clients_per_round: 2   # Minimum to proceed (otherwise round is skipped)
    timeout: 600s              # Max wait for client updates before proceeding

  cross_device:
    strategy: random_sample
    clients_per_round: 100     # Select 100 clients from pool
    eligibility_criteria:
      min_battery: 80%         # Device must have >= 80% battery
      charging: preferred      # Prefer devices on charger
      network: wifi_only       # Only select devices on WiFi
      idle: true               # Device must be idle (screen off, no active use)
    oversampling: 1.3          # Select 130 to account for 30% expected dropout
    scheduling_window:
      start: "02:00"           # Schedule rounds during off-peak hours (local time)
      end: "06:00"

Heterogeneous Device Handling:
  stragglers:
    strategy: deadline         # Ignore updates arriving after deadline
    deadline: 2x_median        # Deadline = 2x the median client completion time
    min_updates: 0.7           # Require at least 70% of selected clients
  compute_scaling:
    strategy: adaptive_epochs  # Faster clients do more local epochs
    min_epochs: 1
    max_epochs: 10
    target_time: 120s          # Each client trains for ~120s regardless of speed

Model Poisoning Detection

Byzantine-robust aggregation detects and mitigates malicious or faulty client updates:

Byzantine Robustness:

  detection_methods:
    norm_check:
      description: Reject updates with L2 norm > threshold * median_norm
      threshold: 3.0           # 3x the median update norm

    cosine_similarity:
      description: Reject updates with low cosine similarity to the aggregate
      min_similarity: 0.1      # Updates anti-correlated with the aggregate are suspicious

    krum:
      description: >
        Multi-Krum selection. For each update, compute sum of distances
        to its n-f-2 nearest neighbors (f = number of Byzantine clients).
        Select the m updates with smallest scores.
      f: 1                     # Assumed number of Byzantine clients
      m: null                  # Default: num_clients - f (select all but worst)
      reference: Blanchard et al., 2017

    trimmed_mean:
      description: >
        For each model parameter, sort the values across clients,
        trim the top and bottom beta fraction, and average the rest.
      beta: 0.1                # Trim 10% from each tail
      reference: Yin et al., 2018

  response:
    action: exclude_and_flag   # Exclude suspicious update, flag participant
    consecutive_flags: 3       # After 3 flags, quarantine participant
    quarantine_duration: 24h   # Participant excluded for 24 hours
    alert: true                # Notify FL operator of quarantine event

Convergence Monitoring and Early Stopping

Convergence Monitoring:
  metrics:
    global_loss:
      description: Loss computed on a held-out validation set at the server
      frequency: every_round
      window: 10               # Smoothed over last 10 rounds

    client_loss_variance:
      description: Variance of local losses across clients (DP noise added)
      frequency: every_round
      alert_threshold: 0.5     # High variance indicates heterogeneity issues

    gradient_norm:
      description: L2 norm of the aggregated gradient
      frequency: every_round
      convergence_signal: <0.01 # Near-zero gradient norm signals convergence

  early_stopping:
    enabled: true
    patience: 20               # Stop if no improvement for 20 rounds
    min_delta: 0.001           # Minimum loss improvement to count as progress
    min_rounds: 50             # Train for at least 50 rounds regardless
    max_rounds: 1000           # Hard stop after 1000 rounds

  checkpointing:
    frequency: every_10_rounds
    storage: "s3://hanzo-fl/checkpoints/{job_id}/"
    keep_best: 3               # Retain top 3 checkpoints by validation loss

On-Chain Training Round Attestation

Each completed training round publishes an attestation to the Hanzo L1 chain (HIP-0024, chain ID 36963), creating an immutable record of the training process:

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.24;

interface IFederatedTrainingRegistry {
    event RoundCompleted(
        bytes32 indexed jobId,
        uint256 indexed roundNumber,
        uint256 numParticipants,
        bytes32 modelCheckpointHash,
        uint256 cumulativeEpsilon,   // Privacy budget spent (scaled by 1e6)
        uint256 timestamp
    );

    event JobCompleted(
        bytes32 indexed jobId,
        bytes32 finalModelHash,
        uint256 totalRounds,
        uint256 totalParticipants,
        uint256 finalEpsilon
    );

    /// @notice Record a completed training round.
    function attestRound(
        bytes32 jobId,
        uint256 roundNumber,
        uint256 numParticipants,
        bytes32 modelCheckpointHash,
        uint256 cumulativeEpsilon
    ) external;

    /// @notice Finalize a training job.
    function finalizeJob(
        bytes32 jobId,
        bytes32 finalModelHash,
        uint256 totalRounds
    ) external;

    /// @notice Verify training provenance for a model.
    function getProvenance(bytes32 jobId)
        external
        view
        returns (
            uint256 totalRounds,
            uint256 totalParticipants,
            bytes32 finalModelHash,
            uint256 finalEpsilon,
            bool finalized
        );
}

This on-chain attestation serves two purposes:

Training provenance. When a model is deployed, anyone can verify how it was trained: how many rounds, how many participants, what privacy budget was spent, and that the model checkpoint hash matches. This is critical for regulatory compliance (EU AI Act requires AI systems to document training provenance).
Participant accountability. The round attestation records the number of participants (not their identities). Combined with the enrollment records, an auditor can verify that the training process involved the claimed participants without revealing individual contributions.

Integration with ML Pipeline (HIP-0057)

Federated training jobs are submitted through the ML Pipeline API (HIP-0057) with a type: federated specifier:

Job Submission:
  name: "zen-7b-medical-federated"
  type: federated
  image: "hanzoai/fl-pytorch:2.3-cuda12.4"

  model:
    base: "zenlm/zen-7b"
    method: lora
    lora_rank: 32
    lora_alpha: 64

  federated:
    mode: cross_silo
    aggregation: fedavg
    local_epochs: 5
    total_rounds: 200
    min_clients_per_round: 3

    privacy:
      differential_privacy: true
      target_epsilon: 8.0
      target_delta: 1e-5
      max_grad_norm: 1.0

    secure_aggregation: true

    byzantine_robustness:
      method: trimmed_mean
      beta: 0.1

  participants:
    - did: "did:hanzo:hospital-alpha"
      spiffe: "spiffe://hanzo.ai/fl-clients/hospital-alpha"
    - did: "did:hanzo:hospital-beta"
      spiffe: "spiffe://hanzo.ai/fl-clients/hospital-beta"
    - did: "did:hanzo:hospital-gamma"
      spiffe: "spiffe://hanzo.ai/fl-clients/hospital-gamma"

When the job completes, the resulting model is registered in the ML Pipeline's Model Registry (HIP-0057) with federated training metadata: participant count, total rounds, privacy budget consumed, and the on-chain attestation transaction hash.

API Endpoints

Endpoint	Method	Description
Enrollment
`/v1/enroll`	POST	Enroll a participant (DID + VCs + mTLS)
`/v1/participants`	GET	List enrolled participants (operator only)
`/v1/participants/{id}`	DELETE	Remove a participant
Jobs
`/v1/jobs`	POST	Create a federated training job
`/v1/jobs/{id}`	GET	Get job status, round progress, privacy budget
`/v1/jobs/{id}/cancel`	POST	Cancel a running job
`/v1/jobs/{id}/rounds`	GET	List completed rounds with metrics
`/v1/jobs/{id}/provenance`	GET	Get on-chain attestation proof
Client Protocol
`/v1/rounds/current`	GET	Get current round info and global model URL
`/v1/rounds/{id}/update`	POST	Submit encrypted gradient update
`/v1/rounds/{id}/status`	GET	Round aggregation status
Monitoring
`/v1/jobs/{id}/metrics`	GET	Convergence metrics (loss, gradient norm)
`/v1/jobs/{id}/privacy`	GET	Privacy budget consumption over time
`/v1/jobs/{id}/participants/health`	GET	Client health and participation rates
`/health`	GET	Service health check
`/metrics`	GET	Prometheus metrics

Configuration

# /etc/hanzo-fl/config.yaml

server:
  host: 0.0.0.0
  port: 8067
  workers: 4

database:
  url: "postgresql://hanzo:password@postgres:5432/hanzo_fl"

storage:
  endpoint: "http://minio:9000"
  access_key: "${HANZO_STORAGE_ACCESS_KEY}"
  secret_key: "${HANZO_STORAGE_SECRET_KEY}"
  bucket: "hanzo-fl"

chain:
  rpc_url: "https://api.hanzo.ai/ext/bc/hanzo/rpc"
  chain_id: 36963
  registry_address: "0xFLRegistryContractAddress"
  attestation_key: "${FL_ATTESTATION_KEY}"   # From KMS (HIP-0027)

auth:
  iam_url: "https://hanzo.id"
  verify_tokens: true
  spire_socket: "/run/spire/sockets/agent.sock"

privacy:
  default_epsilon: 8.0
  default_delta: 1e-5
  max_epsilon: 20.0            # Hard cap on privacy budget per job
  accountant: rdp

secure_aggregation:
  enabled: true
  dropout_threshold: 0.67      # Minimum fraction of clients to complete aggregation

metrics:
  enabled: true
  port: 9090
  path: /metrics

logging:
  level: info
  format: json

Implementation

Client Binary

The FL client is a standalone binary that runs inside the participant's network:

# Install
curl -fsSL https://get.hanzo.ai/fl | sh

# Enroll
hanzo-fl-client enroll \
  --server https://fl.hanzo.ai:8067 \
  --did "did:hanzo:hospital-alpha" \
  --credential ./safety-eval.vc.json \
  --data-path /data/radiology/ \
  --gpu-count 2

# Join a training job
hanzo-fl-client train \
  --job "zen-7b-medical-federated" \
  --local-epochs 5 \
  --batch-size 16

# Check status
hanzo-fl-client status

The client binary:

Downloads the global model at the start of each round
Runs local training using PyTorch with the specified configuration
Clips gradient norms per the job's DP configuration
Encrypts the update using the secure aggregation protocol
Uploads the encrypted update to the FL server
Reports local metrics (with DP noise) for monitoring

Deployment

Docker

docker run -p 8067:8067 -p 9090:9090 \
  -e HANZO_FL_DATABASE_URL="postgresql://..." \
  -e HANZO_FL_STORAGE_ENDPOINT="http://minio:9000" \
  -e HANZO_FL_CHAIN_RPC_URL="https://api.hanzo.ai/ext/bc/hanzo/rpc" \
  hanzoai/fl:latest

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hanzo-fl
  namespace: hanzo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hanzo-fl
  template:
    metadata:
      labels:
        app: hanzo-fl
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: hanzo-fl
      containers:
        - name: hanzo-fl
          image: hanzoai/fl:latest
          ports:
            - containerPort: 8067
              name: api
            - containerPort: 9090
              name: metrics
          env:
            - name: HANZO_FL_DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: hanzo-fl-secrets
                  key: database-url
          readinessProbe:
            httpGet:
              path: /health
              port: 8067
          livenessProbe:
            httpGet:
              path: /health
              port: 8067
---
apiVersion: v1
kind: Service
metadata:
  name: hanzo-fl
  namespace: hanzo
spec:
  selector:
    app: hanzo-fl
  ports:
    - name: api
      port: 8067
    - name: metrics
      port: 9090

Prometheus Metrics

Metrics:
  Counters:
    hanzo_fl_rounds_total{job_id, status}            # Completed rounds
    hanzo_fl_updates_received_total{job_id}           # Client updates received
    hanzo_fl_updates_rejected_total{job_id, reason}   # Rejected updates (norm, byzantine)
    hanzo_fl_attestations_total{job_id}               # On-chain attestations published

  Histograms:
    hanzo_fl_round_duration_seconds{job_id}           # Round wall-clock time
    hanzo_fl_aggregation_duration_seconds{job_id}     # Secure aggregation compute time
    hanzo_fl_client_update_size_bytes{job_id}          # Update payload size

  Gauges:
    hanzo_fl_active_jobs                              # Currently running FL jobs
    hanzo_fl_enrolled_participants{job_id}             # Enrolled clients per job
    hanzo_fl_privacy_epsilon_remaining{job_id}         # Remaining privacy budget
    hanzo_fl_global_loss{job_id}                       # Current global validation loss
    hanzo_fl_client_participation_rate{job_id}         # Fraction of clients completing rounds

Implementation Roadmap

Phase 1: Cross-Silo Core (Q2 2026)

FedAvg aggregation with weighted averaging
Participant enrollment with DID and mTLS
Per-client gradient clipping (DP-SGD clipping component)
Basic convergence monitoring and early stopping
CLI client binary for Linux and macOS
Integration with ML Pipeline for job submission

Phase 2: Privacy and Security (Q3 2026)

Differential privacy with RDP accounting and auto-stop
Secure aggregation via additive secret sharing
Byzantine-robust aggregation (Krum, trimmed mean)
On-chain round attestation on Hanzo L1
Model poisoning detection and participant quarantine

Phase 3: Cross-Device Support (Q4 2026)

Cross-device client selection and scheduling
Dropout-tolerant secure aggregation (Bonawitz protocol)
Adaptive local epochs for heterogeneous devices
Mobile client SDK (iOS, Android)
Delta compression for bandwidth efficiency

Phase 4: Advanced Strategies and HE (Q1 2027)

FedProx and SCAFFOLD aggregation strategies
Homomorphic encryption option (Paillier)
Federated hyperparameter optimization
Privacy amplification by subsampling (formal integration)
Multi-task federated learning (multiple models per job)

Security Considerations

Threat Model

Threat	Mitigation
Gradient inversion (reconstructing data from gradients)	Differential privacy (DP-SGD) + secure aggregation
Model poisoning (malicious client sends corrupted updates)	Byzantine-robust aggregation (Krum, trimmed mean, norm checks)
Free-riding (client claims participation without training)	Minimum update norm check + local loss consistency verification
Sybil attack (attacker enrolls many fake clients)	DID-based enrollment with Verifiable Credentials + rate limiting
Server compromise (attacker controls aggregation server)	Secure aggregation ensures server never sees individual updates
Network eavesdropping (attacker intercepts client-server traffic)	mTLS (HIP-0054) encrypts all communication
Privacy budget exhaustion (training too many rounds leaks info)	RDP accountant with hard epsilon cap and auto-stop

Data at Rest

The FL server stores only aggregated model checkpoints, never individual client updates. Client updates exist in memory only during the secure aggregation computation and are discarded immediately after. The aggregated model checkpoints are encrypted at rest in Object Storage (HIP-0032) using server-side encryption.

Audit Trail

Every enrollment, round, aggregation, quarantine, and attestation event is logged to the audit system with the participant's DID (not real identity), timestamp, and event type. Audit logs are append-only and retained for the lifetime of the training job plus 365 days.

Privacy Budget Enforcement

The privacy budget (epsilon, delta) is enforced at the server level. The server refuses to start a new round if the cumulative privacy spend would exceed the job's configured target. This prevents operators from accidentally training beyond the privacy guarantee. The hard cap (max_epsilon: 20.0 in configuration) is a system-level safety net that cannot be overridden per-job.

References

McMahan et al., "Communication-Efficient Learning of Deep Networks from Decentralized Data" (2017) - FedAvg
Li et al., "Federated Optimization in Heterogeneous Networks" (2020) - FedProx
Karimireddy et al., "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (2020) - SCAFFOLD
Bonawitz et al., "Practical Secure Aggregation for Privacy-Preserving Machine Learning" (2017) - Secure aggregation
Abadi et al., "Deep Learning with Differential Privacy" (2016) - DP-SGD
Zhu et al., "Deep Leakage from Gradients" (2019) - Gradient inversion attacks
Blanchard et al., "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent" (2017) - Krum
Yin et al., "Byzantine-Robust Distributed Learning" (2018) - Trimmed mean
Paillier, "Public-Key Cryptosystems Based on Composite Degree Residuosity Classes" (1999) - Paillier HE
HIP-0024: Hanzo Sovereign L1 Chain Architecture - On-chain attestation
HIP-0048: Decentralized Identity Standard - Participant DID identity
HIP-0054: Zero Trust Architecture Standard - mTLS and participant enrollment
HIP-0057: ML Pipeline & Training Standard - Training orchestration
HIP-0032: Object Storage Standard - Model checkpoint storage
HIP-0027: Secrets Management Standard - Attestation key custody

Federated Learning Standard