HIP-108: On-Demand Subsystem Supervisor + Warm Pool
Abstract
This HIP defines how the unified cloud binary (HIP-0106) supervises
two things at runtime:
- Out-of-process subsystems that cannot fold into the Go binary — Rust, Python, C++, and TypeScript services with native deps, native sandboxes, or licence/scope constraints that exclude them from the single-binary mount path.
- Per-tenant in-process state that should not be permanently resident — extension modules, sub-interpreters, per-tenant SQLite handles that activate on first request and idle out when traffic stops.
The contract is on-demand activation + warm pools + idle eviction. A realistic Hanzo deployment fits in 1-2 GB RAM at idle and scales by tenant load rather than by service count.
This HIP sits on top of HIP-0105 (the in-process runtime substrate),
HIP-0106 (the unified binary), HIP-0107 (the replication substrate),
and HIP-0302 (per-tenant SQLite + encryption). It introduces one new
package — cloud.Supervisor — and one new operator concept:
tiered activation.
Motivation
Per the HIP-0106 audit, the unified Go binary itself is ~150 MB resident at idle. But the full Hanzo ecosystem has 13+ out-of-process services that intentionally stay out of the unified binary:
| Reason out-of-process | Services |
|---|---|
| Native deps (torch, faiss, sentence-transformers) | flow, llm |
| PCI scope isolation (HIP-0106 §Solo-vault CDE) | vault, payments |
| Heavy column store with its own replication | datastore |
| Free-threading blockers per the 2026-05-19 FT audit | cli, erp, insights, sentry, studio |
| Independent frontend / studio process | platform, chat |
| Polyglot agent runtimes | brain, agents, bot |
If all 13 ran "always on, production-sized," the deployment would consume 64-256 GB RAM regardless of actual tenant traffic. Most tenants never touch most services.
Real workload profile (Hanzo flagship target):
| Quantity | What |
|---|---|
| 1000 | Total tenants |
| ~10 | Concurrently active on ML inference (flow, llm) |
| ~50 | Concurrently active on JS extensions (per-tenant goja modules) |
| ~100 | Concurrently active on CRUD (per-tenant Base) |
| ~840 | Idle in any given minute |
The "always-on" architecture over-provisions by roughly an order of magnitude. This HIP closes that gap with a supervisor pattern that co-exists with the always-on path for the workloads that warrant it.
Specification
Three activation tiers
Every workload in a Hanzo deployment falls into exactly one tier. The tier dictates how memory is amortised and how cold-start cost is paid.
| Tier | What runs there | RSS per active tenant | Cold-start cost |
|---|---|---|---|
| 0 — in-process | wazero modules, goja runtimes, starlark threads, native Go handlers | 9 KB (goja) – 200 KB (wazero) | sub-millisecond |
| 1 — sub-interpreter | per-tenant CPython sub-interp (pyvm), per-tenant Node worker_thread | 2–20 MB | low-millisecond |
| 2 — subprocess (warm pool) | per-(tenant, service) supervised OS process; idle-evictable; CRIU-snapshottable on Linux | 50–500 MB active; 0 idle | ~50–100 ms snapshot restore; ~2–10 s cold spawn |
Tier 0 is the HIP-0105 substrate. Tier 1 sub-interpreters share a
process with the unified binary (pyvm pool, Node worker_threads).
Tier 2 is what HIP-0108 introduces: supervised, idle-evictable OS
processes scoped per (orgID, service).
cloud.Supervisor API
package cloud
// Supervisor manages tier-2 warm-pool workers. Workers are OS
// processes scoped per (orgID, service). The supervisor handles spawn,
// affinity, idle eviction, optional CRIU snapshots, and resource
// isolation. Callers receive opaque WorkerHandles.
type Supervisor interface {
// Acquire returns a handle to a warm worker for (orgID, service).
// Blocks until a worker is available (warm from pool or freshly
// spawned). Caller MUST Release the handle when done.
Acquire(ctx context.Context, orgID, service string) (WorkerHandle, error)
// Stats returns current pool stats for o11y / debugging.
Stats() SupervisorStats
}
type WorkerHandle interface {
// Invoke sends a request to the worker over its IPC channel
// (unix socket / shared memory / ZAP RPC).
Invoke(ctx context.Context, method string, payload []byte) ([]byte, error)
// Release returns the worker to the warm pool (if pool has slack)
// or schedules it for idle reaping.
Release()
}
type SupervisorStats struct {
PerService map[string]ServiceStats
TotalRSS uint64
}
type ServiceStats struct {
WarmWorkers int
ActiveInvokes int
SnapshotsOnDisk int
LastEvictionAt time.Time
}
The interface is intentionally narrow. Three calls — Acquire,
Invoke, Release — cover the lifecycle. The supervisor is the only
component that knows whether a request was served by a warm worker, a
freshly spawned worker, or a CRIU-restored worker. Callers see a
uniform WorkerHandle.
Worker lifecycle
- Cold spawn.
fork+execthe target binary with--worker --tenant=<orgID>and the ZAP socket fd inherited. The worker performs language-specific initialisation and signalsREADYover the socket. - Warm pool. When
Releaseis called, the worker stays resident foridle_timeoutseconds (default 60s, configurable per service). SubsequentAcquirecalls for the same(orgID, service)return the same worker via consistent hashing. - CRIU snapshot (Linux only, opt-in). On idle eviction the
supervisor may freeze the worker to disk under
{data-dir}/snapshots/{org}/{service}.crium. NextAcquirerestores in ~50–100 ms instead of a full 2–10 s spawn. - Idle eviction. Idle longer than the configured timeout →
SIGTERM→ 5 s grace →SIGKILL. Worker MUST honourSIGTERMgracefully: flush state, release WAL, close fds. The supervisor's ZAP healthcheck verifies the worker handed back its WAL handle before reaping. - Affinity. Same tenant routes to the same warm worker via
consistent hashing on
orgID. This is what makes per-tenant in-process state caching effective at tier 2 — the worker's heap already has that tenant's hot rows in memory.
Resource isolation
Each tier-2 worker runs under explicit limits:
- cgroups v2 per worker process:
memory.max,cpu.max,pids.max. Limits are per-service defaults overridable per-tenant via supervisor config. - Namespaces per worker where the service warrants it: PID, network, mount. Network namespace is the common case (the worker reaches outbound services only through the supervisor-provided socket).
- File-descriptor limits per worker via
RLIMIT_NOFILE. - Graceful shutdown contract. Worker MUST honour
SIGTERM: flush in-flight writes, release WAL, close fds, exit within 5 s. Workers that miss the deadline takeSIGKILL; their per-tenant SQLite WAL is recovered on next acquire (the SQLite WAL recovery protocol is idempotent — see HIP-0302).
Per-tenant WAL handoff (per HIP-0302)
When the supervisor activates a warm worker for tenant T against
service S:
- The worker receives the file descriptor for
{data-dir}/orgs/{T}/{S}.db. - The per-org DEK is fetched from KMS (HKDF-derived from the per-service master key per HIP-0302).
- The worker opens SQLite in WAL mode.
- On
Releasethe workerfsyncs the WAL and returns control to the supervisor. - HIP-0107
replicatecontinues streaming WAL changes tovfs/S3 regardless of which worker currently holds the handle.
The consequence: per-tenant state is shared between the cloud binary AND the warm worker via the same SQLite file. The cloud binary can read from the same DB while the worker is active — no IPC round-trip for reads from the supervised path. The worker owns WAL writes for the duration of its acquire; the cloud binary defers WAL writes to the worker via the same Supervisor.
Service classification
For each Hanzo service, the canonical tier and pool size:
| Service | Activation tier | Pool size | Idle timeout |
|---|---|---|---|
| Most extensions (wazero / goja) | 0 in-process | per-runtime defaults (HIP-0105) | n/a |
pyvm (single-tenant CPython) | 0 in-process | 4 sub-interps | n/a |
flow (visual ML pipeline) | 2 subprocess | 0–2 warm | 120 s |
llm (Python LLM gateway) | 2 subprocess | 1–4 warm | 300 s |
insights | 2 subprocess | 0–1 warm | 60 s |
brain, agents (TS runtimes) | 2 subprocess | 0–2 warm | 60 s |
platform (Next.js UI) | always-on | 1–3 replicas | n/a |
chat (TS UI) | always-on | 1–2 replicas | n/a |
datastore (OLAP column store) | always-on (server) | single | n/a |
vault, payments (PCI scope) | always-on | 2 replicas | n/a |
The cloud binary's Supervisor manages tier-2 workers only. Tier 0
lives in the HIP-0105 substrate. Always-on services are out of
HIP-0108 scope — they remain traditional K8s deployments because
either (a) PCI scope dictates isolation, (b) the workload is itself
the always-on serving plane (UIs, OLAP), or (c) the service has its
own internal idle-management.
CRIU integration
CRIU lets the supervisor freeze a worker process to disk on idle eviction and restore it on next acquire. The trade-off:
- Snapshot size: ~100 MB per
(orgID, service)snapshot, evictable on an LRU policy. - Restore latency: ~50–100 ms vs ~2–10 s for a full spawn.
- Platform: Linux only. macOS dev environments take the spawn-only path (no snapshot, no resource limits — see Open Questions).
- Storage:
{data-dir}/snapshots/{org}/{service}.crium. The supervisor reuses HIP-0107vfssemantics for off-host snapshot storage when configured, so snapshots survive node loss. - WAL replay: on restore, the worker reopens its per-tenant SQLite handle and the HIP-0107 replicate frame stream replays any missed WAL frames before the worker accepts new requests.
CRIU snapshotting is opt-in per service. The default is spawn-only; operators turn snapshotting on for services where the cold-spawn cost dominates the request budget.
Memory budget
For the Hanzo flagship profile (1000 tenants; 10 concurrent ML; 50 active JS; 100 concurrent CRUD):
| Component | RSS |
|---|---|
| Cloud baseline (unified binary, all subsystems compiled in) | 150 MB |
| Tier 0: in-process extensions (50 active JS × ~9 KB + idle pool overhead) | ~5 MB |
| Tier 1: sub-interpreters (4 × ~4.6 MB, shared across tenants via affinity) | ~20 MB |
| Tier 2: warm ML workers (10 × ~1 GB) | ~10 GB |
| Per-tenant SQLite handles (100 × ~5 MB) | ~500 MB |
| Headroom (50%) | ~5 GB |
| Total | ~16 GB |
Compare to ~64–128 GB for an always-on equivalent. The supervisor pattern delivers a 5–10× memory reduction at the cost of ~50–100 ms restore latency on the cold path (per snapshot) or 2–10 s on a full cold spawn.
Implementation phases
Phase 1 — Supervisor MVP (~1 week)
cloud.Supervisorinterface in~/work/hanzo/cloud/alongside the existingDeps(HIP-0106 §deps.go).- Spawn/reap with unix socket IPC.
- LRU warm pool keyed by
(orgID, service). No CRIU yet. - Per-tenant consistent-hash affinity.
- O11y
/v1/supervisor/statsendpoint emittingSupervisorStats.
Phase 2 — cgroups + namespaces (~1 week)
- libcontainer wrapper per-worker for
memory.max,cpu.max,pids.max. - Memory accounting per
(orgID, service)surfaced through o11y. - Per-tenant kill switch (
POST /v1/supervisor/evict?org=...&service=...).
Phase 3 — CRIU snapshots (~2 weeks)
- Snapshot on idle eviction (Linux only).
- Restore on next request.
- WAL replay coordination with HIP-0107
replicate. - Snapshot LRU + disk quota under
{data-dir}/snapshots/. - Optional off-host snapshot storage via
vfs.
Phase 4 — per-service adapters (~1 week each)
- ML services: Python via
pyvmin supervisor mode (re-using the HIP-0105 CPython embed when the workload is single-tenant; spawning full Python subprocess workers otherwise). - LLM services: subprocess workers with model weights pre-loaded
on first acquire and held warm for
idle_timeout. - TS services: Node
worker_threadsfor lightweight cases; full Node subprocess spawn for heavyweight (brain, agents).
Non-goals
- Cluster-level scheduling. K8s does this. The supervisor is in-process to the cloud binary; cross-pod scheduling is not in scope.
- Multi-cluster failover. HIP scope is per-deployment. The supervisor coordinates with HIP-0107 replicate for state durability; multi-cluster failover is a separate concern.
- Hot reload of subsystem code. Phase 4 may add it for specific services, but it is not required by this HIP.
- Folding always-on services into the supervisor. UIs, OLAP stores, and PCI-scoped services stay always-on by design.
Open questions
- macOS dev story. No CRIU, no cgroups, no namespaces. The supervisor MUST work without these on darwin — just no snapshot path and no resource limits. Accept the per-worker memory cost in dev. Linux remains the canonical production target.
- Restart storms. If a worker crashes mid-request, what is the
retry semantic? Suggested default: exponential-backoff retry with
a circuit breaker — 3 failures in 30 s marks the
(orgID, service)unhealthy and short-circuits subsequent acquires for a 60 s cooldown. Surfaced to o11y. - Tenant migration across clusters. When tenant
Tmoves clusters, what happens to its warm workers and CRIU snapshots? Suggested protocol: "drain on migration signal" — supervisor stops acceptingAcquireforT, drains in-flight invokes, reaps the warm worker, leaves snapshot on disk. State on the destination cluster is rebuilt from the HIP-0107 replicate stream on first acquire. CRIU snapshots do NOT migrate; only the replicated SQLite WAL does. - Backpressure on
Acquire. When the pool is at its max and no warm worker is available,Acquireblocks onctx. Should the supervisor expose a queue depth metric and a per-servicemax_queueconfig? Suggest yes — wire underSupervisorStats.PerService[svc].QueueDepthin Phase 1.
References
- HIP-0105 — In-Process Extension Runtime Standard (the substrate for tier 0)
- HIP-0106 — Unified Hanzo Cloud Binary (the binary the Supervisor lives in)
- HIP-0107 — Streaming Replication over VFS (per-tenant WAL replay on worker activation)
- HIP-0302 — Encrypted SQLite + ZapDB Durability (per-tenant SQLite
- encryption contract the Supervisor honours)
~/work/hanzo/base/docs/EXTENSIONS_BENCHMARK.md— runtime perf data underpinning the tier-0/1 numbers- AWS Lambda warm-starts pattern (architectural inspiration)
- Fly.io
fly machinesmodel (single-process supervisor + idle suspend) - Linux CRIU — https://criu.org
- cgroups v2 — kernel docs
Documentation/admin-guide/cgroup-v2.rst