HIP-60: Serverless Functions (FaaS) Standard
Abstract
This proposal defines the Serverless Functions standard for the Hanzo ecosystem. Hanzo Functions is a Function-as-a-Service platform for event-driven AI workloads, built on Knative Serving with a custom AI runtime that supports GPU-attached execution and sub-second cold starts.
Functions are the smallest deployable unit of compute in the Hanzo platform. Where the Inference Engine (HIP-0043) runs persistent model-serving processes and the Edge layer (HIP-0050) runs V8 isolates for lightweight request processing, Functions occupy the middle ground: containerized, stateless units of work that spin up on demand, execute, and disappear. They are the correct abstraction for bursty, event-driven AI workloads that do not justify a long-running service.
The platform supports four function runtimes: Python, Go, Rust, and TypeScript. Each runtime ships as a pre-built base image with AI-specific dependencies (PyTorch, ONNX Runtime, Candle, transformers.js) pre-installed and pre-warmed, eliminating the dependency installation tax that plagues cold starts on generic FaaS platforms.
AI-specific triggers connect functions to the broader Hanzo infrastructure: model inference events from the LLM Gateway (HIP-0004), webhook delivery, scheduled retraining cycles, data pipeline stages from Hanzo Stream (HIP-0030), and async task invocation from Hanzo MQ (HIP-0055). Functions can also be deployed to the Edge (HIP-0050) for latency-sensitive invocation without GPU requirements.
Repository: github.com/hanzoai/functions
Port: 8060 (API), 8061 (function invocation proxy)
Docker: ghcr.io/hanzoai/functions:latest
Binary: hanzo-fn
Motivation
The Bursty Inference Problem
AI workloads are fundamentally bursty. A customer fine-tunes a model once a week. A webhook fires when a document is uploaded and needs embedding. A nightly job recomputes recommendation scores. A Slack bot classifies incoming messages and routes them to the right team.
None of these workloads justify a long-running service. A persistent Kubernetes Deployment with 2 replicas, running 24/7 to handle a webhook that fires 50 times a day, wastes 99.9% of its allocated compute. Multiply this across dozens of AI-adjacent microservices and the waste is substantial -- hundreds of dollars per month in idle GPU and CPU time.
The serverless model eliminates this waste. A function scales to zero when idle and scales up within seconds when triggered. You pay (in cluster resources) only for the milliseconds of actual execution. For the webhook that fires 50 times a day, the cost drops from "always-on Deployment" to "50 cold starts + 50 executions."
The GPU Cold Start Problem
Generic FaaS platforms (AWS Lambda, Google Cloud Functions, Knative with default configuration) are designed for CPU workloads. Cold start time is typically 100-500ms for a pre-built container, which is acceptable for HTTP request handling.
GPU functions are a different story. A function that runs a small inference model must:
- Schedule onto a node with an available GPU (0-30 seconds if the cluster has spare capacity, minutes if a new node must be provisioned)
- Pull the container image (2-10 seconds for a base Python + PyTorch image, which is 2-8GB)
- Initialize the CUDA runtime (1-3 seconds)
- Load model weights into GPU memory (1-10 seconds depending on model size)
- Execute the actual inference (10-500ms)
Steps 1-4 can take 30+ seconds. For a function that processes a webhook in 200ms, spending 30 seconds on cold start is absurd. The cold start is 150x longer than the useful work.
Hanzo Functions solves this with three mechanisms: pre-warmed GPU pools (step 1 is eliminated), container snapshots (steps 2-3 are reduced to <1 second), and model caching (step 4 is eliminated for recently-used models). The result is GPU function cold starts under 2 seconds for models already in the cache, and under 5 seconds for first-time model loads.
The Event-Driven AI Pipeline Problem
Modern AI applications are not monolithic inference endpoints. They are pipelines of discrete steps:
Document uploaded
--> Extract text (CPU, 500ms)
--> Chunk text (CPU, 50ms)
--> Generate embeddings (GPU, 200ms per chunk)
--> Store in vector DB (CPU, 100ms)
--> Trigger reranking index update (CPU, 2s)
--> Send notification (CPU, 50ms)
Each step has different resource requirements (CPU vs. GPU), different scaling characteristics (embedding generation is the bottleneck), and different failure modes (vector DB write might fail independently of embedding generation). Running this pipeline in a single service means the entire service needs GPU access even though only one step uses the GPU.
Functions let you decompose the pipeline into independent units. The embedding step runs on a GPU function. Every other step runs on a CPU function. The GPU function scales independently based on its queue depth. If the embedding step fails, it retries independently without re-running text extraction.
The glue between steps is Hanzo Stream (HIP-0030) for durable event-driven pipelines and Hanzo MQ (HIP-0055) for async task invocation. Functions subscribe to stream topics or MQ subjects and are triggered automatically when events arrive.
Why Not Just Use Kubernetes Jobs
Kubernetes Jobs are the simplest way to run one-off compute tasks. They schedule a pod, run a container to completion, and clean up. Why build a FaaS layer on top?
Three reasons: cold start optimization, event binding, and developer experience.
Cold start: A Kubernetes Job starts from scratch every time. It pulls the image, initializes the runtime, and loads dependencies. There is no concept of a warm pool, container reuse, or model caching. Knative Serving (which underlies Hanzo Functions) maintains a pool of warm containers that are immediately available for new requests. The difference is 200ms vs. 30 seconds for a GPU workload.
Event binding: Kubernetes Jobs have no built-in trigger mechanism. You need an external system (CronJob for schedules, a webhook receiver for HTTP events, a custom controller for Kafka consumption) to create Jobs in response to events. Hanzo Functions provides a declarative trigger configuration: "run this function when a message arrives on mq.batch.embeddings" or "run this function on a cron schedule." The trigger-to-function binding is managed by the platform, not by the developer.
Developer experience: A Kubernetes Job requires writing a Dockerfile, a Job YAML manifest, understanding pod scheduling, resource limits, service accounts, and image pull secrets. A Hanzo Function requires writing a function in Python/Go/Rust/TypeScript, pointing the CLI at it, and declaring a trigger. The platform handles containerization, scheduling, scaling, and monitoring.
Design Philosophy
Why Knative Over OpenFaaS and AWS Lambda
The three major approaches to serverless on Kubernetes are Knative, OpenFaaS, and Lambda-compatible runtimes (Firecracker/Lambda containers). Each makes different tradeoffs.
AWS Lambda is the gold standard for serverless developer experience: write a function, deploy it, never think about infrastructure. But Lambda is a proprietary AWS service. Running Lambda-compatible runtimes on Kubernetes (via Firecracker or Lambda Web Adapter) gives you the API surface but not the operational benefits. You still manage the cluster, the networking, and the scaling. And Lambda's runtime contract (256MB /tmp, 15-minute timeout, no GPU support) is designed for web APIs, not AI workloads. There is no path to GPU-attached Lambda functions on self-hosted infrastructure.
OpenFaaS is a lightweight, Kubernetes-native FaaS framework. It is simpler than Knative: fewer CRDs, no Istio dependency, and a straightforward watchdog pattern (HTTP → container → response). However, OpenFaaS has two limitations for AI workloads. First, its autoscaler is basic: it scales based on requests-per-second, not queue depth or GPU utilization. For AI functions where a single request takes 10 seconds of GPU time, request-rate scaling is the wrong signal. Second, OpenFaaS does not support scale-to-zero natively (the "zero-scale" feature requires a separate component and has a 5-10 second wake-up penalty with no warm pool concept).
Knative Serving is the most sophisticated option. It provides:
-
Scale-to-zero with warm pools: Knative maintains a configurable number of warm instances (the "initial scale" and "min scale" settings). When traffic drops to zero, instances drain gracefully. When traffic returns, warm instances handle requests immediately while new instances spin up.
-
Concurrency-based autoscaling: Knative scales based on concurrent requests per instance, not requests per second. This is the correct signal for AI functions: if each instance can handle 1 concurrent GPU inference, and 10 requests arrive, Knative scales to 10 instances. OpenFaaS's rate-based scaling would see "10 requests in 1 second" and might scale differently.
-
Revision management: Knative supports traffic splitting between function revisions. Deploy a new version, route 10% of traffic to it, monitor GPU utilization and latency, then promote to 100%. This is essential for AI functions where a new model version might have different memory or latency characteristics.
-
Kubernetes-native: Knative uses standard Kubernetes primitives (Deployments, Services, HPAs). No proprietary abstractions. When something breaks, you debug with
kubectl, not a vendor-specific CLI.
Trade-off acknowledged: Knative is more complex than OpenFaaS. It has more CRDs, a heavier control plane, and a steeper learning curve. We accept this because its autoscaling model and revision management are correct for GPU workloads, and its Kubernetes-native design means we are not locked into a framework-specific operational model.
| Factor | AWS Lambda | OpenFaaS | Knative |
|---|---|---|---|
| GPU support | No | Manual (no first-class) | Via custom runtime (this HIP) |
| Scale-to-zero | Native | Plugin (slow wake) | Native (warm pool) |
| Autoscaling signal | Concurrency | Request rate | Concurrency |
| Revision/traffic split | Aliases + weights | No | Native |
| K8s integration | External | Lightweight | Deep (CRDs, Services) |
| Operational complexity | None (managed) | Low | Medium |
| Vendor lock-in | AWS only | None | None |
| Cold start (CPU) | 100-500ms | 1-5s | 500ms-2s |
| Cold start (GPU) | N/A | 30s+ | 2-5s (with this HIP) |
Why AI Needs Serverless
The dominant pattern for AI model serving is persistent deployments: load a model into GPU memory and keep it there, serving requests indefinitely. This is correct for high-traffic models (zen-72b serving thousands of requests per minute). But the AI ecosystem has a long tail of workloads that are poorly served by persistent deployments:
-
Bursty inference: A customer's fine-tuned model receives 100 requests during business hours and zero requests overnight. A persistent deployment wastes 16 hours of GPU time per day. A function scales to zero overnight and wakes in seconds when the first morning request arrives.
-
Event-driven pipelines: Document processing, image annotation, video transcription, and data enrichment are triggered by uploads, not by user requests. They run for seconds or minutes, not hours. Functions are the natural execution model.
-
Scheduled retraining: Nightly model retraining, weekly evaluation runs, and monthly dataset refreshes are periodic GPU workloads. Functions with cron triggers replace the CronJob + custom-container pattern with a single declarative configuration.
-
Webhook handlers: OAuth callbacks, payment confirmations, CI/CD notifications, and third-party API webhooks need lightweight, stateless request handling. Functions handle these without a dedicated microservice per webhook source.
-
Prototyping and experimentation: Researchers need to deploy a model for a demo, test a new preprocessing step, or run a one-off evaluation. Functions let them deploy in seconds without writing Dockerfiles or Kubernetes manifests.
The principle: persistent deployments for steady-state serving; functions for everything else.
Why Not Edge Functions for Everything
Hanzo Edge (HIP-0050) provides V8 isolate-based functions at globally distributed PoPs. Why not use Edge for all serverless workloads?
Because Edge functions run on CPU-only infrastructure. They execute JavaScript/TypeScript in V8 isolates with sub-millisecond cold starts and <128MB memory limits. This is perfect for authentication, routing, caching, and small-model inference (embeddings, classification).
But Edge functions cannot:
- Attach to GPUs for model inference
- Run Python, Go, or Rust code natively
- Use more than 128MB of memory
- Execute for longer than 100ms of CPU time (configurable, but fundamentally limited by the isolate model)
Hanzo Functions fills the gap between Edge (millisecond-scale, CPU-only, globally distributed) and the Inference Engine (persistent, GPU-attached, origin-only). Functions run in containers with full OS access, arbitrary memory limits, GPU attachment, and execution times measured in seconds or minutes.
Workload Type │ Execution Model │ Cold Start │ GPU │ Duration
───────────────────────┼──────────────────────┼─────────────┼─────┼──────────
Auth/routing/caching │ Edge (HIP-0050) │ <1ms │ No │ <100ms
Webhooks/ETL/pipelines │ Functions (HIP-0060) │ 500ms-5s │ Opt │ <15min
Model serving │ Engine (HIP-0043) │ 2-30s │ Yes │ Persistent
Training/fine-tuning │ ML Pipeline (HIP-57) │ Minutes │ Yes │ Hours
Specification
Architecture
┌────────────────────────────────────────────┐
│ Hanzo Functions │
│ Control Plane :8060 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Function │ │ Trigger │ │ Revision │ │
│ │ Registry │ │ Manager │ │ Router │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────┬───────────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌──────┴──────┐ ┌───────┴──────┐ ┌──────┴──────┐
│ Knative │ │ Knative │ │ Knative │
│ Serving │ │ Serving │ │ Serving │
│ (CPU Pool) │ │ (GPU Pool) │ │ (Edge Sync) │
│ │ │ │ │ │
│ Python │ │ Python+CUDA │ │ Sync to │
│ Go │ │ Rust+Candle │ │ HIP-0050 │
│ Rust │ │ │ │ │
│ TypeScript │ │ │ │ │
└──────┬──────┘ └──────┬───────┘ └─────────────┘
│ │
┌─────────┴──────────┐ ┌───────┴───────────┐
│ │ │ │
┌──┴───┐ ┌──────┐ ┌──┴──┐ │ ┌──────┐ ┌─────┐ │
│Stream│ │ MQ │ │HTTP │ │ │Model │ │GPU │ │
│HIP-30│ │HIP-55│ │ │ │ │Cache │ │Pool │ │
└──────┘ └──────┘ └─────┘ │ └──────┘ └─────┘ │
└───────────────────┘
The architecture has three layers:
-
Control Plane (port 8060): Manages function definitions, trigger bindings, revision history, and deployment orchestration. Talks to the Knative API server to create/update Knative Services.
-
Execution Plane: Knative Serving manages the lifecycle of function instances. CPU functions run on the standard node pool. GPU functions run on GPU-labeled nodes with pre-warmed CUDA containers. Edge-compatible functions are synced to the Edge control plane (HIP-0050) for deployment as V8 isolates.
-
Event Plane: Trigger sources (Stream, MQ, HTTP, Cron) are connected to functions via the Trigger Manager. When an event arrives, the Trigger Manager invokes the function through the Knative invocation proxy (port 8061).
Function Definition
A function is defined by a manifest file (function.yaml) and source code:
# function.yaml
name: embed-document
runtime: python
version: 1.0.0
entry: handler.embed
description: Generate embeddings for uploaded documents
resources:
cpu: "500m"
memory: "1Gi"
gpu: "nvidia.com/gpu: 1" # Optional: request a GPU
timeout: 300s # Maximum execution time
concurrency: 4 # Max concurrent requests per instance
scaling:
min_instances: 0 # Scale to zero when idle
max_instances: 20 # Hard ceiling
target_concurrency: 1 # Scale up when concurrency exceeds this
scale_down_delay: 300s # Wait 5 min before scaling down
triggers:
- type: mq
source: mq.batch.embeddings
consumer_group: fn-embed-document
- type: http
path: /v1/functions/embed-document
methods: [POST]
- type: cron
schedule: "0 2 * * *" # Daily at 02:00 UTC
payload: '{"mode": "reindex"}'
environment:
MODEL_NAME: bge-large-en-v1.5
BATCH_SIZE: "32"
secrets:
- name: hanzo-api-key
env: HANZO_API_KEY
Manifest Fields
| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Function name. Lowercase, alphanumeric + hyphens. Globally unique within the org. |
runtime | enum | yes | python, go, rust, typescript |
version | semver | yes | Function version. New versions create new Knative revisions. |
entry | string | yes | Entrypoint. Format depends on runtime (see Runtime Specification). |
description | string | no | Human-readable description. |
resources.cpu | string | no | CPU request (Kubernetes format). Default: 250m. |
resources.memory | string | no | Memory request. Default: 256Mi. |
resources.gpu | string | no | GPU resource request. Omit for CPU-only functions. |
resources.timeout | duration | no | Max execution time. Default: 60s. Max: 900s (15 minutes). |
resources.concurrency | int | no | Max concurrent requests per instance. Default: 1 for GPU, 10 for CPU. |
scaling.min_instances | int | no | Minimum instances. 0 enables scale-to-zero. Default: 0. |
scaling.max_instances | int | no | Maximum instances. Default: 10. |
scaling.target_concurrency | int | no | Target concurrency for autoscaling. Default: 1. |
scaling.scale_down_delay | duration | no | Grace period before scaling down. Default: 300s. |
triggers | list | yes | One or more trigger definitions (see Trigger Specification). |
environment | map | no | Environment variables injected into the function container. |
secrets | list | no | KMS secrets (HIP-0027) injected as environment variables. |
Runtime Specification
Each runtime provides a base container image with pre-installed dependencies and a standard invocation contract.
Python Runtime
Base image: ghcr.io/hanzoai/fn-python:3.12
Pre-installed: torch, transformers, onnxruntime, numpy, httpx, pydantic
# handler.py
from hanzo.functions import Context, Response
def embed(ctx: Context) -> Response:
"""Generate embeddings for a document."""
doc_id = ctx.data["document_id"]
text = fetch_document(doc_id)
model = ctx.model("bge-large-en-v1.5") # Loaded from model cache
embeddings = model.encode(text, batch_size=32)
store_embeddings(doc_id, embeddings)
return Response(
status=200,
body={"document_id": doc_id, "dimensions": len(embeddings[0])}
)
The Context object provides:
| Attribute | Type | Description |
|---|---|---|
ctx.data | dict | Parsed trigger payload (JSON body, MQ message data, Stream event data) |
ctx.headers | dict | HTTP headers (for HTTP triggers) or message headers |
ctx.trigger | TriggerInfo | Trigger metadata (type, source, timestamp) |
ctx.model(name) | Model | Load a model from the GPU model cache |
ctx.kv | KVClient | Valkey client (HIP-0028) |
ctx.storage | StorageClient | Object storage client (HIP-0032) |
ctx.publish(subject, data) | None | Publish to MQ (HIP-0055) or Stream (HIP-0030) |
ctx.log | Logger | Structured logger with request correlation ID |
Go Runtime
Base image: ghcr.io/hanzoai/fn-go:1.22
Pre-installed: ONNX Runtime C bindings, Hanzo SDK
// handler.go
package main
import (
"github.com/hanzoai/functions/sdk"
)
func Classify(ctx *sdk.Context) *sdk.Response {
text := ctx.Data["text"].(string)
model, err := ctx.Model("distilbert-intent")
if err != nil {
return sdk.Error(500, err)
}
result, err := model.Predict(text)
if err != nil {
return sdk.Error(500, err)
}
return sdk.OK(map[string]interface{}{
"label": result.Label,
"confidence": result.Score,
})
}
Rust Runtime
Base image: ghcr.io/hanzoai/fn-rust:1.77
Pre-installed: Candle (HIP-0019), tokio, serde, reqwest
// src/handler.rs
use hanzo_functions::{Context, Response, Result};
pub async fn transcribe(ctx: Context) -> Result<Response> {
let audio_url: String = ctx.data().get("audio_url")?;
let audio = ctx.storage().get(&audio_url).await?;
let model = ctx.model("whisper-small").await?;
let transcript = model.transcribe(&audio).await?;
ctx.publish("mq.pipeline.transcription.complete", &serde_json::json!({
"audio_url": audio_url,
"transcript": transcript,
})).await?;
Ok(Response::ok(serde_json::json!({
"transcript": transcript,
})))
}
TypeScript Runtime
Base image: ghcr.io/hanzoai/fn-typescript:22
Pre-installed: @xenova/transformers, onnxruntime-node, Hanzo SDK
// handler.ts
import { Context, Response } from '@hanzo/functions'
export async function webhook(ctx: Context): Promise<Response> {
const event = ctx.data as WebhookPayload
// Classify the incoming webhook
const intent = await ctx.model('distilbert-intent').predict(event.text)
// Route based on classification
if (intent.label === 'support') {
await ctx.publish('mq.notify.support', {
channel: 'slack',
message: event.text,
metadata: { confidence: intent.score },
})
}
return Response.ok({ routed: true, intent: intent.label })
}
Trigger Specification
Triggers connect external events to function invocations. A function can have multiple triggers of different types.
HTTP Trigger
Exposes the function as an HTTP endpoint on the invocation proxy (port 8061).
triggers:
- type: http
path: /v1/functions/my-function
methods: [GET, POST]
auth: required # "required" (default), "optional", "none"
rate_limit: 100/min # Per-API-key rate limit
HTTP triggers create a Knative Route that maps the path to the function's Knative Service. Authentication is handled by the invocation proxy using IAM (HIP-0026) JWT validation.
MQ Trigger (HIP-0055)
Invokes the function when a message arrives on a NATS JetStream subject.
triggers:
- type: mq
source: mq.batch.embeddings
consumer_group: fn-embed-document
batch_size: 1 # Messages per invocation (default: 1)
max_batch_wait: 5s # Max wait to fill batch
The Trigger Manager runs a NATS consumer in the specified consumer group. When a message arrives, it invokes the function via HTTP and acknowledges the message only after the function returns successfully. If the function fails, the message is nacked and redelivered per the MQ queue's retry policy.
Stream Trigger (HIP-0030)
Invokes the function when an event is published to a Kafka topic.
triggers:
- type: stream
topic: llm_usage
consumer_group: fn-usage-aggregator
batch_size: 100 # Events per invocation
max_batch_wait: 10s
start_offset: latest # "latest" or "earliest"
The Trigger Manager runs a Kafka consumer. Events are batched and delivered to the function as an array in ctx.data. The consumer commits offsets only after successful function execution.
Cron Trigger
Invokes the function on a schedule.
triggers:
- type: cron
schedule: "0 */6 * * *" # Every 6 hours (standard cron syntax)
timezone: UTC
payload: '{"type": "full_reconcile"}'
The Trigger Manager uses an internal scheduler (backed by PostgreSQL, not Kubernetes CronJobs) to fire cron triggers. This avoids the Kubernetes CronJob limitation of 1-minute granularity and provides better observability through the management API.
Inference Event Trigger
Invokes the function in response to LLM Gateway (HIP-0004) inference events. This is a convenience trigger built on top of the Stream trigger that filters llm_usage events.
triggers:
- type: inference
models: [zen-7b, zen-14b] # Filter by model
event_types: [completed, failed] # Filter by outcome
min_tokens: 1000 # Filter by token count
GPU Function Execution
GPU functions are the distinguishing feature of Hanzo Functions. This section specifies the mechanisms that make GPU cold starts tolerable.
Pre-Warmed GPU Pool
The cluster maintains a pool of GPU-equipped pods that are pre-initialized with the CUDA runtime and a base function container. These pods are idle but warm -- the CUDA context is loaded, the GPU driver is initialized, and the base container is running.
When a GPU function is invoked:
- The scheduler selects a warm pod from the pool (0ms scheduling delay).
- The function code and dependencies are injected into the warm pod via a volume mount (200-500ms).
- The model is loaded from the model cache (0ms if cached, 1-10s if not).
- The function executes.
- After execution, the pod returns to the pool for reuse.
Pool sizing:
gpu_pool:
size: 4 # Pre-warmed GPU pods
gpu_type: nvidia-a10g # GPU type for pool pods
idle_timeout: 600s # Return to pool after 10 min idle
max_model_cache: 8Gi # Per-pod model cache size
preload_models: # Models loaded at pool startup
- bge-large-en-v1.5
- whisper-small
- distilbert-intent
Container Snapshots
Traditional container cold start involves pulling the image, unpacking layers, and initializing the runtime. For a Python + PyTorch image (5GB+), this takes 5-15 seconds even from a local registry.
Hanzo Functions uses container snapshots (CRIU-based checkpoint/restore) to reduce cold start to <1 second:
- Snapshot creation: When a function is deployed, the platform runs the container, initializes the runtime (imports, CUDA setup, model loading), and creates a CRIU checkpoint of the process state.
- Snapshot restore: On cold start, instead of starting the container from scratch, the platform restores the checkpoint. All imports are loaded, CUDA is initialized, and models are in memory. The function is ready to execute in <1 second.
Snapshots are stored in Object Storage (HIP-0032) and cached on local SSD at each node. They are invalidated when the function code or runtime version changes.
Traditional cold start: Pull image (5s) → Start container (1s) → Import torch (3s) → Load CUDA (2s) → Load model (3s) = 14s
Snapshot cold start: Restore checkpoint (800ms) → Ready
Snapshot support is available for Python and Rust runtimes. Go and TypeScript runtimes already have fast cold starts (<1s) without snapshots due to their lightweight initialization.
Model Cache
GPU functions frequently load the same models. The model cache is a node-local LRU cache backed by NVMe SSD that stores model weights in a ready-to-load format.
Model requested by function
--> Check node-local cache (NVMe SSD)
--> Hit: mmap into GPU memory (50-200ms)
--> Miss: Download from Object Storage (HIP-0032) (1-10s)
--> Cache locally
--> mmap into GPU memory
The cache operates at the node level, not the pod level. Multiple function pods on the same node share the cache. Cache eviction uses LRU with a configurable size limit (default: 50GB per node).
Cache metrics:
hanzo_fn_model_cache_hits_total{node, model}
hanzo_fn_model_cache_misses_total{node, model}
hanzo_fn_model_cache_size_bytes{node}
hanzo_fn_model_cache_evictions_total{node}
hanzo_fn_model_load_duration_seconds{model}
Control Plane API
The control plane runs on port 8060 and manages function lifecycle.
| Endpoint | Method | Description |
|---|---|---|
/v1/functions | GET | List all functions (paginated, filterable by runtime/trigger type) |
/v1/functions | POST | Create a new function (accepts function.yaml + source archive) |
/v1/functions/{name} | GET | Function detail: revisions, triggers, scaling config |
/v1/functions/{name} | PUT | Update function (creates new revision) |
/v1/functions/{name} | DELETE | Delete function and all revisions |
/v1/functions/{name}/revisions | GET | List revisions with traffic allocation |
/v1/functions/{name}/revisions/{rev} | GET | Revision detail: instances, metrics |
/v1/functions/{name}/invoke | POST | Synchronous invocation (waits for response) |
/v1/functions/{name}/invoke-async | POST | Asynchronous invocation (returns task ID) |
/v1/functions/{name}/logs | GET | Function execution logs (streaming SSE) |
/v1/functions/{name}/metrics | GET | Per-function metrics (invocations, latency, errors) |
/v1/functions/{name}/traffic | PUT | Update traffic split between revisions |
/v1/triggers | GET | List all active triggers |
/v1/triggers/{id} | GET | Trigger detail: source, function, status |
/v1/gpu-pool | GET | GPU pool status: warm pods, cache utilization |
/health | GET | Control plane health |
Invocation Protocol
Functions are invoked via HTTP POST to the function's Knative Service endpoint. The invocation proxy (port 8061) handles routing, authentication, and trigger-specific payload transformation.
Request Format
POST /v1/functions/embed-document/invoke
Content-Type: application/json
Authorization: Bearer <jwt>
X-Function-Trigger: mq
X-Function-Trigger-Source: mq.batch.embeddings
X-Request-ID: req_01HQ3X7K8M2N4P5R6S7T8U9V0W
{
"document_id": "doc_abc123",
"options": { "model": "bge-large-en-v1.5" }
}
Response Format
HTTP/1.1 200 OK
Content-Type: application/json
X-Function-Name: embed-document
X-Function-Revision: embed-document-00003
X-Function-Duration-Ms: 1250
X-Function-Instance: fn-embed-document-00003-deployment-abc12-xyz
{
"document_id": "doc_abc123",
"dimensions": 1024,
"chunks_processed": 42
}
Traffic Splitting
Functions support gradual rollouts via traffic splitting between revisions:
# Deploy new version (creates revision 4)
hanzo-fn deploy --name embed-document --source ./src
# Route 10% of traffic to the new revision
hanzo-fn traffic embed-document --revision 4 --percent 10
# Monitor metrics, then promote
hanzo-fn traffic embed-document --revision 4 --percent 100
Traffic splitting is implemented via Knative's traffic configuration:
traffic:
- revisionName: embed-document-00003
percent: 90
- revisionName: embed-document-00004
percent: 10
Prometheus Metrics
Metrics are exported on port 9060 with namespace hanzo_fn:
| Metric | Type | Description |
|---|---|---|
hanzo_fn_invocations_total | Counter | Total invocations by function, trigger type, status |
hanzo_fn_duration_seconds | Histogram | End-to-end execution time (includes cold start) |
hanzo_fn_cold_start_duration_seconds | Histogram | Cold start time by function, runtime |
hanzo_fn_cold_starts_total | Counter | Cold starts by function (vs. warm invocations) |
hanzo_fn_errors_total | Counter | Errors by function, error type |
hanzo_fn_concurrent_executions | Gauge | Currently executing instances per function |
hanzo_fn_gpu_utilization | Gauge | GPU utilization per function (GPU functions only) |
hanzo_fn_gpu_memory_bytes | Gauge | GPU memory usage per function |
hanzo_fn_gpu_pool_available | Gauge | Available warm GPU pods in pool |
hanzo_fn_gpu_pool_in_use | Gauge | GPU pods currently executing functions |
hanzo_fn_model_cache_hit_rate | Gauge | Model cache hit ratio per node |
hanzo_fn_trigger_lag | Gauge | Lag between event arrival and function invocation |
hanzo_fn_snapshot_restore_seconds | Histogram | CRIU snapshot restore time |
Implementation
CLI
The hanzo-fn CLI is the primary developer interface:
# Initialize a new function project
hanzo-fn init --runtime python --name my-function
# Local development (runs function locally with hot reload)
hanzo-fn dev --port 8080
# Deploy to cluster
hanzo-fn deploy --name my-function --source ./src
# Invoke a deployed function
hanzo-fn invoke my-function --data '{"key": "value"}'
# View logs
hanzo-fn logs my-function --follow
# View metrics
hanzo-fn metrics my-function
# List all functions
hanzo-fn list
# Delete a function
hanzo-fn delete my-function
Kubernetes Deployment
Control Plane
apiVersion: apps/v1
kind: Deployment
metadata:
name: hanzo-functions
namespace: hanzo
spec:
replicas: 2
selector:
matchLabels:
app: hanzo-functions
template:
spec:
containers:
- name: control-plane
image: ghcr.io/hanzoai/functions:latest
args: ["serve", "--config", "/etc/functions/config.yaml"]
ports:
- containerPort: 8060
name: api
- containerPort: 8061
name: invoke
- containerPort: 9060
name: metrics
resources:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "2000m", memory: "2Gi" }
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: hanzo-functions-db
key: url
- name: NATS_URL
value: nats://nats-mq.hanzo.svc:4222
- name: KAFKA_BROKERS
value: insights-kafka-0.insights-kafka.hanzo.svc:9092
volumeMounts:
- name: config
mountPath: /etc/functions
volumes:
- name: config
configMap:
name: functions-config
GPU Pool DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hanzo-fn-gpu-pool
namespace: hanzo
spec:
selector:
matchLabels:
app: hanzo-fn-gpu-pool
template:
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: gpu-warm
image: ghcr.io/hanzoai/fn-python:3.12-cuda
command: ["hanzo-fn-agent", "--mode=pool", "--model-cache=/cache"]
resources:
requests:
cpu: "1000m"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4000m"
memory: "16Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /cache
- name: snapshots
mountPath: /snapshots
volumes:
- name: model-cache
hostPath:
path: /var/hanzo/model-cache
type: DirectoryOrCreate
- name: snapshots
hostPath:
path: /var/hanzo/snapshots
type: DirectoryOrCreate
Docker Development
# compose.yml
services:
functions:
image: ghcr.io/hanzoai/functions:latest
ports:
- "8060:8060"
- "8061:8061"
- "9060:9060"
environment:
DATABASE_URL: postgresql://hanzo:hanzo@postgres:5432/hanzo_functions
NATS_URL: nats://nats:4222
KAFKA_BROKERS: kafka:9092
OBJECT_STORAGE_URL: http://minio:9000
GPU_POOL_ENABLED: "false" # No GPU in dev
volumes:
- ./config.yaml:/etc/functions/config.yaml
# Local function runner (no Knative needed for dev)
function-runner:
image: ghcr.io/hanzoai/fn-python:3.12
ports:
- "8080:8080"
volumes:
- ./my-function:/app
command: ["hanzo-fn-agent", "--mode=dev", "--source=/app"]
Configuration
# config.yaml
server:
api_port: 8060
invoke_port: 8061
metrics_port: 9060
database:
url: ${DATABASE_URL}
max_connections: 20
knative:
namespace: hanzo-functions
domain: fn.hanzo.ai
triggers:
nats:
url: ${NATS_URL}
credentials:
user: functions-trigger
password: ${NATS_PASSWORD}
kafka:
brokers: ${KAFKA_BROKERS}
group_prefix: fn-
cron:
store: database # Cron state in PostgreSQL
gpu_pool:
enabled: true
size: 4
gpu_type: nvidia.com/gpu
idle_timeout: 600s
max_model_cache: 50Gi
preload_models:
- bge-large-en-v1.5
- whisper-small
snapshots:
enabled: true
storage: s3://hanzo-functions/snapshots/
max_age: 7d
runtimes:
python:
image: ghcr.io/hanzoai/fn-python:3.12
gpu_image: ghcr.io/hanzoai/fn-python:3.12-cuda
default_timeout: 60s
max_timeout: 900s
go:
image: ghcr.io/hanzoai/fn-go:1.22
default_timeout: 30s
max_timeout: 300s
rust:
image: ghcr.io/hanzoai/fn-rust:1.77
gpu_image: ghcr.io/hanzoai/fn-rust:1.77-cuda
default_timeout: 60s
max_timeout: 900s
typescript:
image: ghcr.io/hanzoai/fn-typescript:22
default_timeout: 30s
max_timeout: 300s
observability:
log_level: info
trace_sampling: 0.1
metrics_namespace: hanzo_fn
Implementation Roadmap
Phase 1: Core Platform (Q1 2026)
- Knative Serving integration with Hanzo control plane
- Python and TypeScript runtimes with CPU execution
- HTTP triggers with IAM authentication
- CLI for deploy/invoke/logs
- Prometheus metrics export
Phase 2: Event Triggers (Q1 2026)
- MQ trigger (NATS JetStream consumer)
- Stream trigger (Kafka consumer)
- Cron trigger with PostgreSQL-backed scheduler
- Async invocation with task status tracking
Phase 3: GPU Functions (Q2 2026)
- Pre-warmed GPU pool with CUDA-initialized containers
- Model cache with LRU eviction on NVMe SSD
- Python + CUDA and Rust + Candle GPU runtimes
- Container snapshots (CRIU) for Python GPU functions
Phase 4: Edge Integration (Q2 2026)
- Sync TypeScript functions to Edge (HIP-0050) as V8 isolates
- Unified deployment: single function.yaml deploys to both origin and edge
- Latency-based routing: edge for CPU functions, origin for GPU functions
Phase 5: Advanced Features (Q3 2026)
- Go runtime with ONNX bindings
- Traffic splitting and canary deployments
- Function composition (output of one function triggers another)
- Cost attribution per function per org (integrated with billing)
Security Considerations
Function Isolation
Each function instance runs in its own Kubernetes pod with:
- Network namespace isolation: Functions cannot communicate with each other directly. All inter-function communication goes through MQ or Stream.
- Filesystem isolation: Read-only root filesystem. Writable
/tmpwith size limits (512MB default). - Resource limits: CPU, memory, and GPU limits enforced by Kubernetes. Functions that exceed limits are OOM-killed.
- Service account: Each function runs with a dedicated Kubernetes service account with minimal RBAC permissions.
Secret Management
Function secrets are sourced from KMS (HIP-0027) and injected as environment variables. Secrets are never stored in function.yaml, the control plane database, or container images.
secrets:
- name: hanzo-api-key # KMS secret name
env: HANZO_API_KEY # Environment variable name in function
- name: db-connection-string
env: DATABASE_URL
The control plane fetches secrets from KMS at deployment time and creates Kubernetes Secrets that are mounted into function pods. Secret rotation triggers a rolling update of function pods.
Authentication and Authorization
- Function deployment: Requires IAM authentication with
functions:deploypermission scoped to the organization. - HTTP trigger invocation: Requires IAM JWT or API key. Configurable per trigger (
auth: required,optional, ornone). - MQ/Stream triggers: Authenticated via NATS/Kafka credentials managed by the control plane. Individual functions do not handle message bus authentication.
- Control plane API: All endpoints require IAM authentication with appropriate
functions:*permissions.
Network Policies
Functions can only communicate with:
- Hanzo internal services (IAM, KV, Object Storage, MQ, Stream) via their cluster-internal endpoints
- External URLs explicitly allowlisted in the function configuration
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: hanzo-functions-egress
namespace: hanzo-functions
spec:
podSelector:
matchLabels:
app.kubernetes.io/managed-by: hanzo-functions
egress:
- to:
- namespaceSelector:
matchLabels:
name: hanzo
ports:
- protocol: TCP
port: 4222 # NATS
- protocol: TCP
port: 9092 # Kafka
- protocol: TCP
port: 6379 # Valkey
- protocol: TCP
port: 9000 # MinIO
- protocol: TCP
port: 5432 # PostgreSQL
policyTypes:
- Egress
Execution Time Limits
Functions have a hard timeout (default 60s, max 900s). The invocation proxy terminates functions that exceed their timeout and returns a 504 Gateway Timeout to the caller. For MQ/Stream triggers, the message is nacked and redelivered.
This prevents runaway GPU consumption from buggy or malicious functions. A function that enters an infinite loop will be killed after its timeout, and the GPU is returned to the pool.
Supply Chain Security
Function container images are built from Hanzo-maintained base images. These base images are:
- Built from minimal base images (distroless for Go/Rust, slim for Python/TypeScript)
- Scanned for CVEs on every build via Trivy
- Signed with cosign and verified at deployment time
- Pinned to specific digests in the function manifest (not mutable tags)
User function code is injected into these base images at deployment time. The control plane validates that the source archive does not contain executable binaries, symlinks outside the function directory, or files larger than 100MB.
Relationship to Other HIPs
| HIP | Relationship |
|---|---|
| HIP-4 (LLM Gateway) | Inference event trigger source. Functions process LLM usage events. |
| HIP-19 (Tensor Operations) | Candle library used in Rust GPU runtime for tensor operations. |
| HIP-26 (IAM) | Authentication for function deployment and HTTP trigger invocation. |
| HIP-27 (KMS) | Secret injection into function environments. |
| HIP-28 (KV Store) | Functions access Valkey via ctx.kv for caching and state. |
| HIP-30 (Event Streaming) | Stream trigger consumes Kafka topics. Functions publish to Stream. |
| HIP-31 (Observability) | Prometheus metrics and structured logging. |
| HIP-32 (Object Storage) | Model cache storage. Container snapshot storage. Function access via ctx.storage. |
| HIP-37 (AI Cloud) | Functions are a deployment target within the Cloud platform. |
| HIP-43 (Inference Engine) | Persistent serving complement. Engine for steady-state; Functions for bursty. |
| HIP-50 (Edge Computing) | TypeScript functions sync to Edge for latency-sensitive CPU workloads. |
| HIP-55 (Message Queue) | MQ trigger consumes NATS subjects. Functions publish to MQ via ctx.publish. |
| HIP-57 (ML Pipeline) | Pipeline stages can be implemented as functions. Retraining triggers. |
| HIP-105 (In-Process Extension Runtime) | Complementary, different workload class. HIP-60 runs full containerized functions in Knative pods (cold start in seconds, GPU-attachable). HIP-105 runs in-process wasm/JS/Go extensions inside a host service (cold start in microseconds, no pod). Rule of thumb: if the work justifies a fresh pod, HIP-60; if it's a hot-path validator or per-record hook, HIP-105. |
References
- Knative Serving Documentation
- Knative Autoscaling
- CRIU: Checkpoint/Restore in Userspace
- NVIDIA Container Toolkit
- HIP-4: LLM Gateway
- HIP-30: Event Streaming Standard
- HIP-43: LLM Inference Engine Standard
- HIP-50: Edge Computing Standard
- HIP-55: Message Queue Standard
- HIP-57: ML Pipeline & Training Standard
- OpenFaaS Architecture
- AWS Lambda Execution Environment
- Hanzo Functions Repository
Copyright
Copyright and related rights waived via CC0.