HIP-52: Nexus Integration Hub Standard
Abstract
This proposal defines the Nexus Integration Hub, a lightweight service mesh and API orchestration layer for Hanzo Cloud. Nexus provides service discovery, circuit breaking, protocol translation, traffic management, and health-aware routing for all 33+ internal services in the Hanzo ecosystem.
Nexus is not a general-purpose service mesh. It is a Go binary purpose-built for Hanzo's topology: two DOKS Kubernetes clusters, a mix of gRPC and REST backends, and a traffic pattern dominated by fan-out from the API Gateway (HIP-0044) to multiple downstream services per request. It replaces ad hoc service-to-service communication with a configuration-driven control plane and a transparent data plane proxy.
Repository: github.com/hanzoai/nexus
Control Plane Port: 8052
Data Plane Port: 15001
Docker: ghcr.io/hanzoai/nexus:latest
Motivation
The Problem: 33 Services, No Mesh
Hanzo runs 33+ services across two Kubernetes clusters (hanzo-k8s and lux-k8s). Today, inter-service communication relies on Kubernetes Service DNS and basic ClusterIP routing. This works -- until it does not.
Consider what happens when the LLM Gateway (port 4000) calls IAM (port 8000) for token validation, then calls Search (port 3000) for context retrieval, then calls Analytics (port 9090) for usage tracking. Each call is a bare HTTP request with no retry logic, no circuit breaker, no timeout propagation, and no distributed trace. When IAM is slow, the LLM Gateway blocks. When Search is down, the Gateway returns a 500 to the user. When Analytics drops a write, the data is lost silently.
These failure modes compound. In a 33-service topology, the blast radius of a single slow service is the entire platform.
Five Specific Problems
-
No circuit breaking. A degraded IAM instance accepting connections but responding in 30+ seconds will cascade latency into every service that validates tokens. Without circuit breaking, the entire platform slows to IAM's pace.
-
No retry budget. Services either do not retry (losing data on transient failures) or retry aggressively (amplifying load on an already struggling service). There is no coordinated retry budget across the call graph.
-
No protocol translation. Some services speak gRPC (Candle inference, node RPC), others speak REST (IAM, Commerce, Console). Services that need to call across protocol boundaries must embed translation logic, duplicating marshaling code.
-
No traffic management. Deploying a new version of any service is all-or-nothing. There is no mechanism for canary deployments, traffic splitting, or header-based routing to test builds.
-
No service dependency awareness. Kubernetes readiness probes check whether a single container is healthy. They do not check whether a service's dependencies are healthy. A service can be "ready" while its database is unreachable.
Why Not Istio or Linkerd?
This is the obvious question. Both are mature service meshes with large communities. We evaluated both and rejected them for Hanzo's use case.
Istio is architecturally complex. It requires Envoy sidecars (50-100 MB memory per pod), a control plane (istiod), and generates substantial configuration via CRDs. For a 33-service deployment on two modest DOKS clusters, Istio's resource overhead would consume 15-20% of available cluster capacity. Its Envoy-based data plane adds 2-5 ms of latency per hop. And its configuration surface area (VirtualService, DestinationRule, Gateway, PeerAuthentication, AuthorizationPolicy, ServiceEntry, Sidecar, EnvoyFilter) is larger than the services it manages.
Linkerd is lighter than Istio but still requires a control plane (destination, identity, proxy-injector) and Rust-based micro-proxies. Its resource overhead is lower (~20 MB per sidecar) but still non-trivial at scale. More importantly, Linkerd's mTLS identity system conflicts with our Zero Trust architecture (HIP-0054), which uses SPIFFE-based identity managed by a dedicated certificate authority.
Nexus is purpose-built for Hanzo. It runs as a single Go binary per node (DaemonSet, not sidecar), consuming ~15 MB of memory. It understands Hanzo's service topology natively. It delegates mTLS to the Zero Trust layer (HIP-0054) rather than reimplementing it. It integrates directly with Zap (HIP-0031) for traces rather than requiring adapter configuration. And its entire configuration fits in a single YAML file.
The Nexus Approach
External Traffic
|
┌─────v─────┐
│ Gateway │ HIP-0044 (KrakenD)
│ :8080 │
└─────┬─────┘
|
┌─────v─────┐
│ Nexus │ Control Plane :8052
│ Control │ Service registry, config, health
└─────┬─────┘
|
┌─────────────┼─────────────┐
| | |
┌─────v─────┐ ┌────v─────┐ ┌────v─────┐
│ Nexus │ │ Nexus │ │ Nexus │
│ Data │ │ Data │ │ Data │
│ :15001 │ │ :15001 │ │ :15001 │
└─────┬─────┘ └────┬─────┘ └────┬─────┘
| | |
┌─────v─────┐ ┌────v─────┐ ┌────v─────┐
│ IAM │ │ LLM GW │ │ Search │
│ :8000 │ │ :4000 │ │ :3000 │
└───────────┘ └──────────┘ └──────────┘
Nexus has two planes:
- Control plane (:8052): Maintains the service registry, distributes routing configuration, aggregates health checks, and exposes the management API.
- Data plane (:15001): Runs on every node as a DaemonSet. Intercepts outbound service-to-service traffic (via iptables REDIRECT or explicit proxy configuration), applies routing rules, circuit breaking, retries, and protocol translation.
Design Philosophy
Principle 1: One Binary, One Config
Nexus is a single statically linked Go binary. The control plane and data plane are compiled into the same binary and selected at startup via nexus control or nexus proxy. There is no code generation, no CRDs, no operator pattern. Configuration is a single YAML file distributed via ConfigMap.
Principle 2: DaemonSet, Not Sidecar
Istio and Linkerd inject a proxy sidecar into every pod. This means N sidecars for N pods, each with independent memory, CPU, and connection pools. Nexus runs one proxy per node. All pods on that node route through the same proxy. This reduces memory overhead from O(pods) to O(nodes) and simplifies certificate management.
Principle 3: Delegate What Others Do Better
Nexus does not implement mTLS (HIP-0054 does that), does not implement external routing (HIP-0044 does that), and does not implement metric collection (HIP-0031 does that). Nexus emits OpenTelemetry spans and Prometheus metrics in the formats those systems expect. It trusts SPIFFE identities injected by the Zero Trust layer. It accepts traffic from the API Gateway without re-authenticating it. Each system does one thing well.
Principle 4: Fail Open, Not Closed
When the Nexus control plane is unreachable, the data plane continues routing with its last known configuration. When the data plane proxy itself fails, iptables rules are removed and traffic routes directly to services (bypassing mesh features but preserving availability). The mesh is an enhancement, not a gate.
Specification
Service Registry
Every Hanzo service registers with the Nexus control plane at startup. Registration is automatic for Kubernetes-native services (Nexus watches the Kubernetes API for Service and Endpoint resources) and explicit for external services.
# nexus.yaml - Service registry and routing configuration
registry:
# Automatic discovery from Kubernetes
kubernetes:
enabled: true
namespaces:
- hanzo
- lux
label_selector: "hanzo.ai/mesh=enabled"
# Explicit service entries for external dependencies
external:
- name: postgres
endpoints:
- address: postgres.hanzo.svc
port: 5432
protocol: tcp
health_check:
type: tcp
interval: 10s
timeout: 3s
- name: redis
endpoints:
- address: redis.hanzo.svc
port: 6379
protocol: tcp
health_check:
type: command
command: ["redis-cli", "ping"]
interval: 5s
Service Definition
Each service declares its mesh configuration via annotations or the central config file:
services:
iam:
address: iam.hanzo.svc
port: 8000
protocol: http
timeout: 5s
retry:
attempts: 3
per_try_timeout: 2s
retry_on: [503, 502, 504, "connect-failure", "reset"]
circuit_breaker:
consecutive_errors: 5
interval: 30s
base_ejection_time: 30s
max_ejection_percent: 50
health_check:
path: /api/health
interval: 10s
healthy_threshold: 2
unhealthy_threshold: 3
llm-gateway:
address: llm-gateway.hanzo.svc
port: 4000
protocol: http
timeout: 120s # LLM requests can be slow
retry:
attempts: 2
per_try_timeout: 60s
retry_on: [503, "connect-failure"]
circuit_breaker:
consecutive_errors: 10
interval: 60s
base_ejection_time: 15s
max_ejection_percent: 30
candle-inference:
address: candle.hanzo.svc
port: 50051
protocol: grpc
timeout: 30s
retry:
attempts: 2
retry_on: ["unavailable", "resource-exhausted"]
circuit_breaker:
consecutive_errors: 3
interval: 30s
base_ejection_time: 60s
max_ejection_percent: 50
Circuit Breaker
Nexus implements the circuit breaker pattern with three states:
┌──────────┐ errors > threshold ┌──────────┐
│ CLOSED │ ────────────────────────> │ OPEN │
│ (normal) │ │ (reject) │
└──────────┘ └────┬─────┘
^ |
| timer expires |
| v
| ┌──────────┐
└───── success ────────────────│HALF-OPEN │
│ (probe) │
└──────────┘
CLOSED: Traffic flows normally. Nexus counts consecutive errors per upstream endpoint.
OPEN: After consecutive_errors failures within interval, the endpoint is ejected. All requests to that endpoint receive an immediate 503 without making the upstream call. This prevents cascading latency.
HALF-OPEN: After base_ejection_time, Nexus allows a single probe request through. If it succeeds, the circuit closes. If it fails, the ejection time doubles (up to a maximum of 300s).
The max_ejection_percent parameter prevents the circuit breaker from ejecting all endpoints simultaneously, which would make the service completely unreachable.
Retry Policy
Retries are budget-aware. Each service defines the maximum number of retries, but Nexus enforces a global retry budget: no more than 20% of total requests to a service can be retries. This prevents retry storms when a service is partially degraded.
type RetryPolicy struct {
MaxAttempts int `yaml:"attempts"`
PerTryTimeout time.Duration `yaml:"per_try_timeout"`
RetryOn []string `yaml:"retry_on"`
RetryBudget float64 `yaml:"retry_budget"` // default: 0.20
BackoffBase time.Duration `yaml:"backoff_base"` // default: 25ms
BackoffMax time.Duration `yaml:"backoff_max"` // default: 250ms
}
Retries use exponential backoff with jitter: sleep = min(backoff_base * 2^attempt + jitter, backoff_max). The jitter is uniformly distributed over [0, backoff_base) to decorrelate retry waves from multiple callers.
Protocol Translation
Nexus transparently translates between gRPC and REST at the data plane:
translation:
routes:
# REST callers can reach gRPC services via Nexus
- from:
protocol: http
path: /api/v1/inference/predict
method: POST
to:
service: candle-inference
protocol: grpc
method: candle.Inference/Predict
request_mapping:
body: proto # JSON body is marshaled to protobuf
response_mapping:
proto: json # Protobuf response is marshaled to JSON
# gRPC callers can reach REST services via Nexus
- from:
protocol: grpc
method: hanzo.IAM/ValidateToken
to:
service: iam
protocol: http
path: /api/validate-token
method: POST
request_mapping:
proto: json
response_mapping:
json: proto
Translation uses pre-compiled protobuf descriptors. Nexus does not perform runtime reflection or dynamic proto parsing. Services register their .proto files with the control plane, which compiles them at configuration load time.
Request Routing
Nexus supports three routing modes:
1. Direct Routing (Default)
Requests are routed by service name. When service A calls http://iam:8000/api/health, the Nexus data plane intercepts the DNS resolution, resolves it to a healthy IAM endpoint, and proxies the request.
2. Header-Based Routing
Used for canary deployments and testing:
routing:
rules:
- match:
headers:
x-api-version: "v2.1-canary"
route:
service: llm-gateway
subset: canary
- match:
headers:
x-iam-tenant: "enterprise-acme"
route:
service: llm-gateway
subset: dedicated
3. Content-Based Routing
Used for intelligent fan-out based on request body inspection:
routing:
rules:
- match:
content:
json_path: "$.model"
pattern: "^zen-.*"
route:
service: llm-gateway
subset: zen-cluster
- match:
content:
json_path: "$.model"
pattern: "^gpt-.*|^claude-.*"
route:
service: llm-gateway
subset: provider-proxy
Content-based routing incurs a small overhead (the request body must be buffered and parsed). It is disabled by default and enabled per-route.
Traffic Splitting
Canary deployments use weighted traffic splitting:
traffic:
splits:
- service: llm-gateway
subsets:
- name: stable
weight: 95
labels:
version: v1.8.0
- name: canary
weight: 5
labels:
version: v1.9.0-rc1
- service: search
subsets:
- name: stable
weight: 90
labels:
version: v2.3.0
- name: canary
weight: 10
labels:
version: v2.4.0-beta
Weights are enforced per-request using deterministic hashing of the request ID. This ensures that a single user session is consistently routed to the same subset (sticky canary), preventing inconsistent behavior within a session.
Health Checking and Dependency Management
Nexus extends Kubernetes health checking with dependency-aware health:
health:
# Standard liveness/readiness
liveness:
path: /healthz
interval: 10s
timeout: 3s
# Dependency-aware health
dependencies:
llm-gateway:
critical:
- iam # Must be healthy for auth
- redis # Must be healthy for caching
degraded:
- analytics # Can function without analytics
- search # Can function without context search
console:
critical:
- iam
- postgres
degraded:
- commerce
A service is healthy when all critical dependencies are reachable. It is degraded when one or more degraded dependencies are unreachable (Nexus reports it as healthy to Kubernetes but emits a nexus_service_degraded metric). It is unhealthy when any critical dependency is unreachable.
The control plane builds a dependency graph and detects circular dependencies at configuration load time, rejecting configurations that would create health check cycles.
Observability Integration
Nexus emits telemetry in three formats, all consumed by Zap (HIP-0031):
OpenTelemetry Traces
Every proxied request generates a span:
Trace: 4bf92f3577b34da6a3ce929d0e0e4736
├─ Span: gateway-ingress (API Gateway, HIP-0044)
│ ├─ Span: nexus-route (Nexus data plane)
│ │ ├─ Span: iam-validate-token (IAM service)
│ │ ├─ Span: nexus-route (Nexus data plane)
│ │ │ └─ Span: llm-completion (LLM Gateway)
│ │ └─ Span: nexus-route (Nexus data plane)
│ │ └─ Span: analytics-track (Analytics)
Nexus propagates W3C Trace Context headers (traceparent, tracestate) and injects its own span attributes:
nexus.service.source = "llm-gateway"
nexus.service.destination = "iam"
nexus.routing.rule = "direct"
nexus.retry.attempt = 0
nexus.circuit_breaker.state = "closed"
Prometheus Metrics
# Request metrics
nexus_requests_total{source, destination, method, status}
nexus_request_duration_seconds{source, destination, quantile}
nexus_request_size_bytes{source, destination}
nexus_response_size_bytes{source, destination}
# Circuit breaker metrics
nexus_circuit_breaker_state{service, endpoint} # 0=closed, 1=half-open, 2=open
nexus_circuit_breaker_ejections_total{service, endpoint}
# Retry metrics
nexus_retries_total{source, destination, attempt}
nexus_retry_budget_exhausted_total{source, destination}
# Health metrics
nexus_service_healthy{service}
nexus_service_degraded{service, missing_dependency}
nexus_health_check_duration_seconds{service}
# Traffic split metrics
nexus_traffic_split_requests_total{service, subset, version}
Structured Logs
Nexus logs in JSON format to stdout, consumed by the Zap log pipeline:
{
"level": "warn",
"ts": "2026-02-23T10:30:00.000Z",
"msg": "circuit breaker opened",
"service": "iam",
"endpoint": "10.244.0.15:8000",
"consecutive_errors": 5,
"ejection_time": "30s",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}
Control Plane API
The control plane exposes a management API on port 8052:
endpoints:
# Service registry
GET /api/v1/services # List all services
GET /api/v1/services/{name} # Get service detail
GET /api/v1/services/{name}/endpoints # List healthy endpoints
GET /api/v1/services/{name}/health # Get dependency health
# Routing
GET /api/v1/routes # List routing rules
PUT /api/v1/routes # Update routing rules (hot reload)
GET /api/v1/routes/graph # Dependency graph (DOT format)
# Traffic management
GET /api/v1/traffic/splits # List active splits
PUT /api/v1/traffic/splits/{service} # Update traffic split
DELETE /api/v1/traffic/splits/{service} # Remove split (100% stable)
# Circuit breakers
GET /api/v1/circuits # List circuit states
POST /api/v1/circuits/{service}/reset # Force close circuit
# Diagnostics
GET /api/v1/topology # Full mesh topology
GET /api/v1/topology/dot # Graphviz DOT output
GET /metrics # Prometheus metrics
GET /healthz # Control plane health
Integration with API Gateway (HIP-0044)
The API Gateway handles external traffic. Nexus handles internal traffic. They meet at the boundary:
External (Internet) Internal (Mesh)
| |
┌─────v──────┐ Proxy Pass ┌─────────────v───────────┐
│ Gateway │ ───────────────> │ Nexus Data Plane │
│ :8080 │ │ :15001 │
│ (KrakenD) │ │ circuit break, retry, │
│ TLS term, │ │ translate, split, trace │
│ auth, │ └─────────────┬───────────┘
│ rate limit│ |
└────────────┘ ┌─────v─────┐
│ Backend │
└───────────┘
The Gateway terminates TLS and validates authentication. Nexus handles service-to-service concerns. Neither duplicates the other's work.
Integration with Zero Trust (HIP-0054)
Nexus does not implement its own mTLS. Instead, it trusts the SPIFFE identity injected by the Zero Trust layer:
- The Zero Trust agent provisions a SPIFFE SVID (X.509 certificate) for each workload.
- Nexus reads the SVID from the SPIFFE Workload API socket.
- All data plane connections use the SVID for mutual TLS.
- Nexus extracts the SPIFFE ID (
spiffe://hanzo.ai/service/iam) and uses it for authorization policy evaluation.
This separation means Nexus never touches private keys and never needs to run a certificate authority.
Implementation
Binary Structure
nexus/
├── cmd/
│ └── nexus/
│ └── main.go # CLI: nexus control | nexus proxy
├── internal/
│ ├── control/
│ │ ├── registry.go # Service registry (K8s watch + static)
│ │ ├── health.go # Dependency-aware health aggregation
│ │ ├── config.go # YAML config loader with hot reload
│ │ └── api.go # Control plane HTTP API
│ ├── proxy/
│ │ ├── listener.go # Transparent proxy listener (:15001)
│ │ ├── router.go # Direct / header / content routing
│ │ ├── circuit.go # Circuit breaker state machine
│ │ ├── retry.go # Budget-aware retry with backoff
│ │ ├── translate.go # gRPC <-> REST translation
│ │ └── split.go # Weighted traffic splitting
│ ├── telemetry/
│ │ ├── traces.go # OpenTelemetry span emission
│ │ ├── metrics.go # Prometheus metric registration
│ │ └── logs.go # Structured JSON logging
│ └── mesh/
│ ├── topology.go # Dependency graph construction
│ └── spiffe.go # SPIFFE Workload API integration
├── proto/
│ └── nexus/v1/
│ ├── control.proto # Control plane gRPC service
│ └── health.proto # Health reporting proto
├── deploy/
│ ├── daemonset.yaml # Data plane DaemonSet
│ ├── deployment.yaml # Control plane Deployment
│ └── configmap.yaml # Nexus configuration
├── nexus.yaml # Example configuration
├── go.mod
├── go.sum
├── Makefile
├── Dockerfile
└── README.md
Deployment
Kubernetes Manifests
# Control plane
apiVersion: apps/v1
kind: Deployment
metadata:
name: nexus-control
namespace: hanzo
spec:
replicas: 2 # HA pair
selector:
matchLabels:
app: nexus-control
template:
metadata:
labels:
app: nexus-control
spec:
serviceAccountName: nexus
containers:
- name: nexus
image: ghcr.io/hanzoai/nexus:latest
args: ["control"]
ports:
- containerPort: 8052
name: api
volumeMounts:
- name: config
mountPath: /etc/nexus
env:
- name: NEXUS_LOG_LEVEL
value: "info"
- name: NEXUS_OTEL_ENDPOINT
value: "zap.hanzo.svc:4317"
volumes:
- name: config
configMap:
name: nexus-config
---
# Data plane
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nexus-proxy
namespace: hanzo
spec:
selector:
matchLabels:
app: nexus-proxy
template:
metadata:
labels:
app: nexus-proxy
spec:
serviceAccountName: nexus
hostNetwork: true
containers:
- name: nexus
image: ghcr.io/hanzoai/nexus:latest
args: ["proxy", "--control-plane=nexus-control.hanzo.svc:8052"]
ports:
- containerPort: 15001
name: proxy
resources:
requests:
memory: "15Mi"
cpu: "50m"
limits:
memory: "64Mi"
cpu: "200m"
securityContext:
capabilities:
add: ["NET_ADMIN"] # Required for iptables redirect
volumeMounts:
- name: spiffe
mountPath: /run/spiffe
readOnly: true
volumes:
- name: spiffe
csi:
driver: spiffe.csi.cert-manager.io
Configuration Hot Reload
The control plane watches its ConfigMap for changes (via Kubernetes informer). When the configuration changes:
- The new configuration is validated (schema check, cycle detection, endpoint resolution).
- If valid, it is distributed to all data plane proxies via gRPC streaming.
- Data plane proxies apply the new configuration atomically (swap the routing table in a single pointer write).
- The old configuration is retained as fallback for 60 seconds.
If the new configuration is invalid, the control plane rejects it and logs an error. The running configuration is not affected.
Graceful Degradation
Nexus is designed to fail safely at every level:
| Failure Mode | Behavior |
|---|---|
| Control plane unreachable | Data plane continues with last known config |
| Data plane proxy crash | iptables rules removed; traffic routes directly |
| Config update invalid | Rejected; running config preserved |
| All endpoints ejected | Circuit breaker respects max_ejection_percent |
| SPIFFE socket unavailable | Falls back to plaintext (logged as critical) |
| Upstream timeout | Returns 504 with x-nexus-timeout: true header |
Security Considerations
Network Security
- No new attack surface: Nexus does not expose any port externally. The control plane API (8052) and data plane proxy (15001) are cluster-internal only.
- mTLS delegation: By delegating mTLS to HIP-0054, Nexus avoids the complexity (and risk) of managing certificates. It never possesses private key material beyond what the SPIFFE Workload API provides for its own identity.
- Authorization policy: Nexus can enforce service-to-service authorization based on SPIFFE identities. For example, only
spiffe://hanzo.ai/service/llm-gatewaycan callspiffe://hanzo.ai/service/candle-inference.
Data Security
- No request storage: Nexus does not persist request or response bodies. Content-based routing reads the body, routes the request, and discards the parsed content immediately.
- Header sanitization: Nexus strips internal mesh headers (
x-nexus-*) from responses before they leave the mesh, preventing internal topology leakage. - Trace ID propagation: Trace IDs are opaque identifiers that do not contain PII. Nexus never logs request bodies in trace spans.
Operational Security
- RBAC on control plane API: The management API (8052) requires a Kubernetes ServiceAccount token with the
nexus-adminClusterRole. - Audit logging: All configuration changes (route updates, traffic splits, circuit resets) are logged with the requesting identity and timestamp.
- Rate limiting the control plane: The management API enforces a rate limit of 100 requests/second to prevent configuration churn from destabilizing the mesh.
References
- HIP-0031: Observability & Metrics Standard -- Telemetry consumption
- HIP-0044: API Gateway Standard -- External traffic ingress
- HIP-0054: Zero Trust Security Standard -- mTLS and SPIFFE identity
- HIP-0004: LLM Gateway -- AI-specific proxy
- Envoy Proxy Circuit Breaking
- SPIFFE Specification
- W3C Trace Context
- Nexus Repository
Copyright
Copyright and related rights waived via CC0.