HIP-59: Timeseries Database Standard
Abstract
This proposal defines the timeseries database standard for the Hanzo ecosystem. Hanzo Metrics provides a dedicated TimescaleDB instance for storing, querying, and aggregating infrastructure metrics, AI performance telemetry, and operational time-series data. Every Hanzo service that produces or consumes numeric measurements indexed by time MUST use the cluster-local TimescaleDB instance following this specification.
Repository: github.com/hanzoai/metrics
Image: ghcr.io/hanzoai/metrics:latest
Ports: 8059 (HTTP API), 5433 (PostgreSQL wire protocol)
Engine: TimescaleDB 2.x on PostgreSQL 16+
Motivation
Hanzo operates 14 Zen models, 100+ third-party LLM providers, and dozens of infrastructure services across two DOKS Kubernetes clusters. Each of these produces a continuous stream of numeric measurements: request latencies, token throughput, GPU utilization, memory pressure, queue depths, error rates, and cost accumulation.
Today, Prometheus (HIP-0031) handles short-term metrics with a 15-day retention window. ClickHouse (HIP-0047) handles high-volume analytics events. But a gap exists between these two systems:
-
Prometheus is not a long-term database. Its local TSDB is designed for real-time alerting with limited retention. Queries over weeks or months require workarounds like Thanos or Cortex, each introducing significant operational complexity (object storage, compactors, store gateways, query frontends). We need a simpler long-term metrics store.
-
ClickHouse is optimized for analytics, not metrics. ClickHouse excels at scanning billions of wide event rows with dozens of string columns. Metrics are narrow: a timestamp, a metric name, a numeric value, and a handful of labels. ClickHouse can store metrics, but it lacks native constructs for continuous aggregation, time-bucket functions, gap filling, and last-observation-carried-forward interpolation that metrics queries demand constantly.
-
AI workloads need metric correlation with relational data. When a model's p99 latency spikes, engineers need to JOIN that timeseries against the relational model registry (which model version, which GPU node, which organization's traffic caused the spike). This requires the metrics database to speak SQL and live in the PostgreSQL ecosystem where relational tables already exist.
-
No rollup pipeline. Raw metrics at 1-second resolution consume substantial storage. A single metric with 100 label combinations produces 8.6 million data points per day. After 90 days, that is 780 million rows for ONE metric. Without continuous aggregation that rolls 1-second data into 1-minute, 1-hour, and 1-day summaries, long-term metric storage is economically infeasible.
-
Alerting needs historical context. Alert rules that fire on "p99 latency exceeds the 30-day rolling average by 3 standard deviations" require efficient access to 30 days of aggregated latency data. Prometheus cannot serve this. ClickHouse can, but not with the ergonomics that time-series native functions provide.
We need a dedicated timeseries database that bridges the gap between Prometheus's real-time window and ClickHouse's analytics warehouse.
Design Philosophy
This section explains the reasoning behind each major architectural decision. Understanding the why is as important as understanding the what.
Why TimescaleDB Over InfluxDB
InfluxDB is the most well-known open-source time-series database. We reject it for three reasons:
-
SQL compatibility. TimescaleDB is a PostgreSQL extension. Every SQL tool, ORM, driver, and visualization platform that works with PostgreSQL works with TimescaleDB unchanged. InfluxDB uses Flux (InfluxDB 2.x) or InfluxQL (1.x), both proprietary query languages. Flux is being deprecated in favor of SQL in InfluxDB 3.x, which validates our position: the industry is converging on SQL for time-series.
-
PostgreSQL ecosystem. TimescaleDB inherits PostgreSQL's mature ecosystem: pgvector for embeddings, pg_cron for scheduling, pg_stat_statements for query analysis, logical replication, point-in-time recovery, and decades of operational tooling. InfluxDB is a standalone system with its own backup, replication, and monitoring tools that our team must learn separately.
-
JOIN with relational data. The killer feature. TimescaleDB hypertables live in the same PostgreSQL database as regular tables. A single query can JOIN a timeseries of model latencies against a relational table of model configurations, deployment versions, or organization metadata. InfluxDB cannot do this without exporting data to an external system.
Decision: Use TimescaleDB. SQL compatibility and PostgreSQL ecosystem integration outweigh InfluxDB's purpose-built ingestion performance.
Why TimescaleDB Over VictoriaMetrics
VictoriaMetrics is a high-performance Prometheus-compatible TSDB. It excels as a drop-in Prometheus replacement with better compression and longer retention. We still choose TimescaleDB because:
-
Query expressiveness. VictoriaMetrics uses PromQL (and MetricsQL, a superset). PromQL is powerful for dashboards but limited for ad-hoc analysis. It cannot express JOINs, subqueries with CTEs, window functions, or aggregations that group by non-label dimensions. SQL can express all of these natively.
-
Unified storage layer. VictoriaMetrics is metrics-only. It cannot store the relational metadata (model registry, deployment history, organization configuration) that engineers need to correlate with metrics. With TimescaleDB, metrics and metadata live side by side.
-
Continuous aggregation. TimescaleDB's continuous aggregates are materialized views that automatically maintain rollups as new data arrives. VictoriaMetrics has downsampling, but it operates as a batch process with less granular control over aggregation intervals and retention per tier.
Decision: Use TimescaleDB. VictoriaMetrics is an excellent Prometheus backend, but we need SQL and relational JOINs.
Why Not Just ClickHouse (HIP-0047)
ClickHouse (HIP-0047) is already deployed for analytics. Could it also store metrics? Technically yes. Practically, the access patterns are different enough to warrant separate systems:
| Property | Metrics (TimescaleDB) | Analytics (ClickHouse) |
|---|---|---|
| Row width | Narrow (timestamp + value + 5-10 labels) | Wide (50+ columns per event) |
| Write pattern | Continuous single-row or small-batch inserts | Large batch inserts (10K+ rows) |
| Read pattern | Time-range scans with aggregation, gap filling | Full-table scans with GROUP BY |
| Query language | SQL with time-series functions | SQL with columnar optimizations |
| Rollups | Continuous aggregation (real-time) | Materialized views (insert-triggered) |
| Retention | Tiered (hot/warm/cold with different resolutions) | TTL-based (uniform resolution) |
| JOIN requirement | Frequent JOINs with relational metadata | Rare JOINs (denormalized schema) |
| Cardinality | Low-medium (metric names x label values) | High (arbitrary user properties) |
ClickHouse's batch-insert requirement is the fundamental incompatibility. Metrics
arrive continuously -- a Prometheus remote-write flush every 15 seconds sends
hundreds of individual metric samples. ClickHouse's too many parts failure mode
punishes exactly this pattern. TimescaleDB, built on PostgreSQL's WAL-based
storage, handles continuous small inserts without degradation.
Decision: TimescaleDB for metrics. ClickHouse for analytics. Different tools for different access patterns.
Why a Separate Instance (Port 5433, Not 5432)
Hanzo already runs PostgreSQL on port 5432 for transactional data (IAM, Cloud, Console). Why not add TimescaleDB as an extension to the existing instance?
-
Resource isolation. Metric ingestion is write-heavy and continuous. A burst of metric writes (GPU cluster scaling event producing 50,000 data points per second) should not increase p99 latency on IAM login queries running on the same PostgreSQL instance.
-
Independent scaling. The transactional PostgreSQL instance scales based on connection count and query complexity. The metrics instance scales based on data volume and retention depth. These dimensions are uncorrelated.
-
Different maintenance windows. TimescaleDB chunk compression, continuous aggregate refresh, and retention policy enforcement create I/O spikes. These should not compete with transactional VACUUM and autovacuum cycles.
-
Configuration tuning. Metrics workloads benefit from aggressive
shared_buffers, largework_memfor time-bucket aggregations, andsynchronous_commit = offfor higher write throughput. These settings would harm transactional workload latency if shared.
Decision: Dedicated TimescaleDB instance on port 5433. Isolated from transactional PostgreSQL on port 5432.
Specification
Architecture Overview
┌─────────────────────────────────────────┐
│ Hanzo Services │
│ (LLM Gateway, Cloud, IAM, Nodes) │
└──────┬──────────────┬───────────────────┘
│ │
Prometheus Direct SQL
remote_write (port 5433)
(port 8059) │
│ │
┌──────▼──────────────▼───────────────────┐
│ Hanzo Metrics │
│ ┌─────────────────────────────┐ │
│ │ HTTP API (port 8059) │ │
│ │ Prometheus remote-write │ │
│ │ compatible ingestion │ │
│ └──────────┬──────────────────┘ │
│ │ │
│ ┌──────────▼──────────────────┐ │
│ │ TimescaleDB (port 5433) │ │
│ │ PostgreSQL 16 + Timescale │ │
│ │ │ │
│ │ ┌──── Hypertables ───────┐│ │
│ │ │ metrics_raw (1s) ││ │
│ │ │ metrics_1m (cagg) ││ │
│ │ │ metrics_1h (cagg) ││ │
│ │ │ metrics_1d (cagg) ││ │
│ │ └────────────────────────┘│ │
│ │ │ │
│ │ ┌──── Relational ────────┐│ │
│ │ │ metric_metadata ││ │
│ │ │ alert_rules ││ │
│ │ │ alert_state ││ │
│ │ └────────────────────────┘│ │
│ └─────────────────────────────┘ │
└──────────────┬───────────────────────────┘
│
┌──────────────▼───────────────────┐
│ Consumers │
│ - Visor dashboards (HIP-0053) │
│ - Grafana (HIP-0031) │
│ - Alert evaluator │
│ - Analytics (HIP-0017) │
└──────────────────────────────────┘
Hypertable Schema
The core data model uses a single hypertable partitioned by time with a label-based schema following the Prometheus data model.
Raw Metrics Table
-- Enable TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- Core metrics hypertable
CREATE TABLE metrics_raw (
time TIMESTAMPTZ NOT NULL,
metric TEXT NOT NULL,
value DOUBLE PRECISION NOT NULL,
-- Labels (denormalized for query performance)
service TEXT NOT NULL DEFAULT '',
instance TEXT NOT NULL DEFAULT '',
org_id TEXT NOT NULL DEFAULT '',
model TEXT NOT NULL DEFAULT '',
provider TEXT NOT NULL DEFAULT '',
node TEXT NOT NULL DEFAULT '',
endpoint TEXT NOT NULL DEFAULT '',
status TEXT NOT NULL DEFAULT '',
-- Overflow labels stored as JSONB for low-frequency dimensions
labels JSONB NOT NULL DEFAULT '{}'
);
-- Convert to hypertable with 1-hour chunks
SELECT create_hypertable(
'metrics_raw',
by_range('time', INTERVAL '1 hour')
);
-- Composite index for the most common query pattern:
-- "give me metric X for service Y over time range Z"
CREATE INDEX idx_metrics_raw_metric_service_time
ON metrics_raw (metric, service, time DESC);
-- Index for organization-scoped queries (billing, usage dashboards)
CREATE INDEX idx_metrics_raw_org_time
ON metrics_raw (org_id, time DESC)
WHERE org_id != '';
-- Index for model-specific queries (AI performance)
CREATE INDEX idx_metrics_raw_model_time
ON metrics_raw (model, time DESC)
WHERE model != '';
Design decisions in this schema:
-
Denormalized labels rather than a normalized label table with JOINs. Metric queries filter by label values in WHERE clauses. Denormalized columns enable B-tree index scans. A normalized design would require JOINs on every query, adding 2-5ms latency that compounds on dashboard loads with 20+ panels.
-
JSONB
labelscolumn for overflow. Most metrics use 3-5 well-known labels (service, instance, model). Occasionally, a metric carries additional context (Kubernetes pod name, deployment revision, canary group). These low-frequency labels go into JSONB rather than adding more columns. -
1-hour chunk interval. TimescaleDB partitions hypertables into chunks by time range. Smaller chunks (1 hour) mean faster chunk exclusion during time-range queries and more granular compression. With 1-hour chunks, a query for the last 24 hours touches only 24 chunks instead of scanning the entire table.
-
TIMESTAMPTZnotTIMESTAMP. All times are stored in UTC. TimescaleDB's time-series functions (time_bucket,time_bucket_gapfill) work correctly with timezone-aware timestamps.
AI-Specific Metrics
The following standard metric names MUST be used by all AI services:
-- Token throughput (LLM Gateway, HIP-0004)
-- metric: hanzo_llm_tokens_total
-- labels: model, provider, org_id, token_type (prompt|completion)
-- Model latency distribution
-- metric: hanzo_llm_request_duration_seconds
-- labels: model, provider, org_id, status, quantile
-- GPU utilization (Compute nodes)
-- metric: hanzo_gpu_utilization_percent
-- labels: node, gpu_index, model
-- GPU memory usage
-- metric: hanzo_gpu_memory_used_bytes
-- labels: node, gpu_index, model
-- Cost accumulation (per-organization running total)
-- metric: hanzo_cost_accumulated_usd
-- labels: org_id, model, provider
-- Inference queue depth
-- metric: hanzo_inference_queue_depth
-- labels: model, provider
-- Cache hit rate (LLM semantic cache)
-- metric: hanzo_cache_hit_ratio
-- labels: model, cache_type (semantic|exact)
-- Active concurrent requests
-- metric: hanzo_requests_active
-- labels: service, endpoint
Continuous Aggregation
Continuous aggregates are the mechanism by which raw 1-second data is rolled up into coarser resolutions. TimescaleDB evaluates these incrementally -- only new data since the last refresh is processed, not the entire dataset.
1-Minute Rollup
CREATE MATERIALIZED VIEW metrics_1m
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 minute', time) AS bucket,
metric,
service,
instance,
org_id,
model,
provider,
node,
endpoint,
status,
AVG(value) AS avg_value,
MIN(value) AS min_value,
MAX(value) AS max_value,
COUNT(*) AS sample_count,
-- Approximate percentiles for latency distributions
percentile_agg(value) AS pct_agg
FROM metrics_raw
GROUP BY bucket, metric, service, instance, org_id,
model, provider, node, endpoint, status;
-- Refresh policy: process new data every 1 minute, with a 5-minute lag
-- to ensure late-arriving samples are included
SELECT add_continuous_aggregate_policy('metrics_1m',
start_offset => INTERVAL '10 minutes',
end_offset => INTERVAL '5 minutes',
schedule_interval => INTERVAL '1 minute'
);
1-Hour Rollup
CREATE MATERIALIZED VIEW metrics_1h
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', bucket) AS bucket,
metric,
service,
org_id,
model,
provider,
status,
AVG(avg_value) AS avg_value,
MIN(min_value) AS min_value,
MAX(max_value) AS max_value,
SUM(sample_count) AS sample_count,
rollup(pct_agg) AS pct_agg
FROM metrics_1m
GROUP BY bucket, metric, service, org_id,
model, provider, status;
SELECT add_continuous_aggregate_policy('metrics_1h',
start_offset => INTERVAL '4 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour'
);
1-Day Rollup
CREATE MATERIALIZED VIEW metrics_1d
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 day', bucket) AS bucket,
metric,
service,
org_id,
model,
provider,
AVG(avg_value) AS avg_value,
MIN(min_value) AS min_value,
MAX(max_value) AS max_value,
SUM(sample_count) AS sample_count,
rollup(pct_agg) AS pct_agg
FROM metrics_1h
GROUP BY bucket, metric, service, org_id,
model, provider;
SELECT add_continuous_aggregate_policy('metrics_1d',
start_offset => INTERVAL '3 days',
end_offset => INTERVAL '1 day',
schedule_interval => INTERVAL '1 day'
);
Why cascaded aggregation (1m from raw, 1h from 1m, 1d from 1h): Each level processes only the data from the level below, not the full raw dataset. Rolling up 1 day of raw data at 1-second resolution means aggregating 86,400 rows per metric-label combination. Rolling up 1 day from the 1-minute aggregate means aggregating only 1,440 rows. Rolling up from the 1-hour aggregate means 24 rows. The computational cost decreases by orders of magnitude at each tier.
Retention Policies
Data retention follows a tiered model. Higher resolution data expires sooner; aggregated data lives longer.
| Tier | Resolution | Retention | Storage (est. per 1000 metrics) | Source |
|---|---|---|---|---|
| Hot | 1 second (raw) | 7 days | ~25 GB | metrics_raw |
| Warm | 1 minute (aggregated) | 90 days | ~3 GB | metrics_1m |
| Cool | 1 hour (aggregated) | 365 days | ~150 MB | metrics_1h |
| Cold | 1 day (aggregated) | Indefinite | ~5 MB/year | metrics_1d |
Retention is enforced by TimescaleDB's built-in retention policies, which drop entire chunks when they expire. This is an O(1) operation (dropping a PostgreSQL table) rather than an O(n) row-by-row DELETE.
-- Drop raw data older than 7 days
SELECT add_retention_policy('metrics_raw', INTERVAL '7 days');
-- Drop 1-minute aggregates older than 90 days
SELECT add_retention_policy('metrics_1m', INTERVAL '90 days');
-- Drop 1-hour aggregates older than 1 year
SELECT add_retention_policy('metrics_1h', INTERVAL '365 days');
-- Daily aggregates are kept indefinitely (negligible storage)
Chunk Compression
TimescaleDB compresses older chunks using columnar compression, reducing storage by 10-20x. Compressed chunks are still queryable via standard SQL but cannot accept new writes.
-- Enable compression on raw hypertable
ALTER TABLE metrics_raw SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'metric, service, org_id',
timescaledb.compress_orderby = 'time DESC'
);
-- Compress chunks older than 2 hours (after continuous aggregation has processed them)
SELECT add_compression_policy('metrics_raw', INTERVAL '2 hours');
-- Compress 1-minute aggregates older than 1 day
ALTER MATERIALIZED VIEW metrics_1m SET (
timescaledb.compress
);
SELECT add_compression_policy('metrics_1m', INTERVAL '1 day');
Why compress_segmentby = 'metric, service, org_id': Compression groups
rows by segment columns. Queries that filter on these columns can skip
entire compressed segments without decompression. Since nearly every metric
query filters by metric name and service (and often org_id), segmenting on
these columns gives the best query-time performance on compressed data.
Prometheus Remote-Write Ingestion
The HTTP API on port 8059 accepts Prometheus remote-write protocol, making Hanzo Metrics a drop-in long-term storage backend for any Prometheus instance.
┌────────────┐ remote_write ┌─────────────────────┐
│ Prometheus │ ──────────────────→ │ Metrics API (:8059) │
│ (HIP-0031) │ (protobuf/snappy)│ │
│ 15-day │ │ - Decode protobuf │
│ retention │ │ - Map labels → cols │
└────────────┘ │ - Batch INSERT │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ TimescaleDB (:5433) │
│ metrics_raw │
└─────────────────────┘
Prometheus Configuration
# prometheus.yml (HIP-0031 Zap configuration)
remote_write:
- url: "http://metrics-api:8059/api/v1/write"
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 15s
max_shards: 10
write_relabel_configs:
# Only send metrics with the hanzo_ prefix to long-term storage
- source_labels: [__name__]
regex: 'hanzo_.*'
action: keep
Why relabel to hanzo_ prefix only: Prometheus scrapes thousands of
internal metrics from every pod (Go runtime, HTTP handler, gRPC, Kubernetes
cadvisor). Most of these are useful only for real-time debugging within
Prometheus's 15-day window. Only Hanzo-defined metrics (hanzo_llm_*,
hanzo_gpu_*, hanzo_cost_*, etc.) need long-term retention. The relabel
filter reduces write volume by 90%+ and keeps TimescaleDB focused on
business-relevant metrics.
Remote-Write API
The ingestion endpoint implements the Prometheus remote-write specification (protobuf with Snappy compression):
POST /api/v1/write
Content-Type: application/x-protobuf
Content-Encoding: snappy
# Request body: prometheus.WriteRequest protobuf message
# Response: 204 No Content (success) or 400/500 with error details
The API server decodes the protobuf, maps well-known labels to denormalized
columns, and inserts into metrics_raw using COPY for maximum throughput.
Query Interface
Services query TimescaleDB directly over the PostgreSQL wire protocol (port 5433) using standard SQL. The HTTP API (port 8059) also exposes a PromQL-compatible read endpoint for Grafana compatibility.
Common Query Patterns
Model latency over time (Visor dashboard, HIP-0053):
SELECT
time_bucket('5 minutes', time) AS bucket,
model,
AVG(value) AS avg_latency,
approx_percentile(0.99, pct_agg) AS p99_latency
FROM metrics_1m
WHERE metric = 'hanzo_llm_request_duration_seconds'
AND time > NOW() - INTERVAL '24 hours'
GROUP BY bucket, model
ORDER BY bucket;
GPU utilization heatmap (operations dashboard):
SELECT
time_bucket('1 hour', time) AS bucket,
node,
AVG(value) AS avg_util,
MAX(value) AS peak_util
FROM metrics_1m
WHERE metric = 'hanzo_gpu_utilization_percent'
AND time > NOW() - INTERVAL '7 days'
GROUP BY bucket, node
ORDER BY bucket, node;
Cost accumulation by organization (billing, HIP-0017):
SELECT
org_id,
model,
time_bucket('1 day', bucket) AS day,
MAX(avg_value) - MIN(avg_value) AS daily_cost_usd
FROM metrics_1h
WHERE metric = 'hanzo_cost_accumulated_usd'
AND bucket > NOW() - INTERVAL '30 days'
AND org_id = 'org_123'
GROUP BY org_id, model, day
ORDER BY day;
Gap-filled time series (for smooth chart rendering):
SELECT
time_bucket_gapfill('1 minute', time) AS bucket,
COALESCE(AVG(value), 0) AS value,
locf(AVG(value)) AS value_interpolated
FROM metrics_raw
WHERE metric = 'hanzo_requests_active'
AND service = 'llm-gateway'
AND time > NOW() - INTERVAL '1 hour'
GROUP BY bucket
ORDER BY bucket;
locf (Last Observation Carried Forward) fills gaps by repeating the
most recent known value. This is critical for metrics that report only on
change (gauge metrics from intermittent scrapers) where missing data points
mean "same as before," not "zero."
Alerting Rules Engine
TimescaleDB stores alert rule definitions and maintains alert state. The alert evaluator is a lightweight Go process that runs SQL queries on a schedule and fires notifications when thresholds are breached.
-- Alert rules table
CREATE TABLE alert_rules (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
description TEXT NOT NULL DEFAULT '',
query TEXT NOT NULL,
condition TEXT NOT NULL, -- 'gt', 'lt', 'eq', 'ne'
threshold DOUBLE PRECISION NOT NULL,
for_duration INTERVAL NOT NULL DEFAULT '5 minutes',
severity TEXT NOT NULL DEFAULT 'warning',
labels JSONB NOT NULL DEFAULT '{}',
annotations JSONB NOT NULL DEFAULT '{}',
enabled BOOLEAN NOT NULL DEFAULT true,
eval_interval INTERVAL NOT NULL DEFAULT '1 minute',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Alert state (current firing/pending/resolved)
CREATE TABLE alert_state (
rule_id INTEGER REFERENCES alert_rules(id),
state TEXT NOT NULL DEFAULT 'inactive',
value DOUBLE PRECISION,
fired_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
last_eval TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (rule_id)
);
Example alert rule: notify when LLM p99 latency exceeds 5 seconds for 5 consecutive minutes:
INSERT INTO alert_rules (name, description, query, condition, threshold, for_duration, severity)
VALUES (
'llm_p99_latency_high',
'LLM request p99 latency exceeds 5 seconds',
$$
SELECT approx_percentile(0.99, pct_agg)
FROM metrics_1m
WHERE metric = 'hanzo_llm_request_duration_seconds'
AND bucket > NOW() - INTERVAL '10 minutes'
$$,
'gt',
5.0,
'5 minutes',
'critical'
);
Alert notifications are dispatched via HIP-0031 (Zap) webhook integration to Slack, PagerDuty, or any webhook-compatible endpoint.
Metric Metadata
A relational table stores metric descriptions, types, and units for self-documenting metric discovery:
CREATE TABLE metric_metadata (
metric TEXT PRIMARY KEY,
type TEXT NOT NULL, -- 'gauge', 'counter', 'histogram', 'summary'
unit TEXT NOT NULL DEFAULT '',
description TEXT NOT NULL DEFAULT '',
service TEXT NOT NULL DEFAULT '',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Seed with standard Hanzo metrics
INSERT INTO metric_metadata (metric, type, unit, description, service) VALUES
('hanzo_llm_tokens_total', 'counter', 'tokens', 'Total LLM tokens processed', 'llm-gateway'),
('hanzo_llm_request_duration_seconds', 'histogram', 'seconds', 'LLM request latency distribution', 'llm-gateway'),
('hanzo_gpu_utilization_percent', 'gauge', 'percent', 'GPU compute utilization', 'compute'),
('hanzo_gpu_memory_used_bytes', 'gauge', 'bytes', 'GPU memory in use', 'compute'),
('hanzo_cost_accumulated_usd', 'counter', 'usd', 'Cumulative cost per organization', 'billing'),
('hanzo_inference_queue_depth', 'gauge', 'requests', 'Pending inference requests', 'llm-gateway'),
('hanzo_cache_hit_ratio', 'gauge', 'ratio', 'Semantic/exact cache hit rate', 'llm-gateway'),
('hanzo_requests_active', 'gauge', 'requests', 'Active concurrent requests', '*');
Implementation
Deployment
Docker (Development)
# compose.yml
services:
metrics-db:
image: timescale/timescaledb:latest-pg16
ports:
- "5433:5432"
environment:
POSTGRES_DB: hanzo_metrics
POSTGRES_USER: hanzo
POSTGRES_PASSWORD: "${METRICS_DB_PASSWORD}"
volumes:
- metrics_data:/var/lib/postgresql/data
- ./schemas/init.sql:/docker-entrypoint-initdb.d/01-init.sql
command: >
postgres
-c shared_preload_libraries=timescaledb
-c timescaledb.telemetry_level=off
-c max_connections=200
-c shared_buffers=2GB
-c work_mem=64MB
-c synchronous_commit=off
-c wal_level=replica
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hanzo -d hanzo_metrics"]
interval: 10s
timeout: 5s
retries: 3
metrics-api:
image: ghcr.io/hanzoai/metrics:latest
ports:
- "8059:8059"
environment:
DATABASE_URL: "postgresql://hanzo:${METRICS_DB_PASSWORD}@metrics-db:5432/hanzo_metrics"
LISTEN_ADDR: "0.0.0.0:8059"
depends_on:
metrics-db:
condition: service_healthy
volumes:
metrics_data:
Kubernetes (Production)
# k8s/metrics-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: metrics-db
namespace: hanzo
spec:
serviceName: metrics-db
replicas: 1
selector:
matchLabels:
app: metrics-db
template:
metadata:
labels:
app: metrics-db
spec:
containers:
- name: timescaledb
image: timescale/timescaledb:latest-pg16
ports:
- containerPort: 5432
name: postgresql
env:
- name: POSTGRES_DB
value: hanzo_metrics
- name: POSTGRES_USER
value: hanzo
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: metrics-db-credentials
key: password
args:
- postgres
- -c
- shared_preload_libraries=timescaledb
- -c
- timescaledb.telemetry_level=off
- -c
- max_connections=200
- -c
- shared_buffers=4GB
- -c
- work_mem=128MB
- -c
- synchronous_commit=off
- -c
- effective_cache_size=12GB
resources:
requests:
memory: 8Gi
cpu: "4"
limits:
memory: 16Gi
cpu: "8"
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 200Gi
storageClassName: do-block-storage
---
apiVersion: v1
kind: Service
metadata:
name: metrics-db
namespace: hanzo
spec:
selector:
app: metrics-db
ports:
- port: 5433
targetPort: 5432
name: postgresql
clusterIP: None
Resource Sizing Guide
| Cluster Size | Metrics | Scrape Interval | CPU | RAM | Disk |
|---|---|---|---|---|---|
| Small (dev) | < 500 | 15s | 2 cores | 4 GB | 50 GB |
| Medium | 500 - 5,000 | 15s | 4 cores | 8 GB | 200 GB |
| Large | 5,000 - 50,000 | 10s | 8 cores | 16 GB | 500 GB |
| XL | 50,000+ | 10s | 16 cores | 32 GB | 1 TB |
Integration Points
O11y / Zap (HIP-0031)
Zap is the primary data producer. It scrapes Prometheus metrics from every
Hanzo service and remote-writes the hanzo_* subset to the Metrics API.
TimescaleDB serves as the long-term storage backend that extends Prometheus's
15-day retention to 1 year+.
Zap sidecar → Prometheus (15-day real-time) → remote_write → Metrics API → TimescaleDB (1y+)
Visor Dashboards (HIP-0053)
Visor connects to TimescaleDB via the PostgreSQL wire protocol as a standard
data source. Dashboard panels issue SQL queries with time_bucket for
time-series charts, approx_percentile for latency distributions, and
time_bucket_gapfill for continuous line charts.
Analytics / Insights (HIP-0017)
Business metrics (daily active users, feature adoption, conversion rates) are computed in ClickHouse (HIP-0047) and pushed to TimescaleDB as derived metric time-series. This enables a single dashboard to display both infrastructure metrics (from Prometheus) and business metrics (from ClickHouse) with consistent time-bucketing and aggregation.
Grafana (HIP-0031)
Grafana's built-in PostgreSQL data source queries TimescaleDB directly.
The timescaledb macro support in Grafana enables $__timeFilter(time),
$__timeGroup(time, $__interval), and other template variables that map
to TimescaleDB time-series functions.
Security Considerations
Authentication
TimescaleDB inherits PostgreSQL's authentication. Two database roles are provisioned:
| Role | Purpose | Permissions |
|---|---|---|
hanzo | API server, ingestion | Full DDL + DML on hanzo_metrics |
metrics_readonly | Dashboards, Grafana, Visor | SELECT only, 60s statement timeout |
Credentials are managed through KMS (HIP-0033) and injected as Kubernetes secrets.
Network
- PostgreSQL port (5433) MUST be accessible only from within the Kubernetes
cluster network (
10.0.0.0/8). - HTTP API port (8059) MUST be accessible only from within the cluster.
- No ports are exposed to the public internet.
- Kubernetes NetworkPolicy restricts access to pods with the label
metrics-client: "true".
Data Classification
| Data Type | Sensitivity | Handling |
|---|---|---|
| Metric values (numeric) | Low | Standard retention |
| Metric names | Low | Public knowledge (documented in this HIP) |
| Label values (service, instance) | Low | Internal infrastructure identifiers |
| Organization IDs (org_id) | Medium | Pseudonymized; no PII |
| Alert rules and state | Low | Operational data |
Metric data MUST NOT contain personally identifiable information. Label values MUST be limited to infrastructure identifiers (service names, node names, model names) and organizational pseudonyms (org_id, not org_name). If a metric requires user-level granularity, it MUST be aggregated to remove individual user identification before ingestion.
Backup and Recovery
Backups use PostgreSQL's native pg_basebackup and WAL archiving to
MinIO (HIP-0032):
# Base backup (weekly)
pg_basebackup -D /backups/base_$(date +%Y%m%d) \
-h metrics-db -U replication -Ft -z -P
# Continuous WAL archiving
archive_command = 'aws s3 cp %p s3://metrics-backups/wal/%f --endpoint-url http://minio:9000'
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Base backup | Weekly (Sunday 03:00 UTC) | 4 weeks | MinIO (HIP-0032) |
| WAL archive | Continuous | 7 days | MinIO (HIP-0032) |
Recovery time objective (RTO): < 30 minutes. Recovery point objective (RPO): < 5 minutes (WAL archive lag).
References
- HIP-0017: Analytics Event Standard -- business event ingestion and ClickHouse queries
- HIP-0031: Observability & Metrics Standard -- Prometheus, Zap sidecar, Grafana
- HIP-0032: Object Storage Standard -- MinIO for backups
- HIP-0033: Secret Management Standard -- KMS for credential injection
- HIP-0047: Analytics Datastore Standard -- ClickHouse for OLAP analytics events
- HIP-0053: Visor Dashboard Standard -- dashboard queries against TimescaleDB
- TimescaleDB Documentation: docs.timescale.com
- Prometheus Remote Write: prometheus.io/docs/specs/remote_write_spec
- PostgreSQL 16: postgresql.org/docs/16
Copyright
This document is placed in the public domain.