HIP-47: Analytics Datastore Standard
Abstract
This proposal defines the analytics datastore standard for the Hanzo ecosystem. Hanzo Datastore provides columnar analytics storage via ClickHouse, deployed as a replicated cluster on each DOKS Kubernetes cluster. Every Hanzo service that requires high-throughput event ingestion, OLAP queries, or time-series aggregation MUST use the cluster-local ClickHouse instance following this specification.
Repository: github.com/hanzoai/datastore
Image: ghcr.io/hanzoai/datastore:latest
Ports: 8123 (HTTP), 9000 (native TCP), 9440 (native TLS)
Engine: ClickHouse 24.x+
Motivation
Hanzo generates massive volumes of event data across its services. Insights (HIP-0017) ingests 10M+ events per day from product analytics, session replay, feature flags, and A/B tests. Metrics (HIP-0031) ships structured logs to long-term storage. Billing analytics tracks token usage, credit consumption, and API call volumes across every organization.
All of this data shares a common characteristic: it is append-only, time- stamped, and queried in aggregate. These are the exact access patterns that row-oriented databases like PostgreSQL handle poorly and columnar databases handle exceptionally well.
Without a dedicated analytics datastore:
-
PostgreSQL becomes the bottleneck. Running
SELECT COUNT(*) FROM events WHERE timestamp > '2026-01-01' GROUP BY user_idon a 500M-row PostgreSQL table takes 45-120 seconds. The same query on ClickHouse completes in 200-800 milliseconds. PostgreSQL scans every column of every row; ClickHouse reads only the columns referenced in the query. -
Storage costs explode. A single analytics event averages 1-2 KB in PostgreSQL (row storage, TOAST overhead, indexes). ClickHouse compresses the same event to 50-100 bytes using columnar compression. At 10M events/day, PostgreSQL consumes ~15 GB/day; ClickHouse consumes ~700 MB/day. Over a year, that is 5.4 TB vs 250 GB.
-
SaaS analytics are prohibitively expensive. BigQuery charges $5/TB scanned. Snowflake charges per-second compute plus storage. At our query volume (thousands of dashboard loads per day, each scanning 10-100 GB), managed analytics would cost $5,000-20,000/month. Self-hosted ClickHouse runs on existing Kubernetes infrastructure for near-zero marginal cost.
-
No unified query layer. Without a standard, Insights uses one database, Metrics uses another, and billing analytics lives in a third. Cross-cutting queries (correlate LLM token usage with product engagement) become multi-system joins that are slow and fragile.
We need ONE analytics datastore that all event-driven workloads share, with a single schema convention, replication strategy, backup policy, and query interface.
Design Philosophy
This section explains the reasoning behind each major architectural decision. Understanding the why is as important as understanding the what.
Why ClickHouse Over PostgreSQL for Analytics
PostgreSQL is the Hanzo standard for transactional data (HIP-0029). It excels at OLTP: small reads and writes, row-level locking, ACID transactions, foreign keys, and joins across normalized tables. IAM stores users, organizations, and OAuth tokens in PostgreSQL. Cloud stores projects, API keys, and configuration.
Analytics data has the opposite access pattern:
| Property | OLTP (PostgreSQL) | OLAP (ClickHouse) |
|---|---|---|
| Write pattern | Single-row inserts/updates | Batch inserts (1000+ rows) |
| Read pattern | Point lookups by primary key | Full-column scans with aggregation |
| Schema | Normalized, many tables | Denormalized, few wide tables |
| Transactions | Required (ACID) | Not needed (append-only) |
| Compression | Row-level, 1-2x | Column-level, 10-100x |
| Typical query | SELECT * FROM users WHERE id = 42 | SELECT COUNT(*) FROM events WHERE ts > '2026-01-01' GROUP BY browser |
ClickHouse stores data column-by-column on disk. When a query references 3 out of 50 columns, only those 3 columns are read from disk. PostgreSQL stores data row-by-row, so it must read all 50 columns even if only 3 are needed.
For a table with 1 billion rows and 50 columns, a query touching 3 columns:
- PostgreSQL: Reads ~1 TB from disk (all columns, all rows)
- ClickHouse: Reads ~6 GB from disk (3 columns, compressed)
This is not a marginal difference. It is 100-200x less I/O, which translates directly into 100-200x faster queries.
Decision: Use ClickHouse for all analytics workloads. Keep PostgreSQL for transactional data per HIP-0029.
Why ClickHouse Over Snowflake / BigQuery
Both Snowflake and BigQuery are excellent managed columnar databases. We reject them for three reasons:
-
Data sovereignty. Hanzo stores user behavior data, LLM prompts (in aggregated form), and billing records. Enterprise customers require that this data stays on infrastructure we control. Snowflake and BigQuery are third-party managed services where data resides on vendor infrastructure.
-
Per-query pricing is unpredictable. BigQuery charges $5/TB scanned. A single poorly written query on a 10 TB table costs $50. A dashboard with 20 panels, each scanning 5 GB, costs $0.50 per page load. At 1000 dashboard loads/day, that is $500/day or $15,000/month -- just for dashboards. ClickHouse has zero per-query cost.
-
Latency. Snowflake and BigQuery add network round-trip latency (50-200ms) before query execution begins. ClickHouse is co-located with the services that query it, so the network hop is sub-millisecond. For interactive dashboards where users expect instant results, this matters.
Decision: Self-host ClickHouse. Eliminate per-query costs and data residency concerns.
Why ClickHouse Over Apache Druid
Druid is a real-time analytics database designed for sub-second OLAP queries. It is a valid alternative. We chose ClickHouse over Druid for three reasons:
-
Simpler architecture. Druid has six process types (Coordinator, Overlord, Broker, Router, Historical, MiddleManager) plus external dependencies on ZooKeeper, a metadata store (PostgreSQL/MySQL), and deep storage (S3/HDFS). ClickHouse is a single binary. A production cluster needs ClickHouse nodes and (optionally since 24.x) ClickHouse Keeper for coordination. No ZooKeeper, no external metadata store.
-
SQL-native. Druid has its own query language (Druid SQL is a subset of SQL with significant gaps). ClickHouse supports full ANSI SQL plus extensions for arrays, maps, nested data, and window functions. Engineers already know SQL. Druid SQL requires learning Druid-specific syntax and limitations.
-
Better compression. ClickHouse consistently achieves 10-20x compression ratios on analytics data. Druid achieves 3-8x. On a dataset of 1 billion events (1 TB uncompressed), ClickHouse stores it in ~60 GB; Druid stores it in ~150 GB.
Decision: Use ClickHouse. Simpler to operate, better compression, standard SQL.
Why ClickHouse Over TimescaleDB
TimescaleDB is PostgreSQL with time-series extensions. It is excellent for metrics (CPU usage, request latency, disk I/O) where each data point is a small fixed-schema tuple with a timestamp and a few numeric values.
Analytics events are different. Each event has 20-50 fields: timestamp, user ID, session ID, event type, URL, browser, OS, device, screen dimensions, referrer, UTM parameters, custom properties (JSON), and more. Events arrive at 100-1000x the rate of metrics. A busy dashboard page load generates 10-20 analytics events; the same page generates 1-2 metric data points.
TimescaleDB inherits PostgreSQL's row-oriented storage. For wide tables with billions of rows, it cannot match ClickHouse's columnar performance:
| Workload | TimescaleDB | ClickHouse |
|---|---|---|
| Ingestion (events/sec, single node) | 50,000-100,000 | 1,000,000-5,000,000 |
| Query: COUNT by day (1B rows) | 8-15 seconds | 0.1-0.3 seconds |
| Compression ratio (analytics events) | 3-5x | 10-20x |
| Disk usage (1B events) | 200-400 GB | 50-100 GB |
TimescaleDB is the right choice for infrastructure metrics (Prometheus long-term storage, IoT sensor data). ClickHouse is the right choice for analytics events.
Decision: Use ClickHouse for analytics events. Use Prometheus + Grafana for infrastructure metrics per HIP-0031. TimescaleDB is not needed.
Why Separate from PostgreSQL (HIP-0029)
The simplest architecture would be one database for everything. We reject this because analytics and transactional workloads compete destructively when co-located:
-
Lock contention. A long-running analytics query (
SELECT ... GROUP BY ... ORDER BY ... LIMIT 100on 500M rows) holds read locks that block transactional writes. PostgreSQL's MVCC mitigates this but does not eliminate it -- vacuum, buffer pool pressure, and WAL volume all increase. -
Resource competition. Analytics queries consume CPU and memory for sorting, hashing, and aggregation. This starves transactional queries that need sub-millisecond response times for user-facing API calls.
-
Scaling dimensions differ. Transactional databases scale vertically (bigger instance) and with read replicas. Analytics databases scale horizontally (shard by time, distribute across nodes). Combining them forces a compromise that serves neither well.
Decision: PostgreSQL for OLTP (HIP-0029). ClickHouse for OLAP (this HIP). Separate processes, separate storage, separate scaling.
Specification
Architecture Overview
┌─────────────────────────────────┐
│ Hanzo Services │
│ (Insights, Metrics, Billing) │
└────────┬────────────┬───────────┘
│ │
Batch HTTP Kafka Consumer
(port 8123) (HIP-0030)
│ │
┌────────▼────────────▼───────────┐
│ ClickHouse Cluster │
│ ┌──────────┐ ┌──────────┐ │
│ │ Shard 1 │ │ Shard 1 │ │
│ │ Replica A│ │ Replica B│ │
│ └──────────┘ └──────────┘ │
│ ┌──────────────────────┐ │
│ │ ClickHouse Keeper │ │
│ │ (coordination) │ │
│ └──────────────────────┘ │
└─────────────────────────────────┘
│
┌────────▼────────┐
│ MinIO (S3) │
│ (HIP-0032) │
│ Backups │
└─────────────────┘
Engine: MergeTree Family
All tables MUST use a MergeTree-family engine. The MergeTree engine is the foundation of ClickHouse's performance. It provides:
- Sorted storage: Data is stored sorted by the
ORDER BYkey, enabling efficient range queries and binary search. - Sparse indexing: Primary index stores one entry per 8192 rows (granule), not per row. This keeps the index small enough to fit in memory even for billion-row tables.
- Background merges: Small data parts are merged into larger ones asynchronously, maintaining sort order and applying TTL/compression.
For production deployments with replication, all tables MUST use
ReplicatedMergeTree:
CREATE TABLE events ON CLUSTER '{cluster}'
(
-- columns defined below
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), uuid)
SETTINGS index_granularity = 8192;
The ZooKeeper path template {shard} and {replica} are resolved from the
ClickHouse server configuration. Since ClickHouse 24.x, ClickHouse Keeper
(a built-in Raft-based coordination service) replaces ZooKeeper.
Schema Design
Events Table
The primary table storing all analytics events. This is the table that Insights (HIP-0017) writes to and queries from.
CREATE TABLE events ON CLUSTER '{cluster}'
(
-- Identity
uuid UUID DEFAULT generateUUIDv4(),
team_id UInt64,
distinct_id String,
session_id String DEFAULT '',
-- Event
event String,
timestamp DateTime64(3, 'UTC'),
created_at DateTime64(3, 'UTC') DEFAULT now64(3),
-- Properties (stored as JSON string, queried via JSONExtract)
properties String DEFAULT '{}',
-- Denormalized property columns for high-cardinality queries
-- These are extracted from properties at ingestion time
url String DEFAULT '',
referrer String DEFAULT '',
host String DEFAULT '',
pathname String DEFAULT '',
-- Device
browser LowCardinality(String) DEFAULT '',
browser_version String DEFAULT '',
os LowCardinality(String) DEFAULT '',
os_version String DEFAULT '',
device_type LowCardinality(String) DEFAULT '',
-- Geo (resolved at ingestion from IP)
country_code LowCardinality(FixedString(2)) DEFAULT '\0\0',
city LowCardinality(String) DEFAULT '',
region LowCardinality(String) DEFAULT '',
-- Attribution
utm_source LowCardinality(String) DEFAULT '',
utm_medium LowCardinality(String) DEFAULT '',
utm_campaign LowCardinality(String) DEFAULT '',
-- LLM-specific (Hanzo extension)
model LowCardinality(String) DEFAULT '',
provider LowCardinality(String) DEFAULT '',
tokens_prompt UInt32 DEFAULT 0,
tokens_completion UInt32 DEFAULT 0,
cost_usd Decimal64(8) DEFAULT 0,
-- Billing
organization_id String DEFAULT '',
project_id String DEFAULT '',
api_key_id String DEFAULT ''
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), uuid)
SETTINGS index_granularity = 8192;
Design decisions in this schema:
LowCardinality(String)for columns with few distinct values (browser, OS, country). ClickHouse stores these as dictionary-encoded integers, reducing storage 5-10x and speeding equality comparisons.FixedString(2)for country codes. All ISO 3166-1 alpha-2 codes are exactly 2 bytes. Fixed-size storage avoids length prefix overhead.propertiesas String (JSON) rather than a Map or Nested type. This preserves schema flexibility -- any event can carry any custom properties. Queries useJSONExtractString(properties, 'key')for ad-hoc access. High-frequency property keys are denormalized into dedicated columns.- ORDER BY includes
cityHash64(distinct_id)to distribute rows with the same team/date/event across granules, improving parallelism for queries that filter by distinct_id. - PARTITION BY
toYYYYMM(timestamp)creates monthly partitions. This enables efficient partition pruning (queries with date ranges skip irrelevant months) and simplified TTL management.
Sessions Table
Materialized from the events table for session-level analytics.
CREATE MATERIALIZED VIEW sessions_mv ON CLUSTER '{cluster}'
TO sessions
AS SELECT
team_id,
session_id,
distinct_id,
min(timestamp) AS session_start,
max(timestamp) AS session_end,
dateDiff('second', min(timestamp),
max(timestamp)) AS duration_seconds,
count() AS event_count,
countIf(event = '$pageview') AS pageview_count,
argMin(url, timestamp) AS entry_url,
argMax(url, timestamp) AS exit_url,
argMin(referrer, timestamp) AS initial_referrer,
argMin(utm_source, timestamp) AS initial_utm_source,
argMin(utm_medium, timestamp) AS initial_utm_medium,
argMin(utm_campaign, timestamp) AS initial_utm_campaign,
argMin(browser, timestamp) AS browser,
argMin(os, timestamp) AS os,
argMin(device_type, timestamp) AS device_type,
argMin(country_code, timestamp) AS country_code
FROM events
GROUP BY team_id, session_id, distinct_id;
The sessions target table uses ReplicatedAggregatingMergeTree so that
late-arriving events for the same session are merged correctly.
Metrics Rollup Tables
For long-term metric storage (HIP-0031), pre-aggregated rollup tables reduce query time and storage for common dashboard queries.
-- Hourly event counts by team, event type
CREATE MATERIALIZED VIEW events_hourly_mv ON CLUSTER '{cluster}'
TO events_hourly
AS SELECT
team_id,
event,
toStartOfHour(timestamp) AS hour,
count() AS count,
uniqHLL12(distinct_id) AS unique_users
FROM events
GROUP BY team_id, event, hour;
-- Daily token usage by organization, model
CREATE MATERIALIZED VIEW token_usage_daily_mv ON CLUSTER '{cluster}'
TO token_usage_daily
AS SELECT
organization_id,
model,
provider,
toDate(timestamp) AS day,
sum(tokens_prompt) AS total_prompt_tokens,
sum(tokens_completion) AS total_completion_tokens,
sum(cost_usd) AS total_cost_usd,
count() AS request_count
FROM events
WHERE event = '$llm_request'
GROUP BY organization_id, model, provider, day;
Why materialized views: ClickHouse materialized views are triggers that
run on INSERT. They are not periodic batch jobs. When a batch of events is
inserted into the events table, ClickHouse automatically updates the
materialized view target tables. This means rollup tables are always
up-to-date with sub-second latency.
Data Ingestion
Batch HTTP Inserts
The primary ingestion path for Insights (HIP-0017). The Rust capture service buffers events in memory (or Kafka, per HIP-0030) and flushes to ClickHouse in batches via HTTP.
POST http://clickhouse:8123/?query=INSERT+INTO+events+FORMAT+JSONEachRow
Content-Type: application/json
{"uuid":"...","team_id":1,"event":"$pageview","timestamp":"2026-02-23T12:00:00.000Z","distinct_id":"user_42","properties":"{\"url\":\"/dashboard\"}"}
{"uuid":"...","team_id":1,"event":"$pageview","timestamp":"2026-02-23T12:00:01.000Z","distinct_id":"user_43","properties":"{\"url\":\"/settings\"}"}
Batch size requirements:
| Parameter | Value | Rationale |
|---|---|---|
| Minimum batch size | 1,000 rows | Below this, per-INSERT overhead dominates |
| Recommended batch size | 10,000-100,000 rows | Optimal merge behavior |
| Maximum batch size | 1,000,000 rows | Beyond this, INSERT timeout risk increases |
| Flush interval | 5-10 seconds | Balance latency vs batch size |
Services MUST NOT insert single rows per HTTP request. Single-row inserts
create thousands of tiny data parts that ClickHouse must merge, consuming
CPU and degrading query performance. The too many parts exception is a
hard failure mode caused by excessive single-row inserts.
Kafka Consumer (Streaming)
For high-throughput services that cannot batch in-process, ClickHouse's built-in Kafka engine consumes directly from Kafka topics (HIP-0030).
CREATE TABLE events_kafka ON CLUSTER '{cluster}'
(
uuid String,
team_id UInt64,
event String,
timestamp String,
distinct_id String,
properties String
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'hanzo_events',
kafka_group_name = 'clickhouse_events_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4,
kafka_max_block_size = 65536;
CREATE MATERIALIZED VIEW events_kafka_mv ON CLUSTER '{cluster}'
TO events
AS SELECT
toUUID(uuid) AS uuid,
team_id,
event,
parseDateTimeBestEffort(timestamp) AS timestamp,
now64(3) AS created_at,
distinct_id,
properties
FROM events_kafka;
This pattern decouples ingestion from query serving. Kafka absorbs traffic spikes; ClickHouse consumes at its own pace.
Query Interface
ClickHouse uses SQL with extensions. All Hanzo services MUST query ClickHouse using standard SQL via the HTTP interface or the native TCP protocol.
HTTP Interface (Port 8123)
# Simple query
curl 'http://clickhouse:8123/?query=SELECT+count()+FROM+events+WHERE+team_id=1'
# Parameterized query (prevents SQL injection)
curl 'http://clickhouse:8123/' \
--data-urlencode "param_team_id=1" \
--data-urlencode "param_start=2026-02-01" \
-d "query=SELECT count() FROM events WHERE team_id = {team_id:UInt64} AND timestamp >= {start:Date}"
Common Query Patterns
Funnel analysis (Insights):
SELECT
level,
count() AS users
FROM
(
SELECT
distinct_id,
windowFunnel(86400)(
timestamp,
event = 'signup',
event = 'onboarding_complete',
event = 'first_api_call'
) AS level
FROM events
WHERE team_id = 1
AND timestamp >= '2026-02-01'
AND timestamp < '2026-03-01'
GROUP BY distinct_id
)
GROUP BY level
ORDER BY level DESC;
Retention analysis (Insights):
SELECT
cohort_date,
period,
count(DISTINCT distinct_id) AS retained_users
FROM
(
SELECT
distinct_id,
toDate(min(timestamp)) AS cohort_date,
dateDiff('week', toDate(min(timestamp)), toDate(timestamp)) AS period
FROM events
WHERE team_id = 1
AND event IN ('$pageview', '$screen')
AND timestamp >= '2026-01-01'
GROUP BY distinct_id, toDate(timestamp)
)
GROUP BY cohort_date, period
ORDER BY cohort_date, period;
Token usage dashboard (Billing):
SELECT
organization_id,
model,
sum(total_prompt_tokens) AS prompt_tokens,
sum(total_completion_tokens) AS completion_tokens,
sum(total_cost_usd) AS total_cost
FROM token_usage_daily
WHERE day >= today() - 30
GROUP BY organization_id, model
ORDER BY total_cost DESC
LIMIT 100;
Partitioning Strategy
All tables MUST use time-based partitioning:
| Table | Partition Key | Granularity | Rationale |
|---|---|---|---|
| events | toYYYYMM(timestamp) | Monthly | Balances partition count vs pruning efficiency |
| sessions | toYYYYMM(session_start) | Monthly | Aligned with events |
| events_hourly | toYYYYMM(hour) | Monthly | Rollups are small per partition |
| token_usage_daily | toYYYYMM(day) | Monthly | Consistent with events |
Why monthly, not daily: Daily partitions create 365 partitions per year. With 2 replicas, that is 730 data parts to track per table. After 3 years of data, the partition count becomes unwieldy. Monthly partitions keep the count manageable (36 per year) while still enabling efficient date-range pruning.
TTL (Time-to-Live)
Raw event data has a default retention of 12 months. Rollup tables retain data for 36 months. TTL is enforced at the table level:
ALTER TABLE events ON CLUSTER '{cluster}'
MODIFY TTL timestamp + INTERVAL 12 MONTH DELETE;
ALTER TABLE events_hourly ON CLUSTER '{cluster}'
MODIFY TTL hour + INTERVAL 36 MONTH DELETE;
ALTER TABLE token_usage_daily ON CLUSTER '{cluster}'
MODIFY TTL day + INTERVAL 36 MONTH DELETE;
TTL is evaluated during background merges. Expired data is dropped during merge, not immediately upon expiration. This is a lazy deletion model that avoids write amplification.
Organizations on enterprise plans MAY override TTL via the team_settings
table in PostgreSQL. The ingestion service respects per-team TTL overrides
when creating partitions.
Compression
ClickHouse supports per-column codec selection. The default codec is LZ4, which provides fast compression/decompression with moderate ratios. For columns that benefit from higher compression, ZSTD is used.
CREATE TABLE events
(
uuid UUID CODEC(LZ4),
team_id UInt64 CODEC(Delta, LZ4),
distinct_id String CODEC(ZSTD(3)),
session_id String CODEC(ZSTD(3)),
event String CODEC(ZSTD(1)),
timestamp DateTime64(3) CODEC(DoubleDelta, LZ4),
properties String CODEC(ZSTD(3)),
url String CODEC(ZSTD(3)),
browser LowCardinality(String) CODEC(ZSTD(1)),
os LowCardinality(String) CODEC(ZSTD(1)),
country_code LowCardinality(FixedString(2)) CODEC(LZ4),
tokens_prompt UInt32 CODEC(Delta, LZ4),
tokens_completion UInt32 CODEC(Delta, LZ4),
cost_usd Decimal64(8) CODEC(Delta, LZ4)
-- ... remaining columns
);
Codec selection rationale:
| Codec | Used For | Why |
|---|---|---|
LZ4 | UUIDs, fixed-width types | Fast decompression, acceptable ratio |
ZSTD(1) | Low-cardinality strings | Better ratio than LZ4, still fast |
ZSTD(3) | High-cardinality strings (URLs, properties) | 2-3x better ratio than LZ4 |
Delta + LZ4 | Monotonic integers (team_id, timestamps) | Delta encodes small differences; LZ4 compresses the deltas |
DoubleDelta + LZ4 | Timestamps | Timestamps increment by near-constant intervals; double-delta reduces to near-zero |
Expected compression ratios on production data:
| Column | Uncompressed | Compressed | Ratio |
|---|---|---|---|
| timestamp | 8 bytes/row | 0.3 bytes/row | 27x |
| team_id | 8 bytes/row | 0.1 bytes/row | 80x |
| properties (JSON) | 500 bytes/row avg | 40 bytes/row avg | 12x |
| url | 80 bytes/row avg | 8 bytes/row avg | 10x |
| Overall | ~1200 bytes/row | ~80 bytes/row | ~15x |
Replication
Production deployments MUST use a 2-replica configuration. ClickHouse replication is asynchronous and log-based:
- When a write arrives at Replica A, it is written to the local MergeTree and a log entry is created in ClickHouse Keeper.
- Replica B polls the log and fetches the new data part from Replica A.
- Both replicas are eventually consistent. Replication lag is typically under 1 second.
Cluster Configuration
<!-- /etc/clickhouse-server/config.d/cluster.xml -->
<clickhouse>
<remote_servers>
<hanzo_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-0</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-1</host>
<port>9000</port>
</replica>
</shard>
</hanzo_cluster>
</remote_servers>
<macros>
<cluster>hanzo_cluster</cluster>
<shard>01</shard>
<replica>clickhouse-{replica_number}</replica>
</macros>
</clickhouse>
ClickHouse Keeper Configuration
Since ClickHouse 24.x, the built-in ClickHouse Keeper replaces ZooKeeper. A 3-node Keeper ensemble provides coordination for replication:
<!-- /etc/clickhouse-server/config.d/keeper.xml -->
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<raft_configuration>
<server>
<id>1</id>
<hostname>keeper-0</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>keeper-1</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>keeper-2</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
</clickhouse>
Why 3 Keeper nodes, not 2: Raft consensus requires a majority quorum. With 2 nodes, losing 1 node means no quorum and the cluster cannot coordinate writes. With 3 nodes, the cluster tolerates 1 node failure.
Backup and Recovery
Backups follow the Hanzo backup standard using S3-compatible storage (MinIO, per HIP-0032).
Backup Strategy
# Full backup to S3
clickhouse-backup create --tables 'default.events,default.sessions' daily_2026_02_23
clickhouse-backup upload daily_2026_02_23
# Incremental backup (only changed parts since last full)
clickhouse-backup create_remote --diff-from-remote daily_2026_02_23 incremental_2026_02_24
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Full | Weekly (Sunday 02:00 UTC) | 4 weeks | MinIO (HIP-0032) |
| Incremental | Daily (02:00 UTC) | 7 days | MinIO (HIP-0032) |
| Partition-level | On TTL drop | 30 days | MinIO (HIP-0032) |
Backup Configuration
# /etc/clickhouse-backup/config.yml
general:
remote_storage: s3
backups_to_keep_local: 2
backups_to_keep_remote: 28
s3:
access_key: "${MINIO_ACCESS_KEY}"
secret_key: "${MINIO_SECRET_KEY}"
bucket: "clickhouse-backups"
endpoint: "http://minio:9000"
region: "us-east-1"
path: "hanzo-k8s/"
compression_format: "zstd"
Recovery Procedure
# List available backups
clickhouse-backup list remote
# Download and restore
clickhouse-backup download daily_2026_02_23
clickhouse-backup restore daily_2026_02_23
# Verify row counts
clickhouse-client -q "SELECT count() FROM events"
Recovery time objective (RTO): < 1 hour for full cluster restore. Recovery point objective (RPO): < 24 hours (last daily backup).
Monitoring
ClickHouse exposes internal metrics through system tables. Hanzo Zap (HIP-0031) scrapes these and exports them as Prometheus metrics.
Key System Tables
| Table | Purpose | Query Frequency |
|---|---|---|
system.metrics | Current server state (connections, queries, merges) | Every 15s |
system.events | Cumulative counters (inserts, selects, bytes read) | Every 15s |
system.asynchronous_metrics | Background metrics (memory, disk, replication lag) | Every 60s |
system.query_log | Query history with timing and resource usage | On demand |
system.parts | Data part inventory per table | Every 60s |
system.replicas | Replication status and lag | Every 30s |
Critical Alerts
| Alert | Condition | Severity |
|---|---|---|
| Too Many Parts | SELECT count() FROM system.parts WHERE active AND table = 'events' > 300 | Critical |
| Replication Lag | SELECT replica_delay FROM system.replicas WHERE table = 'events' > 300 seconds | Warning |
| Disk Usage | SELECT free_space FROM system.disks WHERE name = 'default' < 10 GB | Critical |
| Query Duration | p99 query time > 30 seconds | Warning |
| Insert Failures | system.events counter FailedInsertQuery increasing | Critical |
| Keeper Disconnect | SELECT * FROM system.zookeeper WHERE path = '/clickhouse' fails | Critical |
Grafana Dashboard
The standard Grafana dashboard for ClickHouse includes panels for:
- Ingestion rate: Rows inserted per second (from
system.events) - Query performance: p50/p95/p99 query latency (from
system.query_log) - Active parts: Part count per table (from
system.parts) - Replication lag: Seconds behind leader (from
system.replicas) - Disk usage: Used/free per disk (from
system.disks) - Memory usage: Resident memory and cache hit rates
- Merge throughput: Background merge rate and duration
- Compression ratio: Raw vs compressed bytes per table
Deployment
Docker (Development)
# compose.yml
services:
clickhouse:
image: ghcr.io/hanzoai/datastore:latest
ports:
- "8123:8123" # HTTP
- "9000:9000" # Native TCP
volumes:
- clickhouse_data:/var/lib/clickhouse
- clickhouse_logs:/var/log/clickhouse-server
environment:
CLICKHOUSE_DB: hanzo
CLICKHOUSE_USER: hanzo
CLICKHOUSE_PASSWORD: "${CLICKHOUSE_PASSWORD}"
ulimits:
nofile:
soft: 262144
hard: 262144
healthcheck:
test: ["CMD", "clickhouse-client", "--query", "SELECT 1"]
interval: 10s
timeout: 5s
retries: 3
volumes:
clickhouse_data:
clickhouse_logs:
Kubernetes (Production)
Production deployments use the ClickHouse Operator for lifecycle management.
# k8s/clickhouse-installation.yaml
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
name: hanzo-datastore
namespace: hanzo
spec:
configuration:
clusters:
- name: hanzo_cluster
layout:
shardsCount: 1
replicasCount: 2
templates:
podTemplate: clickhouse-pod
volumeClaimTemplate: data-volume
settings:
max_concurrent_queries: 200
max_memory_usage: 10000000000 # 10 GB per query
max_bytes_before_external_group_by: 5000000000 # 5 GB
max_execution_time: 300 # 5 minutes
log_queries: 1
merge_tree/max_parts_in_total: 100000
profiles:
default/max_memory_usage: 10000000000
default/max_execution_time: 300
readonly/readonly: 1
readonly/max_execution_time: 60
users:
hanzo/password_sha256_hex: "${CLICKHOUSE_PASSWORD_SHA256}"
hanzo/networks/ip: "10.0.0.0/8"
hanzo/profile: default
hanzo/quota: default
readonly/password_sha256_hex: "${CLICKHOUSE_READONLY_PASSWORD_SHA256}"
readonly/networks/ip: "10.0.0.0/8"
readonly/profile: readonly
templates:
podTemplates:
- name: clickhouse-pod
spec:
containers:
- name: clickhouse
image: ghcr.io/hanzoai/datastore:latest
resources:
requests:
memory: 8Gi
cpu: "4"
limits:
memory: 16Gi
cpu: "8"
volumeClaimTemplates:
- name: data-volume
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 500Gi
storageClassName: do-block-storage
Resource Sizing Guide
| Cluster Size | Events/Day | CPU | RAM | Disk | Replicas |
|---|---|---|---|---|---|
| Small (dev) | < 1M | 2 cores | 4 GB | 50 GB | 1 |
| Medium | 1M - 50M | 4-8 cores | 8-16 GB | 200-500 GB | 2 |
| Large | 50M - 500M | 16-32 cores | 32-64 GB | 1-2 TB | 2 |
| XL | 500M+ | 64+ cores | 128+ GB | 4+ TB | 2 (multi-shard) |
Security
Authentication
ClickHouse MUST require authentication for all connections. Two user profiles are provisioned:
| User | Purpose | Permissions |
|---|---|---|
hanzo | Application writes and reads | Full DDL + DML on hanzo database |
readonly | Dashboard queries, Grafana | SELECT only, 60s query timeout |
Passwords are stored as SHA-256 hashes in the ClickHouse users configuration. Credentials are managed through KMS (HIP-0033) and injected as Kubernetes secrets.
Network
- HTTP port (8123) and native port (9000) MUST be accessible only from within
the Kubernetes cluster network (
10.0.0.0/8). - Native TLS port (9440) is used for cross-cluster replication if needed.
- No ports are exposed to the public internet.
- Kubernetes NetworkPolicy restricts access to pods with the label
clickhouse-client: "true".
Data Classification
| Data Type | Sensitivity | Handling |
|---|---|---|
| Event metadata (timestamps, counts) | Low | Standard retention |
| User identifiers (distinct_id) | Medium | Hashed or pseudonymized |
| URLs and referrers | Medium | Retained per TTL, no PII |
| Properties (custom JSON) | Variable | May contain PII; TTL enforced |
| LLM request content | High | NOT stored in analytics; only token counts and costs |
Services MUST NOT store raw LLM prompts or completions in the analytics datastore. Only aggregate metadata (model, token counts, latency, cost) is permitted.
Integration Points
Insights (HIP-0017)
Insights is the primary consumer of the analytics datastore. The integration flow is:
SDK (browser/server) --> Capture Service (Rust) --> Kafka --> ClickHouse
--> PostgreSQL (metadata only)
Insights queries ClickHouse for:
- Event trends and breakdowns
- Funnel analysis (multi-step conversion)
- Retention cohorts
- Session replay event sequences
- Feature flag exposure tracking
- A/B test statistical analysis
Metrics (HIP-0031)
Zap (HIP-0031) ships structured logs to ClickHouse for long-term storage. Prometheus handles short-term metrics (15-day retention). For queries spanning weeks or months, Grafana queries ClickHouse via the ClickHouse data source plugin.
Zap sidecar --> Prometheus (15-day) --> remote_write --> ClickHouse (36-month)
Billing Analytics
The billing pipeline tracks per-organization resource consumption:
-- Monthly invoice line items
SELECT
organization_id,
model,
sum(tokens_prompt + tokens_completion) AS total_tokens,
sum(cost_usd) AS total_cost,
count() AS api_calls
FROM events
WHERE event = '$llm_request'
AND organization_id = 'org_123'
AND timestamp >= '2026-02-01'
AND timestamp < '2026-03-01'
GROUP BY organization_id, model
ORDER BY total_cost DESC;
This query runs against the token_usage_daily materialized view in
production for sub-second response times.
Migration Path
From PostgreSQL Analytics Tables
Services currently storing analytics data in PostgreSQL MUST migrate to ClickHouse:
-
Schema mapping: Convert PostgreSQL table schema to ClickHouse DDL. Replace
SERIALwithUInt64,JSONBwithString,TIMESTAMPTZwithDateTime64(3, 'UTC'). -
Data migration: Use
clickhouse-clientwithINSERT INTO ... SELECTfrom PostgreSQL via thepostgresql()table function:INSERT INTO events SELECT * FROM postgresql( 'postgres:5432', 'hanzo_iam', 'analytics_events', 'hanzo', 'password' ); -
Dual-write period: Write to both PostgreSQL and ClickHouse for 7 days. Validate row counts and query results match.
-
Cutover: Switch reads to ClickHouse. Stop writes to PostgreSQL. Drop PostgreSQL analytics tables after 30-day grace period.
Version Upgrades
ClickHouse follows a calendar-based release cycle (YY.M). Upgrades MUST follow this procedure:
- Read the changelog for breaking changes.
- Test the upgrade on a staging replica.
- Upgrade one replica at a time (rolling upgrade).
- Verify replication health after each replica upgrade.
- Run validation queries to confirm data integrity.
Rationale
The analytics datastore is a critical piece of Hanzo infrastructure that sits between event ingestion (HIP-0017, HIP-0030) and user-facing dashboards (Insights, Grafana, billing). The choice of ClickHouse is driven by three non-negotiable requirements:
-
Sub-second queries on billions of rows. Interactive dashboards require that users see results immediately. No row-oriented database can deliver this at our data volume.
-
10x-100x compression over PostgreSQL. Storage costs scale linearly with data volume. Columnar compression is the only way to keep analytics data for 12+ months affordably.
-
Self-hosted with zero per-query costs. Managed analytics services charge by the query or by data scanned. At Hanzo's query volume, this creates unpredictable and unacceptable costs.
ClickHouse satisfies all three. It is battle-tested at companies processing trillions of events (Uber, Cloudflare, eBay, GitLab). It is open-source (Apache 2.0). It has a single-binary deployment model that fits our Kubernetes-native infrastructure.
Backwards Compatibility
This HIP introduces a new infrastructure component. There are no backwards compatibility concerns for existing services because:
- PostgreSQL (HIP-0029) continues to serve transactional workloads unchanged.
- Services currently using PostgreSQL for analytics will undergo a phased migration (see Migration Path).
- The ClickHouse query interface (SQL over HTTP) requires no new client libraries -- any HTTP client can query ClickHouse.
Reference Implementation
The reference implementation is at github.com/hanzoai/datastore and includes:
Dockerfile: ClickHouse image with Hanzo-specific configurationschemas/: DDL for all tables and materialized viewsconfig/: ClickHouse server configuration templatesk8s/: Kubernetes manifests and ClickHouse Operator CRDbackup/: Backup scripts and CronJob manifestsgrafana/: Dashboard JSON exportstests/: Integration tests usingclickhouse-client
Copyright
This document is placed in the public domain.