HIP-46: Embeddings Standard

Abstract

This proposal defines the Embeddings Standard for the Hanzo ecosystem. It specifies a unified, OpenAI-compatible API for text and multimodal embedding generation served through the LLM Gateway (HIP-0004). The standard covers multi-provider routing, batch processing, dimension reduction, caching, chunking strategies, and integration with downstream consumers including Search (HIP-0012), Vector stores (HIP-0042), Agent memory (HIP-0009), and Chat RAG pipelines.

Gateway Endpoint: POST /v1/embeddings Gateway Port: 4000 Protocol: OpenAI-compatible REST API

Motivation

Embeddings are the foundational representation layer for semantic operations across the Hanzo ecosystem. Every system that performs retrieval, similarity matching, clustering, or semantic search depends on dense vector representations of text, images, or other modalities.

Current challenges:

Provider Diversity: OpenAI, Cohere, and local models each excel at different embedding tasks. No single provider dominates all use cases.
Dimension-Cost Tradeoff: Applications need different vector sizes. A 256-dim vector suffices for coarse filtering; 3072-dim vectors are needed for fine-grained retrieval.
Scale: RAG pipelines ingest millions of document chunks. Per-request embedding is too slow; batch processing is required.
Redundant Computation: The same text fragments are embedded repeatedly across services. Caching eliminates this waste.
Fragmented Integration: Without a standard, each service implements its own embedding client, chunking logic, and normalization. Bugs and inconsistencies multiply.

This HIP provides a single specification that all Hanzo services MUST use for embedding generation.

Design Philosophy

Why Multi-Provider Embeddings

Different embedding models have different strengths. OpenAI text-embedding-3-large produces the highest-quality vectors for English retrieval tasks. text-embedding-3-small runs 5x faster at 60% of the cost for acceptable quality. Cohere embed-v3 leads on multilingual benchmarks across 100+ languages. Local models via Engine eliminate network latency and data egress concerns for sensitive workloads.

The Gateway routes embedding requests to the optimal provider based on the model field in the request. Callers specify what they want; the Gateway handles where it runs. This decouples application logic from provider selection, enabling transparent migration when better models emerge.

Why OpenAI-Compatible API

The OpenAI embeddings API is the de facto standard. Every major embedding library, vector database, and RAG framework supports it natively. By implementing the same interface, any existing OpenAI client works against the Hanzo Gateway without code changes. A developer using openai.embeddings.create() in Python simply changes the base_url to the Gateway and gains multi-provider routing, caching, and cost optimization for free.

Inventing a proprietary API would require building and maintaining client libraries for every language, writing migration guides, and convincing developers to adopt a non-standard interface. The marginal benefit is zero; the cost is substantial.

Why Not Train Our Own Embedding Models

Training competitive embedding models requires:

Massive paired datasets: Billions of (query, document) pairs with relevance labels
Contrastive learning infrastructure: Multi-GPU training with hard negative mining
Evaluation benchmarks: MTEB, BEIR, and domain-specific test suites
Ongoing maintenance: Models degrade as language shifts; retraining is continuous

OpenAI and Cohere each invest hundreds of millions of dollars in training data curation and model development. Their embedding models are commodities priced at fractions of a cent per million tokens. The correct strategy is to leverage their models through the Gateway and focus engineering effort on the application layer: chunking, retrieval, re-ranking, and synthesis.

If a domain-specific embedding model is needed (e.g., for code, scientific papers, or blockchain data), fine-tuning an open model via Engine is the path. This is covered in the Local Models section.

Why Dimension Reduction Support

OpenAI text-embedding-3-* models support Matryoshka Representation Learning (MRL). The full embedding is 3072 dimensions, but it can be truncated to 256, 512, or 1024 dimensions with graceful quality degradation. This is not naive truncation; the model is trained so that the first N dimensions capture the most important semantic information.

The practical impact:

Dimensions	Storage per Vector	Index Memory (1M vectors)	Quality (MTEB)
3072	12,288 bytes	~12 GB	64.6
1024	4,096 bytes	~4 GB	62.1
512	2,048 bytes	~2 GB	60.8
256	1,024 bytes	~1 GB	58.4

For a vector store with 100M documents, reducing from 3072 to 256 dimensions saves 1.1 TB of storage and proportionally reduces query latency. The Gateway exposes this via the dimensions parameter, making it trivial for callers to select their quality-cost tradeoff.

Why Batch Processing

Embedding generation is embarrassingly parallel. Each input text is independent; there are no sequential dependencies. The batch API accepts up to 2048 inputs per request and processes them concurrently on the provider side. This reduces:

Network overhead: 1 HTTP request instead of 2048
Authentication overhead: 1 API key validation instead of 2048
Connection overhead: 1 TLS handshake instead of 2048

In practice, batch embedding is 10-50x faster than sequential per-request embedding for document ingestion workloads. The Gateway further optimizes by splitting oversized batches into provider-optimal chunk sizes and reassembling results.

Specification

API Endpoint

POST /v1/embeddings
Content-Type: application/json
Authorization: Bearer sk-hanzo-...

All requests go through the LLM Gateway (HIP-0004) at port 4000. The Gateway handles provider routing, authentication, rate limiting, and response normalization.

Request Schema

{
  "model": "text-embedding-3-large",
  "input": "The quick brown fox jumps over the lazy dog",
  "encoding_format": "float",
  "dimensions": 1024,
  "user": "user-abc123"
}

Fields

Field	Type	Required	Default	Description
`model`	string	Yes	-	Model identifier (see Available Models)
`input`	string or string[]	Yes	-	Text(s) to embed. Max 2048 items in array.
`encoding_format`	string	No	`"float"`	`"float"` or `"base64"`. Base64 reduces payload by ~25%.
`dimensions`	integer	No	Model default	Output dimensions. Only supported by MRL-capable models.
`user`	string	No	-	End-user identifier for abuse tracking.

Input Constraints

String input: Single text, max token length per model (see Available Models table).
Array input: Array of strings, max 2048 items. Total tokens across all items must not exceed the model's batch token limit.
Empty strings: Rejected with 400 error. Embeddings of empty text are undefined.
Encoding: Input MUST be valid UTF-8. The Gateway rejects malformed sequences.

Response Schema

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0023, -0.0092, 0.0156, ...],
      "index": 0
    }
  ],
  "model": "text-embedding-3-large",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Fields

Field	Type	Description
`object`	string	Always `"list"`.
`data`	array	Array of embedding objects, one per input.
`data[].object`	string	Always `"embedding"`.
`data[].embedding`	float[]	The embedding vector. Length equals `dimensions`.
`data[].index`	integer	Position in the input array (0-indexed).
`model`	string	The model that generated the embeddings.
`usage.prompt_tokens`	integer	Total tokens consumed across all inputs.
`usage.total_tokens`	integer	Same as `prompt_tokens` (embeddings have no output).

Base64 Encoding

When encoding_format is "base64", the embedding field contains a base64-encoded string of little-endian float32 values instead of a JSON array:

{
  "object": "embedding",
  "embedding": "AGDwPQAAkL0AAHA+...",
  "index": 0
}

Decoding in Python:

import base64
import numpy as np

raw = base64.b64decode(data["embedding"])
vector = np.frombuffer(raw, dtype=np.float32)

Base64 encoding reduces JSON payload size by approximately 25% and eliminates floating-point serialization overhead. Recommended for batch operations.

Available Models

Model	Provider	Dimensions	Max Tokens	Pricing (per 1M tokens)	Strengths
`text-embedding-3-large`	OpenAI	3072*	8191	$0.13	Highest quality English
`text-embedding-3-small`	OpenAI	1536*	8191	$0.02	Fast, cost-effective
`text-embedding-ada-002`	OpenAI	1536	8191	$0.10	Legacy, wide compatibility
`embed-english-v3.0`	Cohere	1024	512	$0.10	English retrieval
`embed-multilingual-v3.0`	Cohere	1024	512	$0.10	100+ languages
`embed-english-light-v3.0`	Cohere	384	512	$0.10	Lightweight English
`hanzo/bge-large`	Engine	1024	8192	Self-hosted	On-prem, no data egress
`hanzo/e5-large`	Engine	1024	512	Self-hosted	On-prem, multilingual
`hanzo/clip-vit-l-14`	Engine	768	N/A	Self-hosted	Multimodal (text + image)

*Supports Matryoshka dimension reduction: 256, 512, 1024, or full size.

Model Selection Guidelines

Use Cases:
  high_quality_retrieval:
    model: text-embedding-3-large
    dimensions: 3072
    rationale: Maximum semantic fidelity for critical search

  general_rag:
    model: text-embedding-3-large
    dimensions: 1024
    rationale: Good quality at 1/3 storage cost

  real_time_search:
    model: text-embedding-3-small
    dimensions: 512
    rationale: Sub-10ms latency, acceptable quality

  multilingual:
    model: embed-multilingual-v3.0
    rationale: Best cross-language performance

  cost_sensitive:
    model: text-embedding-3-small
    dimensions: 256
    rationale: Lowest cost, suitable for coarse filtering

  air_gapped:
    model: hanzo/bge-large
    rationale: No external API calls, full data sovereignty

  multimodal:
    model: hanzo/clip-vit-l-14
    rationale: Shared text-image embedding space

Batch API

For bulk embedding workloads (document ingestion, index rebuilding), the batch API provides asynchronous processing with higher throughput limits and lower per-token cost.

Create Batch

POST /v1/embeddings/batches
Content-Type: application/json
Authorization: Bearer sk-hanzo-...

{
  "model": "text-embedding-3-large",
  "input_file_id": "file-abc123",
  "dimensions": 1024,
  "encoding_format": "base64",
  "metadata": {
    "purpose": "document_ingestion",
    "collection": "knowledge_base_v2"
  }
}

The input_file_id references a JSONL file uploaded via the Files API, where each line is:

{"custom_id": "doc-001", "input": "First document text..."}
{"custom_id": "doc-002", "input": "Second document text..."}

Batch Response

Returns a batch object with id, status (queued / processing / completed / failed), request_counts (total/completed/failed), and output_file_id. Poll status via GET /v1/embeddings/batches/{batch_id}. When completed, the output file is JSONL with custom_id, embedding, index, and tokens per line.

Batch Pricing

Batch embeddings are billed at 50% of the real-time rate. For text-embedding-3-large: $0.065 per 1M tokens instead of $0.13.

Caching

The Gateway implements a semantic embedding cache to eliminate redundant computation.

Cache Architecture

┌────────────────┐     ┌───────────────┐     ┌──────────────┐
│  Embedding     │────▶│  Cache Layer  │────▶│  Provider    │
│  Request       │     │  (Redis)      │     │  (OpenAI,    │
│                │◀────│               │◀────│   Cohere)    │
└────────────────┘     └───────────────┘     └──────────────┘

Cache Key Computation

Cache keys are computed as:

key = SHA-256(model || dimensions || encoding_format || normalize(input))

Where normalize(input) applies:

Unicode NFC normalization
Whitespace collapsing (multiple spaces to single)
Leading/trailing whitespace removal

This ensures that semantically identical inputs with trivial formatting differences produce cache hits.

Cache Configuration

embedding_cache:
  enabled: true
  backend: redis
  ttl: 86400          # 24 hours default
  max_entries: 10_000_000
  max_memory: 8GB
  eviction: lru

  # Per-model TTL overrides
  model_ttl:
    text-embedding-3-large: 604800   # 7 days (expensive)
    text-embedding-3-small: 86400    # 1 day (cheap)

  # Skip cache for these patterns
  bypass:
    - "hanzo/*"        # Local models have no API cost

Cache Metrics

The Gateway exposes cache metrics via Prometheus (HIP-0031):

hanzo_embedding_cache_hits_total{model="text-embedding-3-large"}
hanzo_embedding_cache_misses_total{model="text-embedding-3-large"}
hanzo_embedding_cache_hit_rate{model="text-embedding-3-large"}
hanzo_embedding_cache_size_bytes
hanzo_embedding_cache_evictions_total

Chunking Strategies

Embedding models have finite context windows. Documents exceeding the model's max token limit MUST be split into chunks before embedding. The Gateway does not perform chunking; it is the responsibility of the caller. Three strategies are recommended.

Fixed-Size Chunking: Split text into chunks of a fixed token count with configurable overlap. Default: 512 tokens, 64-token overlap (12.5%). Simple, predictable, works for uniform content like chat logs.

Semantic Chunking: Split at natural boundaries (paragraphs, sections) while respecting token limits. Produces higher-quality embeddings because each chunk contains a coherent unit of meaning. Best for structured documents.

Recursive Chunking: Split hierarchically -- first by section headers, then paragraphs, then sentences, then token boundaries. Most robust for heterogeneous documents. Falls back gracefully when structure is absent.

Strategy Selection

Document Type	Strategy	Chunk Size	Overlap	Rationale
Knowledge base	Semantic	512	-	Coherent units for retrieval
Source code	Recursive	1024	128	Preserve function boundaries
Chat logs	Fixed	256	32	Uniform message lengths
Academic papers	Recursive	512	64	Respect section structure
API documentation	Semantic	384	-	One endpoint per chunk

Normalization

All embedding vectors MUST be L2-normalized before storage in vector databases. Normalized vectors have unit length (||v|| = 1), which makes cosine similarity equivalent to dot product. Dot product is computationally cheaper (no division), so normalization enables faster similarity search.

Normalization Formula

v_normalized = v / ||v||_2

where ||v||_2 = sqrt(sum(v_i^2))

Implementation

import numpy as np

def normalize(embedding: list[float]) -> list[float]:
    v = np.array(embedding, dtype=np.float32)
    norm = np.linalg.norm(v)
    if norm == 0:
        return v.tolist()
    return (v / norm).tolist()

OpenAI models return pre-normalized vectors. Cohere models do not; the Gateway normalizes Cohere outputs before returning them to callers. Local models vary; the Gateway checks and normalizes as needed.

The Gateway normalizes per-provider: OpenAI vectors are pre-normalized (skipped), Cohere and Engine outputs are normalized before returning to callers.

Integration Points

Vector Store Ingestion (HIP-0042)

The primary consumer of embeddings. Documents are chunked, embedded, and inserted into vector collections:

from hanzo import embeddings, vector

# Chunk the document
chunks = semantic_chunk(document.text, max_tokens=512)

# Generate embeddings via Gateway
response = embeddings.create(
    model="text-embedding-3-large",
    input=chunks,
    dimensions=1024
)

# Insert into vector store
vector.upsert(
    collection="knowledge_base",
    vectors=[
        {
            "id": f"{document.id}-{d.index}",
            "values": d.embedding,
            "metadata": {
                "document_id": document.id,
                "chunk_index": d.index,
                "text": chunks[d.index]
            }
        }
        for d in response.data
    ]
)

Search Indexing (HIP-0012)

The Search engine uses embeddings for hybrid search (BM25 + vector similarity):

# Index document with embedding for hybrid search
search.index(
    index="documents",
    document={
        "id": doc.id,
        "title": doc.title,
        "content": doc.content,
        "_vectors": {
            "default": embeddings.create(
                model="text-embedding-3-large",
                input=doc.content[:8000],
                dimensions=1024
            ).data[0].embedding
        }
    }
)

Agent Memory (HIP-0009)

Agents embed memories for semantic recall. Store uses text-embedding-3-small at 512 dimensions for cost efficiency. Recall embeds the query with the same model and performs vector similarity search against the agent's memory collection.

Chat RAG Pipeline

Chat (HIP-0011) uses embeddings for retrieval-augmented generation: embed the user query, retrieve top-k relevant chunks from the vector store, inject them as system context, and generate the response via LLM Gateway. The embedding model for RAG queries should match the model used during document ingestion to ensure consistent vector space alignment.

Multimodal Embeddings

Multimodal embeddings map text and images into the same vector space, enabling cross-modal search (e.g., searching images with text queries).

CLIP-based models are available via Engine for self-hosted deployments. Text inputs use the standard input field. Image inputs use {"type": "image", "data": "<base64>"}. Both produce vectors in the same 768-dimensional space, enabling cross-modal similarity via dot product.

Roadmap

Phase	Model	Modalities	Status
1	CLIP ViT-L/14	Text + Image	Available
2	SigLIP	Text + Image	Planned
3	ImageBind	Text + Image + Audio + Video	Planned
4	Unified Embed	All modalities	Research

Multimodal embedding API extensions will be specified in a future HIP when the interface stabilizes.

Rate Limits and Quotas

Rate limits are enforced per API key at the Gateway level. Limits vary by model tier and account plan.

Default Rate Limits

Plan	Requests/min	Tokens/min	Batch Quota/day
Free	60	100,000	500,000 tokens
Developer	500	1,000,000	10M tokens
Team	3,000	10,000,000	100M tokens
Enterprise	10,000	50,000,000	Unlimited

All responses include X-RateLimit-* headers: Limit-Requests, Limit-Tokens, Remaining-Requests, Remaining-Tokens, Reset-Requests, Reset-Tokens. On 429, clients SHOULD implement exponential backoff with jitter.

Error Handling

Errors follow the OpenAI error format: {"error": {"message", "type", "code", "param"}}.

Error Codes

HTTP Status	Code	Description
400	`invalid_request`	Malformed request body or missing fields
400	`invalid_model`	Model not found or not an embedding model
400	`invalid_dimensions`	Dimensions not supported by model
400	`input_too_long`	Input exceeds model's max token limit
400	`batch_too_large`	More than 2048 inputs in array
401	`invalid_api_key`	Missing or invalid API key
403	`insufficient_quota`	Account has no remaining embedding quota
429	`rate_limit_exceeded`	Rate limit hit; retry after indicated time
500	`provider_error`	Upstream provider returned an error
503	`provider_unavailable`	Provider is down; failover in progress

Monitoring and Observability

The Gateway exposes embedding-specific metrics via Prometheus (HIP-0031) and logs to the unified observability stack.

Prometheus Metrics

# Latency histogram by model and provider
hanzo_embedding_latency_seconds{model, provider, status}

# Request counter by model
hanzo_embedding_requests_total{model, provider, status, encoding_format}

# Token counter by model
hanzo_embedding_tokens_total{model, provider}

# Batch size histogram
hanzo_embedding_batch_size{model}

# Dimension usage counter
hanzo_embedding_dimensions_used{model, dimensions}

# Cache metrics
hanzo_embedding_cache_hits_total{model}
hanzo_embedding_cache_misses_total{model}
hanzo_embedding_cache_hit_rate{model}

# Provider health
hanzo_embedding_provider_up{provider}
hanzo_embedding_provider_latency_p99{provider}

Alert Rules

Recommended alerts: EmbeddingLatencyHigh (p99 > 2s for 5m), EmbeddingProviderDown (provider_up == 0 for 1m, critical), EmbeddingCacheHitRateLow (hit_rate < 0.3 for 15m), EmbeddingErrorRateHigh (error rate > 5% for 5m, critical).

Structured Logging

Every embedding request produces a structured JSON log entry containing: timestamp, model, provider, dimensions, input_count, total_tokens, latency_ms, cache_hit, user_id, and api_key_hash. Raw input text MUST NOT appear in logs at INFO level.

Gateway Routing Configuration

The Gateway routes embedding requests based on the model field. Provider selection, failover, and load balancing are configured in the Gateway config:

# /app/config.yaml (LLM Gateway)
embeddings:
  providers:
    openai:
      api_key: ${OPENAI_API_KEY}
      models:
        - text-embedding-3-large
        - text-embedding-3-small
        - text-embedding-ada-002
      rate_limit: 10000/min
      timeout: 30s
      retry:
        max_attempts: 3
        backoff: exponential

    cohere:
      api_key: ${COHERE_API_KEY}
      models:
        - embed-english-v3.0
        - embed-multilingual-v3.0
        - embed-english-light-v3.0
      rate_limit: 5000/min
      timeout: 30s

    engine:
      base_url: http://engine:8080
      models:
        - hanzo/bge-large
        - hanzo/e5-large
        - hanzo/clip-vit-l-14
      rate_limit: 0   # No limit for local models
      timeout: 10s

  failover:
    text-embedding-3-large:
      - text-embedding-3-small    # Same provider, smaller model
      - embed-english-v3.0        # Different provider
    embed-english-v3.0:
      - text-embedding-3-small    # Cross-provider failover

  defaults:
    encoding_format: float
    normalize: true

Client Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hanzo.ai/v1",
    api_key="sk-hanzo-..."
)

# Single embedding
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Hanzo provides unified AI infrastructure",
    dimensions=1024
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")  # 1024

# Batch embedding
texts = ["First document", "Second document", "Third document"]
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
    encoding_format="base64"
)
for item in response.data:
    print(f"Index {item.index}: {len(item.embedding)} chars (base64)")

JavaScript (Hanzo SDK)

import { HanzoAI } from '@hanzoai/sdk'

const hanzo = new HanzoAI({ apiKey: 'sk-hanzo-...' })

const response = await hanzo.embeddings.create({
  model: 'text-embedding-3-large',
  input: ['Hello world', 'Goodbye world'],
  dimensions: 512
})

for (const item of response.data) {
  console.log(`Index ${item.index}: [${item.embedding.slice(0, 3).join(', ')}...]`)
}

cURL

curl -X POST https://llm.hanzo.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-hanzo-..." \
  -d '{
    "model": "text-embedding-3-large",
    "input": "Sample text",
    "dimensions": 1024
  }'

Security Considerations

Data in Transit: All API communication MUST use TLS 1.3. The Gateway terminates TLS; upstream provider calls use TLS.
Data at Rest: Cached embeddings in Redis SHOULD be encrypted at rest when the deployment requires it. Embeddings themselves are not directly invertible to source text, but they can leak semantic information.
API Key Scoping: Embedding API keys can be scoped to specific models and rate limits. A key with embedding:read scope cannot access chat completions.
Input Logging: Raw input text MUST NOT be logged at INFO level. Only token counts, model names, and latency are logged by default. Debug-level logging of inputs requires explicit opt-in.
Provider Key Isolation: Provider API keys (OpenAI, Cohere) are stored in KMS (HIP-0027) and injected at runtime. They never appear in config files, logs, or error messages.

Backwards Compatibility

This HIP introduces a new endpoint with no backwards compatibility concerns. The API is fully OpenAI-compatible. Future revisions MUST only add fields, never remove or rename them.

Test Vectors

Conformance tests: (1) embed "hello" with text-embedding-3-small at 8 dimensions, verify vector length and values within tolerance 0.001; (2) embed ["hello", "world"], verify 2 results with indices [0, 1]; (3) embed "hello" with text-embedding-3-large at 256 dimensions, verify L2 norm equals 1.0 within tolerance 0.0001.

Embeddings Standard

HIP-46: Embeddings Standard

Abstract

Motivation

Design Philosophy

Why Multi-Provider Embeddings

Why OpenAI-Compatible API

Why Not Train Our Own Embedding Models

Why Dimension Reduction Support

Why Batch Processing

Specification

API Endpoint

Request Schema

Fields

Input Constraints

Response Schema

Fields

Base64 Encoding

Available Models

Model Selection Guidelines

Batch API

Create Batch

Batch Response

Batch Pricing

Caching

Cache Architecture

Cache Key Computation

Cache Configuration

Cache Metrics

Chunking Strategies

Strategy Selection

Normalization

Normalization Formula

Implementation

Integration Points

Vector Store Ingestion (HIP-0042)

Search Indexing (HIP-0012)

Agent Memory (HIP-0009)

Chat RAG Pipeline

Multimodal Embeddings

Roadmap

Rate Limits and Quotas

Default Rate Limits

Error Handling

Error Codes

Monitoring and Observability

Prometheus Metrics

Alert Rules

Structured Logging

Gateway Routing Configuration

Client Examples

Python (OpenAI SDK)

JavaScript (Hanzo SDK)

cURL

Security Considerations

Backwards Compatibility

Test Vectors

Reference Implementation

References

Copyright