HIP-81: Computer Vision Pipeline Standard

Abstract

This proposal defines the Computer Vision Pipeline standard for the Hanzo ecosystem. Hanzo Vision provides end-to-end computer vision capabilities -- from raw sensor input to structured output -- for both real-time and batch workloads. It handles images, video streams, 3D point clouds, depth maps, and thermal/infrared data through a unified, model-agnostic pipeline architecture.

The pipeline is organized into three stages: ingest (decode, normalize, and route sensor data), process (run one or more vision models in a directed acyclic graph), and emit (format results and deliver them to downstream consumers). Each stage is independently scalable. The pipeline ships with a model zoo covering object detection (YOLO, DETR), segmentation (SAM2), classification (EfficientNet, ConvNeXt), OCR (PaddleOCR, TrOCR), face recognition (ArcFace), and pose estimation (RTMPose). Any model that accepts a tensor and returns structured output can be added to the zoo without modifying the pipeline itself.

The system integrates with Jin (HIP-0003) for vision-language tasks, Engine (HIP-0043) for GPU inference serving, Edge (HIP-0050) for lightweight models deployed at the network edge, ML Pipeline (HIP-0057) for training and fine-tuning vision models, and Robotics (HIP-0080) for robot perception. It is streaming-first: video frames flow through the pipeline as a continuous stream, not as discrete batch requests. This is a hard requirement for robotics, surveillance, and autonomous systems where latency budgets are measured in milliseconds, not seconds.

Repository: github.com/hanzoai/vision Port: 8081 (API) Binary: hanzo-vision Container: ghcr.io/hanzoai/vision:latest

Motivation

The Fragmentation Problem

Every team that needs computer vision rebuilds the same infrastructure from scratch. A robotics team writes RTSP stream decoding, frame resizing, model inference, and result serialization. A retail analytics team writes the same pipeline with different models. A content moderation team writes it again. Each implementation makes different choices about color space conversion, aspect ratio handling, batching strategy, and output format. When a better model becomes available, each team ports it independently, introducing bugs that the others already fixed.

This is not a hypothetical problem. Within the Hanzo ecosystem alone, four separate subsystems need vision capabilities:

Robotics (HIP-0080): Robots need object detection, depth estimation, and semantic segmentation at 30+ FPS with <50ms latency. A warehouse robot that detects an obstacle 200ms late has already collided.
Content moderation: User-uploaded images need classification (NSFW detection, violence detection), OCR (text extraction for policy enforcement), and face detection (for consent verification). This runs as batch processing over millions of images per day.
Video analytics: Surveillance and monitoring systems need real-time event detection (person enters restricted area, package left unattended, smoke detected) from continuous video streams.
Document processing (HIP-0016): Invoices, receipts, and forms need OCR, layout analysis, and table extraction. This is batch-oriented but requires high accuracy.

Each of these has different latency requirements, different model choices, and different output formats. But they all share the same core pipeline: decode input, preprocess, run model, postprocess, emit results. Hanzo Vision standardizes this core and lets each use case configure it for their specific needs.

The Preprocessing Tax

Model inference is the glamorous part of computer vision. Preprocessing is the part that actually breaks in production.

Consider what happens before a YOLO model sees a frame from an RTSP camera stream:

Decode: The H.264/H.265 bitstream must be decoded to raw pixels. This requires hardware acceleration (NVDEC, VAAPI, VideoToolbox) to keep up with 30+ FPS per stream.
Color convert: Camera output is typically YUV (NV12 or I420). Models expect RGB or BGR. The conversion must be correct -- swapping U and V channels produces subtly wrong colors that degrade model accuracy without obvious visual artifacts.
Resize: The camera produces 1920x1080 or 3840x2160 frames. The model expects 640x640. Naive resizing distorts aspect ratio. Letterboxing preserves aspect ratio but wastes computation on padding. The resize interpolation method (bilinear, bicubic, area) affects model accuracy by up to 2% mAP on COCO.
Normalize: Pixel values must be scaled and shifted to match the model's training distribution. YOLO expects [0, 1]. ImageNet-pretrained models expect mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. Using the wrong normalization silently degrades accuracy.
Batch: For GPU efficiency, multiple frames should be batched into a single inference call. But batching introduces latency -- waiting to fill a batch delays all frames in the batch. The optimal batch size depends on the model, GPU memory, and latency budget.
Device transfer: The preprocessed tensor must be moved to GPU memory. For video pipelines, this should happen via zero-copy DMA, not through CPU staging buffers.

Every one of these steps has been implemented incorrectly in production systems. Hanzo Vision provides a tested, GPU-accelerated preprocessing pipeline that handles these correctly for all supported input modalities. Teams configure the pipeline; they do not reimplement it.

The Postprocessing Tax

Postprocessing is equally treacherous. A YOLO model outputs raw bounding box coordinates in the model's internal coordinate system (640x640 with letterbox padding). Converting these back to the original image coordinates requires reversing the exact letterbox transform applied during preprocessing. Getting this wrong by even one pixel at the 640x640 scale means getting it wrong by three pixels at 1920x1080. For a face recognition system drawing a crop for downstream processing, a three-pixel shift can cut off an ear and reduce recognition accuracy.

Non-maximum suppression (NMS) is another postprocessing step that teams implement differently. IoU thresholds, score thresholds, class-agnostic vs. class-aware NMS, soft-NMS vs. hard-NMS -- each choice affects the final detection quality. Hanzo Vision provides configurable postprocessing with sensible defaults that match the model zoo's evaluation benchmarks.

Why Streaming-First

The batch-request model (send image, receive results) works for content moderation and document processing. It does not work for robots, surveillance, or autonomous systems.

A robot navigating a warehouse receives a continuous stream of depth frames at 30 FPS. It does not send HTTP requests for each frame. It needs a pipeline that ingests frames from a shared memory buffer, runs depth estimation and object detection in parallel, fuses the results, and publishes the output to a topic that the navigation planner subscribes to. The total pipeline latency -- from frame capture to result publication -- must be under 50ms.

Surveillance systems are similar. A campus with 200 cameras produces 6,000 frames per second. Each frame needs person detection and tracking. When a person enters a restricted zone, an alert must fire within 500ms. This is a streaming problem, not a request-response problem.

Hanzo Vision is designed around streams. Frames enter the pipeline from a source (camera, video file, message queue). They flow through processing stages connected by bounded buffers. Results exit the pipeline to a sink (message queue, webhook, gRPC stream). The HTTP API exists for batch use cases, but it is implemented on top of the streaming core, not the other way around.

Why Model-Agnostic

The state of the art in computer vision changes every six months. YOLO v5 was replaced by YOLO v8, which was replaced by YOLO v11. DETR replaced Faster R-CNN for transformer-based detection. SAM replaced everything for interactive segmentation, then SAM2 replaced SAM with video support.

If the pipeline is coupled to a specific model, replacing that model means rewriting the pipeline. If the pipeline treats models as interchangeable units with defined input/output contracts, swapping YOLO for DETR requires changing one line of configuration:

# Before
detector:
  model: yolo-v11-l
  input: [1, 3, 640, 640]
  output: detections

# After
detector:
  model: detr-resnet-50
  input: [1, 3, 800, 800]
  output: detections

The pipeline handles the different input resolutions, preprocessing requirements, and output formats automatically. The downstream consumer sees the same Detection schema regardless of which model produced it.

This is not just about convenience. It is about evaluation. When a new model claims better accuracy, you swap it into the pipeline, run the same evaluation dataset, and compare metrics end-to-end. No code changes, no new integration work, just a config change and a benchmark run.

Design Philosophy

Separation of Concerns: Pipeline vs. Model vs. Application

The pipeline is not a model. The pipeline is not an application. The pipeline is the infrastructure between them.

Models are opaque functions: they accept tensors and return tensors. They know nothing about cameras, video codecs, coordinate systems, or output formats. They are trained, evaluated, and versioned by the ML Pipeline (HIP-0057).

Applications are domain-specific logic: "alert when a person enters zone A," "blur all faces in this video," "extract table data from this invoice." They know nothing about GPU memory management, batch scheduling, or tensor normalization.

The pipeline connects models to applications. It handles everything between raw sensor input and structured application output: decoding, preprocessing, batching, inference scheduling, postprocessing, and result delivery. By owning this layer explicitly, we prevent models from accumulating preprocessing logic and applications from accumulating inference logic.

The DAG Execution Model

Real vision applications rarely use a single model. A video analytics system might run:

Person detection (YOLO)
Face detection on each detected person crop (RetinaFace)
Face recognition on each detected face (ArcFace)
Pose estimation on each detected person crop (RTMPose)
Action recognition on the pose sequence (SlowFast)

Steps 2, 3, and 4 depend on step 1. Steps 3 depends on step 2. Step 5 depends on step 4. This forms a directed acyclic graph (DAG), not a linear pipeline.

Hanzo Vision models the processing pipeline as a DAG. Each node is a processing stage (model inference, preprocessing, postprocessing, filtering, fusion). Edges define data flow. The scheduler executes independent nodes in parallel and respects dependencies. This DAG is defined in YAML configuration, not in code.

pipeline:
  name: video-analytics
  source:
    type: rtsp
    url: rtsp://camera-01.local/stream

  stages:
    - id: detect_persons
      model: yolo-v11-l
      classes: [person]
      confidence: 0.5

    - id: detect_faces
      model: retinaface-r50
      depends_on: detect_persons
      input: crops               # Run on cropped detections from previous stage

    - id: recognize_faces
      model: arcface-r100
      depends_on: detect_faces
      gallery: s3://hanzo-vision/galleries/employees/

    - id: estimate_pose
      model: rtmpose-l
      depends_on: detect_persons
      input: crops

    - id: detect_actions
      model: slowfast-r50
      depends_on: estimate_pose
      temporal_window: 32        # Frames of pose history

  sink:
    type: stream
    topic: hanzo.vision.analytics.camera-01

GPU Memory as the Binding Constraint

Vision models are large. YOLO-v11-l is 49M parameters (100MB FP16). SAM2-large is 224M parameters (450MB FP16). A pipeline running five models simultaneously can consume 2-4GB of GPU memory for weights alone, before accounting for activation memory during inference.

GPU memory is the binding constraint for vision pipelines, not compute. A modern GPU (A100, H100) has enormous compute throughput but fixed memory (40GB or 80GB). The pipeline must be memory-aware:

Model loading: Load models lazily, unload models that have not been used recently. A model needed only for batch processing at night should not consume GPU memory during the day.
Batch sizing: Automatically determine the maximum batch size that fits in available GPU memory, accounting for all loaded models and their activation requirements.
Precision selection: Use FP16 or INT8 by default. FP32 is only justified when quantization measurably degrades accuracy for the specific task.
TensorRT compilation: For NVIDIA GPUs, compile ONNX models to TensorRT engines at deployment time. TensorRT fuses operations and optimizes memory layout, reducing both memory and latency by 2-5x.

Privacy by Default

Vision systems see people. People have faces, license plates, personal belongings, and private spaces. Hanzo Vision treats privacy as a pipeline-level concern, not an application-level afterthought.

Every pipeline can declare privacy policies that apply before results leave the pipeline:

privacy:
  face_blur:
    enabled: true
    method: gaussian          # gaussian, pixelate, solid
    kernel_size: 31
    apply_to: output_frames   # Blur faces in any output video/images

  pii_detection:
    enabled: true
    types: [license_plate, credit_card, ssn]
    action: redact            # redact, flag, log

  consent:
    enabled: true
    gallery: s3://hanzo-vision/consent/opted-in/
    action: blur_unconsented  # Blur faces not in the consent gallery

  retention:
    raw_frames: 0h            # Do not retain raw frames
    detections: 720h          # Retain detection metadata for 30 days
    face_embeddings: 0h       # Do not retain face embeddings (derive on demand)

Face blurring runs as a mandatory postprocessing stage when enabled. It cannot be bypassed by the application. This ensures that even if the application code is buggy or malicious, faces in output frames are blurred before they leave the pipeline. The consent gallery allows opted-in individuals (employees who consented to recognition) to be excluded from blurring.

Specification

Architecture Overview

                ┌─────────────────────────────────────────────────────┐
                │              Hanzo Vision API (:8081)                │
                │                                                     │
                │  ┌──────────┐  ┌───────────┐  ┌─────────────────┐  │
                │  │  Ingest  │  │  Process   │  │      Emit       │  │
                │  │  Manager │  │  Scheduler │  │    Dispatcher   │  │
                │  └────┬─────┘  └─────┬──────┘  └───────┬─────────┘  │
                │       │              │                  │           │
                │  ┌────┴──────────────┴──────────────────┴────────┐  │
                │  │              Pipeline Runtime                 │  │
                │  │   ┌─────────┐  ┌──────────┐  ┌────────────┐  │  │
                │  │   │ Sources │  │  Stages  │  │   Sinks    │  │  │
                │  │   │ (RTSP,  │  │  (DAG    │  │  (Stream,  │  │  │
                │  │   │  file,  │──│  of model│──│   webhook, │  │  │
                │  │   │  queue) │  │  nodes)  │  │   gRPC)    │  │  │
                │  │   └─────────┘  └──────────┘  └────────────┘  │  │
                │  └───────────────────────────────────────────────┘  │
                │       │              │                  │           │
                │  ┌────┴──┐    ┌──────┴──────┐   ┌──────┴────────┐  │
                │  │ Model │    │   Privacy   │   │  Annotation   │  │
                │  │  Zoo  │    │   Engine    │   │   Pipeline    │  │
                │  └───────┘    └─────────────┘   └───────────────┘  │
                └────────────────────┬────────────────────────────────┘
                                     │
              ┌──────────────────────┼──────────────────────┐
              │                      │                      │
     ┌────────┴─────────┐  ┌────────┴────────┐  ┌─────────┴──────────┐
     │  Engine (8043)   │  │  Object Storage  │  │  Stream (HIP-30)   │
     │  GPU Inference   │  │  (HIP-0032)      │  │  Event Delivery    │
     │  HIP-0043        │  │  Models/Data     │  │                    │
     └──────────────────┘  └─────────────────┘  └────────────────────┘

The architecture has four layers:

API Layer (port 8081): REST and gRPC endpoints for pipeline management, batch inference, and stream control. Handles authentication via Hanzo IAM.
Pipeline Runtime: The streaming execution engine. Manages source connections, DAG scheduling, buffer management, and sink delivery. One runtime instance can run multiple pipelines concurrently.
Supporting Services: Model Zoo (model registry and loading), Privacy Engine (face blurring, PII detection), and Annotation Pipeline (AI-assisted labeling).
External Dependencies: Engine (HIP-0043) for GPU-accelerated inference, Object Storage (HIP-0032) for models and training data, Stream (HIP-0030) for event delivery.

Input Modalities

Hanzo Vision supports five input modalities through a unified Frame abstraction.

Frame Schema

message Frame {
  string      frame_id    = 1;   // Unique ID (UUID v7 for time-ordering)
  string      source_id   = 2;   // Source that produced this frame
  int64       timestamp   = 3;   // Capture time (nanoseconds since epoch)
  int64       sequence    = 4;   // Monotonic sequence number within source
  Modality    modality    = 5;   // Image, video, depth, pointcloud, thermal
  TensorSpec  tensor      = 6;   // Raw tensor data
  Metadata    metadata    = 7;   // Source-specific metadata (camera intrinsics, etc.)
}

enum Modality {
  IMAGE       = 0;   // Single RGB/BGR/Grayscale image
  VIDEO       = 1;   // Frame from a video stream (carries temporal context)
  DEPTH       = 2;   // Depth map (single-channel float32, meters)
  POINTCLOUD  = 3;   // 3D point cloud (Nx3 or Nx6 with normals)
  THERMAL     = 4;   // Thermal/IR image (single-channel float32, Kelvin)
}

message TensorSpec {
  repeated int64 shape  = 1;   // e.g., [1080, 1920, 3] for RGB image
  DataType       dtype  = 2;   // uint8, float16, float32
  string         layout = 3;   // HWC, CHW, NHWC, NCHW
  bytes          data   = 4;   // Raw tensor bytes (or reference to shared memory)
}

Source Types

Source Type	Protocol	Modalities	Use Case
`rtsp`	RTSP/RTP	Video, Depth	IP cameras, depth sensors
`v4l2`	Video4Linux2	Video, Thermal	USB cameras, thermal cameras
`ros2`	ROS 2 DDS	All	Robot sensors (HIP-0080)
`file`	Local/S3	Image, Video, PointCloud	Batch processing
`queue`	Hanzo Stream	All	Event-driven processing
`grpc`	gRPC stream	All	Custom integrations
`shm`	Shared memory	All	Low-latency local inference

Video Stream Decoding

Video decoding is GPU-accelerated where hardware is available:

Platform	Decoder	Codecs	Throughput
NVIDIA GPU	NVDEC	H.264, H.265, VP9, AV1	60+ streams @ 1080p
Apple Silicon	VideoToolbox	H.264, H.265, ProRes	30+ streams @ 1080p
Intel	VAAPI/QSV	H.264, H.265	40+ streams @ 1080p
CPU fallback	FFmpeg (libavcodec)	All	5-10 streams @ 1080p

The pipeline automatically selects the best available decoder. When GPU decoding is available, decoded frames remain in GPU memory, avoiding a GPU-to-CPU-to-GPU round trip for subsequent model inference.

Model Zoo

The model zoo is a curated registry of pre-tested vision models. Each model entry includes the ONNX weights, preprocessing specification, postprocessing specification, and benchmark results on standard datasets.

Supported Model Families

Object Detection

Model	Parameters	Input Size	COCO mAP	Latency (A100)	Use Case
YOLO-v11-n	2.6M	640x640	39.5	1.2ms	Edge, real-time
YOLO-v11-s	9.4M	640x640	47.0	1.8ms	Balanced
YOLO-v11-m	20.1M	640x640	51.5	3.4ms	Accuracy-focused
YOLO-v11-l	49.0M	640x640	53.4	5.1ms	High accuracy
YOLO-v11-x	56.9M	640x640	54.7	7.8ms	Maximum accuracy
DETR-ResNet-50	41M	800x800	42.0	12ms	Transformer-based
DETR-ResNet-101	60M	800x800	43.5	18ms	Transformer-based
RT-DETR-l	32M	640x640	53.0	6.2ms	Real-time transformer

Segmentation

Model	Parameters	Input Size	Metric	Latency (A100)	Use Case
SAM2-tiny	38M	1024x1024	-	8ms	Interactive, edge
SAM2-small	46M	1024x1024	-	12ms	Interactive
SAM2-base	80M	1024x1024	-	20ms	General segmentation
SAM2-large	224M	1024x1024	-	35ms	High quality
YOLO-v11-seg-l	50M	640x640	44.6 mask mAP	6ms	Instance segmentation

Classification

Model	Parameters	Input Size	ImageNet Top-1	Latency (A100)
EfficientNet-B0	5.3M	224x224	77.1%	0.8ms
EfficientNet-B4	19M	380x380	82.9%	2.1ms
ConvNeXt-T	28M	224x224	82.1%	1.5ms
ConvNeXt-B	89M	224x224	83.8%	3.2ms

OCR

Model	Parameters	Languages	Latency (A100)	Use Case
PaddleOCR-v4	14M	80+	15ms/page	General OCR
TrOCR-base	334M	English	25ms/line	High-accuracy English
EasyOCR	20M	80+	20ms/page	Lightweight
Surya-OCR	180M	90+	30ms/page	Document-focused

Face Recognition

Model	Parameters	Embedding Dim	LFW Accuracy	Latency (A100)
ArcFace-R50	44M	512	99.5%	2.5ms
ArcFace-R100	65M	512	99.8%	4.0ms
AdaFace-R100	65M	512	99.8%	4.2ms
RetinaFace-R50	27M	-	-	3.0ms

Pose Estimation

Model	Parameters	Keypoints	COCO AP	Latency (A100)
RTMPose-t	3.3M	17	68.5	1.0ms
RTMPose-s	5.5M	17	72.2	1.3ms
RTMPose-m	13.6M	17	75.8	2.2ms
RTMPose-l	27.6M	17	76.5	3.5ms
RTMPose-x	49.4M	17	78.3	5.8ms

Model Format and Optimization

All models in the zoo are stored in ONNX format as the canonical interchange representation. At deployment time, models are optimized for the target hardware:

ONNX (canonical)
  |
  +---> TensorRT engine    (NVIDIA GPUs: 2-5x faster than ONNX Runtime)
  +---> CoreML model       (Apple Silicon: Metal acceleration)
  +---> OpenVINO IR        (Intel CPUs/GPUs)
  +---> ONNX Runtime       (CPU fallback, cross-platform)

The optimization happens automatically at first load. The optimized model is cached in Object Storage (HIP-0032) keyed by model version, hardware profile, and optimization options (precision, max batch size, workspace size). Subsequent loads on the same hardware use the cached optimized model.

model_optimization:
  precision: fp16                    # fp32, fp16, int8
  max_batch_size: 16                 # TensorRT optimization parameter
  workspace_size_mb: 2048            # TensorRT workspace
  calibration_dataset: coco-val-500  # For INT8 calibration
  cache_path: s3://hanzo-vision/model-cache/

Pipeline Configuration

A pipeline is defined as a YAML document that specifies sources, processing stages, privacy policies, and sinks.

Complete Pipeline Example

apiVersion: vision.hanzo.ai/v1
kind: Pipeline
metadata:
  name: warehouse-safety
  organization: hanzo
  labels:
    environment: production
    site: warehouse-01

spec:
  sources:
    - id: camera-north
      type: rtsp
      url: rtsp://10.0.1.10/stream1
      fps: 30
      decode: gpu                    # GPU-accelerated decoding

    - id: camera-south
      type: rtsp
      url: rtsp://10.0.1.11/stream1
      fps: 30
      decode: gpu

    - id: depth-sensor
      type: ros2
      topic: /realsense/depth/image_rect_raw
      modality: depth

  stages:
    - id: detect_objects
      model: yolo-v11-l
      classes: [person, forklift, pallet, hard_hat, safety_vest]
      confidence: 0.4
      nms_iou: 0.5
      device: gpu:0
      batch_size: auto               # Auto-tune based on GPU memory

    - id: detect_faces
      model: retinaface-r50
      depends_on: detect_objects
      input: crops
      filter: "class == 'person'"    # Only crop person detections
      device: gpu:0

    - id: check_ppe
      type: rule                     # Not a model -- a rule-based stage
      depends_on: detect_objects
      rules:
        - name: hard_hat_required
          condition: "person AND NOT hard_hat WITHIN 50px"
          severity: warning
        - name: vest_required
          condition: "person AND NOT safety_vest WITHIN 50px"
          severity: warning
        - name: forklift_proximity
          condition: "person WITHIN 3m OF forklift"
          severity: critical
          requires: depth-sensor      # Uses depth data for distance

    - id: estimate_depth
      model: depth-anything-v2-small
      sources: [camera-north, camera-south]
      device: gpu:1

    - id: fuse_detections
      type: fusion
      depends_on: [detect_objects, estimate_depth]
      method: project_to_3d          # Project 2D detections into 3D using depth
      camera_intrinsics: s3://hanzo-vision/calibration/warehouse-01/

  privacy:
    face_blur:
      enabled: true
      method: gaussian
      kernel_size: 31
    retention:
      raw_frames: 0h
      detections: 720h

  sinks:
    - id: alerts
      type: stream
      topic: hanzo.vision.alerts.warehouse-01
      filter: "severity IN ('warning', 'critical')"
      format: json

    - id: analytics
      type: stream
      topic: hanzo.vision.detections.warehouse-01
      format: protobuf

    - id: dashboard
      type: grpc
      endpoint: dashboard.internal:50051
      method: StreamDetections

  resources:
    gpu: 2                           # Request 2 GPUs
    memory: 16Gi
    cpu: 8

Output Schemas

All pipeline stages produce structured output conforming to standard schemas. This is what makes the pipeline model-agnostic: regardless of which model produced the result, the output schema is the same.

Detection Schema

message Detection {
  string   detection_id  = 1;   // Unique ID
  string   frame_id      = 2;   // Source frame
  string   class_name    = 3;   // e.g., "person", "forklift"
  int32    class_id      = 4;   // Numeric class ID
  float    confidence    = 5;   // [0.0, 1.0]
  BBox     bbox          = 6;   // Bounding box in original image coordinates
  Mask     mask          = 7;   // Optional instance segmentation mask
  repeated Keypoint keypoints = 8;  // Optional pose keypoints
  map<string, string> attributes = 9;  // Model-specific attributes
}

message BBox {
  float x1 = 1;  // Top-left x (pixels, original image coordinates)
  float y1 = 2;  // Top-left y
  float x2 = 3;  // Bottom-right x
  float y2 = 4;  // Bottom-right y
}

message Keypoint {
  string name       = 1;   // e.g., "left_shoulder"
  float  x          = 2;   // x coordinate (pixels)
  float  y          = 3;   // y coordinate (pixels)
  float  confidence = 4;   // [0.0, 1.0]
  bool   visible    = 5;   // Whether the keypoint is visible (not occluded)
}

OCR Schema

message OCRResult {
  string   frame_id     = 1;
  repeated TextRegion regions = 2;
  string   full_text    = 3;    // All text concatenated in reading order
  string   language     = 4;    // Detected language (ISO 639-1)
}

message TextRegion {
  repeated Point polygon = 1;   // Bounding polygon (4+ points)
  string  text           = 2;   // Recognized text
  float   confidence     = 3;   // [0.0, 1.0]
  int32   line_number    = 4;   // Reading order line number
}

Face Recognition Schema

message FaceResult {
  string  frame_id       = 1;
  BBox    bbox           = 2;   // Face bounding box
  float   detection_conf = 3;
  repeated float embedding = 4; // Face embedding vector (512-dim)
  string  identity       = 5;   // Matched identity (if gallery provided)
  float   match_score    = 6;   // Similarity to matched identity [0.0, 1.0]
  FaceLandmarks landmarks = 7;  // 5-point or 68-point landmarks
}

REST API

Batch Inference

POST /v1/inference
Content-Type: multipart/form-data

Parameters:
  image:    binary          (image file)
  model:    string          (model name, e.g., "yolo-v11-l")
  options:  JSON            (model-specific options)

Response:
{
  "request_id": "req_abc123",
  "model": "yolo-v11-l",
  "latency_ms": 12.3,
  "results": {
    "detections": [
      {
        "class": "person",
        "confidence": 0.92,
        "bbox": [120, 80, 350, 420]
      }
    ]
  }
}

Pipeline Management

POST   /v1/pipelines                  Create pipeline from YAML spec
GET    /v1/pipelines                  List pipelines
GET    /v1/pipelines/{id}             Get pipeline status
PUT    /v1/pipelines/{id}             Update pipeline (hot-reload)
DELETE /v1/pipelines/{id}             Stop and remove pipeline
POST   /v1/pipelines/{id}/start       Start a stopped pipeline
POST   /v1/pipelines/{id}/stop        Stop a running pipeline
GET    /v1/pipelines/{id}/metrics     Pipeline performance metrics
GET    /v1/pipelines/{id}/frames      Get recent processed frames (debug)

Model Management

GET    /v1/models                     List available models
GET    /v1/models/{name}              Model details and benchmarks
POST   /v1/models/{name}/load         Load model to GPU
POST   /v1/models/{name}/unload       Unload model from GPU
GET    /v1/models/status              GPU memory usage per model
POST   /v1/models/import              Import custom ONNX model

Video Analytics

Video analytics extends the base pipeline with temporal reasoning -- understanding what happens across frames, not just within a single frame.

Object Tracking

The pipeline includes multi-object tracking (MOT) as a built-in stage. Tracking assigns persistent IDs to detections across frames, enabling counting, trajectory analysis, and re-identification.

stages:
  - id: detect
    model: yolo-v11-l
    classes: [person, vehicle]

  - id: track
    type: tracker
    depends_on: detect
    algorithm: botsort             # botsort, bytetrack, ocsort
    max_age: 30                    # Frames before track is deleted
    min_hits: 3                    # Detections before track is confirmed
    iou_threshold: 0.3             # Association threshold
    reid_model: osnet-ain-x1.0    # Re-identification model (optional)

Event Detection

Events are temporal patterns defined over tracked objects:

events:
  - name: zone_intrusion
    type: zone_crossing
    zone:
      polygon: [[100,200], [400,200], [400,500], [100,500]]
    trigger: enter                  # enter, exit, dwell
    classes: [person]
    cooldown: 30s                   # Suppress duplicate alerts

  - name: abandoned_object
    type: stationary
    duration: 300s                  # Object stationary for 5 minutes
    classes: [backpack, suitcase, bag]

  - name: crowd_formation
    type: density
    threshold: 10                   # 10+ people per 100 sq meters
    area: 100                       # Square meters (requires calibration)

  - name: anomaly
    type: anomaly_detection
    model: video-anomaly-v1
    threshold: 0.8                  # Anomaly score threshold
    temporal_window: 64             # Frames of history

Integration with Jin (HIP-0003)

Jin provides vision-language models that combine visual understanding with natural language. The Vision Pipeline integrates with Jin for tasks that require both modalities:

Visual question answering: "Is anyone in this frame not wearing a hard hat?" The pipeline extracts the frame, Jin processes the image with the question, and returns a natural language answer.
Image captioning: Generate natural language descriptions of scenes for accessibility, logging, or search indexing.
Open-vocabulary detection: Instead of detecting only predefined classes, describe what to find in natural language. "Find all red objects on the conveyor belt." Jin's CLIP-based vision encoder matches the description against image regions.
Grounded conversation: Combine detection results with Jin's language model for contextual understanding. The pipeline provides structured detections; Jin reasons about them in natural language.

stages:
  - id: detect
    model: yolo-v11-l

  - id: describe_scene
    type: jin
    depends_on: detect
    model: jin-base
    prompt: |
      Describe the warehouse scene. Note any safety concerns.
      Detected objects: {detections}
    output: text

Integration with Engine (HIP-0043)

The Vision Pipeline does not run inference directly. It delegates inference to Engine (HIP-0043), which manages GPU resources, model loading, request batching, and multi-model scheduling.

The integration is transparent: the pipeline sends inference requests to Engine via gRPC, Engine returns results. From the pipeline's perspective, inference is a remote procedure call. From Engine's perspective, the Vision Pipeline is just another client.

This separation matters for resource management. A single Engine instance can serve multiple pipelines, multiple batch inference endpoints, and the LLM Gateway simultaneously. Engine handles the GPU scheduling; the Vision Pipeline handles the vision-specific preprocessing and postprocessing.

engine:
  endpoint: engine.internal:50051    # Engine gRPC endpoint
  timeout: 100ms                     # Inference timeout
  retry:
    max_attempts: 2
    backoff: 10ms

Integration with Edge (HIP-0050)

For edge-deployed vision models (security cameras at retail locations, robot onboard processors, mobile devices), the pipeline supports a lightweight Edge Runtime that runs on devices without datacenter GPUs.

Edge-deployed models are optimized for the target hardware:

Target	Runtime	Optimization	Typical Models
NVIDIA Jetson	TensorRT	INT8, FP16	YOLO-n/s, RTMPose-t/s
Apple Neural Engine	CoreML	FP16	YOLO-n/s, EfficientNet
Qualcomm NPU	QNN	INT8	YOLO-n, MobileNet
CPU (x86/ARM)	ONNX Runtime	INT8	YOLO-n, EfficientNet-B0

The Edge Runtime uses the same pipeline YAML configuration as the datacenter version, but with an edge-specific resource profile. A pipeline developed and tested in the datacenter can be deployed to the edge by changing the resource section:

resources:
  runtime: edge
  device: jetson-orin-nano
  power_budget: 15W               # Power-constrained optimization
  max_latency: 50ms               # Hard latency bound

Integration with Robotics (HIP-0080)

Robot perception is the most demanding vision use case: multiple sensors, hard real-time constraints, and tight integration with control systems.

The Vision Pipeline integrates with the Robotics framework through ROS 2:

Sensor input: The pipeline subscribes to ROS 2 image, depth, and point cloud topics. It handles time synchronization across sensors using the ROS 2 message timestamp, not wall clock time.
Perception output: Detection, segmentation, and depth results are published to ROS 2 topics that the robot's navigation and manipulation planners subscribe to.
Calibration: Camera intrinsic and extrinsic parameters are loaded from ROS 2 camera_info topics or from calibration files. The pipeline uses these to project 2D detections into the robot's 3D coordinate frame.
Sensor fusion: For robots with multiple cameras and depth sensors, the pipeline fuses detections from all sensors into a single 3D world model. Overlapping fields of view produce deduplicated detections with higher confidence.

sources:
  - id: front_camera
    type: ros2
    topic: /front_camera/image_raw
    camera_info: /front_camera/camera_info
    modality: video

  - id: front_depth
    type: ros2
    topic: /front_camera/depth/image_rect_raw
    modality: depth

  - id: lidar
    type: ros2
    topic: /velodyne_points
    modality: pointcloud

sinks:
  - id: perception
    type: ros2
    topic: /vision/detections
    message_type: vision_msgs/Detection3DArray
    frame_id: base_link            # TF frame for 3D coordinates

Annotation and Labeling Pipeline

Training vision models requires labeled data. The annotation pipeline provides AI-assisted labeling that reduces human effort by 5-10x.

Workflow

Raw Images/Video
      |
      v
  AI Pre-label (run detection/segmentation models on unlabeled data)
      |
      v
  Human Review (correct AI predictions in labeling UI)
      |
      v
  Quality Check (consensus scoring, inter-annotator agreement)
      |
      v
  Export to Dataset (COCO, Pascal VOC, YOLO format)
      |
      v
  ML Pipeline (HIP-0057) for model training

AI-assisted labeling works by running the current best model on unlabeled data, converting its predictions to draft annotations, and presenting them to human annotators for correction. For mature models with >90% accuracy, the annotator only needs to fix the ~10% of errors, rather than drawing every bounding box from scratch.

Supported Export Formats

Format	Structure	Use Case
COCO JSON	Single JSON with image list + annotations	Standard benchmark format
Pascal VOC XML	Per-image XML annotation files	Legacy compatibility
YOLO TXT	Per-image text files (class cx cy w h)	YOLO training
LabelMe JSON	Per-image JSON polygons	Segmentation labeling
CVAT XML	CVAT project export	CVAT interop

Data Format Standards

Image Formats

Format	Channels	Depth	Compression	Use Case
JPEG	RGB	8-bit	Lossy	General photography
PNG	RGB/RGBA	8/16-bit	Lossless	Screenshots, synthetic
WebP	RGB/RGBA	8-bit	Lossy/Lossless	Web delivery
TIFF	Any	8/16/32-bit	Optional	Scientific imaging
EXR	Any	16/32-bit float	Lossless	HDR, depth maps
NV12/I420	YUV	8-bit	None	Video frame decode

3D Formats

Format	Contents	Use Case
PLY	Points + normals + colors	Point cloud exchange
PCD	Points + fields	ROS/PCL standard
LAS/LAZ	Points + classification	LiDAR data
NumPy (.npy)	Raw tensor	Fast I/O

Metrics and Observability

The pipeline exposes Prometheus metrics on port 8081 at /metrics:

# Pipeline throughput
hanzo_vision_frames_processed_total{pipeline, source, stage}
hanzo_vision_frames_dropped_total{pipeline, source, reason}

# Latency
hanzo_vision_stage_latency_seconds{pipeline, stage, quantile}
hanzo_vision_pipeline_latency_seconds{pipeline, quantile}
hanzo_vision_inference_latency_seconds{model, quantile}

# GPU utilization
hanzo_vision_gpu_utilization{device}
hanzo_vision_gpu_memory_used_bytes{device}
hanzo_vision_gpu_memory_total_bytes{device}
hanzo_vision_model_memory_bytes{model}

# Detection quality
hanzo_vision_detections_total{pipeline, stage, class}
hanzo_vision_detection_confidence{pipeline, stage, class, quantile}

# Stream health
hanzo_vision_source_fps{pipeline, source}
hanzo_vision_source_latency_seconds{pipeline, source}
hanzo_vision_source_reconnects_total{pipeline, source}

# Privacy
hanzo_vision_faces_blurred_total{pipeline}
hanzo_vision_pii_detected_total{pipeline, type}

Security Considerations

Access Control

All API endpoints require authentication via Hanzo IAM (HIP-0001). Pipelines are scoped to organizations. A pipeline in organization "acme" cannot access models, galleries, or streams owned by organization "hanzo."

Camera credentials (RTSP URLs with embedded passwords) are stored as Hanzo KMS secrets (HIP-0005), never in pipeline YAML. The pipeline references secrets by name:

sources:
  - id: camera-01
    type: rtsp
    url: kms://hanzo-vision/camera-01-url    # Resolved at runtime from KMS

Model Supply Chain

Models in the zoo are signed with Ed25519 keys. The signature covers the ONNX weights, preprocessing spec, and benchmark results. The pipeline verifies signatures before loading a model. This prevents a compromised Object Storage bucket from serving a poisoned model.

Custom models uploaded via the import API undergo basic safety checks: tensor shape validation, weight distribution analysis (detecting NaN/Inf), and a test inference on a standard input. These checks do not guarantee safety but catch common corruption and basic adversarial weights.

Frame Data Security

Raw video frames are the most sensitive data in the pipeline. They are:

Never written to disk unless explicitly configured (e.g., for debugging)
Held in memory only for the duration of processing (typically <100ms)
Encrypted in transit between pipeline components using mTLS
Subject to the privacy policies defined in the pipeline configuration

Backward Compatibility

This is a new standard. There are no existing Hanzo Vision deployments to maintain compatibility with.

The output schemas (Detection, OCRResult, FaceResult) are versioned. Future schema changes will be additive (new optional fields). Removing or renaming fields requires a new major schema version with a migration period.

The pipeline YAML format follows Kubernetes conventions (apiVersion, kind, metadata, spec). Future versions will increment the API version (e.g., vision.hanzo.ai/v2) and support conversion webhooks for automatic migration.

Reference Implementation

The reference implementation is at github.com/hanzoai/vision.

Technology Choices

Component	Technology	Rationale
Pipeline runtime	Rust	Memory safety, zero-cost abstractions, GPU interop via cudarc
API server	Rust (axum)	Async, low overhead, same binary
Video decoding	FFmpeg + NVDEC/VAAPI bindings	Industry standard, hardware acceleration
Inference	ONNX Runtime + TensorRT	Cross-platform + NVIDIA optimization
Preprocessing	GPU kernels (CUDA/Metal)	Avoid CPU-GPU transfers
Streaming	gRPC streams + Hanzo Stream	Low latency + durable delivery
Configuration	YAML + JSON Schema validation	Human-readable, machine-validatable

Directory Structure

vision/
  cmd/
    hanzo-vision/          # Main binary entry point
  pkg/
    pipeline/              # Pipeline runtime and DAG scheduler
    source/                # Source implementations (RTSP, V4L2, ROS2, etc.)
    stage/                 # Processing stage implementations
    sink/                  # Sink implementations (Stream, gRPC, webhook)
    model/                 # Model loading, optimization, and caching
    preprocess/            # GPU-accelerated preprocessing kernels
    postprocess/           # NMS, coordinate transform, schema mapping
    privacy/               # Face blur, PII detection, consent
    tracking/              # Multi-object tracking algorithms
    annotation/            # AI-assisted labeling pipeline
    fusion/                # Multi-sensor fusion
  api/
    proto/                 # Protobuf definitions
    openapi/               # OpenAPI spec for REST endpoints
  models/
    zoo/                   # Model zoo manifests (YAML per model)
  deploy/
    docker/                # Dockerfile and compose files
    k8s/                   # Kubernetes manifests
    edge/                  # Edge deployment configs (Jetson, etc.)
  tests/
    integration/           # Pipeline integration tests
    benchmark/             # Performance benchmarks
    data/                  # Test images and videos

Build and Run

# Build
cargo build --release

# Run with config
./target/release/hanzo-vision --config pipeline.yaml

# Docker
docker run -d \
  --gpus all \
  -p 8081:8081 \
  -v /path/to/pipelines:/etc/hanzo-vision/pipelines \
  ghcr.io/hanzoai/vision:latest

# Kubernetes
kubectl apply -f deploy/k8s/vision-deployment.yaml

Test Vectors

Minimum Viable Pipeline

A pipeline that reads an image from disk, runs YOLO detection, and writes results to stdout:

apiVersion: vision.hanzo.ai/v1
kind: Pipeline
metadata:
  name: test-detection
spec:
  sources:
    - id: input
      type: file
      path: /data/test.jpg
  stages:
    - id: detect
      model: yolo-v11-n
      confidence: 0.25
  sinks:
    - id: output
      type: stdout
      format: json

Expected output for a test image containing one person and one car:

{
  "frame_id": "frm_001",
  "detections": [
    {
      "class": "person",
      "confidence": 0.87,
      "bbox": [120, 80, 350, 420]
    },
    {
      "class": "car",
      "confidence": 0.93,
      "bbox": [500, 200, 900, 450]
    }
  ]
}

Preprocessing Correctness

The preprocessing pipeline must produce bit-identical output for the same input across runs (deterministic). The test suite includes reference input/output pairs for each preprocessing step:

Test	Input	Expected Output
Resize (letterbox)	1920x1080 RGB	640x640 with 60px top/bottom padding
Color convert (YUV->RGB)	NV12 frame	RGB with CCIR 601 coefficients
Normalize (ImageNet)	[0-255] uint8	float32, mean-subtracted, std-divided
Batch (2 images)	Two 640x640 RGB	[2, 3, 640, 640] NCHW tensor

Computer Vision Pipeline Standard