HIPsHanzo Proposals
Back to HIPs
HIP-260DraftMeta

Efficient Model Practices

Best practices for developing and deploying energy-efficient AI models.

Hanzo AI Team (@hanzoai)
Created: 2025-12-17
sustainabilityefficiencyoptimizationenergy
Requires: HIP-200, HIP-250, HIP-251

HIP-260: Efficient Model Practices

Abstract

This HIP establishes best practices for developing and deploying energy-efficient AI models at Hanzo AI. It covers architecture decisions, training optimizations, inference efficiency, and operational practices that reduce environmental impact while maintaining capability.

Efficiency Principles

Guiding Principles

  1. Right-size models: Use smallest model that meets requirements
  2. Optimize before scale: Efficiency improvements before scaling compute
  3. Measure continuously: Track energy and efficiency metrics
  4. Share learnings: Document and share efficiency improvements
  5. Balance trade-offs: Consider efficiency in capability decisions

Efficiency Hierarchy

1. Avoid unnecessary computation
2. Reduce computation needed
3. Make computation more efficient
4. Use clean energy for remaining computation
5. Offset residual emissions

Training Efficiency

Architecture Design

Efficient Architectures

TechniqueBenefitTrade-off
Sparse attentionO(n) vs O(n²)Some quality loss
Linear attentionLower complexityLimited context
Mixture of ExpertsConditional computeComplexity
Parameter sharingSmaller modelsSome quality loss

Architecture Guidelines

GuidelineRationale
Start smallProve approach before scaling
Test efficiencyMeasure before committing
Consider alternativesEvaluate efficient variants
Document choicesRecord efficiency trade-offs

Training Optimizations

Compute Efficiency

TechniqueImplementationBenefit
Mixed precision (BF16/FP16)Default for all training2x memory, ~1.5x speed
Gradient checkpointingFor memory-limited3-4x memory reduction
Flash AttentionDefault for transformers2-4x attention speedup
Fused kernelsUse optimized libraries10-30% speedup

Data Efficiency

TechniqueImplementationBenefit
Data deduplicationPreprocessingBetter quality per token
Quality filteringCuration pipelineFewer tokens needed
Curriculum learningEasy to hardFaster convergence
Active learningTargeted data collectionLess data needed

Training Strategies

StrategyImplementationBenefit
Learning rate schedulingCosine with warmupFaster convergence
Early stoppingValidation monitoringAvoid overtraining
Checkpoint averagingAverage best checkpointsBetter final model
Hyperparameter tuningSystematic searchOptimal efficiency

Training Process Requirements

Pre-Training Checklist

ItemVerification
☐ Baseline efficiency establishedMeasured baseline metrics
☐ Efficiency techniques appliedAll applicable techniques
☐ Hardware utilization plannedGPU utilization >80% target
☐ Energy tracking configuredMonitoring in place

During Training

MetricTargetAction if Below
GPU utilization>80%Optimize batching
Memory utilization>70%Adjust batch size
Training loss curveExpected descentInvestigate, adjust

Inference Efficiency

Model Optimization

Quantization

LevelFormatUse CaseQuality Impact
FP16Half precisionDefault deploymentMinimal
INT88-bit integerProduction0-2% quality loss
INT44-bit integerEdge/cost-sensitive2-5% quality loss
GPTQ/AWQAdvanced quantBest quality at low bits<2% typically

Model Compression

TechniqueReductionQuality Impact
Pruning30-50% parameters1-3% quality loss
Knowledge distillation2-10x smallerVariable
Low-rank factorization20-40% reduction1-2% quality loss

Inference Optimizations

Batching

StrategyUse CaseBenefit
Dynamic batchingAPI servingBetter utilization
Continuous batchingLLM servingHigher throughput
Request coalescingSimilar requestsEfficiency gain

Caching

Cache TypeImplementationBenefit
KV cacheStandard for LLMsRequired for efficiency
Response cacheExact match cacheAvoid recomputation
Semantic cacheSimilar query cacheReduce redundant work

Speculative Decoding

TechniqueImplementationBenefit
Draft modelSmall model proposes2-3x speedup
Self-speculativeSame model, different depth1.5-2x speedup
Medusa headsMultiple prediction heads2-3x speedup

Serving Infrastructure

Request Routing

StrategyImplementationBenefit
Model selectionRoute to appropriate modelUse smallest sufficient
Complexity estimationAssess request complexityMatch model to need
Load balancingEfficient distributionBetter utilization

Scaling

StrategyImplementationBenefit
Horizontal scalingAdd instancesHandle load
Vertical scalingBetter hardwareEfficiency per request
Auto-scalingDemand-basedAvoid idle compute

Operational Efficiency

Compute Scheduling

Time-Based Scheduling

StrategyImplementationBenefit
Off-peak trainingSchedule for low-carbon hoursLower emissions
Batch processingAggregate non-urgent workBetter utilization
Preemptible instancesUse spot/preemptibleLower cost/emissions

Location-Based Scheduling

StrategyImplementationBenefit
Green region preferenceRoute to clean gridsLower emissions
Carbon-aware schedulingReal-time carbon intensityOptimal timing
Follow-the-sunMove work to clean regionsMaximize renewables

Hardware Efficiency

Hardware Selection

FactorConsideration
Latest generation20-50% efficiency gain per generation
Right-sizedMatch hardware to workload
UtilizationShared resources where appropriate

Hardware Lifecycle

PracticeImplementation
Refresh cyclesPlan efficient hardware upgrades
Utilization targetsMaintain >70% average utilization
End-of-lifeResponsible recycling/resale

Development Practices

Experiment Efficiency

PracticeImplementation
Small-scale firstTest on small data/models first
Ablation studiesSystematic, efficient experiments
Negative result trackingAvoid repeating failed experiments
Experiment trackingLog all runs to avoid duplicates

Code Efficiency

PracticeImplementation
ProfilingIdentify bottlenecks
Optimized librariesUse best implementations
Batch operationsVectorize where possible
Memory managementAvoid unnecessary allocations

Metrics & Monitoring

Efficiency Metrics

Training Metrics

MetricDefinitionTarget
FLOPS/tokenCompute per tokenTrack and reduce
Samples/GPU-hourTraining throughputMaximize
GPU utilizationCompute usage>80%
Time to resultTraining durationMinimize

Inference Metrics

MetricDefinitionTarget
Tokens/second/GPUThroughputMaximize
Latency (p50, p99)Response timePer SLA
CO2e/1K tokensCarbon intensityMinimize
Requests/wattEnergy efficiencyTrack and improve

Monitoring Dashboard

PanelContents
Efficiency overviewKey efficiency metrics
Training efficiencyCurrent training jobs
Inference efficiencyServing metrics
Carbon intensityReal-time carbon metrics
TrendsEfficiency over time

Reporting

ReportFrequencyContents
Efficiency digestWeeklyKey metrics, anomalies
Optimization opportunitiesMonthlyIdentified improvements
Efficiency reviewQuarterlyProgress, initiatives

Implementation Requirements

New Model Development

PhaseEfficiency Requirement
DesignEfficiency consideration in architecture
TrainingEfficiency techniques applied
EvaluationEfficiency metrics measured
DeploymentOptimization before deployment

Model Deployment

RequirementVerification
Quantization evaluatedDocumented quality vs. efficiency trade-off
Serving optimizedBatching, caching implemented
Monitoring configuredEfficiency metrics tracked
Right-sized deploymentHardware matches workload

Continuous Improvement

ActivityFrequency
Efficiency benchmarkingMonthly
Technique evaluationQuarterly
Hardware assessmentAnnual
Process reviewAnnual

Related HIPs

  • HIP-200: Responsible AI Principles
  • HIP-250: Sustainability Standards Alignment
  • HIP-251: AI Compute Carbon Footprint
  • HIP-270: AI Supply Chain Responsibility

Changelog

VersionDateChanges
1.02025-12-17Initial draft

Copyright

Copyright and related rights waived via CC0.