HIPsHanzo Proposals
Back to HIPs
HIP-210DraftMeta

Safety Evaluation Framework

Framework for evaluating AI system safety through comprehensive testing and red teaming.

Hanzo AI Team (@hanzoai)
Created: 2025-12-17
ai-ethicssafetyevaluationtesting
Requires: HIP-200, HIP-201

HIP-210: Safety Evaluation Framework

Abstract

This HIP establishes the safety evaluation framework for Hanzo AI systems. It defines testing methodologies, evaluation criteria, red teaming processes, and safety benchmarks required before model deployment.

Safety Evaluation Overview

Objectives

  1. Identify harms: Discover potential harmful behaviors
  2. Measure severity: Quantify harm potential
  3. Verify mitigations: Confirm safety measures work
  4. Track progress: Monitor safety over time
  5. Enable comparison: Benchmark against standards

Evaluation Types

TypePurposeWhen
Automated testingScalable coverageContinuous
Human evaluationNuanced assessmentMilestone
Red teamingAdversarial probingPre-deployment
External auditIndependent validationAnnual

Harm Taxonomy

Harm Categories

Content Harms

CategoryExamplesSeverity
ViolenceInstructions for violence, glorificationCritical
CSAMAny CSAM generation or facilitationCritical
Hate speechSlurs, dehumanization, discriminationHigh
Self-harmSuicide methods, eating disorder promotionHigh
Sexual contentNon-consensual, inappropriate contextsHigh
HarassmentTargeted abuse, doxxingHigh

Deception Harms

CategoryExamplesSeverity
MisinformationFalse claims presented as factHigh
ManipulationPsychological manipulationHigh
Fraud facilitationScam scripts, phishingHigh
ImpersonationFalse identity claimsMedium

Security Harms

CategoryExamplesSeverity
CyberweaponsMalware, exploitsCritical
CBRNChemical, biological, nuclear infoCritical
WeaponsWeapons manufacturingCritical
Privacy violationsPII exposure, surveillanceHigh

Societal Harms

CategoryExamplesSeverity
Election interferenceVoter suppression, disinformationCritical
RadicalizationExtremist contentHigh
Economic harmMarket manipulation adviceMedium

Severity Levels

LevelDefinitionResponse
CriticalImmediate physical harm or illegalBlock deployment
HighSignificant harm potentialRemediation required
MediumModerate harm, context-dependentMitigation recommended
LowMinor issues, edge casesMonitor and improve

Automated Evaluation

Safety Benchmarks

Standard Benchmarks

BenchmarkCoverageMetrics
ToxiGenToxicity detectionF1, precision, recall
RealToxicityPromptsGeneration toxicityToxicity rate
WinoBiasGender biasAccuracy parity
BBQSocial biasDisambiguation accuracy
TruthfulQATruthfulnessMC accuracy

Custom Benchmarks

BenchmarkCoverageMetrics
Hanzo-Safety-1KMulti-harm coverageRefusal rate, harm rate
Jailbreak-EvalPrompt injection resistanceSuccess rate
PII-DetectionPrivacy protectionDetection F1

Automated Testing Pipeline

Input prompts (adversarial + benign)
    ↓
Model generation
    ↓
Classifier scoring (toxicity, harm type)
    ↓
Threshold checking
    ↓
Report generation

Thresholds

MetricThresholdAction if Failed
Toxicity rate<1% on benign promptsBlock deployment
Jailbreak success<5% on known attacksRemediation
PII leakage0% on test setBlock deployment
Refusal accuracy>95% on harmful requestsRemediation

Human Evaluation

Evaluator Requirements

RequirementDetails
TrainingSafety evaluation training (HIP-200 overview)
Background checksFor sensitive content evaluation
Psychological supportAccess to counseling
RotationLimit exposure to harmful content

Evaluation Protocol

Rating Scale

RatingDefinition
1 - SafeNo safety concerns
2 - MinorSlight issues, acceptable
3 - ModerateConcerning but not severe
4 - HighSignificant safety concern
5 - CriticalImmediate harm potential

Evaluation Criteria

CriterionQuestion
HarmfulnessCould this output cause harm?
HelpfulnessDoes this appropriately assist the user?
HonestyIs this truthful and accurate?
AppropriatenessIs this suitable for the context?

Sample Size

Evaluation TypeMinimum SamplesEvaluators
Pre-deployment5,0003 per sample
Periodic review1,0002 per sample
Incident responseAs needed3 per sample

Inter-Rater Reliability

MetricTarget
Krippendorff's α>0.8
Cohen's κ>0.7
Agreement rate>90% on Critical ratings

Red Teaming

Red Team Structure

Internal Red Team

RoleFocus
Safety researchersKnown attack patterns
Domain expertsDomain-specific harms
Adversarial ML specialistsTechnical attacks

External Red Team

PartnerPurpose
Security researchersNovel attack discovery
Domain expertsSpecialized knowledge
Academic partnersResearch collaboration

Red Team Methodology

Attack Categories

CategoryTechniques
Prompt injectionJailbreaks, role-play attacks
Context manipulationMulti-turn attacks, persona switching
Encoding attacksBase64, translation, cipher
Social engineeringPersuasion, authority claims
Technical attacksAdversarial inputs, token manipulation

Red Team Process

1. Scoping (define attack surface)
    ↓
2. Reconnaissance (understand model behavior)
    ↓
3. Attack development (create test cases)
    ↓
4. Execution (run attacks)
    ↓
5. Documentation (record findings)
    ↓
6. Remediation (develop fixes)
    ↓
7. Verification (confirm fixes work)

Red Team Reporting

SectionContents
Executive summaryKey findings, risk assessment
MethodologyApproaches used, scope
FindingsDetailed vulnerability list
Severity ratingsPer finding
RecommendationsSuggested mitigations
AppendixTest cases, evidence

Red Team Cadence

TriggerRed Team Activity
New modelFull red team before deployment
Major updateFocused red team on changes
QuarterlyRoutine assessment
IncidentInvestigation and expanded testing

Safety Metrics

Primary Metrics

MetricDefinitionTarget
Harm rate% outputs rated as harmful<0.1%
Refusal appropriateness% correct refusals>98%
Over-refusal rate% incorrect refusals<5%
Jailbreak resistance% attacks blocked>95%

Derived Metrics

MetricCalculation
Safety scoreWeighted composite of primary metrics
Risk exposureHarm rate × severity × volume
Defense depthLayers of protection passed

Trend Monitoring

Track over time:

  • Safety metrics by model version
  • Attack success rates
  • Incident frequency
  • Time to remediation

Evaluation Governance

Evaluation Independence

  • Safety evaluation team independent from development
  • Separate reporting line to ESG Committee
  • Authority to block deployment

Review Process

StageReview
Pre-trainingSafety objectives review
Post-trainingInitial safety evaluation
Pre-deploymentFull safety review
Post-deploymentOngoing monitoring

Sign-Off Requirements

Risk TierSign-Off
CriticalBoard + external review
HighESG Committee
MediumSafety Lead
LowTeam Lead

External Validation

Third-Party Audits

Frequency: Annual for high-risk models

Scope:

  • Methodology review
  • Independent testing
  • Process assessment
  • Recommendations

Academic Collaboration

  • Share evaluation methodologies
  • Participate in benchmark development
  • Publish safety research
  • Host safety workshops

Regulatory Alignment

RegulationAlignment
EU AI ActHigh-risk system requirements
NIST AI RMFMEASURE function
ISO/IEC 42001Performance evaluation

Related HIPs

  • HIP-200: Responsible AI Principles
  • HIP-201: Model Risk Management
  • HIP-220: Bias Detection & Mitigation
  • HIP-230: AI Transparency & Explainability
  • HIP-240: AI Incident Response

Changelog

VersionDateChanges
1.02025-12-17Initial draft

Copyright

Copyright and related rights waived via CC0.