HIPsHanzo Proposals
Back to HIPs
HIP-220DraftMeta

Bias Detection & Mitigation

Framework for detecting, measuring, and mitigating bias in AI systems.

Hanzo AI Team (@hanzoai)
Created: 2025-12-17
ai-ethicsfairnessbiasevaluation
Requires: HIP-200, HIP-201

HIP-220: Bias Detection & Mitigation

Abstract

This HIP establishes the framework for detecting, measuring, and mitigating bias in Hanzo AI systems. It defines bias categories, evaluation methodologies, fairness metrics, and remediation processes to ensure AI systems treat all users equitably.

Bias Framework

Definitions

Algorithmic bias: Systematic and repeatable errors in a computer system that create unfair outcomes for certain groups.

Fairness: The principle that AI systems should treat individuals and groups equitably, without discrimination based on protected characteristics.

Protected Characteristics

CategoryCharacteristics
DemographicRace, ethnicity, nationality, religion
PersonalAge, gender, sexual orientation, disability
SocioeconomicIncome, education, occupation
GeographicRegion, urban/rural, language

Bias Types

Pre-existing Bias

TypeSourceExample
HistoricalPast discrimination in dataHiring data reflecting past discrimination
RepresentationUnderrepresentation in dataFewer examples of certain groups
MeasurementHow data is collectedSensors less accurate for certain skin tones

Technical Bias

TypeSourceExample
AggregationOne-size-fits-all modelsMedical model trained primarily on one population
LearningAlgorithm amplificationFeedback loops reinforcing initial bias
EvaluationBiased benchmarksTest sets not representative

Emergent Bias

TypeSourceExample
DeploymentContext mismatchModel used in unintended population
InteractionUser behavior patternsDifferent usage across groups
TemporalChanging contextsSociety changes, model doesn't

Bias Detection

Detection Methods

Quantitative Analysis

MethodApplicationTools
Demographic parityEqual prediction ratesStatistical analysis
Equalized oddsEqual TPR/FPR across groupsFairlearn
CalibrationConsistent probability meaningReliability diagrams
Individual fairnessSimilar treatment for similar individualsDistance metrics

Qualitative Analysis

MethodApplicationApproach
Output auditingReview generated contentHuman evaluation
Prompt testingTest with demographic markersStructured prompts
User studiesPerception of fairnessSurveys, interviews

Evaluation Datasets

Standard Benchmarks

BenchmarkBias TypeMetrics
WinoBiasGenderAccuracy parity
StereoSetStereotypeLanguage model score
CrowS-PairsMultiplePreference score
BBQSocial biasAccuracy in ambiguous contexts

Custom Evaluation Sets

DatasetCoverageSize
Hanzo-Bias-1KMulti-category bias probes1,000+
Regional-FairnessGeographic/cultural bias500+
Intersectional-SetIntersecting identities300+

Audit Process

Pre-Deployment Audit

1. Define evaluation scope (groups, use cases)
    ↓
2. Select appropriate metrics
    ↓
3. Run quantitative evaluation
    ↓
4. Conduct qualitative review
    ↓
5. Document findings
    ↓
6. Determine if thresholds met

Ongoing Monitoring

ActivityFrequencyScope
Automated metricsWeeklyKey fairness metrics
Sample auditsMonthlyHuman review sample
Full auditQuarterlyComprehensive evaluation

Fairness Metrics

Group Fairness Metrics

Demographic Parity

Definition: Prediction rates equal across groups

Formula:

P(Ŷ=1|A=a) = P(Ŷ=1|A=b) for all groups a, b

Threshold: <10% difference across groups

Equalized Odds

Definition: Equal true positive and false positive rates across groups

Formulas:

P(Ŷ=1|Y=1,A=a) = P(Ŷ=1|Y=1,A=b)  (Equal TPR)
P(Ŷ=1|Y=0,A=a) = P(Ŷ=1|Y=0,A=b)  (Equal FPR)

Threshold: <15% difference in TPR, <10% difference in FPR

Predictive Parity

Definition: Equal precision across groups

Formula:

P(Y=1|Ŷ=1,A=a) = P(Y=1|Ŷ=1,A=b)

Threshold: <10% difference across groups

Individual Fairness Metrics

Consistency

Definition: Similar individuals receive similar treatment

Formula:

|f(x) - f(x')| ≤ d(x, x') for similar x, x'

Language Model Metrics

Stereotype Score

Definition: Tendency to associate groups with stereotypes

Measurement: Compare likelihood of stereotypical vs. anti-stereotypical completions

Target: Score ≈ 50% (no preference)

Representation Metrics

MetricDefinitionTarget
Mention parityEqual mention ratesWithin 20%
Sentiment parityEqual sentimentNo significant difference
AssociationCo-occurrence patternsNo stereotypical clustering

Intersectional Analysis

Analyze bias at intersection of multiple characteristics:

IntersectionAnalysis
Gender × RaceCheck for compounded bias
Age × DisabilityAssess unique patterns
Region × LanguageCultural intersections

Mitigation Strategies

Pre-Training Mitigations

StrategyApplication
Data balancingEnsure representative training data
Data augmentationAdd underrepresented examples
Data filteringRemove biased content
Source diversificationInclude diverse data sources

Training Mitigations

StrategyApplication
Debiasing techniquesRLHF, constitutional AI
RegularizationFairness constraints in loss
Adversarial trainingTrain against bias detectors
Multi-task learningInclude fairness objectives

Post-Training Mitigations

StrategyApplication
Output filteringFlag biased outputs
CalibrationAdjust outputs for fairness
Prompt engineeringInclude fairness instructions
Human reviewManual review of sensitive outputs

Deployment Mitigations

StrategyApplication
Use case restrictionsLimit high-risk applications
User warningsInform users of limitations
Feedback loopsCollect bias reports
A/B testingTest mitigation effectiveness

Implementation

Bias Review Process

New Model Review

1. Complete bias assessment questionnaire
    ↓
2. Run standard benchmark suite
    ↓
3. Conduct demographic analysis
    ↓
4. Document known limitations
    ↓
5. Implement mitigations
    ↓
6. Verify mitigation effectiveness
    ↓
7. Obtain fairness sign-off

Periodic Review

ActivityFrequencyOwner
Metric monitoringWeeklyMRM Team
Benchmark re-runMonthlySafety Team
Full auditQuarterlyExternal + Internal
Process reviewAnnualESG Committee

Documentation Requirements

Fairness Card

SectionContents
Groups evaluatedDemographic breakdown
Metrics usedFairness metrics applied
ResultsMetric values by group
Known limitationsIdentified bias patterns
Mitigations appliedSteps taken to address

Threshold Standards

Model Risk TierDemographic ParityEqualized Odds
Critical<5% difference<10% difference
High<10% difference<15% difference
Medium<15% difference<20% difference
Low<20% difference<25% difference

Governance

Fairness Oversight

BodyRole
Fairness LeadDay-to-day oversight
Safety TeamEvaluation execution
ESG CommitteePolicy and escalations
External BoardIndependent review

Escalation Process

TriggerEscalation
Threshold violationFairness Lead → MRM Lead
User complaintSupport → Fairness Lead
External reportCommunications → ESG Committee
Regulatory inquiryLegal → Board

Continuous Improvement

  • Track bias metrics over model versions
  • Research new evaluation methods
  • Engage with fairness research community
  • Update thresholds based on capabilities

Related HIPs

  • HIP-200: Responsible AI Principles
  • HIP-201: Model Risk Management
  • HIP-210: Safety Evaluation Framework
  • HIP-230: AI Transparency & Explainability
  • HIP-240: AI Incident Response

Changelog

VersionDateChanges
1.02025-12-17Initial draft

Copyright

Copyright and related rights waived via CC0.