Calibration vs. Discrimination Trade-Offs
Overview
In forecasting and prediction, two dimensions of accuracy often trade off against each other: calibration (how well predicted probabilities match actual frequencies) and discrimination (how well predictions distinguish between different outcomes). A well-calibrated forecaster who says "70% chance" should be right 70% of the time. A discriminating forecaster can tell the difference between 30% and 70% events. Optimal performance requires balancing both, but improving one can hurt the other.
Core Concepts
Calibration
Do your probabilities match reality?
- If you say "80% confident" 100 times, are you right ~80 times?
- Measures whether predicted probabilities = actual frequencies
- Overconfidence: Saying 90% when only right 70%
- Underconfidence: Saying 60% when right 80%
Perfect calibration: For all predictions at X%, exactly X% come true.
Discrimination (Resolution)
Can you tell the difference between likely and unlikely events?
- Can you distinguish 90% probability events from 10% events?
- Measures variance in predictions
- High discrimination: Confidently predicts extremes (10%, 90%)
- Low discrimination: Everything clustered around 50% (uninformative)
Perfect discrimination: Always predict 100% for events that happen, 0% for those that don't.
The Trade-Off
Why you can't always have both:
- High discrimination with poor calibration: Overconfident extremes (90% predictions right only 60% of time)
- High calibration with poor discrimination: Safe but uninformative (everything is 50%)
- Sweet spot: Maximally discriminating while maintaining calibration
Execution Steps (Improving Both Dimensions)
1. Measure Current Performance
Calibration:
- Plot predicted probabilities vs. actual outcomes
- Calibration curve should hug the diagonal (45° line)
- Brier score captures both calibration and discrimination
Discrimination:
- Compare variance in predictions
- Area under ROC curve (AUC)
- How often did high-confidence predictions beat low-confidence?
Example: Track predictions over 100 forecasts, plot calibration curve.
2. Identify Bias Direction
- Overconfident: Calibration curve above diagonal (predicted 80%, actual 60%)
- Underconfident: Calibration curve below diagonal (predicted 60%, actual 80%)
- Overly hedged: All predictions near 50% (poor discrimination)
3. Adjust for Calibration
If overconfident:
- Regress toward 50% (moderate extreme predictions)
- Ask "What would make me wrong?"
- Track base rates more carefully
If underconfident:
- Push toward extremes when evidence is strong
- Trust pattern recognition
- Acknowledge uncertainty costs (hedging isn't free)
4. Improve Discrimination
- Seek better information: Distinguish strong vs. weak signals
- Identify drivers: What factors predict different outcomes?
- Decompose questions: Break complex forecasts into sub-components
- Track leading indicators: Early signals that differentiate
Example: Instead of "Will product succeed?" ask "Will it hit X downloads AND Y retention?"
5. Balance the Trade-Off
Extremizing: When aggregating forecasts, push crowd average toward extremes (improves discrimination while maintaining calibration)
Granular confidence: Use 1-99% scale, not just 25/50/75
Context-dependent: High-stakes decisions need calibration; exploratory decisions can tolerate discrimination focus
Anti-Patterns
False Precision: Claiming 73% when you mean "probably" (discrimination theater without calibration)
Perpetual Hedging: Always saying 50-60% to avoid being wrong (good calibration, useless discrimination)
Uncalibrated Extremes: Bold predictions (10%, 90%) without tracking accuracy (discrimination without calibration)
Ignoring Base Rates: Overweighting anecdotes vs. statistical priors (poor calibration)
Quality Indicators
High Signal (Good Balance):
- Brier score < 0.20 (combines calibration + discrimination)
- Calibration curve near diagonal across full probability range
- Predictions vary meaningfully (not clustered at 50%)
- Confidence correlates with accuracy
- Regular scoring and feedback
Low Signal:
- Never track actual outcomes vs. predictions
- All predictions in narrow range (40-60%)
- Wildly overconfident (90% predictions right 50% of time)
- No improvement over time despite feedback
Cross-Domain Applications
Superforecasting
Philip Tetlock's research: Best forecasters balance both dimensions
- Track predictions in prediction markets or tournaments
- Use granular probabilities (not just high/medium/low)
- Update incrementally as new information arrives
Machine Learning
Model evaluation trade-offs:
- Precision vs. recall (similar to calibration vs. discrimination)
- Confidence scores should match actual accuracy
- Platt scaling: Post-hoc calibration of model outputs
Medical Diagnosis
- Discrimination: Can test distinguish sick from healthy?
- Calibration: Does "70% risk" mean 70 out of 100 similar patients?
- Both matter: Wrong treatment OR unnecessary anxiety
Business Forecasting
- Revenue predictions need calibration (for budgeting)
- Opportunity prioritization needs discrimination (which bets to make?)
Related Frameworks
- Brier Score: Combines calibration and discrimination in single metric
- Superforecasting: Tetlock's research on prediction accuracy
- Bayesian Updating: Incremental belief revision improves calibration
- Base Rate Neglect: Ignoring priors hurts calibration
- Extremizing: Aggregation technique that improves discrimination
Scoring (35/50)
- Practitioner Weight (7/10): Core to Tetlock's forecasting research, used in prediction markets
- Clarity (7/10): Concepts clear but measuring them requires statistical knowledge
- Proven ROI (8/10): Superforecasters demonstrably outperform by balancing both
- Novelty (6/10): Statistical concepts applied to forecasting (moderately non-obvious)
- Applicability (7/10): Relevant to forecasting, ML, risk assessment, decision-making
Sources
- Philip Tetlock: Superforecasting (calibration-discrimination trade-offs in expert predictions)
- Philip Tetlock: Expert Political Judgment (foxes vs. hedgehogs, accuracy dimensions)
- Glenn Brier: Verification of forecasts (Brier score)
- Good Judgment Project: Practical forecasting tournament findings
- Nate Silver: The Signal and the Noise (calibration in prediction)
微信扫一扫