This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

Model evaluation, selection & validation

Purpose: Project-agnostic reference for measuring, comparing, and validating models beyond a single accuracy number — metrics, splits, tuning, fairness, explainability, and production experimentation.

Audience: Teams following DATA-SCIENCE.md and techniques/README.md. Complements feature-engineering.md.

Overview

Evaluation is not a single step at the end of modeling; it is a system of choices: what to optimize, how to split data, how to compare candidates, and how to behave under uncertainty, fairness, and cost constraints. The goal is decisions you can defend — to stakeholders and to future you.

Classification metrics

Metric	Definition (intuition)	When to use	Limitations
Accuracy	Fraction correct	Balanced classes; equal error costs	Misleading under imbalance
Precision	TP / (TP + FP)	Costly false positives	Ignores false negatives
Recall	TP / (TP + FN)	Costly false negatives	Ignores false positives
F1	Harmonic mean of precision & recall	Single score balancing P/R	Hides threshold choice
AUC-ROC	Rank quality across thresholds	Compare models threshold-free	Can be optimistic if imbalance mishandled
AUC-PR	Area under precision–recall curve	Imbalanced positives	Less familiar to some stakeholders
Log loss	Probabilistic calibration penalty	When probabilities matter	Sensitive to extreme miscalibration
Matthews correlation	Correlation between predicted and actual (binary)	Imbalance-friendly single score	Binary focus

Regression metrics

Metric	Notes	Comparison
MAE	Average absolute error	Robust to outliers vs squared errors
MSE / RMSE	Penalizes large errors more	RMSE same units as target
MAPE	Percentage errors	Breaks near zero targets
R²	Explained variance fraction	Can mislead with bad baselines
Adjusted R²	Penalizes extra predictors	Model comparison with different feature counts

Confusion matrix → metrics

For probabilistic classifiers, metrics are computed at a threshold or integrated over thresholds (ROC/PR).

Validation strategies

Strategy	When to use	Caveat
Train/test split	Plenty of i.i.d. data	High variance with small data
K-fold CV	Stable metric estimates	Leakage if groups/time ignored
Stratified k-fold	Imbalanced classification	Preserve class ratios per fold
Time-series split	Temporal data	No random shuffle
Leave-one-out	Tiny datasets	High variance; expensive
Nested CV	Tuning + unbiased outer estimate	Costly; gold standard for small data

Each rotation trains on k−1 folds and validates on the held-out fold; scores are averaged.

Hyperparameter tuning

Method	Efficiency	Parallelization	Convergence notes
Grid search	Poor in high dimensions	Embarrassingly parallel	Exhaustive only for small spaces
Random search	Often better than grid per eval	Easy parallel	No principled stop
Bayesian optimization (Optuna, Hyperopt)	Sample-efficient	Parallel trials with batches	Needs enough signal per trial
Early stopping	Saves compute for iterative learners	Built into boosting / NN loops	Tune patience; watch validation leakage

Model comparison: statistical vs practical significance

Approach	Role
Paired t-test / Wilcoxon	Compare fold-wise or instance-wise scores across models
McNemar	Paired disagreements on classification outcomes
Bayesian comparison	Posterior over “probability model A wins” — useful under few runs

A statistically significant 0.1% gain may be practically irrelevant if latency, cost, or interpretability regress. Always tie back to Business Understanding criteria (see crisp-dm.md).

Fairness metrics (overview)

Metric	Idea	Typical use case
Demographic parity	Positive rate similar across groups	Equal selection rates
Equalized odds	TPR/FPR similar across groups	Balance errors across groups
Calibration	Predicted probabilities match outcomes per group	Risk scores, lending, medicine
Individual fairness	Similar individuals treated similarly	Hard to operationalize; needs meaningful distance metric

Fairness metrics conflict in general — choosing one is a policy decision, not purely technical.

Explainability methods

Method	Scope	Agnostic?	Notes
SHAP	Local + global aggregates	Model-agnostic (with appropriate explainers)	Shapley-based attributions
LIME	Local	Yes	Perturbation around instance
Partial dependence	Global marginal effects	Usually model-agnostic	Assumes feature independence (watch correlated features)
Feature importance	Global	Tree-native or permutation	Can hide interaction effects
Attention weights	Local (tokens/regions)	Transformer-specific	Not always faithful “explanation”

Model selection trade-offs

Criterion	Question
Accuracy / calibration	Does it meet metric bars on valid slices?
Interpretability	Can operators trust and debug it?
Latency	p50/p99 within SLA?
Cost	Training + inference $ acceptable?

Treat these as a constraint satisfaction problem: hard gates (latency, compliance) then optimize within the feasible set.

Production evaluation patterns

Pattern	Idea	Comparison
A/B test	Split users; compare KPIs	Gold standard for causal product impact; needs power
Shadow mode	New model scores but does not act	Safe comparison of distributions
Interleaving	Interleave ranked lists	Sensitive ranking comparison
Bandits	Allocate traffic by ongoing performance	Exploration vs exploitation

Anti-patterns

Anti-pattern	Why it hurts
Single metric obsession	Hides harm to subgroups or rare failures
Leakage in validation	Same preprocessing fit on all data; group leakage
Overfitting the validation set	Repeated peeking drives hyperparameters to noise
Ignoring fairness	Legal, ethical, and product risk

External references

scikit-learn metrics — classification, regression, scoring API.
Zheng — Evaluating Machine Learning Models — offline vs online evaluation framing.
Barocas, Hardt, Narayanan — Fairness and Machine Learning — limitations of metrics.
SHAP documentation — explainers and plots.

Keep project-specific model documentation in docs/product/ and experiment logs in docs/development/, not in this file.

Navigate