ForgeSDLC
Navigate
Home
Discover ForgeSDLC (101)
Practice (201)
Master (301)

This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

CRISP-DM: Cross-Industry Standard Process for Data Mining

Purpose: Project-agnostic reference for CRISP-DM — the most widely adopted methodology for data mining and applied machine learning. Use it to structure discovery, modeling, and handoff without tying the blueprint to a specific tool stack.

Audience: Teams following DATA-SCIENCE.md and approaches/README.md.


Overview

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a process model, not a single algorithm or product. It organizes work from business intent through deployed models with explicit feedback loops. Its strength is shared vocabulary across business, analytics, and engineering: everyone can anchor discussions in the same six phases.

The process is iterative by design. Evaluation often sends you back to data preparation or modeling; deployment feedback revisits business understanding. Treat phase boundaries as decision gates, not linear milestones.


Six phases at a glance

Phase Objectives Key activities Typical outputs Typical duration %*
Business Understanding Align analytics with business goals and constraints Stakeholder interviews; problem framing; success criteria; risk and scope Problem statement, success metrics, project plan, feasibility note ~15–20%
Data Understanding Know what data exists and whether it can support the goals Initial collection; EDA; quality checks; hypothesis sketches Data dictionary, EDA report, quality assessment, gap analysis ~15–20%
Data Preparation Produce analysis-ready datasets and reproducible transforms Selection, cleaning, integration, feature construction, formatting Clean datasets, feature specs, transformation code/pipelines ~25–35%
Modeling Fit candidate models and compare approaches Technique selection; design of experiments; train/validation protocol; tuning Trained models, experiment logs, comparison tables ~10–20%
Evaluation Judge whether results meet business and technical bars Metric review vs criteria; error analysis; cost/benefit; deploy decision Evaluation report, go/no-go, documented limitations ~10–15%
Deployment Put the solution into operation and sustain value Rollout plan; monitoring; maintenance; documentation; knowledge transfer Deployed artifact, runbooks, monitoring dashboards, final report ~5–15%

*Percentages are indicative only; highly regulated or messy data domains shift time toward understanding and preparation.

flowchart TB BU[Business Understanding] DU[Data Understanding] DP[Data Preparation] MO[Modeling] EV[Evaluation] DE[Deployment] BU --> DU DU --> DP DP --> MO MO --> EV EV --> DE DE -.->|monitoring & new questions| BU EV -.->|revisit features or data| DP EV -.->|try other algorithms| MO DU -.->|clarify goals or scope| BU

Phase deep dives

Business Understanding

Sub-task What to do Output
Problem framing Translate business language into an analytical or ML problem Formal problem statement
Success criteria Define measurable business and model-level success KPIs, acceptance thresholds
Data mining goals Specify modeling type (e.g. classification, forecasting) and constraints Technical goal doc
Project plan Resources, timeline, data access, ethics/compliance checkpoints Plan with milestones

Data Understanding

Sub-task What to do Output
Initial collection Acquire or catalog sources; document lineage where possible Inventory, access agreements
EDA Univariate/multivariate summaries; visualizations; segment checks EDA notebook or report
Data quality Missingness, duplicates, outliers, label noise Quality scorecard
Early insights Hypotheses that inform preparation and modeling Short insight memo

Data Preparation

Sub-task What to do Output
Selection Choose rows, columns, time windows aligned to the target Subset definitions
Cleaning Imputation, deduplication, outlier handling (with rationale) Cleaning rules
Feature construction Aggregations, ratios, encodings, domain features Feature definitions
Integration Joins across sources; consistent keys and grain Integrated tables
Formatting Types, schemas, splits compatible with modeling tools Model-ready datasets

Modeling

Sub-task What to do Output
Technique selection Match algorithms to data size, interpretability, latency Shortlist with rationale
Train/test split Respect leakage rules (time, groups, stratification) Split protocol
Hyperparameter tuning Search strategy; validation discipline Tuned configs
Model building Train final candidates; track seeds and code versions Model artifacts + metadata

Evaluation

Sub-task What to do Output
Business criteria check Compare metrics to thresholds from Business Understanding Pass/fail vs criteria
Model review Error analysis, robustness spot checks, fairness slices Review notes
Deployment decision Go, iterate, or stop — with documented trade-offs Decision record

Deployment

Sub-task What to do Output
Deployment plan Rollout, rollback, ownership Runbook
Monitoring Data drift, performance, latency, business KPIs Dashboards, alerts
Maintenance Retrain triggers, data refresh, dependency updates Ops checklist
Final report Lessons learned, limitations, handover Closing documentation

CRISP-DM vs other methodologies

Dimension CRISP-DM TDSP (Microsoft) SEMMA (SAS) KDD Process
Origin / style Cross-industry standard; phase-oriented Team roles, agile sprints, code structure Sample, Explore, Modify, Model, Assess Research-oriented knowledge discovery
Business linkage Strong explicit phase Strong (business metrics, backlog) Weaker; more data-centric Strong on problem understanding
Iteration Core to the model Agile loops + phases Implicit in Assess Multiple loops
Typical audience General analytics / ML Enterprise Azure/data teams SAS-centric shops Academia / R&D framing

Use CRISP-DM when you want a neutral, widely recognized scaffold; pair it with MLOps when production automation is required (see mlops.md).


CRISP-DM mapped to SDLC (high level)

Generic SDLC letters A–F here mean: A charter/requirements, B design, C build/integrate, D implementation detail, E test/verify, F release/operate — adjust to your org’s naming.

flowchart LR subgraph CRISP["CRISP-DM"] BU2[Business Understanding] DU2[Data Understanding] DP2[Data Preparation] MO2[Modeling] EV2[Evaluation] DE2[Deployment] end subgraph SDLC["SDLC (illustrative)"] A[A / B] C[C / D] D2[D] E[E] F[F] end BU2 --> A DU2 --> A DP2 --> C MO2 --> D2 EV2 --> E DE2 --> F
  • Business UnderstandingA/B (intent, scope, success definition).
  • Data Understanding + Data PreparationC/D (data products, pipelines, integration).
  • ModelingD (training code, experiments).
  • EvaluationE (verification against criteria).
  • DeploymentF (release, monitoring, operations).

Adapting CRISP-DM for modern ML

  • Deep learning: Modeling and Evaluation expand — longer training cycles, more hardware dependency, and need for baseline + ablation discipline. Data Preparation often includes large-scale labeling and augmentation policies.
  • MLOps: Deployment is not a one-time step; it includes pipelines, registries, and continuous monitoring. Feedback loops are automated (retrain triggers) rather than only manual project reviews.
  • Feature stores and real-time serving: Data Preparation and Deployment overlap with online/offline feature consistency and point-in-time correctness — classic CRISP-DM documents assumed batch analytics unless you extend them explicitly.

Anti-patterns

Anti-pattern Why it hurts Better habit
Skipping Business Understanding Optimizes the wrong metric; shelf-ware models Lock success criteria before heavy modeling
Data leakage in preparation Inflated offline metrics; production failure Time-safe splits, group splits, pipeline encapsulation
Single evaluation metric Hidden failure modes (e.g. good AUC, bad recall on minority) Multi-metric dashboards + slice analysis
Treating Deployment as “IT handoff” No monitoring owner; silent drift Shared SLOs and runbooks from day one

External references

  • CRISP-DM 1.0 — IBM / SPSS consortium process model (widely cited standard document).
  • Witten, Frank, Hall, and Pal — Data Mining: Practical Machine Learning Tools and Techniques — practical grounding for phases and evaluation.

Keep project-specific model documentation in docs/product/ and experiment logs in docs/development/, not in this file.