This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

MLOps: Machine Learning Operations

Purpose: Project-agnostic reference for MLOps — applying DevOps-style practices to machine learning systems so models are reproducible, deployable, observable, and maintainable.

Audience: Teams following DATA-SCIENCE.md, crisp-dm.md, and approaches/README.md.

Overview

MLOps bridges data science (experimentation, uncertainty) and software engineering (reliability, change management). A trained model is not “done” until it can be versioned, served under SLAs, monitored in production, and updated without heroics. MLOps formalizes pipelines, ownership, and automation around that lifecycle.

Maturity levels

Level	Characteristics	Automation	Team skills
0 — Manual / notebook-driven	Ad hoc training; manual copy to production; little reproducibility	Minimal; mostly scripts	Strong modeling; weak production discipline
1 — ML pipeline automation	Repeatable training pipeline; artifact storage; basic deployment scripts	Scheduled or triggered training; scripted deploy	Pipelines as code; basic CI
2 — CI/CD for ML	Tests on data and code; gated promotions; environments as code	CI runs unit + data checks; CD to staging/prod with approvals	DevOps + ML; feature flags / canaries
3 — Continuous training + proactive monitoring	Automated retraining; drift detection; feedback loops into feature and model updates	End-to-end orchestration; policy-driven retrain	SRE + ML platform; strong observability culture

ML pipeline components

Component	Role	Example tools (illustrative)
Data validation	Catch schema drift, bad batches, broken upstreams	Great Expectations, Deequ, custom checks
Feature store	Consistent offline training + online serving features	Feast, Tecton, Databricks Feature Store
Experiment tracking	Params, metrics, artifacts, lineage	MLflow, Weights & Biases, Neptune
Model registry	Versioned models, stages, approvals	MLflow Model Registry, cloud registries
Serving	Low-latency or batch inference	TorchServe, TensorFlow Serving, Ray Serve, SageMaker
Monitoring	Drift, performance, data quality in prod	Evidently, WhyLabs, cloud APM + custom metrics

Feature store deep dive

Concern	Offline store	Online store
Use case	Training, batch backfills, analytics	Real-time inference, low-latency features
Latency	Batch / high throughput	Milliseconds to low seconds
Consistency	Must align with point-in-time joins for training	Must match definitions used at training time

Point-in-time correctness: Features available at prediction time must not include future information that was unavailable when the label was generated — critical to avoid train/serve skew and leakage.

Tools: Feast (open, Kubernetes-friendly), Tecton (managed/feature platform), Databricks Feature Store (lakehouse integration) — evaluate against your cloud, latency, and governance needs.

Experiment tracking comparison

Tool	Tracking	Visualization	Collaboration	Deployment integration
MLflow	Params, metrics, artifacts, models	UI + API; basic plots	Multi-user server; Databricks integration	Model registry, REST serving hooks
Weights & Biases	Rich experiment + system metrics	Strong dashboards, reports	Teams, reports sharing	Registry, launch integrations
Neptune	Experiments, metadata, images	Flexible UI, comparison	Org/workspaces	CI and orchestration hooks
Comet	Experiments, panels	Project views	Sharing, comments	Production monitoring options
ClearML	Experiments + orchestration	Web UI	Multi-user	Agent-based automation, pipelines

Model serving patterns

Pattern	Description	When to use
Batch inference	Score large volumes on a schedule or trigger	Reporting, nightly scores, ETL downstream
Online inference (REST/gRPC)	Request/response API behind load balancers	User-facing apps, real-time decisions
Embedded model	Library or on-device bundle	Mobile, edge, strict latency/isolation
Streaming inference	Consume event streams; emit scores per event	Fraud, IoT, real-time pipelines

Online serving (conceptual):

Batch scoring (conceptual):

CI/CD for ML (sequence)

Model monitoring

Signal	What to watch	Example metrics / methods
Data drift	Input distribution changes vs training	PSI, KS tests, embedding distance
Concept drift	Relationship between X and Y shifts	Rolling accuracy, calibration drift
Performance degradation	Business or model metrics slip	Latency, error rate, AUC decay over time
Feature importance shifts	Model relies on different drivers	SHAP stability over windows, tree gain drift

Infrastructure notes

GPU management: Quotas, autoscaling groups, or Kubernetes device plugins; isolate training from serving clusters when possible.
Distributed training: Horovod, DeepSpeed, or cloud-managed trainers — align checkpointing with registry conventions.
Cost: Spot/preemptible for training; right-size serving; cache hot features; prune stale experiments.

Testing ML systems

Test type	Focus	Examples
Data tests	Schema, ranges, null rates, referential integrity	Great Expectations suites in CI
Model tests	Accuracy floors, fairness constraints, robustness spot checks	Golden sets, adversarial smoke tests
Integration tests	Pipeline end-to-end with small fixture data	Airflow/Prefect dry runs
A/B tests in production	Causal impact on business KPIs	Controlled rollout with power analysis

Anti-patterns

Anti-pattern	Risk
“It works on my laptop”	Non-reproducible training and mystery dependencies
No model versioning	Cannot roll back or audit decisions
Training–serving skew	Different code paths or features offline vs online
No monitoring	Silent degradation until business impact

External references

ml-ops.org — community MLOps overview and maturity framing.
Google — MLOps whitepaper and related cloud documentation (continuous delivery for ML).
Chip Huyen — Designing Machine Learning Systems — production patterns and trade-offs.
MLflow documentation — tracking, registry, projects.

Keep project-specific model documentation in docs/product/ and experiment logs in docs/development/, not in this file.

Navigate