Phase 2 — Classical ML

Phase 2 of 6 · ML & AI Roadmap

Classical
Machine Learning

The algorithms that power real production systems. Credit scoring, fraud detection, recommendation engines, medical diagnosis — all built on the techniques you'll master here. Classical ML works, generalizes well, and is explainable.

Five Topics

01 / 05

Supervised Learning

Linear & logistic regression, decision trees, random forests, SVMs, KNN

02 / 05

Unsupervised Learning

K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, autoencoders

03 / 05

Model Evaluation

Cross-validation, accuracy, precision/recall, F1, ROC-AUC, bias-variance

04 / 05

Feature Engineering

Encoding, scaling, transforms, interactions, pipelines, target encoding

05 / 05

Ensemble Methods

Bagging, boosting, XGBoost, LightGBM, stacking, hyperparameter tuning

Prerequisites from Phase 1

Before starting Phase 2, you should be comfortable with: Python, NumPy arrays and vectorized operations, Pandas DataFrames, basic probability distributions, and the concept of a loss function from calculus. If any of these feel shaky, revisit Phase 1 first.

Learning Path

Week 1

Supervised
Learning

Week 1

Unsupervised
Learning

Week 2

Model
Evaluation

Week 2

Feature
Engineering

Week 3

Ensemble
Methods

Capstone Project Idea

End-to-end Tabular ML: Take any Kaggle classification dataset. Build a complete pipeline: EDA → feature engineering → baseline model → XGBoost with hyperparameter tuning → model evaluation report. Target ROC-AUC > 0.90 on a held-out test set.

Supervised Learning 01 / 05

Learn from Labeled Examples
to Make Predictions

Given input features X and target labels y, find a function f such that f(X) ≈ y on unseen data. This is the foundation of 90% of production ML systems.

Why This Matters

Supervised learning powers the real world: your bank's fraud detector, Spotify's song ranker, Gmail's spam filter, insurance risk models, and medical diagnosis tools. Mastering these algorithms — and knowing when to apply each one — makes you immediately useful in any data role. Crucially, these methods are interpretable, auditable, and battle-tested in production.

Core Algorithms

Linear Regression

Predict continuous values. Fits a hyperplane y = Xw + b by minimizing mean squared error. Closed-form solution or gradient descent.

Logistic Regression

Binary classification. Applies sigmoid to linear output: P(y=1) = σ(Xw). Outputs calibrated probabilities.

Decision Trees

Recursive feature splits that minimize impurity (Gini / entropy). Fully interpretable but prone to overfitting alone.

Random Forests

Ensemble of trees on bootstrap samples + random feature subsets. Reduces variance via averaging. Robust and reliable.

Support Vector Machines

Finds the maximum-margin hyperplane. Kernel trick maps to high-dim spaces. Excellent for small, high-dim datasets.

K-Nearest Neighbors

Classify by majority vote of K closest training points. Non-parametric, lazy — no training, all work at inference time.

Ridge / Lasso

Regularized regression. Ridge (L2) shrinks weights. Lasso (L1) produces sparse solutions with feature selection built in.

Naive Bayes

Assumes feature independence given class. Fast, works well on text. Surprisingly good baseline despite the strong assumption.

The Math Behind It

Linear Regression — Normal Equation

w* = (XᵀX)⁻¹ Xᵀy

Logistic Regression — Sigmoid + Cross-Entropy Loss

σ(z) = 1 / (1 + e⁻ᶻ) L = -[y log(ŷ) + (1-y) log(1-ŷ)]

Decision Tree — Gini Impurity

Gini(t) = 1 - Σ pₖ² (sum over classes k)

Algorithm Comparison

Algorithm	Best For	Interpretable	Needs Scaling	Handles Missing
Linear Regression	Continuous output, linear data	Yes	Yes	No
Logistic Regression	Binary classification baseline	Yes	Yes	No
Decision Tree	Non-linear, categorical data	Yes	No	No
Random Forest	Tabular data general use	Partial	No	No
SVM	High-dim, small datasets	No	Yes	No
KNN	Low-dimensional, no training	Yes	Yes	No

Code Tutorial

supervised_learning.py

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing, load_breast_cancer
import numpy as np

# ── 1. LINEAR REGRESSION ──────────────────────────────────────────────
X, y = fetch_california_housing(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_tr, y_tr)
rmse = np.sqrt(mean_squared_error(y_te, lr.predict(X_te)))
print(f"Linear Regression RMSE: {rmse:.3f}")
print(f"Coefficients: {lr.coef_.round(3)}")

# Ridge (L2 regularization) — penalises large weights
ridge = Ridge(alpha=1.0)
ridge.fit(X_tr, y_tr)

# Lasso (L1 regularization) — drives small weights to exactly 0
lasso = Lasso(alpha=0.01)
lasso.fit(X_tr, y_tr)
print(f"Lasso zeros: {(lasso.coef_ == 0).sum()} / {len(lasso.coef_)}")

# ── 2. LOGISTIC REGRESSION ───────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)

# Scale first — logistic regression benefits from scaling
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

logr = LogisticRegression(C=1.0, max_iter=1000)
logr.fit(X_tr_s, y_tr)
print(f"Logistic Regression Accuracy: {accuracy_score(y_te, logr.predict(X_te_s)):.3f}")

# Get predicted probabilities (important for calibration)
probs = logr.predict_proba(X_te_s)[:, 1]   # P(class=1)

# ── 3. DECISION TREE ─────────────────────────────────────────────────
tree = DecisionTreeClassifier(
    max_depth=5,          # prevent overfitting
    min_samples_leaf=10,  # at least 10 samples per leaf
    criterion='gini',     # or 'entropy'
    random_state=42
)
tree.fit(X_tr, y_tr)
print(f"Decision Tree Accuracy: {accuracy_score(y_te, tree.predict(X_te)):.3f}")

# Print human-readable tree rules
feature_names = load_breast_cancer().feature_names
rules = export_text(tree, feature_names=list(feature_names), max_depth=3)
print(rules)

# ── 4. RANDOM FOREST ─────────────────────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,      # number of trees
    max_depth=8,
    min_samples_leaf=5,
    max_features='sqrt',   # random feature subset per split
    n_jobs=-1,             # use all CPU cores
    random_state=42
)
rf.fit(X_tr, y_tr)
print(f"Random Forest Accuracy: {accuracy_score(y_te, rf.predict(X_te)):.3f}")

# Feature importances
importances = rf.feature_importances_
top5 = np.argsort(importances)[:-6:-1]
for i in top5:
    print(f"  {feature_names[i]:30s} {importances[i]:.4f}")

# ── 5. SVM ───────────────────────────────────────────────────────────
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_tr_s, y_tr)    # SVMs require scaled features!
print(f"SVM Accuracy: {accuracy_score(y_te, svm.predict(X_te_s)):.3f}")

# ── 6. KNN ───────────────────────────────────────────────────────────
knn = KNeighborsClassifier(n_neighbors=7, metric='euclidean')
knn.fit(X_tr_s, y_tr)    # KNN also requires scaled features!
print(f"KNN Accuracy: {accuracy_score(y_te, knn.predict(X_te_s)):.3f}")

# ── 7. COMPARE ALL IN ONE SHOT ───────────────────────────────────────
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM':                 SVC(kernel='rbf', gamma='scale'),
    'KNN':                 KNeighborsClassifier(n_neighbors=7),
}
for name, model in models.items():
    model.fit(X_tr_s, y_tr)
    acc = accuracy_score(y_te, model.predict(X_te_s))
    print(f"  {name:25s}: {acc:.3f}")

Pro Tips

Always start with logistic/linear regression as your baseline. If a complex model doesn't beat it by a meaningful margin, the simple model wins every time — it's faster, interpretable, and easier to maintain.

Random forests need no feature scaling. Tree-based models split on feature values, not distances. Scale only for linear models, SVMs, and KNN.

SVMs shine with small, high-dimensional data — text classification, genomics, financial signals. They struggle with millions of samples (use linear kernel then).

Regularization parameter C in SVM/Logistic: C is inverse regularization strength. Higher C = less regularization, more overfit. Lower C = more regularization, more underfit.

Stratify your train/test split for classification: stratify=y ensures class ratios are preserved. Critical for imbalanced datasets.

KNN computational cost scales with data size. Nearest-neighbor search is O(N·D) per query. Use approximate methods (FAISS, Annoy) for large datasets.

Resources

scikit-learn docs ISLR (free PDF) Hands-On ML — Geron StatQuest YouTube fast.ai Tabular Kaggle Learn

Unsupervised Learning 02 / 05

Find Hidden Structure
in Unlabeled Data

No labels required. Clustering discovers natural groupings, dimensionality reduction reveals structure, and anomaly detection finds the unusual. Most data in the world is unlabeled — this is how you use it.

Why This Matters

Customer segmentation, topic modeling, anomaly detection in logs, compressing features before supervised learning, visualizing high-dimensional embeddings — all unsupervised. PCA alone can turn a 1000-feature dataset into 20 features that capture 95% of variance, making your subsequent supervised model 50x faster with better generalization.

Core Algorithms

K-Means

Assign each point to nearest centroid, update centroids, repeat until convergence. Simple, scalable, sensitive to initialization and outliers.

DBSCAN

Density-Based Spatial Clustering. Finds arbitrary-shaped clusters, labels outliers as noise (-1). Requires no K, but needs eps and min_samples tuning.

Hierarchical Clustering

Builds a dendrogram tree. Agglomerative (bottom-up): each point starts as its own cluster, merge until one remains. Cut the tree to get K clusters.

PCA

Principal Component Analysis. Finds orthogonal axes of maximum variance. Projects data to lower dimensions while preserving the most information.

t-SNE

t-Distributed Stochastic Neighbor Embedding. Non-linear 2D/3D visualization of high-dim data. Preserves local structure. Qualitative, not quantitative.

UMAP

Uniform Manifold Approximation. Faster than t-SNE, better preserves global structure. The modern default for embedding visualization.

Gaussian Mixture Models

Soft probabilistic clustering. Each point has a probability of belonging to each cluster. More flexible than K-Means (elliptical clusters).

Isolation Forest

Anomaly detection via random trees. Anomalies are isolated with fewer splits. Returns anomaly score per sample. Very practical for production.

Choosing K for K-Means

Elbow Method: Plot inertia (sum of squared distances to nearest centroid) vs K. The "elbow" where improvement slows is a good K choice.

Silhouette Score: Ranges from -1 to 1. Higher = better defined clusters. Use sklearn.metrics.silhouette_score(X, labels). Choose K that maximizes it.

PCA — The Math

PCA via SVD

X_centered = X - mean(X) U, Σ, Vᵀ = SVD(X_centered) X_reduced = X_centered @ V[:, :k] ← k principal components explained = Σ² / sum(Σ²) ← variance explained per PC

Code Tutorial

unsupervised_learning.py

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_digits, make_blobs

# ── 1. K-MEANS ────────────────────────────────────────────────────────
X, y_true = make_blobs(n_samples=500, n_features=2,
                       centers=4, random_state=42)

# ALWAYS scale before K-Means
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

# Elbow method — find optimal K
inertias, sil_scores = [], []
K_range = range(2, 10)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_s)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_s, km.labels_))

best_k = K_range.start + np.argmax(sil_scores)
print(f"Best K by silhouette: {best_k}")

# Fit final model
km = KMeans(n_clusters=best_k, n_init=20, random_state=42)
labels = km.fit_predict(X_s)
print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Inertia: {km.inertia_:.1f}")

# ── 2. DBSCAN — density-based, no K needed ────────────────────────────
db = DBSCAN(eps=0.5, min_samples=5)
db_labels = db.fit_predict(X_s)

n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = (db_labels == -1).sum()
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")

# ── 3. GAUSSIAN MIXTURE MODEL — soft probabilistic clusters ──────────
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
gmm.fit(X_s)
gmm_labels = gmm.predict(X_s)
probs = gmm.predict_proba(X_s)       # shape (n, k) — soft assignments
print(f"GMM BIC: {gmm.bic(X_s):.1f}")   # lower = better fit

# ── 4. PCA — dimensionality reduction ─────────────────────────────────
X_digits, y_digits = load_digits(return_X_y=True)   # 1797 × 64

pca = PCA(n_components=0.95)   # keep 95% of variance
X_pca = pca.fit_transform(X_digits)
print(f"PCA: {X_digits.shape[1]} → {X_pca.shape[1]} dims")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

# Manual PCA via SVD — understand what sklearn does under the hood
X_c = X_digits - X_digits.mean(axis=0)
U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
X_manual_pca = X_c @ Vt[:20].T   # top 20 components

# ── 5. t-SNE — 2D visualization ───────────────────────────────────────
# Best practice: PCA first to ~50 dims, then t-SNE (much faster)
X_50 = PCA(n_components=50).fit_transform(X_digits)
X_2d = TSNE(n_components=2, perplexity=30,
             random_state=42, n_iter=1000).fit_transform(X_50)
print(f"t-SNE shape: {X_2d.shape}")   # (1797, 2)

# ── 6. ISOLATION FOREST — anomaly detection ───────────────────────────
# contamination = expected fraction of outliers in your data
iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42)
iso.fit(X_s)
# -1 = anomaly, +1 = normal
anomaly_labels = iso.predict(X_s)
anomaly_scores = iso.decision_function(X_s)   # lower = more anomalous
print(f"Anomalies detected: {(anomaly_labels == -1).sum()}")

Pro Tips

Always scale before clustering. K-Means uses Euclidean distance — a feature with range [0, 1000] will dominate one with range [0, 1].

PCA before t-SNE: reduce to 50 dims with PCA first, then apply t-SNE. Orders of magnitude faster with essentially the same visual result.

DBSCAN automatically discovers cluster count and handles noise. It fails in high dimensions and with varying density. Use HDBSCAN for better results.

t-SNE perplexity balances local vs global structure. Try perplexity ∈ {5, 30, 50, 100}. Results vary significantly — always show multiple.

GMM BIC/AIC to select number of components. Fit GMMs for K=1..15, plot BIC, choose the elbow. More principled than K-Means elbow method.

PCA as preprocessing before supervised learning often improves performance — removes correlated noise features and reduces overfitting risk in linear models.

Resources

scikit-learn clustering guide StatQuest K-Means PCA explained visually UMAP docs HDBSCAN library

Model Evaluation 03 / 05

Measure Performance
Honestly and Correctly

A model that scores 99% accuracy on imbalanced data might never predict the minority class. Choosing the right metric and validation strategy separates real ML engineers from beginners.

Why This Matters

This is the most dangerous section to get wrong. Wrong evaluation leads to overconfident models deployed in production. Understanding bias-variance tradeoff, data leakage, and the right metric for your problem is what makes evaluation a skill rather than a formality.

Core Concepts

Train / Val / Test Split

Never evaluate on training data. Hold out test until the very end. Use val for hyperparameter tuning. Contamination = silent failure.

K-Fold Cross-Validation

Train on K-1 folds, validate on 1, repeat K times. Average metrics across folds. More reliable estimate of true generalization performance.

Accuracy

Fraction correct. Simple. Completely misleading for imbalanced classes. A model predicting majority class always gets 95% on a 95/5 split.

Precision & Recall

Precision = TP/(TP+FP). Recall = TP/(TP+FN). Classic tradeoff: increase threshold → better precision, worse recall and vice versa.

F1 Score

Harmonic mean of precision and recall. Use when both matter equally and classes are imbalanced. F_beta weights recall beta times more.

ROC-AUC

Area under the ROC curve. Measures ranking quality — how well does the model separate classes? Threshold-independent. 0.5 = random.

RMSE / MAE

Regression metrics. RMSE penalises large errors heavily (squared). MAE is more robust to outliers. Always compare to a naive baseline.

Confusion Matrix

TP, FP, TN, FN table. The most informative single diagnostic — tells you exactly what kind of errors your model makes.

Bias-Variance Tradeoff

Bias = error from wrong assumptions (model too simple = underfitting). Variance = error from sensitivity to training set fluctuations (model too complex = overfitting). Total error = Bias² + Variance + Irreducible noise. You want both low — achieved through right model complexity + regularization + enough data.

Key Metrics — Formulas

Precision = TP / (TP + FP) ← "Of predicted positives, how many were right?" Recall = TP / (TP + FN) ← "Of actual positives, how many did we catch?" F1 = 2 · P · R / (P + R) ← harmonic mean AUC-ROC = P(score(pos) > score(neg)) ← ranking quality RMSE = sqrt(mean((y - ŷ)²))

Which Metric to Use?

Problem Type	Good Metric	Avoid	When to Use Precision vs Recall
Balanced classification	Accuracy, F1	—	Accuracy is fine
Imbalanced classification	ROC-AUC, PR-AUC	Accuracy	PR-AUC when positives are rare
Fraud detection	Recall, PR-AUC	Accuracy	Prioritize recall (catch all fraud)
Spam filter	Precision, F1	Accuracy	Prioritize precision (don't block real mail)
Regression	RMSE or MAE	R² alone	MAE if outliers exist in y
Ranking	NDCG, MAP	Accuracy	Order matters

Code Tutorial

model_evaluation.py

import numpy as np
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, KFold,
    learning_curve, validation_curve
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    classification_report, confusion_matrix,
    mean_squared_error, mean_absolute_error, r2_score,
    ConfusionMatrixDisplay, RocCurveDisplay
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# ── 1. STRATIFIED K-FOLD CROSS-VALIDATION ─────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Score multiple metrics in one CV run
from sklearn.model_selection import cross_validate
results = cross_validate(model, X, y, cv=cv,
    scoring=['accuracy', 'roc_auc', 'f1', 'precision', 'recall'],
    return_train_score=True)

for metric in ['accuracy', 'roc_auc', 'f1']:
    val = results[f'test_{metric}']
    train = results[f'train_{metric}']
    print(f"{metric:12s}  val: {val.mean():.3f} ± {val.std():.3f}  "
          f"train: {train.mean():.3f}")
# If train >> val: overfitting. If both low: underfitting.

# ── 2. THRESHOLD-BASED METRICS ────────────────────────────────────────
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
y_prob = model.predict_proba(X_te)[:, 1]

print(classification_report(y_te, y_pred, target_names=['benign', 'malignant']))
print(f"ROC-AUC: {roc_auc_score(y_te, y_prob):.4f}")
print(f"PR-AUC:  {average_precision_score(y_te, y_prob):.4f}")

# Confusion matrix
cm = confusion_matrix(y_te, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN, FP],
#  [FN, TP]]

# ── 3. CUSTOM THRESHOLD — precision/recall tradeoff ───────────────────
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_te, y_prob)

# Find threshold for ≥0.90 precision
idx = np.argmax(precisions >= 0.90)
optimal_threshold = thresholds[idx]
print(f"Threshold for 90% precision: {optimal_threshold:.3f}")
y_pred_custom = (y_prob >= optimal_threshold).astype(int)
print(f"Custom threshold — precision: {precision_score(y_te, y_pred_custom):.3f}, "
      f"recall: {recall_score(y_te, y_pred_custom):.3f}")

# ── 4. LEARNING CURVES — diagnose bias vs variance ────────────────────
train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, cv=cv, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='roc_auc', n_jobs=-1)

# Plot: if train high, val low → overfit (need regularization/more data)
#        if both low           → underfit (need more complex model)
#        if both high and close → ideal

# ── 5. REGRESSION METRICS ─────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
X_r, y_r = fetch_california_housing(return_X_y=True)
X_r_tr, X_r_te, y_r_tr, y_r_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)
reg = RandomForestRegressor(n_estimators=100, random_state=42).fit(X_r_tr, y_r_tr)
y_r_pred = reg.predict(X_r_te)
rmse = np.sqrt(mean_squared_error(y_r_te, y_r_pred))
mae  = mean_absolute_error(y_r_te, y_r_pred)
r2   = r2_score(y_r_te, y_r_pred)
print(f"RMSE: {rmse:.3f}  MAE: {mae:.3f}  R²: {r2:.3f}")

# ── 6. DATA LEAKAGE — the silent killer ───────────────────────────────
# WRONG: fit scaler on all data, then split
from sklearn.preprocessing import StandardScaler
scaler_bad = StandardScaler()
X_bad = scaler_bad.fit_transform(X)   # ← leaks test stats into training!
X_bad_tr, X_bad_te = X_bad[:400], X_bad[400:]

# RIGHT: fit scaler only on train, transform test
X_raw_tr, X_raw_te = X[:400], X[400:]
scaler_good = StandardScaler()
X_good_tr = scaler_good.fit_transform(X_raw_tr)
X_good_te = scaler_good.transform(X_raw_te)   # ← correct!

# Or best of all: use a Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  RandomForestClassifier(random_state=42))
])
cv_scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f"Pipeline CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

Pro Tips

Stratified K-Fold for classification always. Plain K-Fold may put all of a rare class in one fold, making evaluation meaningless.

Look at train vs val score together to diagnose problems. Large gap → overfitting. Both low → underfitting. Check learning curves.

Data leakage is the #1 hidden enemy. Any preprocessing that uses information from the test set — scaling, imputation, feature selection — is leakage. Use Pipelines to prevent it.

For time-series data, use TimeSeriesSplit — you cannot shuffle time-dependent data and do random K-Fold. The future cannot predict the past.

Never tune hyperparameters on the test set. Use nested cross-validation: outer CV for evaluation, inner CV for hyperparameter search.

Always set a dummy baseline before modeling. sklearn's DummyClassifier(strategy="most_frequent") tells you what random performance looks like.

Resources

scikit-learn metrics StatQuest ROC/AUC The bias-variance tradeoff ISLR Ch.5 Kaggle evaluation guide

Feature Engineering 04 / 05

Create Features That
Make Models Smarter

Better features beat better algorithms. A weak model with great features consistently outperforms a complex model on raw data. Feature engineering is where domain expertise meets machine learning.

Why This Matters

Kaggle grand masters don't win with exotic algorithms — they win with creative feature engineering. A log transform, a date feature, or a well-constructed interaction term can be worth 5 points of AUC that no amount of hyperparameter tuning will ever recover. This skill separates domain experts from pure ML people.

Core Techniques

One-Hot Encoding

Convert nominal categories to binary columns. Good for low-cardinality (< ~20 unique values). Use drop_first to avoid multicollinearity.

Ordinal Encoding

Map ordered categories to integers: cold=0, warm=1, hot=2. Preserves order information unlike one-hot.

Target Encoding

Replace category with mean target value. Powerful for high-cardinality (cities, zip codes). Must be done inside CV to prevent leakage.

StandardScaler

Zero mean, unit variance. Required for linear models, SVMs, KNN. Not needed for tree-based models.

Log Transform

np.log1p(x) compresses right-skewed distributions (price, income, counts). Makes linear model assumptions more valid.

Binning

Discretize continuous into buckets: age → young/mid/senior. Captures non-linear effects for linear models. pd.cut() or pd.qcut().

Feature Crosses

Multiply features: age × income. Captures interaction effects that linear models can't express on their own.

Date Features

Extract year, month, dayofweek, hour, is_weekend, days_since_event from timestamps. Datetime columns are treasure troves.

⚠ Target Encoding Leakage Warning: You must compute target encoding statistics only on training data and apply to validation/test. If you compute it on the entire dataset first, you've leaked future information. Use sklearn.preprocessing.TargetEncoder inside a Pipeline, or compute manually inside each CV fold.

Code Tutorial

feature_engineering.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    OneHotEncoder, OrdinalEncoder, TargetEncoder,
    PolynomialFeatures
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split

# ── SAMPLE DATASET ─────────────────────────────────────────────────────
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age':         np.random.randint(18, 75, n),
    'income':      np.random.exponential(50000, n),
    'city':        np.random.choice(['NY','LA','SF','CHI','HOU'], n),
    'education':   np.random.choice(['HS','BA','MS','PhD'], n),
    'join_date':   pd.date_range('2020-01-01', periods=n, freq='D'),
    'score':       np.random.randint(400, 850, n),
    'target':      np.random.randint(0, 2, n)
})
# Inject missing values
df.loc[np.random.choice(n, 50, replace=False), 'age'] = np.nan

# ── 1. BASIC TRANSFORMS ────────────────────────────────────────────────
# Log transform — right-skewed income
df['log_income'] = np.log1p(df['income'])

# Binning age
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100],
                          labels=['young','mid','senior','elder'])

# Interaction terms
df['income_per_age'] = df['income'] / (df['age'] + 1)
df['score_x_income'] = df['score'] * df['log_income']

# ── 2. DATE FEATURES ───────────────────────────────────────────────────
df['year']          = df['join_date'].dt.year
df['month']         = df['join_date'].dt.month
df['dayofweek']     = df['join_date'].dt.dayofweek
df['is_weekend']    = (df['dayofweek'] >= 5).astype(int)
df['quarter']       = df['join_date'].dt.quarter
df['days_since_join'] = (pd.Timestamp('2024-01-01') - df['join_date']).dt.days

# ── 3. ORDINAL ENCODING ────────────────────────────────────────────────
edu_order = [['HS', 'BA', 'MS', 'PhD']]
oe = OrdinalEncoder(categories=edu_order)
df['edu_encoded'] = oe.fit_transform(df[['education']])

# ── 4. SKLEARN PIPELINE — the right way ───────────────────────────────
X = df.drop(columns=['target', 'join_date', 'city', 'education',
                      'age_group'])
y = df['target']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_cols = ['age', 'income', 'log_income', 'score', 'days_since_join']

numeric_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipe, numeric_cols),
], remainder='passthrough')

from sklearn.ensemble import RandomForestClassifier
full_pipe = Pipeline([
    ('prep',  preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
full_pipe.fit(X_tr, y_tr)

# ── 5. FULL ColumnTransformer with multiple column types ───────────────
df2 = df.copy()
num_features  = ['age', 'income', 'score']
cat_nominal   = ['city']
cat_ordinal   = ['education']

preprocessor2 = ColumnTransformer([
    ('num', Pipeline([
        ('impute', SimpleImputer(strategy='median')),
        ('log', __import__('sklearn.preprocessing', fromlist=['FunctionTransformer']).FunctionTransformer(np.log1p)),
        ('scale', StandardScaler()),
    ]), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_nominal),
    ('ord', OrdinalEncoder(categories=edu_order), cat_ordinal),
])

# ── 6. FEATURE SELECTION ───────────────────────────────────────────────
X_prep = preprocessor.fit_transform(X_tr, y_tr)

# Select top-K features by ANOVA F-score
selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X_prep, y_tr)
print(f"Selected {X_selected.shape[1]} features from {X_prep.shape[1]}")

# Mutual information — works for non-linear relationships too
mi_scores = mutual_info_classif(X_prep, y_tr, random_state=42)
print(f"Top MI scores: {sorted(mi_scores, reverse=True)[:5]}")

# ── 7. POLYNOMIAL FEATURES ─────────────────────────────────────────────
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X_prep[:, :3])   # first 3 features
print(f"Polynomial features: {X_prep.shape[1]} → {X_poly.shape[1]}")

Pro Tips

Always use Pipelines when doing preprocessing. They prevent leakage, make cross-validation correct, and make production deployment trivial — one .predict() call handles everything.

Log1p before scaling for skewed features. The order matters: transform, then scale. np.log1p handles zeros gracefully unlike np.log.

High-cardinality categoricals (1000+ unique values) — use target encoding or embedding. One-hot produces too many sparse columns that hurt most models.

Date features are underrated. Extracting dayofweek, hour, is_holiday, and days_since_last_event from a timestamp can be the most predictive features in a dataset.

RobustScaler (scales by IQR) is better than StandardScaler when your data has outliers — outliers won't distort the scale applied to other points.

Feature selection after engineering: more features ≠ better model. Use mutual_info_classif or feature importances from a Random Forest to prune noise.

Resources

Kaggle Feature Engineering course sklearn ColumnTransformer Feature Engineering for ML book Target encoding guide

Ensemble Methods & Boosting 05 / 05

Combine Weak Learners
into Powerful Models

XGBoost and LightGBM win most tabular ML competitions. Gradient boosting is the dominant paradigm for structured data in industry. Understanding how it works helps you tune it masterfully.

Why This Matters

Browse any top Kaggle solution for a tabular competition — you'll find XGBoost or LightGBM. These algorithms consistently outperform all others on structured data. They handle missing values, mixed feature types, non-linear interactions, and irrelevant features with minimal preprocessing. In industry, they are the go-to before any neural network is considered.

The Three Paradigms

Bagging

Train models on random bootstrap samples, average predictions. Reduces variance. Parallel training. Random Forest = bagging of decision trees.

Boosting

Train models sequentially, each correcting previous errors. Reduces bias. Prone to overfitting — use early stopping and regularization.

Stacking

Use predictions of base models as features for a meta-learner (often logistic regression). Captures the best of each base model's specialization.

XGBoost

Regularized gradient boosting with second-order Taylor expansion. Handles missing values natively. The industry standard since 2016.

LightGBM

Histogram-based gradient boosting. 10x faster than XGBoost on large datasets. Grows trees leaf-wise vs level-wise. Better for millions of rows.

CatBoost

Handles categorical features natively via ordered target encoding. Less hyperparameter tuning needed. Strong out-of-the-box on heterogeneous data.

Gradient Boosting

Each new tree fits the negative gradient (residuals) of the loss function. The ensemble is an additive model of weak learners.

Early Stopping

Stop training when validation metric stops improving. Critical for boosting — prevents overfitting. Set eval_set and early_stopping_rounds.

How Gradient Boosting Works

Gradient Boosting — Additive Model

F₀(x) = initial prediction (e.g. mean of y) For t = 1 to T: rᵢ = -∂L/∂F(xᵢ) ← negative gradient = pseudo-residuals hₜ = tree fit to rᵢ ← weak learner fits residuals F_t(x) = F_{t-1}(x) + η·hₜ(x) ← η = learning rate Final: F_T(x) = sum of all trees × learning rate

Key Hyperparameters

Parameter	Effect	Typical Range	Direction if Overfitting
learning_rate (η)	Step size. Smaller = more trees needed but generalizes better.	0.01 – 0.3	↓ decrease
n_estimators	Number of trees. Use with early stopping.	100 – 5000	Use early stopping
max_depth	Tree depth. Deeper = more complex interactions.	3 – 8	↓ decrease
subsample	Fraction of training rows per tree. Adds stochasticity.	0.6 – 1.0	↓ decrease
colsample_bytree	Fraction of features per tree. Like RF's max_features.	0.6 – 1.0	↓ decrease
reg_lambda (L2)	L2 regularization on leaf weights. Prevents large weights.	0.1 – 10	↑ increase
reg_alpha (L1)	L1 regularization. Sparse leaf weights.	0 – 1	↑ increase

Code Tutorial

ensemble_methods.py

import numpy as np
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    VotingClassifier, StackingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_breast_cancer
import optuna

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)
X_tr, X_val, y_tr, y_val = train_test_split(X_tr, y_tr, test_size=0.2,
                                             stratify=y_tr, random_state=42)

# ── 1. XGBOOST — with early stopping ──────────────────────────────────
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.5,
    reg_alpha=0.1,
    min_child_weight=3,
    eval_metric='auc',
    early_stopping_rounds=50,
    random_state=42,
    use_label_encoder=False,
    verbosity=0
)
xgb_model.fit(X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=False)

xgb_auc = roc_auc_score(y_te, xgb_model.predict_proba(X_te)[:, 1])
print(f"XGBoost AUC: {xgb_auc:.4f}  Best round: {xgb_model.best_iteration}")

# Feature importance
feat_imp = xgb_model.feature_importances_
top_idx = np.argsort(feat_imp)[:-6:-1]
print("Top-5 features:", load_breast_cancer().feature_names[top_idx])

# ── 2. LIGHTGBM — faster, better for large data ───────────────────────
lgb_model = lgb.LGBMClassifier(
    n_estimators=2000,
    num_leaves=31,          # LightGBM uses num_leaves, not max_depth
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_samples=20,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50, verbose=False),
               lgb.log_evaluation(-1)])

lgb_auc = roc_auc_score(y_te, lgb_model.predict_proba(X_te)[:, 1])
print(f"LightGBM AUC: {lgb_auc:.4f}")

# ── 3. CATBOOST — native categorical support ──────────────────────────
cb_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    eval_metric='AUC',
    early_stopping_rounds=50,
    random_seed=42,
    verbose=0
)
cb_model.fit(X_tr, y_tr, eval_set=(X_val, y_val))
cb_auc = roc_auc_score(y_te, cb_model.predict_proba(X_te)[:, 1])
print(f"CatBoost AUC: {cb_auc:.4f}")

# ── 4. VOTING ENSEMBLE ────────────────────────────────────────────────
rf  = RandomForestClassifier(n_estimators=200, random_state=42)
xgb2 = xgb.XGBClassifier(n_estimators=200, random_state=42, verbosity=0)
lgb2 = lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)

voting = VotingClassifier(
    estimators=[('rf', rf), ('xgb', xgb2), ('lgb', lgb2)],
    voting='soft'   # average probabilities, not votes
)
voting.fit(X_tr, y_tr)
vote_auc = roc_auc_score(y_te, voting.predict_proba(X_te)[:, 1])
print(f"Voting Ensemble AUC: {vote_auc:.4f}")

# ── 5. STACKING ───────────────────────────────────────────────────────
estimators = [
    ('rf',  RandomForestClassifier(n_estimators=200, random_state=42)),
    ('xgb', xgb.XGBClassifier(n_estimators=200, random_state=42, verbosity=0)),
    ('lgb', lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)),
]
stacker = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(C=0.1),  # meta-learner
    cv=5,
    passthrough=True   # include original features alongside base predictions
)
stacker.fit(X_tr, y_tr)
stack_auc = roc_auc_score(y_te, stacker.predict_proba(X_te)[:, 1])
print(f"Stacking AUC: {stack_auc:.4f}")

# ── 6. OPTUNA HYPERPARAMETER OPTIMIZATION ─────────────────────────────
def objective(trial):
    params = {
        'n_estimators':     trial.suggest_int('n_estimators', 100, 1000),
        'max_depth':        trial.suggest_int('max_depth', 3, 8),
        'learning_rate':    trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda':       trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'reg_alpha':        trial.suggest_float('reg_alpha', 0.0, 1.0),
    }
    model = xgb.XGBClassifier(**params, random_state=42, verbosity=0)
    score = cross_val_score(model, X_tr, y_tr, cv=5, scoring='roc_auc')
    return score.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# Train final model with best params
best = xgb.XGBClassifier(**study.best_params, random_state=42, verbosity=0)
best.fit(X_tr, y_tr)
final_auc = roc_auc_score(y_te, best.predict_proba(X_te)[:, 1])
print(f"Final XGBoost AUC on test: {final_auc:.4f}")

Pro Tips

Always use early stopping with a validation set. Set n_estimators high (2000+) and let early stopping find the right number. Never guess n_estimators.

LightGBM is 10x faster than XGBoost on large datasets (1M+ rows). Use XGBoost for smaller data where you want maximum accuracy and interpretability.

Learning rate and n_estimators are coupled. Smaller learning rate = more trees needed. A good pattern: lr=0.05 with 500-2000 trees + early stopping.

Optuna > GridSearch > RandomSearch for hyperparameter optimization. Optuna's TPE sampler is Bayesian — it learns which regions of the space are promising.

Soft voting always beats hard voting for classifiers with well-calibrated probabilities. Averaging probabilities keeps more information than majority vote.

Feature importance from XGBoost has three types: weight (frequency), gain (improvement in loss), cover (data coverage). Gain is usually most informative.

🎉 Phase 2 Complete! You now have the full Classical ML toolkit. The natural next step is Phase 3: Deep Learning — where you'll understand why neural networks are needed for tasks where these algorithms fall short (images, text, sequences).

Resources

XGBoost docs LightGBM docs CatBoost docs Optuna tutorial Kaggle winning solutions StatQuest Gradient Boosting

ClassicalMachine Learning

Learn from Labeled Examplesto Make Predictions

Find Hidden Structurein Unlabeled Data

Measure PerformanceHonestly and Correctly

Create Features ThatMake Models Smarter

Combine Weak Learnersinto Powerful Models

Classical
Machine Learning

Learn from Labeled Examples
to Make Predictions

Find Hidden Structure
in Unlabeled Data

Measure Performance
Honestly and Correctly

Create Features That
Make Models Smarter

Combine Weak Learners
into Powerful Models