Phase 2 of 6 · ML & AI Roadmap

Classical
Machine Learning

The algorithms that power real production systems. Credit scoring, fraud detection, recommendation engines, medical diagnosis — all built on the techniques you'll master here. Classical ML works, generalizes well, and is explainable.

01 / 05
Supervised Learning
Linear & logistic regression, decision trees, random forests, SVMs, KNN
02 / 05
Unsupervised Learning
K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, autoencoders
03 / 05
Model Evaluation
Cross-validation, accuracy, precision/recall, F1, ROC-AUC, bias-variance
04 / 05
Feature Engineering
Encoding, scaling, transforms, interactions, pipelines, target encoding
05 / 05
Ensemble Methods
Bagging, boosting, XGBoost, LightGBM, stacking, hyperparameter tuning
Before starting Phase 2, you should be comfortable with: Python, NumPy arrays and vectorized operations, Pandas DataFrames, basic probability distributions, and the concept of a loss function from calculus. If any of these feel shaky, revisit Phase 1 first.
Week 1
Supervised
Learning
Week 1
Unsupervised
Learning
Week 2
Model
Evaluation
Week 2
Feature
Engineering
Week 3
Ensemble
Methods
End-to-end Tabular ML: Take any Kaggle classification dataset. Build a complete pipeline: EDA → feature engineering → baseline model → XGBoost with hyperparameter tuning → model evaluation report. Target ROC-AUC > 0.90 on a held-out test set.
Supervised Learning 01 / 05

Learn from Labeled Examples
to Make Predictions

Given input features X and target labels y, find a function f such that f(X) ≈ y on unseen data. This is the foundation of 90% of production ML systems.

Supervised learning powers the real world: your bank's fraud detector, Spotify's song ranker, Gmail's spam filter, insurance risk models, and medical diagnosis tools. Mastering these algorithms — and knowing when to apply each one — makes you immediately useful in any data role. Crucially, these methods are interpretable, auditable, and battle-tested in production.
Linear Regression
Predict continuous values. Fits a hyperplane y = Xw + b by minimizing mean squared error. Closed-form solution or gradient descent.
Logistic Regression
Binary classification. Applies sigmoid to linear output: P(y=1) = σ(Xw). Outputs calibrated probabilities.
Decision Trees
Recursive feature splits that minimize impurity (Gini / entropy). Fully interpretable but prone to overfitting alone.
Random Forests
Ensemble of trees on bootstrap samples + random feature subsets. Reduces variance via averaging. Robust and reliable.
Support Vector Machines
Finds the maximum-margin hyperplane. Kernel trick maps to high-dim spaces. Excellent for small, high-dim datasets.
K-Nearest Neighbors
Classify by majority vote of K closest training points. Non-parametric, lazy — no training, all work at inference time.
Ridge / Lasso
Regularized regression. Ridge (L2) shrinks weights. Lasso (L1) produces sparse solutions with feature selection built in.
Naive Bayes
Assumes feature independence given class. Fast, works well on text. Surprisingly good baseline despite the strong assumption.
Linear Regression — Normal Equation
w* = (XᵀX)⁻¹ Xᵀy
Logistic Regression — Sigmoid + Cross-Entropy Loss
σ(z) = 1 / (1 + e⁻ᶻ) L = -[y log(ŷ) + (1-y) log(1-ŷ)]
Decision Tree — Gini Impurity
Gini(t) = 1 - Σ pₖ² (sum over classes k)
AlgorithmBest ForInterpretableNeeds ScalingHandles Missing
Linear Regression Continuous output, linear data Yes Yes No
Logistic Regression Binary classification baseline Yes Yes No
Decision Tree Non-linear, categorical data Yes No No
Random Forest Tabular data general use Partial No No
SVM High-dim, small datasets No Yes No
KNN Low-dimensional, no training Yes Yes No
supervised_learning.py
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing, load_breast_cancer
import numpy as np

# ── 1. LINEAR REGRESSION ──────────────────────────────────────────────
X, y = fetch_california_housing(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_tr, y_tr)
rmse = np.sqrt(mean_squared_error(y_te, lr.predict(X_te)))
print(f"Linear Regression RMSE: {rmse:.3f}")
print(f"Coefficients: {lr.coef_.round(3)}")

# Ridge (L2 regularization) — penalises large weights
ridge = Ridge(alpha=1.0)
ridge.fit(X_tr, y_tr)

# Lasso (L1 regularization) — drives small weights to exactly 0
lasso = Lasso(alpha=0.01)
lasso.fit(X_tr, y_tr)
print(f"Lasso zeros: {(lasso.coef_ == 0).sum()} / {len(lasso.coef_)}")

# ── 2. LOGISTIC REGRESSION ───────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)

# Scale first — logistic regression benefits from scaling
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

logr = LogisticRegression(C=1.0, max_iter=1000)
logr.fit(X_tr_s, y_tr)
print(f"Logistic Regression Accuracy: {accuracy_score(y_te, logr.predict(X_te_s)):.3f}")

# Get predicted probabilities (important for calibration)
probs = logr.predict_proba(X_te_s)[:, 1]   # P(class=1)

# ── 3. DECISION TREE ─────────────────────────────────────────────────
tree = DecisionTreeClassifier(
    max_depth=5,          # prevent overfitting
    min_samples_leaf=10,  # at least 10 samples per leaf
    criterion='gini',     # or 'entropy'
    random_state=42
)
tree.fit(X_tr, y_tr)
print(f"Decision Tree Accuracy: {accuracy_score(y_te, tree.predict(X_te)):.3f}")

# Print human-readable tree rules
feature_names = load_breast_cancer().feature_names
rules = export_text(tree, feature_names=list(feature_names), max_depth=3)
print(rules)

# ── 4. RANDOM FOREST ─────────────────────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,      # number of trees
    max_depth=8,
    min_samples_leaf=5,
    max_features='sqrt',   # random feature subset per split
    n_jobs=-1,             # use all CPU cores
    random_state=42
)
rf.fit(X_tr, y_tr)
print(f"Random Forest Accuracy: {accuracy_score(y_te, rf.predict(X_te)):.3f}")

# Feature importances
importances = rf.feature_importances_
top5 = np.argsort(importances)[:-6:-1]
for i in top5:
    print(f"  {feature_names[i]:30s} {importances[i]:.4f}")

# ── 5. SVM ───────────────────────────────────────────────────────────
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_tr_s, y_tr)    # SVMs require scaled features!
print(f"SVM Accuracy: {accuracy_score(y_te, svm.predict(X_te_s)):.3f}")

# ── 6. KNN ───────────────────────────────────────────────────────────
knn = KNeighborsClassifier(n_neighbors=7, metric='euclidean')
knn.fit(X_tr_s, y_tr)    # KNN also requires scaled features!
print(f"KNN Accuracy: {accuracy_score(y_te, knn.predict(X_te_s)):.3f}")

# ── 7. COMPARE ALL IN ONE SHOT ───────────────────────────────────────
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM':                 SVC(kernel='rbf', gamma='scale'),
    'KNN':                 KNeighborsClassifier(n_neighbors=7),
}
for name, model in models.items():
    model.fit(X_tr_s, y_tr)
    acc = accuracy_score(y_te, model.predict(X_te_s))
    print(f"  {name:25s}: {acc:.3f}")
01
Always start with logistic/linear regression as your baseline. If a complex model doesn't beat it by a meaningful margin, the simple model wins every time — it's faster, interpretable, and easier to maintain.
02
Random forests need no feature scaling. Tree-based models split on feature values, not distances. Scale only for linear models, SVMs, and KNN.
03
SVMs shine with small, high-dimensional data — text classification, genomics, financial signals. They struggle with millions of samples (use linear kernel then).
04
Regularization parameter C in SVM/Logistic: C is inverse regularization strength. Higher C = less regularization, more overfit. Lower C = more regularization, more underfit.
05
Stratify your train/test split for classification: stratify=y ensures class ratios are preserved. Critical for imbalanced datasets.
06
KNN computational cost scales with data size. Nearest-neighbor search is O(N·D) per query. Use approximate methods (FAISS, Annoy) for large datasets.
scikit-learn docs ISLR (free PDF) Hands-On ML — Geron StatQuest YouTube fast.ai Tabular Kaggle Learn
Unsupervised Learning 02 / 05

Find Hidden Structure
in Unlabeled Data

No labels required. Clustering discovers natural groupings, dimensionality reduction reveals structure, and anomaly detection finds the unusual. Most data in the world is unlabeled — this is how you use it.

Customer segmentation, topic modeling, anomaly detection in logs, compressing features before supervised learning, visualizing high-dimensional embeddings — all unsupervised. PCA alone can turn a 1000-feature dataset into 20 features that capture 95% of variance, making your subsequent supervised model 50x faster with better generalization.
K-Means
Assign each point to nearest centroid, update centroids, repeat until convergence. Simple, scalable, sensitive to initialization and outliers.
DBSCAN
Density-Based Spatial Clustering. Finds arbitrary-shaped clusters, labels outliers as noise (-1). Requires no K, but needs eps and min_samples tuning.
Hierarchical Clustering
Builds a dendrogram tree. Agglomerative (bottom-up): each point starts as its own cluster, merge until one remains. Cut the tree to get K clusters.
PCA
Principal Component Analysis. Finds orthogonal axes of maximum variance. Projects data to lower dimensions while preserving the most information.
t-SNE
t-Distributed Stochastic Neighbor Embedding. Non-linear 2D/3D visualization of high-dim data. Preserves local structure. Qualitative, not quantitative.
UMAP
Uniform Manifold Approximation. Faster than t-SNE, better preserves global structure. The modern default for embedding visualization.
Gaussian Mixture Models
Soft probabilistic clustering. Each point has a probability of belonging to each cluster. More flexible than K-Means (elliptical clusters).
Isolation Forest
Anomaly detection via random trees. Anomalies are isolated with fewer splits. Returns anomaly score per sample. Very practical for production.
Elbow Method: Plot inertia (sum of squared distances to nearest centroid) vs K. The "elbow" where improvement slows is a good K choice.

Silhouette Score: Ranges from -1 to 1. Higher = better defined clusters. Use sklearn.metrics.silhouette_score(X, labels). Choose K that maximizes it.
PCA via SVD
X_centered = X - mean(X) U, Σ, Vᵀ = SVD(X_centered) X_reduced = X_centered @ V[:, :k] ← k principal components explained = Σ² / sum(Σ²) ← variance explained per PC
unsupervised_learning.py
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_digits, make_blobs

# ── 1. K-MEANS ────────────────────────────────────────────────────────
X, y_true = make_blobs(n_samples=500, n_features=2,
                       centers=4, random_state=42)

# ALWAYS scale before K-Means
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

# Elbow method — find optimal K
inertias, sil_scores = [], []
K_range = range(2, 10)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_s)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_s, km.labels_))

best_k = K_range.start + np.argmax(sil_scores)
print(f"Best K by silhouette: {best_k}")

# Fit final model
km = KMeans(n_clusters=best_k, n_init=20, random_state=42)
labels = km.fit_predict(X_s)
print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Inertia: {km.inertia_:.1f}")

# ── 2. DBSCAN — density-based, no K needed ────────────────────────────
db = DBSCAN(eps=0.5, min_samples=5)
db_labels = db.fit_predict(X_s)

n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = (db_labels == -1).sum()
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")

# ── 3. GAUSSIAN MIXTURE MODEL — soft probabilistic clusters ──────────
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
gmm.fit(X_s)
gmm_labels = gmm.predict(X_s)
probs = gmm.predict_proba(X_s)       # shape (n, k) — soft assignments
print(f"GMM BIC: {gmm.bic(X_s):.1f}")   # lower = better fit

# ── 4. PCA — dimensionality reduction ─────────────────────────────────
X_digits, y_digits = load_digits(return_X_y=True)   # 1797 × 64

pca = PCA(n_components=0.95)   # keep 95% of variance
X_pca = pca.fit_transform(X_digits)
print(f"PCA: {X_digits.shape[1]} → {X_pca.shape[1]} dims")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

# Manual PCA via SVD — understand what sklearn does under the hood
X_c = X_digits - X_digits.mean(axis=0)
U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
X_manual_pca = X_c @ Vt[:20].T   # top 20 components

# ── 5. t-SNE — 2D visualization ───────────────────────────────────────
# Best practice: PCA first to ~50 dims, then t-SNE (much faster)
X_50 = PCA(n_components=50).fit_transform(X_digits)
X_2d = TSNE(n_components=2, perplexity=30,
             random_state=42, n_iter=1000).fit_transform(X_50)
print(f"t-SNE shape: {X_2d.shape}")   # (1797, 2)

# ── 6. ISOLATION FOREST — anomaly detection ───────────────────────────
# contamination = expected fraction of outliers in your data
iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42)
iso.fit(X_s)
# -1 = anomaly, +1 = normal
anomaly_labels = iso.predict(X_s)
anomaly_scores = iso.decision_function(X_s)   # lower = more anomalous
print(f"Anomalies detected: {(anomaly_labels == -1).sum()}")
01
Always scale before clustering. K-Means uses Euclidean distance — a feature with range [0, 1000] will dominate one with range [0, 1].
02
PCA before t-SNE: reduce to 50 dims with PCA first, then apply t-SNE. Orders of magnitude faster with essentially the same visual result.
03
DBSCAN automatically discovers cluster count and handles noise. It fails in high dimensions and with varying density. Use HDBSCAN for better results.
04
t-SNE perplexity balances local vs global structure. Try perplexity ∈ {5, 30, 50, 100}. Results vary significantly — always show multiple.
05
GMM BIC/AIC to select number of components. Fit GMMs for K=1..15, plot BIC, choose the elbow. More principled than K-Means elbow method.
06
PCA as preprocessing before supervised learning often improves performance — removes correlated noise features and reduces overfitting risk in linear models.
scikit-learn clustering guide StatQuest K-Means PCA explained visually UMAP docs HDBSCAN library
Model Evaluation 03 / 05

Measure Performance
Honestly and Correctly

A model that scores 99% accuracy on imbalanced data might never predict the minority class. Choosing the right metric and validation strategy separates real ML engineers from beginners.

This is the most dangerous section to get wrong. Wrong evaluation leads to overconfident models deployed in production. Understanding bias-variance tradeoff, data leakage, and the right metric for your problem is what makes evaluation a skill rather than a formality.
Train / Val / Test Split
Never evaluate on training data. Hold out test until the very end. Use val for hyperparameter tuning. Contamination = silent failure.
K-Fold Cross-Validation
Train on K-1 folds, validate on 1, repeat K times. Average metrics across folds. More reliable estimate of true generalization performance.
Accuracy
Fraction correct. Simple. Completely misleading for imbalanced classes. A model predicting majority class always gets 95% on a 95/5 split.
Precision & Recall
Precision = TP/(TP+FP). Recall = TP/(TP+FN). Classic tradeoff: increase threshold → better precision, worse recall and vice versa.
F1 Score
Harmonic mean of precision and recall. Use when both matter equally and classes are imbalanced. F_beta weights recall beta times more.
ROC-AUC
Area under the ROC curve. Measures ranking quality — how well does the model separate classes? Threshold-independent. 0.5 = random.
RMSE / MAE
Regression metrics. RMSE penalises large errors heavily (squared). MAE is more robust to outliers. Always compare to a naive baseline.
Confusion Matrix
TP, FP, TN, FN table. The most informative single diagnostic — tells you exactly what kind of errors your model makes.
Bias = error from wrong assumptions (model too simple = underfitting). Variance = error from sensitivity to training set fluctuations (model too complex = overfitting). Total error = Bias² + Variance + Irreducible noise. You want both low — achieved through right model complexity + regularization + enough data.
Key Metrics — Formulas
Precision = TP / (TP + FP) ← "Of predicted positives, how many were right?" Recall = TP / (TP + FN) ← "Of actual positives, how many did we catch?" F1 = 2 · P · R / (P + R) ← harmonic mean AUC-ROC = P(score(pos) > score(neg)) ← ranking quality RMSE = sqrt(mean((y - ŷ)²))
Problem TypeGood MetricAvoidWhen to Use Precision vs Recall
Balanced classification Accuracy, F1 Accuracy is fine
Imbalanced classification ROC-AUC, PR-AUC Accuracy PR-AUC when positives are rare
Fraud detection Recall, PR-AUC Accuracy Prioritize recall (catch all fraud)
Spam filter Precision, F1 Accuracy Prioritize precision (don't block real mail)
Regression RMSE or MAE R² alone MAE if outliers exist in y
Ranking NDCG, MAP Accuracy Order matters
model_evaluation.py
import numpy as np
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, KFold,
    learning_curve, validation_curve
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    classification_report, confusion_matrix,
    mean_squared_error, mean_absolute_error, r2_score,
    ConfusionMatrixDisplay, RocCurveDisplay
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# ── 1. STRATIFIED K-FOLD CROSS-VALIDATION ─────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Score multiple metrics in one CV run
from sklearn.model_selection import cross_validate
results = cross_validate(model, X, y, cv=cv,
    scoring=['accuracy', 'roc_auc', 'f1', 'precision', 'recall'],
    return_train_score=True)

for metric in ['accuracy', 'roc_auc', 'f1']:
    val = results[f'test_{metric}']
    train = results[f'train_{metric}']
    print(f"{metric:12s}  val: {val.mean():.3f} ± {val.std():.3f}  "
          f"train: {train.mean():.3f}")
# If train >> val: overfitting. If both low: underfitting.

# ── 2. THRESHOLD-BASED METRICS ────────────────────────────────────────
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
y_prob = model.predict_proba(X_te)[:, 1]

print(classification_report(y_te, y_pred, target_names=['benign', 'malignant']))
print(f"ROC-AUC: {roc_auc_score(y_te, y_prob):.4f}")
print(f"PR-AUC:  {average_precision_score(y_te, y_prob):.4f}")

# Confusion matrix
cm = confusion_matrix(y_te, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN, FP],
#  [FN, TP]]

# ── 3. CUSTOM THRESHOLD — precision/recall tradeoff ───────────────────
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_te, y_prob)

# Find threshold for ≥0.90 precision
idx = np.argmax(precisions >= 0.90)
optimal_threshold = thresholds[idx]
print(f"Threshold for 90% precision: {optimal_threshold:.3f}")
y_pred_custom = (y_prob >= optimal_threshold).astype(int)
print(f"Custom threshold — precision: {precision_score(y_te, y_pred_custom):.3f}, "
      f"recall: {recall_score(y_te, y_pred_custom):.3f}")

# ── 4. LEARNING CURVES — diagnose bias vs variance ────────────────────
train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, cv=cv, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='roc_auc', n_jobs=-1)

# Plot: if train high, val low → overfit (need regularization/more data)
#        if both low           → underfit (need more complex model)
#        if both high and close → ideal

# ── 5. REGRESSION METRICS ─────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
X_r, y_r = fetch_california_housing(return_X_y=True)
X_r_tr, X_r_te, y_r_tr, y_r_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)
reg = RandomForestRegressor(n_estimators=100, random_state=42).fit(X_r_tr, y_r_tr)
y_r_pred = reg.predict(X_r_te)
rmse = np.sqrt(mean_squared_error(y_r_te, y_r_pred))
mae  = mean_absolute_error(y_r_te, y_r_pred)
r2   = r2_score(y_r_te, y_r_pred)
print(f"RMSE: {rmse:.3f}  MAE: {mae:.3f}  R²: {r2:.3f}")

# ── 6. DATA LEAKAGE — the silent killer ───────────────────────────────
# WRONG: fit scaler on all data, then split
from sklearn.preprocessing import StandardScaler
scaler_bad = StandardScaler()
X_bad = scaler_bad.fit_transform(X)   # ← leaks test stats into training!
X_bad_tr, X_bad_te = X_bad[:400], X_bad[400:]

# RIGHT: fit scaler only on train, transform test
X_raw_tr, X_raw_te = X[:400], X[400:]
scaler_good = StandardScaler()
X_good_tr = scaler_good.fit_transform(X_raw_tr)
X_good_te = scaler_good.transform(X_raw_te)   # ← correct!

# Or best of all: use a Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  RandomForestClassifier(random_state=42))
])
cv_scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f"Pipeline CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
01
Stratified K-Fold for classification always. Plain K-Fold may put all of a rare class in one fold, making evaluation meaningless.
02
Look at train vs val score together to diagnose problems. Large gap → overfitting. Both low → underfitting. Check learning curves.
03
Data leakage is the #1 hidden enemy. Any preprocessing that uses information from the test set — scaling, imputation, feature selection — is leakage. Use Pipelines to prevent it.
04
For time-series data, use TimeSeriesSplit — you cannot shuffle time-dependent data and do random K-Fold. The future cannot predict the past.
05
Never tune hyperparameters on the test set. Use nested cross-validation: outer CV for evaluation, inner CV for hyperparameter search.
06
Always set a dummy baseline before modeling. sklearn's DummyClassifier(strategy="most_frequent") tells you what random performance looks like.
scikit-learn metrics StatQuest ROC/AUC The bias-variance tradeoff ISLR Ch.5 Kaggle evaluation guide
Feature Engineering 04 / 05

Create Features That
Make Models Smarter

Better features beat better algorithms. A weak model with great features consistently outperforms a complex model on raw data. Feature engineering is where domain expertise meets machine learning.

Kaggle grand masters don't win with exotic algorithms — they win with creative feature engineering. A log transform, a date feature, or a well-constructed interaction term can be worth 5 points of AUC that no amount of hyperparameter tuning will ever recover. This skill separates domain experts from pure ML people.
One-Hot Encoding
Convert nominal categories to binary columns. Good for low-cardinality (< ~20 unique values). Use drop_first to avoid multicollinearity.
Ordinal Encoding
Map ordered categories to integers: cold=0, warm=1, hot=2. Preserves order information unlike one-hot.
Target Encoding
Replace category with mean target value. Powerful for high-cardinality (cities, zip codes). Must be done inside CV to prevent leakage.
StandardScaler
Zero mean, unit variance. Required for linear models, SVMs, KNN. Not needed for tree-based models.
Log Transform
np.log1p(x) compresses right-skewed distributions (price, income, counts). Makes linear model assumptions more valid.
Binning
Discretize continuous into buckets: age → young/mid/senior. Captures non-linear effects for linear models. pd.cut() or pd.qcut().
Feature Crosses
Multiply features: age × income. Captures interaction effects that linear models can't express on their own.
Date Features
Extract year, month, dayofweek, hour, is_weekend, days_since_event from timestamps. Datetime columns are treasure troves.
⚠ Target Encoding Leakage Warning: You must compute target encoding statistics only on training data and apply to validation/test. If you compute it on the entire dataset first, you've leaked future information. Use sklearn.preprocessing.TargetEncoder inside a Pipeline, or compute manually inside each CV fold.
feature_engineering.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    OneHotEncoder, OrdinalEncoder, TargetEncoder,
    PolynomialFeatures
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split

# ── SAMPLE DATASET ─────────────────────────────────────────────────────
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age':         np.random.randint(18, 75, n),
    'income':      np.random.exponential(50000, n),
    'city':        np.random.choice(['NY','LA','SF','CHI','HOU'], n),
    'education':   np.random.choice(['HS','BA','MS','PhD'], n),
    'join_date':   pd.date_range('2020-01-01', periods=n, freq='D'),
    'score':       np.random.randint(400, 850, n),
    'target':      np.random.randint(0, 2, n)
})
# Inject missing values
df.loc[np.random.choice(n, 50, replace=False), 'age'] = np.nan

# ── 1. BASIC TRANSFORMS ────────────────────────────────────────────────
# Log transform — right-skewed income
df['log_income'] = np.log1p(df['income'])

# Binning age
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100],
                          labels=['young','mid','senior','elder'])

# Interaction terms
df['income_per_age'] = df['income'] / (df['age'] + 1)
df['score_x_income'] = df['score'] * df['log_income']

# ── 2. DATE FEATURES ───────────────────────────────────────────────────
df['year']          = df['join_date'].dt.year
df['month']         = df['join_date'].dt.month
df['dayofweek']     = df['join_date'].dt.dayofweek
df['is_weekend']    = (df['dayofweek'] >= 5).astype(int)
df['quarter']       = df['join_date'].dt.quarter
df['days_since_join'] = (pd.Timestamp('2024-01-01') - df['join_date']).dt.days

# ── 3. ORDINAL ENCODING ────────────────────────────────────────────────
edu_order = [['HS', 'BA', 'MS', 'PhD']]
oe = OrdinalEncoder(categories=edu_order)
df['edu_encoded'] = oe.fit_transform(df[['education']])

# ── 4. SKLEARN PIPELINE — the right way ───────────────────────────────
X = df.drop(columns=['target', 'join_date', 'city', 'education',
                      'age_group'])
y = df['target']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_cols = ['age', 'income', 'log_income', 'score', 'days_since_join']

numeric_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipe, numeric_cols),
], remainder='passthrough')

from sklearn.ensemble import RandomForestClassifier
full_pipe = Pipeline([
    ('prep',  preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
full_pipe.fit(X_tr, y_tr)

# ── 5. FULL ColumnTransformer with multiple column types ───────────────
df2 = df.copy()
num_features  = ['age', 'income', 'score']
cat_nominal   = ['city']
cat_ordinal   = ['education']

preprocessor2 = ColumnTransformer([
    ('num', Pipeline([
        ('impute', SimpleImputer(strategy='median')),
        ('log', __import__('sklearn.preprocessing', fromlist=['FunctionTransformer']).FunctionTransformer(np.log1p)),
        ('scale', StandardScaler()),
    ]), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_nominal),
    ('ord', OrdinalEncoder(categories=edu_order), cat_ordinal),
])

# ── 6. FEATURE SELECTION ───────────────────────────────────────────────
X_prep = preprocessor.fit_transform(X_tr, y_tr)

# Select top-K features by ANOVA F-score
selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X_prep, y_tr)
print(f"Selected {X_selected.shape[1]} features from {X_prep.shape[1]}")

# Mutual information — works for non-linear relationships too
mi_scores = mutual_info_classif(X_prep, y_tr, random_state=42)
print(f"Top MI scores: {sorted(mi_scores, reverse=True)[:5]}")

# ── 7. POLYNOMIAL FEATURES ─────────────────────────────────────────────
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X_prep[:, :3])   # first 3 features
print(f"Polynomial features: {X_prep.shape[1]} → {X_poly.shape[1]}")
01
Always use Pipelines when doing preprocessing. They prevent leakage, make cross-validation correct, and make production deployment trivial — one .predict() call handles everything.
02
Log1p before scaling for skewed features. The order matters: transform, then scale. np.log1p handles zeros gracefully unlike np.log.
03
High-cardinality categoricals (1000+ unique values) — use target encoding or embedding. One-hot produces too many sparse columns that hurt most models.
04
Date features are underrated. Extracting dayofweek, hour, is_holiday, and days_since_last_event from a timestamp can be the most predictive features in a dataset.
05
RobustScaler (scales by IQR) is better than StandardScaler when your data has outliers — outliers won't distort the scale applied to other points.
06
Feature selection after engineering: more features ≠ better model. Use mutual_info_classif or feature importances from a Random Forest to prune noise.
Kaggle Feature Engineering course sklearn ColumnTransformer Feature Engineering for ML book Target encoding guide
Ensemble Methods & Boosting 05 / 05

Combine Weak Learners
into Powerful Models

XGBoost and LightGBM win most tabular ML competitions. Gradient boosting is the dominant paradigm for structured data in industry. Understanding how it works helps you tune it masterfully.

Browse any top Kaggle solution for a tabular competition — you'll find XGBoost or LightGBM. These algorithms consistently outperform all others on structured data. They handle missing values, mixed feature types, non-linear interactions, and irrelevant features with minimal preprocessing. In industry, they are the go-to before any neural network is considered.
Bagging
Train models on random bootstrap samples, average predictions. Reduces variance. Parallel training. Random Forest = bagging of decision trees.
Boosting
Train models sequentially, each correcting previous errors. Reduces bias. Prone to overfitting — use early stopping and regularization.
Stacking
Use predictions of base models as features for a meta-learner (often logistic regression). Captures the best of each base model's specialization.
XGBoost
Regularized gradient boosting with second-order Taylor expansion. Handles missing values natively. The industry standard since 2016.
LightGBM
Histogram-based gradient boosting. 10x faster than XGBoost on large datasets. Grows trees leaf-wise vs level-wise. Better for millions of rows.
CatBoost
Handles categorical features natively via ordered target encoding. Less hyperparameter tuning needed. Strong out-of-the-box on heterogeneous data.
Gradient Boosting
Each new tree fits the negative gradient (residuals) of the loss function. The ensemble is an additive model of weak learners.
Early Stopping
Stop training when validation metric stops improving. Critical for boosting — prevents overfitting. Set eval_set and early_stopping_rounds.
Gradient Boosting — Additive Model
F₀(x) = initial prediction (e.g. mean of y) For t = 1 to T: rᵢ = -∂L/∂F(xᵢ) ← negative gradient = pseudo-residuals hₜ = tree fit to rᵢ ← weak learner fits residuals F_t(x) = F_{t-1}(x) + η·hₜ(x) ← η = learning rate Final: F_T(x) = sum of all trees × learning rate
ParameterEffectTypical RangeDirection if Overfitting
learning_rate (η) Step size. Smaller = more trees needed but generalizes better. 0.01 – 0.3 ↓ decrease
n_estimators Number of trees. Use with early stopping. 100 – 5000 Use early stopping
max_depth Tree depth. Deeper = more complex interactions. 3 – 8 ↓ decrease
subsample Fraction of training rows per tree. Adds stochasticity. 0.6 – 1.0 ↓ decrease
colsample_bytree Fraction of features per tree. Like RF's max_features. 0.6 – 1.0 ↓ decrease
reg_lambda (L2) L2 regularization on leaf weights. Prevents large weights. 0.1 – 10 ↑ increase
reg_alpha (L1) L1 regularization. Sparse leaf weights. 0 – 1 ↑ increase
ensemble_methods.py
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    VotingClassifier, StackingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_breast_cancer
import optuna

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                           stratify=y, random_state=42)
X_tr, X_val, y_tr, y_val = train_test_split(X_tr, y_tr, test_size=0.2,
                                             stratify=y_tr, random_state=42)

# ── 1. XGBOOST — with early stopping ──────────────────────────────────
xgb_model = xgb.XGBClassifier(
    n_estimators=2000,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.5,
    reg_alpha=0.1,
    min_child_weight=3,
    eval_metric='auc',
    early_stopping_rounds=50,
    random_state=42,
    use_label_encoder=False,
    verbosity=0
)
xgb_model.fit(X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=False)

xgb_auc = roc_auc_score(y_te, xgb_model.predict_proba(X_te)[:, 1])
print(f"XGBoost AUC: {xgb_auc:.4f}  Best round: {xgb_model.best_iteration}")

# Feature importance
feat_imp = xgb_model.feature_importances_
top_idx = np.argsort(feat_imp)[:-6:-1]
print("Top-5 features:", load_breast_cancer().feature_names[top_idx])

# ── 2. LIGHTGBM — faster, better for large data ───────────────────────
lgb_model = lgb.LGBMClassifier(
    n_estimators=2000,
    num_leaves=31,          # LightGBM uses num_leaves, not max_depth
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_samples=20,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50, verbose=False),
               lgb.log_evaluation(-1)])

lgb_auc = roc_auc_score(y_te, lgb_model.predict_proba(X_te)[:, 1])
print(f"LightGBM AUC: {lgb_auc:.4f}")

# ── 3. CATBOOST — native categorical support ──────────────────────────
cb_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    eval_metric='AUC',
    early_stopping_rounds=50,
    random_seed=42,
    verbose=0
)
cb_model.fit(X_tr, y_tr, eval_set=(X_val, y_val))
cb_auc = roc_auc_score(y_te, cb_model.predict_proba(X_te)[:, 1])
print(f"CatBoost AUC: {cb_auc:.4f}")

# ── 4. VOTING ENSEMBLE ────────────────────────────────────────────────
rf  = RandomForestClassifier(n_estimators=200, random_state=42)
xgb2 = xgb.XGBClassifier(n_estimators=200, random_state=42, verbosity=0)
lgb2 = lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)

voting = VotingClassifier(
    estimators=[('rf', rf), ('xgb', xgb2), ('lgb', lgb2)],
    voting='soft'   # average probabilities, not votes
)
voting.fit(X_tr, y_tr)
vote_auc = roc_auc_score(y_te, voting.predict_proba(X_te)[:, 1])
print(f"Voting Ensemble AUC: {vote_auc:.4f}")

# ── 5. STACKING ───────────────────────────────────────────────────────
estimators = [
    ('rf',  RandomForestClassifier(n_estimators=200, random_state=42)),
    ('xgb', xgb.XGBClassifier(n_estimators=200, random_state=42, verbosity=0)),
    ('lgb', lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)),
]
stacker = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(C=0.1),  # meta-learner
    cv=5,
    passthrough=True   # include original features alongside base predictions
)
stacker.fit(X_tr, y_tr)
stack_auc = roc_auc_score(y_te, stacker.predict_proba(X_te)[:, 1])
print(f"Stacking AUC: {stack_auc:.4f}")

# ── 6. OPTUNA HYPERPARAMETER OPTIMIZATION ─────────────────────────────
def objective(trial):
    params = {
        'n_estimators':     trial.suggest_int('n_estimators', 100, 1000),
        'max_depth':        trial.suggest_int('max_depth', 3, 8),
        'learning_rate':    trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda':       trial.suggest_float('reg_lambda', 0.1, 10.0, log=True),
        'reg_alpha':        trial.suggest_float('reg_alpha', 0.0, 1.0),
    }
    model = xgb.XGBClassifier(**params, random_state=42, verbosity=0)
    score = cross_val_score(model, X_tr, y_tr, cv=5, scoring='roc_auc')
    return score.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# Train final model with best params
best = xgb.XGBClassifier(**study.best_params, random_state=42, verbosity=0)
best.fit(X_tr, y_tr)
final_auc = roc_auc_score(y_te, best.predict_proba(X_te)[:, 1])
print(f"Final XGBoost AUC on test: {final_auc:.4f}")
01
Always use early stopping with a validation set. Set n_estimators high (2000+) and let early stopping find the right number. Never guess n_estimators.
02
LightGBM is 10x faster than XGBoost on large datasets (1M+ rows). Use XGBoost for smaller data where you want maximum accuracy and interpretability.
03
Learning rate and n_estimators are coupled. Smaller learning rate = more trees needed. A good pattern: lr=0.05 with 500-2000 trees + early stopping.
04
Optuna > GridSearch > RandomSearch for hyperparameter optimization. Optuna's TPE sampler is Bayesian — it learns which regions of the space are promising.
05
Soft voting always beats hard voting for classifiers with well-calibrated probabilities. Averaging probabilities keeps more information than majority vote.
06
Feature importance from XGBoost has three types: weight (frequency), gain (improvement in loss), cover (data coverage). Gain is usually most informative.
🎉 Phase 2 Complete! You now have the full Classical ML toolkit. The natural next step is Phase 3: Deep Learning — where you'll understand why neural networks are needed for tasks where these algorithms fall short (images, text, sequences).
XGBoost docs LightGBM docs CatBoost docs Optuna tutorial Kaggle winning solutions StatQuest Gradient Boosting