Phase 3 — Deep Learning

Phase 3 of 6 · ML & AI Roadmap

Deep
Learning

Where the real magic begins. Classical ML hits a wall with images, text, audio, and sequences. Deep learning breaks through it — neural networks learn hierarchical representations that no human could hand-craft. This phase takes you from a single neuron to the Transformer architecture that powers GPT-4 and Claude.

Six Topics

01 / 06

Neural Networks & Backprop

Build and train a network from scratch. Understand every gradient.

02 / 06

PyTorch

The dominant deep learning framework. Tensors, autograd, training loop.

03 / 06

CNNs & Computer Vision

Convolutions, pooling, transfer learning, data augmentation.

04 / 06

RNNs & LSTMs

Sequential data, vanishing gradients, gating mechanisms.

05 / 06

Transformers & Attention

The architecture powering every modern LLM and vision model.

06 / 06

Training Tricks & Tuning

Regularization, optimizers, schedulers, debugging training failures.

Prerequisites from Phase 2

Before starting Phase 3, you should know: Python, NumPy vectorized operations, gradient descent conceptually (from Phase 1 Calculus), and how to use scikit-learn's fit/predict API. You don't need to know PyTorch yet — that's Topic 2 here. But comfort with matrix multiplication and derivatives is essential.

Suggested Schedule

Week 1

Backprop

+ PyTorch basics

Week 2–3

CNNs

+ RNNs / LSTMs

Week 4–5

Transformers

+ Training tricks

Capstone Project

Image + Text Classifier: Train a CNN on CIFAR-10 (>90% accuracy with transfer learning), then build a sentiment classifier with a Transformer encoder on IMDb. Both must use proper train/val/test splits, early stopping, and a learning rate scheduler. Deploy both as a FastAPI endpoint.

Neural Networks 01 / 06

Build Networks from Scratch
& Understand Every Gradient

Before you reach for PyTorch, implement backpropagation by hand. Once you understand why gradients flow the way they do, every training failure becomes diagnosable.

Why This Matters

Most people jump straight to high-level frameworks and treat training as a black box. Then they can't debug why their model diverges, why gradients vanish, or why loss spikes after epoch 10. Implementing backprop from scratch — even once — gives you a mental model that pays dividends for years. You'll understand what loss.backward() actually does, and why.

Core Concepts

Neuron

z = Wx + b, a = activation(z). Linear transform followed by non-linearity. The atomic unit of a neural network.

Layers

Stack of neurons. Input → hidden → output. Each layer applies a linear transform then an activation function.

Activations

ReLU: max(0,z). Sigmoid: 1/(1+e⁻ᶻ). Tanh. GELU. Non-linearities let networks learn complex functions. ReLU is default.

Forward Pass

Data flows input → output through successive matrix multiplications and activations. Produces prediction.

Loss Function

MSE for regression, cross-entropy for classification. Scalar measure of how wrong the prediction is.

Backward Pass

Chain rule applied layer by layer from output to input. Computes ∂L/∂w for every weight.

Weight Update

w = w − α · ∂L/∂w. All weights nudged to reduce loss. α is the learning rate.

Initialization

He init for ReLU (√2/n), Xavier for Tanh. Wrong init → gradients vanish or explode immediately.

The Math

Forward Pass — Single Layer

Z = X @ W + b ← linear transform A = relu(Z) ← activation Loss = -mean(y·log(ŷ)) ← cross-entropy

Backward Pass — Chain Rule

dL/dW = Aᵀ_prev @ dL/dZ ← gradient w.r.t. weights dL/db = sum(dL/dZ, axis=0) ← gradient w.r.t. bias dL/dA_prev = dL/dZ @ Wᵀ ← gradient propagated backward dL/dZ = dL/dA ⊙ relu'(Z) ← activation gradient (ReLU: 1 if z>0)

Code Tutorial — Backprop from Scratch

neural_net_scratch.py

import numpy as np

class Layer:
    def __init__(self, in_dim, out_dim):
        # He initialization for ReLU networks
        self.W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
        self.b = np.zeros((1, out_dim))
        self.dW = self.db = None

    def forward(self, x):
        self.x = x             # cache for backward pass
        return x @ self.W + self.b

    def backward(self, d_out):
        self.dW = self.x.T @ d_out
        self.db = d_out.sum(axis=0, keepdims=True)
        return d_out @ self.W.T  # gradient to pass to previous layer

    def update(self, lr):
        self.W -= lr * self.dW
        self.b -= lr * self.db


class ReLU:
    def forward(self, x):
        self.mask = x > 0
        return x * self.mask

    def backward(self, d_out):
        return d_out * self.mask  # kill gradient where input ≤ 0


class CrossEntropyLoss:
    def forward(self, logits, y):
        # Softmax + cross-entropy (numerically stable)
        shifted = logits - logits.max(axis=1, keepdims=True)
        exp_s = np.exp(shifted)
        self.probs = exp_s / exp_s.sum(axis=1, keepdims=True)
        self.y = y
        n = logits.shape[0]
        return -np.log(self.probs[np.arange(n), y] + 1e-9).mean()

    def backward(self):
        n = self.probs.shape[0]
        d = self.probs.copy()
        d[np.arange(n), self.y] -= 1
        return d / n


class MLP:
    def __init__(self, dims):
        # dims: e.g. [784, 256, 128, 10]
        self.layers  = [Layer(dims[i], dims[i+1]) for i in range(len(dims)-1)]
        self.relus   = [ReLU() for _ in range(len(dims)-2)]
        self.loss_fn = CrossEntropyLoss()

    def forward(self, x, y):
        a = x
        for i, layer in enumerate(self.layers):
            z = layer.forward(a)
            a = self.relus[i].forward(z) if i < len(self.relus) else z
        return self.loss_fn.forward(a, y)

    def backward(self):
        d = self.loss_fn.backward()
        for i in reversed(range(len(self.layers))):
            if i < len(self.relus):
                d = self.relus[i].backward(d)
            d = self.layers[i].backward(d)

    def step(self, lr):
        for layer in self.layers:
            layer.update(lr)

    def predict(self, x):
        a = x
        for i, layer in enumerate(self.layers):
            z = layer.forward(a)
            a = self.relus[i].forward(z) if i < len(self.relus) else z
        return a.argmax(axis=1)


# ── TRAIN ON SYNTHETIC DATA ────────────────────────────────────────────
np.random.seed(42)
# 4-class classification, 2D input
X = np.random.randn(1000, 2)
y = ((X[:, 0] > 0).astype(int) + 2 * (X[:, 1] > 0).astype(int))

net = MLP([2, 64, 32, 4])
lr = 0.05
batch_size = 64

for epoch in range(100):
    idx = np.random.permutation(len(X))
    total_loss = 0
    for i in range(0, len(X), batch_size):
        xb = X[idx[i:i+batch_size]]
        yb = y[idx[i:i+batch_size]]
        loss = net.forward(xb, yb)
        net.backward()
        net.step(lr)
        total_loss += loss
    if epoch % 20 == 0:
        acc = (net.predict(X) == y).mean()
        print(f"Epoch {epoch:3d} | loss: {total_loss:.3f} | acc: {acc:.3f}")

# ── NUMERICAL GRADIENT CHECK (debug tool) ─────────────────────────────
# Verify your analytical gradients match finite differences
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x)
    for idx in np.ndindex(x.shape):
        x_plus = x.copy(); x_plus[idx] += eps
        x_minus = x.copy(); x_minus[idx] -= eps
        grad[idx] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad

Pro Tips

Implement backprop from scratch at least once. Watch Andrej Karpathy's micrograd video (~2 hours). After that, PyTorch's autograd will never feel like magic again.

Use He initialization with ReLU, Xavier with Tanh. Wrong initialization causes vanishing or exploding gradients from the very first forward pass.

Numerical gradient checking is your debugging superpower. If analytical and numerical gradients disagree, your backward pass has a bug. Run it on small tensors.

ReLU is the default activation. It avoids the vanishing gradient problem that kills sigmoid/tanh in deep networks, and it's cheap to compute and differentiate.

Monitor the gradient norms during training. If they explode (NaN loss) use gradient clipping. If they shrink to near zero, check your initialization and architecture depth.

Overfit one batch first. Before training the full dataset, try to memorize 32 samples. If loss doesn't go to zero, your model, loss, or training loop has a bug.

Resources

Andrej Karpathy micrograd CS231n backprop notes Deep Learning book Ch.6 3Blue1Brown Neural Nets

PyTorch 02 / 06

The Premier Framework
for Deep Learning

PyTorch is the dominant framework in both research and industry. Its dynamic computation graph makes debugging feel natural. Every modern LLM was trained with it or a variant.

Why This Matters

TensorFlow was the framework of 2018. PyTorch is the framework of now — and the foreseeable future. The entire research ecosystem (HuggingFace, fast.ai, Lightning, vLLM) is built on it. Learning PyTorch properly means you can read, modify, and contribute to virtually any modern deep learning codebase.

Core Concepts

Tensors

N-dimensional arrays with GPU support. Like NumPy arrays but differentiable and CUDA-aware. The fundamental data structure.

Autograd

Automatic differentiation. PyTorch records a computation graph and walks it backward to compute gradients automatically.

nn.Module

Base class for all models. Define layers in __init__, implement forward(). PyTorch handles backward automatically.

Optimizers

Adam, AdamW, SGD with momentum. Updates parameters using computed gradients. AdamW is the default for most DL.

DataLoader

Batches, shuffles, and loads data with multiple workers. Feeds data to the training loop efficiently.

Device (GPU)

.to("cuda") moves tensors and models to GPU for massive speedups. Always check torch.cuda.is_available().

nn.Sequential

Stack layers in order without writing forward(). Clean for simple architectures.

torch.no_grad()

Context manager that disables gradient tracking. Use during validation and inference to save memory.

The 5-Step Training Loop

1. optimizer.zero_grad() → 2. pred = model(x) → 3. loss = criterion(pred, y) → 4. loss.backward() → 5. optimizer.step()

Code Tutorial — Complete PyTorch Workflow

pytorch_fundamentals.py

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
import numpy as np

# ── TENSORS — the basics ───────────────────────────────────────────────
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print(x.shape, x.dtype, x.device)  # torch.Size([2, 2]) float32 cpu

# Move to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = x.to(device)

# Common operations (same as NumPy, but on GPU)
a = torch.randn(3, 4).to(device)
b = torch.zeros(4, 2).to(device)
c = a @ b                     # matrix multiply → (3, 2)
d = torch.cat([a, a], dim=1) # concatenate → (3, 8)

# Convert between NumPy and PyTorch
np_arr = np.random.randn(5)
t = torch.from_numpy(np_arr).float()
back = t.cpu().numpy()

# ── AUTOGRAD — automatic differentiation ──────────────────────────────
# requires_grad=True tells PyTorch to track operations on this tensor
w = torch.tensor(2.0, requires_grad=True)
x = torch.tensor(3.0)
y = w * x ** 2 + w ** 3  # y = 3w² + w³  (if x=3: y = 6w + 3w²... wait: y = wx² + w³)
y.backward()
print(w.grad)  # dy/dw = x² + 3w² = 9 + 12 = 21.0

# Context manager: no gradient tracking (for inference)
with torch.no_grad():
    inference_result = w * x  # does not build computation graph

# ── nn.MODULE — define your model ──────────────────────────────────────
class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.LayerNorm(hidden),   # normalize before activation
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, hidden // 2),
            nn.ReLU(),
            nn.Linear(hidden // 2, out_dim)
        )

    def forward(self, x):
        return self.net(x)  # PyTorch calls backward() automatically

model = MLP(20, 128, 1).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# ── DATASET + DATALOADER ──────────────────────────────────────────────
X_raw = torch.randn(2000, 20)
y_raw = (X_raw.sum(dim=1) > 0).float().unsqueeze(1)
dataset = TensorDataset(X_raw, y_raw)
train_ds, val_ds, test_ds = random_split(dataset, [1600, 200, 200])

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=0)

# ── THE TRAINING LOOP ──────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.BCEWithLogitsLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val_loss = float('inf')
patience, patience_counter = 10, 0

for epoch in range(100):
    # ── Training ──
    model.train()
    train_loss = 0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()          # 1. clear gradients
        pred = model(xb)               # 2. forward pass
        loss = criterion(pred, yb)     # 3. compute loss
        loss.backward()                # 4. backward pass
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip
        optimizer.step()               # 5. update weights
        train_loss += loss.item()
    scheduler.step()

    # ── Validation ──
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            val_loss += criterion(model(xb), yb).item()

    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}"); break

    if epoch % 10 == 0:
        print(f"Ep {epoch:3d} | train: {train_loss/len(train_loader):.4f}"
              f" | val: {val_loss/len(val_loader):.4f}"
              f" | lr: {scheduler.get_last_lr()[0]:.6f}")

# ── SAVE & LOAD ────────────────────────────────────────────────────────
torch.save(model.state_dict(), 'model.pt')

model_loaded = MLP(20, 128, 1).to(device)
model_loaded.load_state_dict(torch.load('model.pt', map_location=device))
model_loaded.eval()

# ── INFERENCE ─────────────────────────────────────────────────────────
with torch.no_grad():
    test_x = torch.randn(10, 20).to(device)
    logits = model_loaded(test_x)
    probs  = torch.sigmoid(logits)
    preds  = (probs > 0.5).long()
    print("Predictions:", preds.squeeze().tolist())

Pro Tips

Always call model.train() before training and model.eval() before validation. This toggles Dropout and BatchNorm behavior — forgetting it causes subtle, hard-to-diagnose bugs.

AdamW over Adam for most tasks. AdamW correctly decouples weight decay from the adaptive learning rate, which Adam does incorrectly. It generalizes better.

Gradient clipping prevents explosions. clip_grad_norm_(model.parameters(), 1.0) should be in every training loop. It's especially important for RNNs and Transformers.

torch.compile() (PyTorch 2.0+) gives free 2-3x speedup on most models. Just wrap your model: model = torch.compile(model). One line of code.

Use BCEWithLogitsLoss not BCELoss. The former combines sigmoid + BCE in one numerically stable operation. Using BCELoss with sigmoid separately causes numerical instability.

Save model.state_dict(), not the whole model. Saving the whole model with pickle ties you to specific Python/PyTorch versions. state_dict is portable.

Resources

PyTorch official tutorials fast.ai course PyTorch Lightning Andrej Karpathy YouTube PyTorch docs

CNNs & Computer Vision 03 / 06

Teach Machines
to See

Convolutional Neural Networks exploit the structure of images — local patterns, translation invariance, and hierarchical features. They remain the workhorses of production vision systems.

Why This Matters

Vision is everywhere: medical imaging (tumor detection), autonomous vehicles, quality control in manufacturing, satellite imagery analysis, face recognition, document understanding. CNNs — and their modern successor, Vision Transformers — power all of it. Transfer learning from ImageNet-pretrained models means you can hit 90%+ accuracy on custom vision tasks with just a few hundred images.

Core Concepts

Convolution

Slide a learnable kernel (filter) across the image, computing dot products. Each kernel detects one local pattern (edge, texture, etc.).

Feature Maps

Output of a conv layer. With 64 filters you get 64 feature maps — one per learned pattern. Deeper layers detect more complex patterns.

Pooling

Max or average pooling reduces spatial dimensions. Adds translation invariance. MaxPool2d(2) halves height and width.

Receptive Field

Region of input that influences one output neuron. Grows with depth. Deeper networks see larger context.

BatchNorm

Normalizes activations per channel per batch. Dramatically stabilizes training. Place after conv, before activation.

Transfer Learning

Use a pretrained backbone (ResNet, EfficientNet, ViT). Freeze all but the last layers. Fine-tune with a small learning rate.

Data Augmentation

Random crops, flips, color jitter, rotation. Artificially expands your dataset. Free regularization — always use it.

Global Avg Pooling

Collapses spatial dimensions to 1×1. Replaces the large flatten+linear combination. Works for any input resolution.

CNN Architecture

Input (3 × 224 × 224)
→ Conv2d(3→64, k=3, p=1) + BN + ReLU → (64 × 224 × 224)
→ MaxPool2d(2) → (64 × 112 × 112)
→ Conv2d(64→128, k=3, p=1) + BN + ReLU → (128 × 112 × 112)
→ MaxPool2d(2) → (128 × 56 × 56)
→ Conv2d(128→256, k=3, p=1) + BN + ReLU→ (256 × 56 × 56)
→ AdaptiveAvgPool2d(1, 1) → (256 × 1 × 1)
→ Flatten → Linear(256 → num_classes) → (num_classes,)

Code Tutorial

cnn_vision.py

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as T
import torchvision.models as models
from torch.utils.data import DataLoader

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. BUILD A CNN FROM SCRATCH ────────────────────────────────────────
class ConvBlock(nn.Module):
    """Conv → BatchNorm → ReLU — the fundamental CNN building block"""
    def __init__(self, in_ch, out_ch, kernel=3, stride=1):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, kernel, stride=stride,
                      padding=kernel//2, bias=False),  # bias=False with BN
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    def forward(self, x): return self.block(x)


class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            ConvBlock(3, 32),
            ConvBlock(32, 64),
            nn.MaxPool2d(2),                 # 32→16
            ConvBlock(64, 128),
            ConvBlock(128, 128),
            nn.MaxPool2d(2),                 # 16→8
            ConvBlock(128, 256),
            nn.AdaptiveAvgPool2d((1, 1)),   # any input size → 1×1
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(128, num_classes)
        )

    def forward(self, x): return self.classifier(self.features(x))

model = SmallCNN(num_classes=10).to(device)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")

# ── 2. DATA AUGMENTATION ───────────────────────────────────────────────
train_transform = T.Compose([
    T.RandomCrop(32, padding=4),         # random crop with padding
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    T.RandomRotation(10),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],   # ImageNet stats
                [0.229, 0.224, 0.225])
])
val_transform = T.Compose([
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],
                [0.229, 0.224, 0.225])
])

train_ds = torchvision.datasets.CIFAR10('./data', train=True,
               download=True, transform=train_transform)
val_ds   = torchvision.datasets.CIFAR10('./data', train=False,
               download=True, transform=val_transform)
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True,  num_workers=4, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=256, shuffle=False, num_workers=4)

# ── 3. TRANSFER LEARNING — ResNet50 ───────────────────────────────────
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Strategy 1: Freeze backbone, train only head (fast, small datasets)
for param in backbone.parameters():
    param.requires_grad = False
backbone.fc = nn.Sequential(
    nn.Linear(backbone.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, 10)
)

# Strategy 2: Fine-tune all layers with differential learning rates
backbone_full = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone_full.fc = nn.Linear(backbone_full.fc.in_features, 10)
optimizer_full = optim.AdamW([
    {'params': backbone_full.layer4.parameters(), 'lr': 1e-4},
    {'params': backbone_full.layer3.parameters(), 'lr': 5e-5},
    {'params': backbone_full.fc.parameters(),     'lr': 1e-3},
], weight_decay=1e-4)

# ── 4. TRAINING WITH AMP (Automatic Mixed Precision) ──────────────────
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=5e-4)
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01,
    steps_per_epoch=len(train_loader), epochs=50)

for epoch in range(50):
    model.train()
    correct, total = 0, 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        with autocast():            # float16 for speed
            out = model(imgs)
            loss = criterion(out, labels)
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        correct += (out.argmax(1) == labels).sum().item()
        total += labels.size(0)

    model.eval()
    val_correct = 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            val_correct += (model(imgs).argmax(1) == labels).sum().item()
    print(f"Ep {epoch:2d} | train acc: {correct/total:.3f}"
          f" | val acc: {val_correct/len(val_ds):.3f}")

Pro Tips

Almost always use transfer learning. Training ResNet from scratch on CIFAR-10 takes hours. Fine-tuning ImageNet weights takes 15 minutes and gives better results.

Batch normalization after Conv, before ReLU. BN dramatically stabilizes training. Without it, deeper CNNs are notoriously difficult to train.

Data augmentation is free regularization. Use it aggressively for small datasets (<10k images). RandomCrop + HorizontalFlip alone can boost accuracy by 3-5%.

OneCycleLR scheduler — one cycle of warmup + decay — often outperforms other schedules. Works especially well with SGD + momentum.

Label smoothing (0.1) prevents overconfidence and consistently improves generalization on classification tasks. One parameter that almost always helps.

AMP (autocast) gives a free 1.5-2x speedup on NVIDIA GPUs. bfloat16 is more numerically stable than float16 on Ampere+ GPUs.

Resources

CS231n Stanford fast.ai Lesson 1 torchvision models Papers With Code timm library

RNNs & LSTMs 04 / 06

Model Sequences,
Time, and Memory

Recurrent networks process sequential data step by step, maintaining a hidden state. LSTMs solve the vanishing gradient problem that cripples vanilla RNNs. Essential for time series, NLP, and audio.

Why This Matters

Even in the Transformer age, RNNs and LSTMs remain relevant: time series forecasting, real-time streaming inference, and on-device models where Transformers are too heavy. More importantly, understanding LSTMs and their gating mechanisms gives you deep intuition for why Transformers work — attention is a more powerful solution to the same problem that LSTMs partially solved.

Core Concepts

Recurrent Connection

hₜ = f(Wxₜ + Uhₜ₋₁ + b). Hidden state carries information across timesteps. Parameters shared across all steps.

Vanishing Gradient

Gradients shrink exponentially with sequence length. RNNs can't learn long-range dependencies. The core problem LSTMs solve.

LSTM Gates

Forget gate: what to erase. Input gate: what to write. Output gate: what to read. Cell state c provides an information highway.

Cell State

The "memory conveyor belt" in LSTMs. Information flows through with minimal transformation — enables long-range gradients.

GRU

Gated Recurrent Unit — simplified LSTM. Two gates (reset, update). Comparable performance, fewer parameters.

Bidirectional

Run one RNN forward and one backward. Doubles hidden size. Useful when the full sequence is available at inference time (not streaming).

Sequence-to-Sequence

Encoder LSTM encodes input sequence. Decoder LSTM generates output sequence. Foundation of pre-Transformer NMT.

TBPTT

Truncated Backprop Through Time. Only backprop through last K steps. Prevents memory explosion on long sequences.

LSTM Gate Equations

LSTM — All Four Gates

f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← forget gate: erase old memory i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← input gate: allow new info c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c) ← candidate cell state c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t ← update cell (the key!) o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← output gate h_t = o_t ⊙ tanh(c_t) ← hidden state

Code Tutorial

rnn_lstm.py

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. VANILLA RNN — understand the mechanics ──────────────────────────
class VanillaRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.Wx = nn.Linear(input_dim, hidden_dim, bias=False)
        self.Wh = nn.Linear(hidden_dim, hidden_dim)
        self.out = nn.Linear(hidden_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        B, T, _ = x.shape
        h = torch.zeros(B, self.Wh.in_features).to(x.device)
        outputs = []
        for t in range(T):
            h = self.tanh(self.Wx(x[:, t, :]) + self.Wh(h))
            outputs.append(h)
        last_h = outputs[-1]
        return self.out(last_h)  # classify based on final hidden state


# ── 2. LSTM — PyTorch built-in ────────────────────────────────────────
class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim,
                 dropout=0.3, bidirectional=False):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional,
            batch_first=True    # (batch, seq, feature) — ALWAYS use this
        )
        d = hidden_dim * (2 if bidirectional else 1)
        self.head = nn.Sequential(
            nn.Linear(d, d // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d // 2, output_dim)
        )

    def forward(self, x):
        # out: (batch, seq, hidden*dirs), (h_n, c_n)
        out, (h_n, c_n) = self.lstm(x)
        # Use the final hidden state of the last layer
        last = out[:, -1, :]
        return self.head(last)


# ── 3. TIME SERIES FORECASTING ────────────────────────────────────────
class LSTMForecaster(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, forecast_horizon):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers,
                            batch_first=True)
        self.fc = nn.Linear(hidden_dim, forecast_horizon)

    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        return self.fc(h_n[-1])   # last layer hidden state → forecast


# ── 4. SEQUENCE-TO-SEQUENCE ───────────────────────────────────────────
class Seq2Seq(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed   = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc_out  = nn.Linear(hidden_dim, vocab_size)

    def forward(self, src, tgt):
        enc_out, (h, c) = self.encoder(self.embed(src))
        dec_out, _      = self.decoder(self.embed(tgt), (h, c))
        return self.fc_out(dec_out)


# ── 5. FULL TRAINING EXAMPLE — Sentiment Classification ───────────────
np.random.seed(42); torch.manual_seed(42)
# Fake token sequences (batch, seq_len, features)
SEQ_LEN, BATCH, FEAT = 50, 64, 16
X_train = torch.randn(800, SEQ_LEN, FEAT)
y_train = torch.randint(0, 2, (800,))
X_val   = torch.randn(200, SEQ_LEN, FEAT)
y_val   = torch.randint(0, 2, (200,))

model = LSTMClassifier(FEAT, 128, 2, 2, dropout=0.3, bidirectional=True).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(30):
    model.train()
    for i in range(0, len(X_train), BATCH):
        xb = X_train[i:i+BATCH].to(device)
        yb = y_train[i:i+BATCH].to(device)
        optimizer.zero_grad()
        loss = criterion(model(xb), yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    model.eval()
    with torch.no_grad():
        val_acc = (model(X_val.to(device)).argmax(1) == y_val.to(device)).float().mean()
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d} | val acc: {val_acc:.3f}")

Pro Tips

Always use batch_first=True in nn.LSTM. The default (seq, batch, feature) is confusing and error-prone. batch_first=True matches what you'd expect from most data loading pipelines.

Gradient clipping is mandatory for RNNs. Exploding gradients are a constant threat. Always clip to 1.0. Vanishing gradients are addressed by LSTM's cell state design.

Bidirectional LSTMs for offline tasks only. If you're classifying a full sentence or time series known in advance, use BiLSTM. For streaming/real-time, you can only go forward.

GRU is a solid default when you want something between vanilla RNN and LSTM. Same performance on most tasks, fewer parameters, trains faster.

For long sequences, Transformer beats LSTM. LSTMs struggle beyond ~500 steps even with cell state. When sequence length exceeds ~200, consider switching to attention mechanisms.

Pack padded sequences with nn.utils.rnn.pack_padded_sequence when sequences have variable lengths. Prevents the model from attending to padding tokens.

Resources

Understanding LSTMs (Olah) PyTorch RNN tutorial StatQuest LSTMs The Unreasonable Effectiveness of RNNs (Karpathy)

Transformers & Attention 05 / 06

The Architecture That
Powers Modern AI

Every frontier AI system today — GPT-4, Claude, Gemini, DALL-E, Whisper — is built on the Transformer. Understanding attention is now a foundational skill, not an advanced one.

Why This Matters

The 2017 paper "Attention Is All You Need" changed the field permanently. Before it: task-specific architectures, RNN bottlenecks, slow sequential processing. After it: one architecture rules everything — text, images, audio, video, protein structure, code. Understanding the Transformer puts you at the frontier of modern AI research and engineering.

Core Concepts

Self-Attention

Every token attends to every other token simultaneously. Captures long-range dependencies in O(n²) not O(n). The core mechanism.

Q, K, V

Query, Key, Value. Each token produces three vectors. Attention = softmax(QKᵀ/√d) · V. Q asks, K answers, V provides content.

Multi-Head Attention

Run attention H times in parallel with different learned projections. Each head can attend to different relationship types. Concatenate and project.

Positional Encoding

Attention is permutation-invariant. Positional encodings inject position information. Sinusoidal (original) or learned (modern).

Feed-Forward Block

Two linear layers (expand → compress) with GELU activation after each attention block. Usually 4× expansion: 512→2048→512.

LayerNorm (Pre-LN)

Normalize per token. Modern Transformers use Pre-LN (before attention) not Post-LN. Much more stable training.

Residual Connections

x = x + sublayer(x). Critical for training depth. Gradients can flow around any layer. Without them deep Transformers don't train.

Causal Masking

Decoder-only (GPT-style): each token can only attend to previous tokens. Achieved by masking the upper triangle of the attention matrix.

Attention — The Key Equation

Scaled Dot-Product Attention

Q = X @ W_Q K = X @ W_K V = X @ W_V Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V ↑ The √d_k scaling prevents softmax saturation in high dimensions. ↑ Causal mask: add -∞ to upper triangle before softmax → prob = 0. Multi-Head: concat [head₁ ... headₕ] @ W_O where headᵢ = Attention(Q@Wᵢ_Q, K@Wᵢ_K, V@Wᵢ_V)

Encoder vs Decoder

Property	Encoder (BERT-style)	Decoder (GPT-style)
Attention direction	Bidirectional (all tokens see all)	Causal (only past tokens)
Training objective	Masked Language Modeling	Next token prediction
Use cases	Classification, NER, embeddings	Text generation, LLMs
Examples	BERT, RoBERTa, DeBERTa	GPT-2/3/4, LLaMA, Claude
Inference	One forward pass	Autoregressive (token by token)

Code Tutorial — Build a GPT-style Transformer

transformer.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── SCALED DOT-PRODUCT ATTENTION ──────────────────────────────────────
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.h = n_heads
        self.d_k = d_model // n_heads
        # One projection matrix for Q, K, V combined (faster)
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and split into heads
        qkv = self.qkv(x).reshape(B, T, 3, self.h, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, h, T, d_k)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention
        scale = math.sqrt(self.d_k)
        scores = (Q @ K.transpose(-2, -1)) / scale   # (B, h, T, T)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = self.dropout(F.softmax(scores, dim=-1))
        out  = (attn @ V).transpose(1, 2).contiguous().reshape(B, T, C)
        return self.proj(out)


# ── TRANSFORMER BLOCK (Pre-LN — modern & stable) ──────────────────────
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1  = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ln2  = nn.LayerNorm(d_model)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),                   # GELU outperforms ReLU in Transformers
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x, mask=None):
        x = x + self.attn(self.ln1(x), mask)  # pre-LN residual
        x = x + self.ff(self.ln2(x))
        return x


# ── GPT-STYLE LANGUAGE MODEL ──────────────────────────────────────────
class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers,
                 d_ff, max_seq_len, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)  # learned
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.ln_final = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying — share embedding and output weights (GPT-2 trick)
        self.head.weight = self.tok_embed.weight
        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, mean=0, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Embedding):
                nn.init.normal_(m.weight, mean=0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos = torch.arange(T, device=idx.device)
        x = self.drop(self.tok_embed(idx) + self.pos_embed(pos))

        # Causal (autoregressive) mask — upper triangle = 0
        mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0).unsqueeze(0)

        for block in self.blocks:
            x = block(x, mask)
        logits = self.head(self.ln_final(x))

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
                                   targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            logits, _ = self(idx[:, -512:])  # keep last 512 tokens
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = float('-inf')
            probs = F.softmax(logits, dim=-1)
            next_tok = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_tok], dim=1)
        return idx


# ── INSTANTIATE A SMALL GPT ────────────────────────────────────────────
model = GPT(
    vocab_size=50257,  # GPT-2 vocabulary
    d_model=256,
    n_heads=8,
    n_layers=6,
    d_ff=1024,
    max_seq_len=512,
    dropout=0.1
).to(device)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Training uses exactly the same loop as PyTorch section above
optimizer = torch.optim.AdamW(
    model.parameters(), lr=3e-4,
    betas=(0.9, 0.95), weight_decay=0.1)

# ── USING HUGGINGFACE INSTEAD (production choice) ─────────────────────
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Classify with BERT-style encoder (much easier than from scratch)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
clf_model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=2)

texts = ["This movie is amazing!", "I hated every minute of it."]
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
    logits = clf_model(**tokens).logits
    preds = logits.argmax(dim=-1)
print("Predictions:", ["negative", "positive"][p] for p in preds.tolist())

Pro Tips

Read "Attention Is All You Need" (2017). It's only 15 pages and surprisingly readable. The original paper explains the motivation better than most tutorials.

Study Karpathy's nanoGPT — ~300 lines of clean PyTorch that trains a real GPT. Reading it once is worth a week of tutorials.

Pre-LN (LayerNorm before attention) is more stable than the original Post-LN. Modern models all use Pre-LN or RMSNorm. It's why you don't need learning rate warmup as long.

Flash Attention 2 makes attention 5-10x faster by computing it in tiled blocks that fit in GPU SRAM. For any real training, use F.scaled_dot_product_attention (PyTorch 2.0+).

Weight tying (input embedding = output projection) reduces parameters and consistently improves perplexity in language models. Always do it for LMs.

For most applications, use HuggingFace rather than implementing from scratch. Understanding the architecture matters; reinventing the wheel production doesn't.

Resources

Attention Is All You Need Illustrated Transformer Karpathy nanoGPT HuggingFace course Flash Attention paper

Training Tricks & Tuning 06 / 06

The Craft of Getting
Models to Actually Train

Knowing the architecture is only half the battle. This is the practitioner knowledge that separates people who get results from people who get NaN loss and give up. Regularization, optimizers, debugging — all of it here.

Why This Matters

A great architecture with poor training practices underperforms a mediocre architecture with expert training. The gap between "the model doesn't train" and "the model works great" is almost always in the training loop: the wrong learning rate, missing gradient clipping, wrong regularization, or a subtle data normalization bug. These tricks are the craft that textbooks skip.

Core Concepts

Dropout

Randomly zero out p% of neurons during training. Forces redundant representations. Use 0.1–0.5 depending on model size.

Batch Normalization

Normalize activations per batch. Allows higher learning rates, reduces sensitivity to init. Use after linear/conv, before activation.

Layer Normalization

Normalize per sample across features. Used in Transformers. Not dependent on batch size — works for any batch.

Weight Decay

L2 regularization on weights. Penalizes large weights. Use AdamW (0.01–0.1) rather than Adam's flawed weight decay.

Learning Rate Warmup

Linearly ramp LR from 0 to target over first N steps. Prevents large early updates from destabilizing training.

Cosine Annealing

LR follows cosine curve from max to min. Smooth decay. CosineAnnealingLR or OneCycleLR in PyTorch.

Gradient Accumulation

Accumulate gradients over N mini-batches before stepping. Simulates larger batch size without extra GPU memory.

Mixed Precision (AMP)

Use float16/bfloat16 for forward/backward, float32 for optimizer updates. 2x speedup, 2x memory savings.

Optimizer Comparison

Optimizer	Best For	Key Params	Notes
SGD + Momentum	CNNs, ResNets with careful tuning	lr=0.1, momentum=0.9	Best final accuracy with proper schedule; hard to tune
Adam	Transformers, MLPs, general use	lr=3e-4, β=(0.9, 0.999)	Adaptive. Flawed weight decay — use AdamW instead
AdamW	Transformers, LLMs, default choice	lr=3e-4, wd=0.01-0.1	Decoupled weight decay. The modern standard
Lion	Large model fine-tuning	lr=3e-5, wd=1.0	Uses only sign of gradient. Memory efficient
Muon	Language model pretraining	—	Orthogonalization-based. Recent SOTA for LM training

Learning Rate — The Most Important Hyperparameter

Learning rate range test: Start very small (1e-7) and increase exponentially over one epoch. Plot loss vs LR. The optimal LR is just before where loss stops decreasing. This "LR finder" technique (from Smith 2015) should be your first step on any new model.

Code Tutorial — Complete Training Tricks Reference

training_tricks.py

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler
import numpy as np
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. REGULARIZATION TECHNIQUES ──────────────────────────────────────
class RegularizedMLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.BatchNorm1d(hidden),          # BN before activation
            nn.GELU(),
            nn.Dropout(0.3),               # dropout after activation
            nn.Linear(hidden, hidden // 2),
            nn.LayerNorm(hidden // 2),      # or use LayerNorm
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(hidden // 2, out_dim)
        )
    def forward(self, x): return self.net(x)

# ── 2. LEARNING RATE WARMUP + COSINE DECAY ────────────────────────────
def get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps, min_lr_ratio=0.1):
    def lr_lambda(step):
        if step < warmup_steps:
            return step / max(1, warmup_steps)  # linear warmup
        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        cosine = 0.5 * (1 + math.cos(math.pi * progress))
        return max(min_lr_ratio, cosine)            # cosine decay
    return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

model = RegularizedMLP(20, 128, 2).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05,
                        betas=(0.9, 0.999), eps=1e-8)
scheduler = get_cosine_schedule_with_warmup(
    optimizer, warmup_steps=100, total_steps=1000)

# ── 3. GRADIENT ACCUMULATION (simulate large batch on small GPU) ───────
ACCUM_STEPS = 4   # effective batch = batch_size × ACCUM_STEPS
criterion = nn.CrossEntropyLoss()

model.train()
optimizer.zero_grad()

for step, (xb, yb) in enumerate(train_loader):  # assume train_loader exists
    xb, yb = xb.to(device), yb.to(device)
    loss = criterion(model(xb), yb) / ACCUM_STEPS  # scale loss
    loss.backward()

    if (step + 1) % ACCUM_STEPS == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

# ── 4. MIXED PRECISION (AMP) ──────────────────────────────────────────
scaler = GradScaler()

for xb, yb in train_loader:
    xb, yb = xb.to(device), yb.to(device)
    optimizer.zero_grad()

    with autocast(dtype=torch.bfloat16):   # bfloat16 on Ampere+
        pred = model(xb)
        loss = criterion(pred, yb)

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

# ── 5. LEARNING RATE FINDER ────────────────────────────────────────────
def lr_finder(model, optimizer, criterion, train_loader,
              start_lr=1e-7, end_lr=10, num_iter=100):
    lrs, losses = [], []
    mult = (end_lr / start_lr) ** (1 / num_iter)
    lr = start_lr
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    model.train()
    for i, (xb, yb) in enumerate(train_loader):
        if i >= num_iter: break
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()
        loss = criterion(model(xb), yb)
        loss.backward()
        optimizer.step()
        lrs.append(lr); losses.append(loss.item())
        lr *= mult
        for pg in optimizer.param_groups: pg['lr'] = lr
        if loss.item() > 4 * min(losses): break  # stop if diverging
    return lrs, losses

# Plot: steepest descent region → your optimal LR

# ── 6. DEBUGGING TOOLKIT ──────────────────────────────────────────────
def diagnose_model(model, x_sample, y_sample, criterion):
    """Quick sanity checks before training."""
    model.eval()
    with torch.no_grad():
        out = model(x_sample.to(device))

    # 1. Check output shape
    print(f"Output shape: {out.shape}")

    # 2. Check initial loss matches random baseline
    loss = criterion(out, y_sample.to(device))
    n_classes = out.shape[-1]
    expected = -math.log(1.0 / n_classes)
    print(f"Initial loss: {loss:.3f} (expected ~{expected:.3f} for {n_classes} classes)")

    # 3. Check gradient flow
    model.train()
    out = model(x_sample.to(device))
    loss = criterion(out, y_sample.to(device))
    loss.backward()
    grads = [(n, p.grad.abs().mean().item())
             for n, p in model.named_parameters()
             if p.grad is not None]
    print("Gradient norms:")
    for name, gnorm in grads[-5:]:
        print(f"  {name:40s} {gnorm:.6f}")

    # 4. Overfit one batch
    print("\nOverfitting 1 batch:")
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    for i in range(100):
        optimizer.zero_grad()
        loss = criterion(model(x_sample.to(device)), y_sample.to(device))
        loss.backward()
        optimizer.step()
        if i % 20 == 0: print(f"  step {i}: {loss.item():.4f}")
    # Should reach near-zero. If not: model, loss, or loop has a bug.

# ── 7. COMMON NaN DEBUGGING ────────────────────────────────────────────
def add_nan_hooks(model):
    """Register hooks that print which layer produced NaN."""
    def hook(module, inp, out, name=''):
        if isinstance(out, torch.Tensor) and torch.isnan(out).any():
            print(f"NaN detected in {name} output!")
    for name, layer in model.named_modules():
        layer.register_forward_hook(
            lambda m, i, o, n=name: hook(m, i, o, n))

# ── 8. EMA — Exponential Moving Average (improves eval performance) ────
class EMA:
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {k: v.clone().detach()
                       for k, v in model.state_dict().items()}

    def update(self):
        with torch.no_grad():
            for k, v in self.model.state_dict().items():
                self.shadow[k] = self.decay * self.shadow[k] + (1 - self.decay) * v

    def apply_shadow(self, model):
        model.load_state_dict(self.shadow)

ema = EMA(model, decay=0.9999)
# Call ema.update() after each optimizer.step() during training
# Use ema.apply_shadow(eval_model) for evaluation

Pro Tips

Run the overfit-one-batch test before any long training run. If loss doesn't reach near-zero on 32 samples in 100 steps, something is fundamentally broken. Find it now, not after 8 hours.

Initial loss reveals bugs immediately. A randomly initialized model should produce -log(1/C) loss where C = num_classes. If it's far off, your model architecture or loss function has a bug.

Log gradient norms per layer. Norms should be similar across layers. If early layers have near-zero gradients → vanishing. If any layer spikes → exploding. Fix with clipping or LR adjustment.

Warmup + cosine annealing is the safest default schedule. Warmup for 5-10% of total steps, then cosine decay to 10% of peak LR. Works across architectures and tasks.

EMA weights consistently outperform final weights at evaluation. Keep an exponential moving average of model parameters and use that for inference. One line of code, measurable gain.

Large batch → large LR (linear scaling rule). If you 4× the batch size, 4× the learning rate. Use warmup when scaling LR to avoid instability in early training.

Training Failure Diagnostic Checklist

Loss is NaN from step 1: Check for log(0), division by zero, wrong dtype.
Loss explodes after N steps: Add gradient clipping. Lower LR. Increase weight decay.
Loss doesn't decrease at all: Check LR (too small?), data normalization, gradient flow.
Train low, val high (overfit): Add dropout, weight decay, data augmentation, reduce model size.
Both high (underfit): Larger model, more training steps, lower weight decay, higher LR.
Loss oscillates wildly: Reduce LR. Add LR warmup. Check for bad batches in dataset.

🎉 Phase 3 Complete! You now understand how neural networks learn, how to build and train CNNs, RNNs, and Transformers in PyTorch, and how to diagnose and fix training failures. Phase 4 (NLP & LLMs) builds directly on everything here — you're ready.

Resources

fast.ai Part 2 PyTorch profiler Deep Learning tuning playbook (Google) Andrej Karpathy — Training NNs Weights & Biases guides

DeepLearning

Build Networks from Scratch& Understand Every Gradient

The Premier Frameworkfor Deep Learning

Teach Machinesto See

Model Sequences,Time, and Memory

The Architecture ThatPowers Modern AI

The Craft of GettingModels to Actually Train

Deep
Learning

Build Networks from Scratch
& Understand Every Gradient

The Premier Framework
for Deep Learning

Teach Machines
to See

Model Sequences,
Time, and Memory

The Architecture That
Powers Modern AI

The Craft of Getting
Models to Actually Train