Phase 3 of 6 · ML & AI Roadmap

Deep
Learning

Where the real magic begins. Classical ML hits a wall with images, text, audio, and sequences. Deep learning breaks through it — neural networks learn hierarchical representations that no human could hand-craft. This phase takes you from a single neuron to the Transformer architecture that powers GPT-4 and Claude.

01 / 06
Neural Networks & Backprop
Build and train a network from scratch. Understand every gradient.
02 / 06
PyTorch
The dominant deep learning framework. Tensors, autograd, training loop.
03 / 06
CNNs & Computer Vision
Convolutions, pooling, transfer learning, data augmentation.
04 / 06
RNNs & LSTMs
Sequential data, vanishing gradients, gating mechanisms.
05 / 06
Transformers & Attention
The architecture powering every modern LLM and vision model.
06 / 06
Training Tricks & Tuning
Regularization, optimizers, schedulers, debugging training failures.
Before starting Phase 3, you should know: Python, NumPy vectorized operations, gradient descent conceptually (from Phase 1 Calculus), and how to use scikit-learn's fit/predict API. You don't need to know PyTorch yet — that's Topic 2 here. But comfort with matrix multiplication and derivatives is essential.
Week 1
Backprop
+ PyTorch basics
Week 2–3
CNNs
+ RNNs / LSTMs
Week 4–5
Transformers
+ Training tricks
Image + Text Classifier: Train a CNN on CIFAR-10 (>90% accuracy with transfer learning), then build a sentiment classifier with a Transformer encoder on IMDb. Both must use proper train/val/test splits, early stopping, and a learning rate scheduler. Deploy both as a FastAPI endpoint.
Neural Networks 01 / 06

Build Networks from Scratch
& Understand Every Gradient

Before you reach for PyTorch, implement backpropagation by hand. Once you understand why gradients flow the way they do, every training failure becomes diagnosable.

Most people jump straight to high-level frameworks and treat training as a black box. Then they can't debug why their model diverges, why gradients vanish, or why loss spikes after epoch 10. Implementing backprop from scratch — even once — gives you a mental model that pays dividends for years. You'll understand what loss.backward() actually does, and why.
Neuron
z = Wx + b, a = activation(z). Linear transform followed by non-linearity. The atomic unit of a neural network.
Layers
Stack of neurons. Input → hidden → output. Each layer applies a linear transform then an activation function.
Activations
ReLU: max(0,z). Sigmoid: 1/(1+e⁻ᶻ). Tanh. GELU. Non-linearities let networks learn complex functions. ReLU is default.
Forward Pass
Data flows input → output through successive matrix multiplications and activations. Produces prediction.
Loss Function
MSE for regression, cross-entropy for classification. Scalar measure of how wrong the prediction is.
Backward Pass
Chain rule applied layer by layer from output to input. Computes ∂L/∂w for every weight.
Weight Update
w = w − α · ∂L/∂w. All weights nudged to reduce loss. α is the learning rate.
Initialization
He init for ReLU (√2/n), Xavier for Tanh. Wrong init → gradients vanish or explode immediately.
Forward Pass — Single Layer
Z = X @ W + b ← linear transform A = relu(Z) ← activation Loss = -mean(y·log(ŷ)) ← cross-entropy
Backward Pass — Chain Rule
dL/dW = Aᵀ_prev @ dL/dZ ← gradient w.r.t. weights dL/db = sum(dL/dZ, axis=0) ← gradient w.r.t. bias dL/dA_prev = dL/dZ @ Wᵀ ← gradient propagated backward dL/dZ = dL/dA ⊙ relu'(Z) ← activation gradient (ReLU: 1 if z>0)
neural_net_scratch.py
import numpy as np

class Layer:
    def __init__(self, in_dim, out_dim):
        # He initialization for ReLU networks
        self.W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
        self.b = np.zeros((1, out_dim))
        self.dW = self.db = None

    def forward(self, x):
        self.x = x             # cache for backward pass
        return x @ self.W + self.b

    def backward(self, d_out):
        self.dW = self.x.T @ d_out
        self.db = d_out.sum(axis=0, keepdims=True)
        return d_out @ self.W.T  # gradient to pass to previous layer

    def update(self, lr):
        self.W -= lr * self.dW
        self.b -= lr * self.db


class ReLU:
    def forward(self, x):
        self.mask = x > 0
        return x * self.mask

    def backward(self, d_out):
        return d_out * self.mask  # kill gradient where input ≤ 0


class CrossEntropyLoss:
    def forward(self, logits, y):
        # Softmax + cross-entropy (numerically stable)
        shifted = logits - logits.max(axis=1, keepdims=True)
        exp_s = np.exp(shifted)
        self.probs = exp_s / exp_s.sum(axis=1, keepdims=True)
        self.y = y
        n = logits.shape[0]
        return -np.log(self.probs[np.arange(n), y] + 1e-9).mean()

    def backward(self):
        n = self.probs.shape[0]
        d = self.probs.copy()
        d[np.arange(n), self.y] -= 1
        return d / n


class MLP:
    def __init__(self, dims):
        # dims: e.g. [784, 256, 128, 10]
        self.layers  = [Layer(dims[i], dims[i+1]) for i in range(len(dims)-1)]
        self.relus   = [ReLU() for _ in range(len(dims)-2)]
        self.loss_fn = CrossEntropyLoss()

    def forward(self, x, y):
        a = x
        for i, layer in enumerate(self.layers):
            z = layer.forward(a)
            a = self.relus[i].forward(z) if i < len(self.relus) else z
        return self.loss_fn.forward(a, y)

    def backward(self):
        d = self.loss_fn.backward()
        for i in reversed(range(len(self.layers))):
            if i < len(self.relus):
                d = self.relus[i].backward(d)
            d = self.layers[i].backward(d)

    def step(self, lr):
        for layer in self.layers:
            layer.update(lr)

    def predict(self, x):
        a = x
        for i, layer in enumerate(self.layers):
            z = layer.forward(a)
            a = self.relus[i].forward(z) if i < len(self.relus) else z
        return a.argmax(axis=1)


# ── TRAIN ON SYNTHETIC DATA ────────────────────────────────────────────
np.random.seed(42)
# 4-class classification, 2D input
X = np.random.randn(1000, 2)
y = ((X[:, 0] > 0).astype(int) + 2 * (X[:, 1] > 0).astype(int))

net = MLP([2, 64, 32, 4])
lr = 0.05
batch_size = 64

for epoch in range(100):
    idx = np.random.permutation(len(X))
    total_loss = 0
    for i in range(0, len(X), batch_size):
        xb = X[idx[i:i+batch_size]]
        yb = y[idx[i:i+batch_size]]
        loss = net.forward(xb, yb)
        net.backward()
        net.step(lr)
        total_loss += loss
    if epoch % 20 == 0:
        acc = (net.predict(X) == y).mean()
        print(f"Epoch {epoch:3d} | loss: {total_loss:.3f} | acc: {acc:.3f}")

# ── NUMERICAL GRADIENT CHECK (debug tool) ─────────────────────────────
# Verify your analytical gradients match finite differences
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x)
    for idx in np.ndindex(x.shape):
        x_plus = x.copy(); x_plus[idx] += eps
        x_minus = x.copy(); x_minus[idx] -= eps
        grad[idx] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad
01
Implement backprop from scratch at least once. Watch Andrej Karpathy's micrograd video (~2 hours). After that, PyTorch's autograd will never feel like magic again.
02
Use He initialization with ReLU, Xavier with Tanh. Wrong initialization causes vanishing or exploding gradients from the very first forward pass.
03
Numerical gradient checking is your debugging superpower. If analytical and numerical gradients disagree, your backward pass has a bug. Run it on small tensors.
04
ReLU is the default activation. It avoids the vanishing gradient problem that kills sigmoid/tanh in deep networks, and it's cheap to compute and differentiate.
05
Monitor the gradient norms during training. If they explode (NaN loss) use gradient clipping. If they shrink to near zero, check your initialization and architecture depth.
06
Overfit one batch first. Before training the full dataset, try to memorize 32 samples. If loss doesn't go to zero, your model, loss, or training loop has a bug.
Andrej Karpathy micrograd CS231n backprop notes Deep Learning book Ch.6 3Blue1Brown Neural Nets
PyTorch 02 / 06

The Premier Framework
for Deep Learning

PyTorch is the dominant framework in both research and industry. Its dynamic computation graph makes debugging feel natural. Every modern LLM was trained with it or a variant.

TensorFlow was the framework of 2018. PyTorch is the framework of now — and the foreseeable future. The entire research ecosystem (HuggingFace, fast.ai, Lightning, vLLM) is built on it. Learning PyTorch properly means you can read, modify, and contribute to virtually any modern deep learning codebase.
Tensors
N-dimensional arrays with GPU support. Like NumPy arrays but differentiable and CUDA-aware. The fundamental data structure.
Autograd
Automatic differentiation. PyTorch records a computation graph and walks it backward to compute gradients automatically.
nn.Module
Base class for all models. Define layers in __init__, implement forward(). PyTorch handles backward automatically.
Optimizers
Adam, AdamW, SGD with momentum. Updates parameters using computed gradients. AdamW is the default for most DL.
DataLoader
Batches, shuffles, and loads data with multiple workers. Feeds data to the training loop efficiently.
Device (GPU)
.to("cuda") moves tensors and models to GPU for massive speedups. Always check torch.cuda.is_available().
nn.Sequential
Stack layers in order without writing forward(). Clean for simple architectures.
torch.no_grad()
Context manager that disables gradient tracking. Use during validation and inference to save memory.
1. optimizer.zero_grad() 2. pred = model(x) 3. loss = criterion(pred, y) 4. loss.backward() 5. optimizer.step()
pytorch_fundamentals.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
import numpy as np

# ── TENSORS — the basics ───────────────────────────────────────────────
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print(x.shape, x.dtype, x.device)  # torch.Size([2, 2]) float32 cpu

# Move to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = x.to(device)

# Common operations (same as NumPy, but on GPU)
a = torch.randn(3, 4).to(device)
b = torch.zeros(4, 2).to(device)
c = a @ b                     # matrix multiply → (3, 2)
d = torch.cat([a, a], dim=1) # concatenate → (3, 8)

# Convert between NumPy and PyTorch
np_arr = np.random.randn(5)
t = torch.from_numpy(np_arr).float()
back = t.cpu().numpy()

# ── AUTOGRAD — automatic differentiation ──────────────────────────────
# requires_grad=True tells PyTorch to track operations on this tensor
w = torch.tensor(2.0, requires_grad=True)
x = torch.tensor(3.0)
y = w * x ** 2 + w ** 3  # y = 3w² + w³  (if x=3: y = 6w + 3w²... wait: y = wx² + w³)
y.backward()
print(w.grad)  # dy/dw = x² + 3w² = 9 + 12 = 21.0

# Context manager: no gradient tracking (for inference)
with torch.no_grad():
    inference_result = w * x  # does not build computation graph

# ── nn.MODULE — define your model ──────────────────────────────────────
class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.LayerNorm(hidden),   # normalize before activation
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, hidden // 2),
            nn.ReLU(),
            nn.Linear(hidden // 2, out_dim)
        )

    def forward(self, x):
        return self.net(x)  # PyTorch calls backward() automatically

model = MLP(20, 128, 1).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# ── DATASET + DATALOADER ──────────────────────────────────────────────
X_raw = torch.randn(2000, 20)
y_raw = (X_raw.sum(dim=1) > 0).float().unsqueeze(1)
dataset = TensorDataset(X_raw, y_raw)
train_ds, val_ds, test_ds = random_split(dataset, [1600, 200, 200])

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=0)

# ── THE TRAINING LOOP ──────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.BCEWithLogitsLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val_loss = float('inf')
patience, patience_counter = 10, 0

for epoch in range(100):
    # ── Training ──
    model.train()
    train_loss = 0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()          # 1. clear gradients
        pred = model(xb)               # 2. forward pass
        loss = criterion(pred, yb)     # 3. compute loss
        loss.backward()                # 4. backward pass
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip
        optimizer.step()               # 5. update weights
        train_loss += loss.item()
    scheduler.step()

    # ── Validation ──
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            val_loss += criterion(model(xb), yb).item()

    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}"); break

    if epoch % 10 == 0:
        print(f"Ep {epoch:3d} | train: {train_loss/len(train_loader):.4f}"
              f" | val: {val_loss/len(val_loader):.4f}"
              f" | lr: {scheduler.get_last_lr()[0]:.6f}")

# ── SAVE & LOAD ────────────────────────────────────────────────────────
torch.save(model.state_dict(), 'model.pt')

model_loaded = MLP(20, 128, 1).to(device)
model_loaded.load_state_dict(torch.load('model.pt', map_location=device))
model_loaded.eval()

# ── INFERENCE ─────────────────────────────────────────────────────────
with torch.no_grad():
    test_x = torch.randn(10, 20).to(device)
    logits = model_loaded(test_x)
    probs  = torch.sigmoid(logits)
    preds  = (probs > 0.5).long()
    print("Predictions:", preds.squeeze().tolist())
01
Always call model.train() before training and model.eval() before validation. This toggles Dropout and BatchNorm behavior — forgetting it causes subtle, hard-to-diagnose bugs.
02
AdamW over Adam for most tasks. AdamW correctly decouples weight decay from the adaptive learning rate, which Adam does incorrectly. It generalizes better.
03
Gradient clipping prevents explosions. clip_grad_norm_(model.parameters(), 1.0) should be in every training loop. It's especially important for RNNs and Transformers.
04
torch.compile() (PyTorch 2.0+) gives free 2-3x speedup on most models. Just wrap your model: model = torch.compile(model). One line of code.
05
Use BCEWithLogitsLoss not BCELoss. The former combines sigmoid + BCE in one numerically stable operation. Using BCELoss with sigmoid separately causes numerical instability.
06
Save model.state_dict(), not the whole model. Saving the whole model with pickle ties you to specific Python/PyTorch versions. state_dict is portable.
PyTorch official tutorials fast.ai course PyTorch Lightning Andrej Karpathy YouTube PyTorch docs
CNNs & Computer Vision 03 / 06

Teach Machines
to See

Convolutional Neural Networks exploit the structure of images — local patterns, translation invariance, and hierarchical features. They remain the workhorses of production vision systems.

Vision is everywhere: medical imaging (tumor detection), autonomous vehicles, quality control in manufacturing, satellite imagery analysis, face recognition, document understanding. CNNs — and their modern successor, Vision Transformers — power all of it. Transfer learning from ImageNet-pretrained models means you can hit 90%+ accuracy on custom vision tasks with just a few hundred images.
Convolution
Slide a learnable kernel (filter) across the image, computing dot products. Each kernel detects one local pattern (edge, texture, etc.).
Feature Maps
Output of a conv layer. With 64 filters you get 64 feature maps — one per learned pattern. Deeper layers detect more complex patterns.
Pooling
Max or average pooling reduces spatial dimensions. Adds translation invariance. MaxPool2d(2) halves height and width.
Receptive Field
Region of input that influences one output neuron. Grows with depth. Deeper networks see larger context.
BatchNorm
Normalizes activations per channel per batch. Dramatically stabilizes training. Place after conv, before activation.
Transfer Learning
Use a pretrained backbone (ResNet, EfficientNet, ViT). Freeze all but the last layers. Fine-tune with a small learning rate.
Data Augmentation
Random crops, flips, color jitter, rotation. Artificially expands your dataset. Free regularization — always use it.
Global Avg Pooling
Collapses spatial dimensions to 1×1. Replaces the large flatten+linear combination. Works for any input resolution.
Input (3 × 224 × 224)
→ Conv2d(3→64, k=3, p=1) + BN + ReLU → (64 × 224 × 224)
→ MaxPool2d(2) → (64 × 112 × 112)
→ Conv2d(64→128, k=3, p=1) + BN + ReLU → (128 × 112 × 112)
→ MaxPool2d(2) → (128 × 56 × 56)
→ Conv2d(128→256, k=3, p=1) + BN + ReLU→ (256 × 56 × 56)
→ AdaptiveAvgPool2d(1, 1) → (256 × 1 × 1)
→ Flatten → Linear(256 → num_classes) → (num_classes,)
cnn_vision.py
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as T
import torchvision.models as models
from torch.utils.data import DataLoader

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. BUILD A CNN FROM SCRATCH ────────────────────────────────────────
class ConvBlock(nn.Module):
    """Conv → BatchNorm → ReLU — the fundamental CNN building block"""
    def __init__(self, in_ch, out_ch, kernel=3, stride=1):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, kernel, stride=stride,
                      padding=kernel//2, bias=False),  # bias=False with BN
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    def forward(self, x): return self.block(x)


class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            ConvBlock(3, 32),
            ConvBlock(32, 64),
            nn.MaxPool2d(2),                 # 32→16
            ConvBlock(64, 128),
            ConvBlock(128, 128),
            nn.MaxPool2d(2),                 # 16→8
            ConvBlock(128, 256),
            nn.AdaptiveAvgPool2d((1, 1)),   # any input size → 1×1
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(128, num_classes)
        )

    def forward(self, x): return self.classifier(self.features(x))

model = SmallCNN(num_classes=10).to(device)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")

# ── 2. DATA AUGMENTATION ───────────────────────────────────────────────
train_transform = T.Compose([
    T.RandomCrop(32, padding=4),         # random crop with padding
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    T.RandomRotation(10),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],   # ImageNet stats
                [0.229, 0.224, 0.225])
])
val_transform = T.Compose([
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406],
                [0.229, 0.224, 0.225])
])

train_ds = torchvision.datasets.CIFAR10('./data', train=True,
               download=True, transform=train_transform)
val_ds   = torchvision.datasets.CIFAR10('./data', train=False,
               download=True, transform=val_transform)
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True,  num_workers=4, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=256, shuffle=False, num_workers=4)

# ── 3. TRANSFER LEARNING — ResNet50 ───────────────────────────────────
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Strategy 1: Freeze backbone, train only head (fast, small datasets)
for param in backbone.parameters():
    param.requires_grad = False
backbone.fc = nn.Sequential(
    nn.Linear(backbone.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, 10)
)

# Strategy 2: Fine-tune all layers with differential learning rates
backbone_full = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone_full.fc = nn.Linear(backbone_full.fc.in_features, 10)
optimizer_full = optim.AdamW([
    {'params': backbone_full.layer4.parameters(), 'lr': 1e-4},
    {'params': backbone_full.layer3.parameters(), 'lr': 5e-5},
    {'params': backbone_full.fc.parameters(),     'lr': 1e-3},
], weight_decay=1e-4)

# ── 4. TRAINING WITH AMP (Automatic Mixed Precision) ──────────────────
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=5e-4)
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01,
    steps_per_epoch=len(train_loader), epochs=50)

for epoch in range(50):
    model.train()
    correct, total = 0, 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        with autocast():            # float16 for speed
            out = model(imgs)
            loss = criterion(out, labels)
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        correct += (out.argmax(1) == labels).sum().item()
        total += labels.size(0)

    model.eval()
    val_correct = 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            val_correct += (model(imgs).argmax(1) == labels).sum().item()
    print(f"Ep {epoch:2d} | train acc: {correct/total:.3f}"
          f" | val acc: {val_correct/len(val_ds):.3f}")
01
Almost always use transfer learning. Training ResNet from scratch on CIFAR-10 takes hours. Fine-tuning ImageNet weights takes 15 minutes and gives better results.
02
Batch normalization after Conv, before ReLU. BN dramatically stabilizes training. Without it, deeper CNNs are notoriously difficult to train.
03
Data augmentation is free regularization. Use it aggressively for small datasets (<10k images). RandomCrop + HorizontalFlip alone can boost accuracy by 3-5%.
04
OneCycleLR scheduler — one cycle of warmup + decay — often outperforms other schedules. Works especially well with SGD + momentum.
05
Label smoothing (0.1) prevents overconfidence and consistently improves generalization on classification tasks. One parameter that almost always helps.
06
AMP (autocast) gives a free 1.5-2x speedup on NVIDIA GPUs. bfloat16 is more numerically stable than float16 on Ampere+ GPUs.
CS231n Stanford fast.ai Lesson 1 torchvision models Papers With Code timm library
RNNs & LSTMs 04 / 06

Model Sequences,
Time, and Memory

Recurrent networks process sequential data step by step, maintaining a hidden state. LSTMs solve the vanishing gradient problem that cripples vanilla RNNs. Essential for time series, NLP, and audio.

Even in the Transformer age, RNNs and LSTMs remain relevant: time series forecasting, real-time streaming inference, and on-device models where Transformers are too heavy. More importantly, understanding LSTMs and their gating mechanisms gives you deep intuition for why Transformers work — attention is a more powerful solution to the same problem that LSTMs partially solved.
Recurrent Connection
hₜ = f(Wxₜ + Uhₜ₋₁ + b). Hidden state carries information across timesteps. Parameters shared across all steps.
Vanishing Gradient
Gradients shrink exponentially with sequence length. RNNs can't learn long-range dependencies. The core problem LSTMs solve.
LSTM Gates
Forget gate: what to erase. Input gate: what to write. Output gate: what to read. Cell state c provides an information highway.
Cell State
The "memory conveyor belt" in LSTMs. Information flows through with minimal transformation — enables long-range gradients.
GRU
Gated Recurrent Unit — simplified LSTM. Two gates (reset, update). Comparable performance, fewer parameters.
Bidirectional
Run one RNN forward and one backward. Doubles hidden size. Useful when the full sequence is available at inference time (not streaming).
Sequence-to-Sequence
Encoder LSTM encodes input sequence. Decoder LSTM generates output sequence. Foundation of pre-Transformer NMT.
TBPTT
Truncated Backprop Through Time. Only backprop through last K steps. Prevents memory explosion on long sequences.
LSTM — All Four Gates
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← forget gate: erase old memory i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← input gate: allow new info c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c) ← candidate cell state c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t ← update cell (the key!) o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← output gate h_t = o_t ⊙ tanh(c_t) ← hidden state
rnn_lstm.py
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. VANILLA RNN — understand the mechanics ──────────────────────────
class VanillaRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.Wx = nn.Linear(input_dim, hidden_dim, bias=False)
        self.Wh = nn.Linear(hidden_dim, hidden_dim)
        self.out = nn.Linear(hidden_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        B, T, _ = x.shape
        h = torch.zeros(B, self.Wh.in_features).to(x.device)
        outputs = []
        for t in range(T):
            h = self.tanh(self.Wx(x[:, t, :]) + self.Wh(h))
            outputs.append(h)
        last_h = outputs[-1]
        return self.out(last_h)  # classify based on final hidden state


# ── 2. LSTM — PyTorch built-in ────────────────────────────────────────
class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim,
                 dropout=0.3, bidirectional=False):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional,
            batch_first=True    # (batch, seq, feature) — ALWAYS use this
        )
        d = hidden_dim * (2 if bidirectional else 1)
        self.head = nn.Sequential(
            nn.Linear(d, d // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d // 2, output_dim)
        )

    def forward(self, x):
        # out: (batch, seq, hidden*dirs), (h_n, c_n)
        out, (h_n, c_n) = self.lstm(x)
        # Use the final hidden state of the last layer
        last = out[:, -1, :]
        return self.head(last)


# ── 3. TIME SERIES FORECASTING ────────────────────────────────────────
class LSTMForecaster(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, forecast_horizon):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers,
                            batch_first=True)
        self.fc = nn.Linear(hidden_dim, forecast_horizon)

    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        return self.fc(h_n[-1])   # last layer hidden state → forecast


# ── 4. SEQUENCE-TO-SEQUENCE ───────────────────────────────────────────
class Seq2Seq(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed   = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc_out  = nn.Linear(hidden_dim, vocab_size)

    def forward(self, src, tgt):
        enc_out, (h, c) = self.encoder(self.embed(src))
        dec_out, _      = self.decoder(self.embed(tgt), (h, c))
        return self.fc_out(dec_out)


# ── 5. FULL TRAINING EXAMPLE — Sentiment Classification ───────────────
np.random.seed(42); torch.manual_seed(42)
# Fake token sequences (batch, seq_len, features)
SEQ_LEN, BATCH, FEAT = 50, 64, 16
X_train = torch.randn(800, SEQ_LEN, FEAT)
y_train = torch.randint(0, 2, (800,))
X_val   = torch.randn(200, SEQ_LEN, FEAT)
y_val   = torch.randint(0, 2, (200,))

model = LSTMClassifier(FEAT, 128, 2, 2, dropout=0.3, bidirectional=True).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(30):
    model.train()
    for i in range(0, len(X_train), BATCH):
        xb = X_train[i:i+BATCH].to(device)
        yb = y_train[i:i+BATCH].to(device)
        optimizer.zero_grad()
        loss = criterion(model(xb), yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    model.eval()
    with torch.no_grad():
        val_acc = (model(X_val.to(device)).argmax(1) == y_val.to(device)).float().mean()
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d} | val acc: {val_acc:.3f}")
01
Always use batch_first=True in nn.LSTM. The default (seq, batch, feature) is confusing and error-prone. batch_first=True matches what you'd expect from most data loading pipelines.
02
Gradient clipping is mandatory for RNNs. Exploding gradients are a constant threat. Always clip to 1.0. Vanishing gradients are addressed by LSTM's cell state design.
03
Bidirectional LSTMs for offline tasks only. If you're classifying a full sentence or time series known in advance, use BiLSTM. For streaming/real-time, you can only go forward.
04
GRU is a solid default when you want something between vanilla RNN and LSTM. Same performance on most tasks, fewer parameters, trains faster.
05
For long sequences, Transformer beats LSTM. LSTMs struggle beyond ~500 steps even with cell state. When sequence length exceeds ~200, consider switching to attention mechanisms.
06
Pack padded sequences with nn.utils.rnn.pack_padded_sequence when sequences have variable lengths. Prevents the model from attending to padding tokens.
Understanding LSTMs (Olah) PyTorch RNN tutorial StatQuest LSTMs The Unreasonable Effectiveness of RNNs (Karpathy)
Transformers & Attention 05 / 06

The Architecture That
Powers Modern AI

Every frontier AI system today — GPT-4, Claude, Gemini, DALL-E, Whisper — is built on the Transformer. Understanding attention is now a foundational skill, not an advanced one.

The 2017 paper "Attention Is All You Need" changed the field permanently. Before it: task-specific architectures, RNN bottlenecks, slow sequential processing. After it: one architecture rules everything — text, images, audio, video, protein structure, code. Understanding the Transformer puts you at the frontier of modern AI research and engineering.
Self-Attention
Every token attends to every other token simultaneously. Captures long-range dependencies in O(n²) not O(n). The core mechanism.
Q, K, V
Query, Key, Value. Each token produces three vectors. Attention = softmax(QKᵀ/√d) · V. Q asks, K answers, V provides content.
Multi-Head Attention
Run attention H times in parallel with different learned projections. Each head can attend to different relationship types. Concatenate and project.
Positional Encoding
Attention is permutation-invariant. Positional encodings inject position information. Sinusoidal (original) or learned (modern).
Feed-Forward Block
Two linear layers (expand → compress) with GELU activation after each attention block. Usually 4× expansion: 512→2048→512.
LayerNorm (Pre-LN)
Normalize per token. Modern Transformers use Pre-LN (before attention) not Post-LN. Much more stable training.
Residual Connections
x = x + sublayer(x). Critical for training depth. Gradients can flow around any layer. Without them deep Transformers don't train.
Causal Masking
Decoder-only (GPT-style): each token can only attend to previous tokens. Achieved by masking the upper triangle of the attention matrix.
Scaled Dot-Product Attention
Q = X @ W_Q K = X @ W_K V = X @ W_V Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V ↑ The √d_k scaling prevents softmax saturation in high dimensions. ↑ Causal mask: add -∞ to upper triangle before softmax → prob = 0. Multi-Head: concat [head₁ ... headₕ] @ W_O where headᵢ = Attention(Q@Wᵢ_Q, K@Wᵢ_K, V@Wᵢ_V)
PropertyEncoder (BERT-style)Decoder (GPT-style)
Attention directionBidirectional (all tokens see all)Causal (only past tokens)
Training objectiveMasked Language ModelingNext token prediction
Use casesClassification, NER, embeddingsText generation, LLMs
ExamplesBERT, RoBERTa, DeBERTaGPT-2/3/4, LLaMA, Claude
InferenceOne forward passAutoregressive (token by token)
transformer.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── SCALED DOT-PRODUCT ATTENTION ──────────────────────────────────────
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.h = n_heads
        self.d_k = d_model // n_heads
        # One projection matrix for Q, K, V combined (faster)
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and split into heads
        qkv = self.qkv(x).reshape(B, T, 3, self.h, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, h, T, d_k)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention
        scale = math.sqrt(self.d_k)
        scores = (Q @ K.transpose(-2, -1)) / scale   # (B, h, T, T)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = self.dropout(F.softmax(scores, dim=-1))
        out  = (attn @ V).transpose(1, 2).contiguous().reshape(B, T, C)
        return self.proj(out)


# ── TRANSFORMER BLOCK (Pre-LN — modern & stable) ──────────────────────
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1  = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ln2  = nn.LayerNorm(d_model)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),                   # GELU outperforms ReLU in Transformers
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x, mask=None):
        x = x + self.attn(self.ln1(x), mask)  # pre-LN residual
        x = x + self.ff(self.ln2(x))
        return x


# ── GPT-STYLE LANGUAGE MODEL ──────────────────────────────────────────
class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers,
                 d_ff, max_seq_len, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)  # learned
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.ln_final = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying — share embedding and output weights (GPT-2 trick)
        self.head.weight = self.tok_embed.weight
        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, mean=0, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Embedding):
                nn.init.normal_(m.weight, mean=0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos = torch.arange(T, device=idx.device)
        x = self.drop(self.tok_embed(idx) + self.pos_embed(pos))

        # Causal (autoregressive) mask — upper triangle = 0
        mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0).unsqueeze(0)

        for block in self.blocks:
            x = block(x, mask)
        logits = self.head(self.ln_final(x))

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
                                   targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            logits, _ = self(idx[:, -512:])  # keep last 512 tokens
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = float('-inf')
            probs = F.softmax(logits, dim=-1)
            next_tok = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_tok], dim=1)
        return idx


# ── INSTANTIATE A SMALL GPT ────────────────────────────────────────────
model = GPT(
    vocab_size=50257,  # GPT-2 vocabulary
    d_model=256,
    n_heads=8,
    n_layers=6,
    d_ff=1024,
    max_seq_len=512,
    dropout=0.1
).to(device)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Training uses exactly the same loop as PyTorch section above
optimizer = torch.optim.AdamW(
    model.parameters(), lr=3e-4,
    betas=(0.9, 0.95), weight_decay=0.1)

# ── USING HUGGINGFACE INSTEAD (production choice) ─────────────────────
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Classify with BERT-style encoder (much easier than from scratch)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
clf_model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=2)

texts = ["This movie is amazing!", "I hated every minute of it."]
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
    logits = clf_model(**tokens).logits
    preds = logits.argmax(dim=-1)
print("Predictions:", ["negative", "positive"][p] for p in preds.tolist())
01
Read "Attention Is All You Need" (2017). It's only 15 pages and surprisingly readable. The original paper explains the motivation better than most tutorials.
02
Study Karpathy's nanoGPT — ~300 lines of clean PyTorch that trains a real GPT. Reading it once is worth a week of tutorials.
03
Pre-LN (LayerNorm before attention) is more stable than the original Post-LN. Modern models all use Pre-LN or RMSNorm. It's why you don't need learning rate warmup as long.
04
Flash Attention 2 makes attention 5-10x faster by computing it in tiled blocks that fit in GPU SRAM. For any real training, use F.scaled_dot_product_attention (PyTorch 2.0+).
05
Weight tying (input embedding = output projection) reduces parameters and consistently improves perplexity in language models. Always do it for LMs.
06
For most applications, use HuggingFace rather than implementing from scratch. Understanding the architecture matters; reinventing the wheel production doesn't.
Attention Is All You Need Illustrated Transformer Karpathy nanoGPT HuggingFace course Flash Attention paper
Training Tricks & Tuning 06 / 06

The Craft of Getting
Models to Actually Train

Knowing the architecture is only half the battle. This is the practitioner knowledge that separates people who get results from people who get NaN loss and give up. Regularization, optimizers, debugging — all of it here.

A great architecture with poor training practices underperforms a mediocre architecture with expert training. The gap between "the model doesn't train" and "the model works great" is almost always in the training loop: the wrong learning rate, missing gradient clipping, wrong regularization, or a subtle data normalization bug. These tricks are the craft that textbooks skip.
Dropout
Randomly zero out p% of neurons during training. Forces redundant representations. Use 0.1–0.5 depending on model size.
Batch Normalization
Normalize activations per batch. Allows higher learning rates, reduces sensitivity to init. Use after linear/conv, before activation.
Layer Normalization
Normalize per sample across features. Used in Transformers. Not dependent on batch size — works for any batch.
Weight Decay
L2 regularization on weights. Penalizes large weights. Use AdamW (0.01–0.1) rather than Adam's flawed weight decay.
Learning Rate Warmup
Linearly ramp LR from 0 to target over first N steps. Prevents large early updates from destabilizing training.
Cosine Annealing
LR follows cosine curve from max to min. Smooth decay. CosineAnnealingLR or OneCycleLR in PyTorch.
Gradient Accumulation
Accumulate gradients over N mini-batches before stepping. Simulates larger batch size without extra GPU memory.
Mixed Precision (AMP)
Use float16/bfloat16 for forward/backward, float32 for optimizer updates. 2x speedup, 2x memory savings.
OptimizerBest ForKey ParamsNotes
SGD + Momentum CNNs, ResNets with careful tuning lr=0.1, momentum=0.9 Best final accuracy with proper schedule; hard to tune
Adam Transformers, MLPs, general use lr=3e-4, β=(0.9, 0.999) Adaptive. Flawed weight decay — use AdamW instead
AdamW Transformers, LLMs, default choice lr=3e-4, wd=0.01-0.1 Decoupled weight decay. The modern standard
Lion Large model fine-tuning lr=3e-5, wd=1.0 Uses only sign of gradient. Memory efficient
Muon Language model pretraining Orthogonalization-based. Recent SOTA for LM training
Learning rate range test: Start very small (1e-7) and increase exponentially over one epoch. Plot loss vs LR. The optimal LR is just before where loss stops decreasing. This "LR finder" technique (from Smith 2015) should be your first step on any new model.
training_tricks.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler
import numpy as np
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. REGULARIZATION TECHNIQUES ──────────────────────────────────────
class RegularizedMLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.BatchNorm1d(hidden),          # BN before activation
            nn.GELU(),
            nn.Dropout(0.3),               # dropout after activation
            nn.Linear(hidden, hidden // 2),
            nn.LayerNorm(hidden // 2),      # or use LayerNorm
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(hidden // 2, out_dim)
        )
    def forward(self, x): return self.net(x)

# ── 2. LEARNING RATE WARMUP + COSINE DECAY ────────────────────────────
def get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps, min_lr_ratio=0.1):
    def lr_lambda(step):
        if step < warmup_steps:
            return step / max(1, warmup_steps)  # linear warmup
        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        cosine = 0.5 * (1 + math.cos(math.pi * progress))
        return max(min_lr_ratio, cosine)            # cosine decay
    return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

model = RegularizedMLP(20, 128, 2).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05,
                        betas=(0.9, 0.999), eps=1e-8)
scheduler = get_cosine_schedule_with_warmup(
    optimizer, warmup_steps=100, total_steps=1000)

# ── 3. GRADIENT ACCUMULATION (simulate large batch on small GPU) ───────
ACCUM_STEPS = 4   # effective batch = batch_size × ACCUM_STEPS
criterion = nn.CrossEntropyLoss()

model.train()
optimizer.zero_grad()

for step, (xb, yb) in enumerate(train_loader):  # assume train_loader exists
    xb, yb = xb.to(device), yb.to(device)
    loss = criterion(model(xb), yb) / ACCUM_STEPS  # scale loss
    loss.backward()

    if (step + 1) % ACCUM_STEPS == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

# ── 4. MIXED PRECISION (AMP) ──────────────────────────────────────────
scaler = GradScaler()

for xb, yb in train_loader:
    xb, yb = xb.to(device), yb.to(device)
    optimizer.zero_grad()

    with autocast(dtype=torch.bfloat16):   # bfloat16 on Ampere+
        pred = model(xb)
        loss = criterion(pred, yb)

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

# ── 5. LEARNING RATE FINDER ────────────────────────────────────────────
def lr_finder(model, optimizer, criterion, train_loader,
              start_lr=1e-7, end_lr=10, num_iter=100):
    lrs, losses = [], []
    mult = (end_lr / start_lr) ** (1 / num_iter)
    lr = start_lr
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    model.train()
    for i, (xb, yb) in enumerate(train_loader):
        if i >= num_iter: break
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()
        loss = criterion(model(xb), yb)
        loss.backward()
        optimizer.step()
        lrs.append(lr); losses.append(loss.item())
        lr *= mult
        for pg in optimizer.param_groups: pg['lr'] = lr
        if loss.item() > 4 * min(losses): break  # stop if diverging
    return lrs, losses

# Plot: steepest descent region → your optimal LR

# ── 6. DEBUGGING TOOLKIT ──────────────────────────────────────────────
def diagnose_model(model, x_sample, y_sample, criterion):
    """Quick sanity checks before training."""
    model.eval()
    with torch.no_grad():
        out = model(x_sample.to(device))

    # 1. Check output shape
    print(f"Output shape: {out.shape}")

    # 2. Check initial loss matches random baseline
    loss = criterion(out, y_sample.to(device))
    n_classes = out.shape[-1]
    expected = -math.log(1.0 / n_classes)
    print(f"Initial loss: {loss:.3f} (expected ~{expected:.3f} for {n_classes} classes)")

    # 3. Check gradient flow
    model.train()
    out = model(x_sample.to(device))
    loss = criterion(out, y_sample.to(device))
    loss.backward()
    grads = [(n, p.grad.abs().mean().item())
             for n, p in model.named_parameters()
             if p.grad is not None]
    print("Gradient norms:")
    for name, gnorm in grads[-5:]:
        print(f"  {name:40s} {gnorm:.6f}")

    # 4. Overfit one batch
    print("\nOverfitting 1 batch:")
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    for i in range(100):
        optimizer.zero_grad()
        loss = criterion(model(x_sample.to(device)), y_sample.to(device))
        loss.backward()
        optimizer.step()
        if i % 20 == 0: print(f"  step {i}: {loss.item():.4f}")
    # Should reach near-zero. If not: model, loss, or loop has a bug.

# ── 7. COMMON NaN DEBUGGING ────────────────────────────────────────────
def add_nan_hooks(model):
    """Register hooks that print which layer produced NaN."""
    def hook(module, inp, out, name=''):
        if isinstance(out, torch.Tensor) and torch.isnan(out).any():
            print(f"NaN detected in {name} output!")
    for name, layer in model.named_modules():
        layer.register_forward_hook(
            lambda m, i, o, n=name: hook(m, i, o, n))

# ── 8. EMA — Exponential Moving Average (improves eval performance) ────
class EMA:
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {k: v.clone().detach()
                       for k, v in model.state_dict().items()}

    def update(self):
        with torch.no_grad():
            for k, v in self.model.state_dict().items():
                self.shadow[k] = self.decay * self.shadow[k] + (1 - self.decay) * v

    def apply_shadow(self, model):
        model.load_state_dict(self.shadow)

ema = EMA(model, decay=0.9999)
# Call ema.update() after each optimizer.step() during training
# Use ema.apply_shadow(eval_model) for evaluation
01
Run the overfit-one-batch test before any long training run. If loss doesn't reach near-zero on 32 samples in 100 steps, something is fundamentally broken. Find it now, not after 8 hours.
02
Initial loss reveals bugs immediately. A randomly initialized model should produce -log(1/C) loss where C = num_classes. If it's far off, your model architecture or loss function has a bug.
03
Log gradient norms per layer. Norms should be similar across layers. If early layers have near-zero gradients → vanishing. If any layer spikes → exploding. Fix with clipping or LR adjustment.
04
Warmup + cosine annealing is the safest default schedule. Warmup for 5-10% of total steps, then cosine decay to 10% of peak LR. Works across architectures and tasks.
05
EMA weights consistently outperform final weights at evaluation. Keep an exponential moving average of model parameters and use that for inference. One line of code, measurable gain.
06
Large batch → large LR (linear scaling rule). If you 4× the batch size, 4× the learning rate. Use warmup when scaling LR to avoid instability in early training.
Loss is NaN from step 1: Check for log(0), division by zero, wrong dtype.
Loss explodes after N steps: Add gradient clipping. Lower LR. Increase weight decay.
Loss doesn't decrease at all: Check LR (too small?), data normalization, gradient flow.
Train low, val high (overfit): Add dropout, weight decay, data augmentation, reduce model size.
Both high (underfit): Larger model, more training steps, lower weight decay, higher LR.
Loss oscillates wildly: Reduce LR. Add LR warmup. Check for bad batches in dataset.
🎉 Phase 3 Complete! You now understand how neural networks learn, how to build and train CNNs, RNNs, and Transformers in PyTorch, and how to diagnose and fix training failures. Phase 4 (NLP & LLMs) builds directly on everything here — you're ready.
fast.ai Part 2 PyTorch profiler Deep Learning tuning playbook (Google) Andrej Karpathy — Training NNs Weights & Biases guides