Phase 4 — NLP & LLMs

Phase 4 of 6 · ML & AI Roadmap

NLP &
Large Language Models

Language is the frontier. This phase takes you from converting text to numbers all the way to fine-tuning billion-parameter models and building production RAG pipelines. You'll understand the stack that powers GPT-4, Claude, and every modern language AI — from tokenization to LoRA to agents.

Five Topics

01 / 05

Text & Embeddings

Tokenization, BPE, word vectors, sentence embeddings, semantic search, FAISS

02 / 05

BERT & Fine-tuning

Masked LM pretraining, the CLS token, classification, NER, QA with HuggingFace

03 / 05

GPT & Text Generation

Causal LMs, autoregressive decoding, temperature, sampling strategies, streaming

04 / 05

Fine-tuning & PEFT

Full fine-tuning, LoRA, QLoRA, instruction tuning, SFT, DPO, RLHF

05 / 05

RAG & LLM Agents

Retrieval-Augmented Generation, chunking strategies, reranking, tool use, ReAct

Prerequisites from Phase 3

Before starting Phase 4, you should know: The Transformer architecture (self-attention, Q/K/V, multi-head attention, positional encoding), PyTorch's training loop and nn.Module, and basic text preprocessing. Phase 3 Topic 4 (Transformers) is the direct prerequisite for everything here — revisit it if attention mechanics feel unclear.

Suggested Schedule

Week 1

Text &
Embeddings

Week 2

BERT &
Fine-tuning

Week 3

GPT &
Generation

Week 4

PEFT &
LoRA

Week 5

RAG &
Agents

Capstone Project

End-to-End LLM Application: Fine-tune a DistilBERT classifier for a domain-specific task (e.g. legal clause classification, medical triage, customer intent). Then build a RAG system over a PDF corpus that answers questions using that classifier to route to the right context. Deploy both as a single FastAPI service. Target F1 > 0.88 on the classification task and RAGAS faithfulness > 0.85.

Text & Embeddings 01 / 05

Represent Language
as Mathematical Objects

All NLP begins with one question: how do you turn text into numbers a model can reason about? Modern embeddings encode semantic meaning so precisely that arithmetic in vector space reflects linguistic meaning.

Why This Matters

Before any model can process language, text must become vectors. The quality of those vectors determines everything downstream. Modern embedding models like all-MiniLM and text-embedding-3 encode semantic meaning so faithfully that you can build semantic search, clustering, deduplication, anomaly detection, and recommendation systems purely from embeddings — before writing a single fine-tuning line. Vector databases and embeddings are now core infrastructure for every LLM application.

Core Concepts

Tokenization

Split text into tokens before encoding. Word-level, character-level, or subword (most common). Vocabulary size is a key design choice.

BPE

Byte-Pair Encoding: iteratively merge the most frequent character pair. Used by GPT-2, GPT-4, LLaMA. Handles unknown words by subword splitting.

WordPiece

BERT's tokenizer. Merges based on maximizing likelihood. Unknown words split at ## boundaries: "playing" → ["play", "##ing"].

Word2Vec

Learn word vectors by predicting context (Skip-gram) or predicting center word from context (CBOW). Enabled "king − man + woman = queen".

Sentence Embeddings

Encode an entire sentence or paragraph into a fixed-size dense vector. Used for semantic search, clustering, and cross-lingual retrieval.

Cosine Similarity

Measures the angle between vectors. 1 = identical direction, 0 = orthogonal, -1 = opposite. Standard metric for embedding similarity.

Vector Databases

FAISS, Chroma, Pinecone, Weaviate. Index millions of embeddings for approximate nearest-neighbor (ANN) search in milliseconds.

Contextual Embeddings

Unlike Word2Vec, BERT-style models produce different vectors for the same word in different contexts: "bank" (river) vs "bank" (finance).

The Embedding Pipeline

Raw text → Tokenizer → Token IDs → Transformer encoder → Token embeddings
→ Mean pooling (or CLS) → Sentence embedding (768d or 384d)
→ L2 normalize → FAISS / vector DB → Approximate nearest-neighbor search

Query: "What causes inflation?" → encode → find top-K most similar docs

Tokenization Deep Dive

BPE — How It Builds the Vocabulary

Start: character vocabulary {"h","e","l","o","w","r","d",...} Corpus: "hello world hello" → count pairs Most frequent pair: ("l","l") → merge → "ll" Repeat until vocab_size reached (e.g. 50,000 merges for GPT-2) Result: common words are single tokens; rare words split into subwords "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]

Code Tutorial

text_embeddings.py

import numpy as np
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import faiss
import torch

# ── 1. TOKENIZATION — understand what happens before encoding ──────────
# BPE tokenizer (GPT-style)
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "The Transformer architecture changed NLP forever."

tokens = gpt_tokenizer.tokenize(text)
ids = gpt_tokenizer.encode(text)
print("GPT-2 tokens:", tokens)
# ['The', 'ĠTransformer', 'Ġarchitecture', 'Ġchanged', 'ĠNLP', 'Ġforever', '.']
print("Token IDs:", ids)
print("Decoded back:", gpt_tokenizer.decode(ids))

# WordPiece tokenizer (BERT-style)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_tokens = bert_tokenizer.tokenize("unbelievable transformations")
print("BERT tokens:", bert_tokens)
# ['un', '##believable', 'transform', '##ations']

# Special tokens: [CLS], [SEP], [PAD], [MASK]
encoded = bert_tokenizer(
    "Hello world",
    return_tensors='pt',
    padding=True,
    truncation=True,
    max_length=128
)
print("Input IDs:", encoded['input_ids'])
print("Attention mask:", encoded['attention_mask'])
# attention_mask: 1 = real token, 0 = padding (ignored in attention)

# Count tokens before sending to an API — critical for cost control
def count_tokens(text: str, model: str = 'gpt2') -> int:
    tok = AutoTokenizer.from_pretrained(model)
    return len(tok.encode(text))

print(f"Token count: {count_tokens('Hello, how are you today?')}")

# ── 2. SENTENCE EMBEDDINGS ────────────────────────────────────────────
# all-MiniLM-L6-v2: fast, good quality, 384 dims — the practical default
embedder = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is transforming industries.",
    "Artificial intelligence reshapes the business world.",
    "I love hiking through mountain trails.",
    "Deep learning requires large amounts of data.",
    "The weather is beautiful today.",
]
embeddings = embedder.encode(sentences, normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")  # (5, 384)

# Cosine similarity (dot product of normalized vectors)
def cosine_sim(a, b):
    return float(np.dot(a, b))   # works if vectors already L2-normalized

print(f"ML vs AI similarity:   {cosine_sim(embeddings[0], embeddings[1]):.3f}")
# ~0.85 — semantically similar
print(f"ML vs hiking:          {cosine_sim(embeddings[0], embeddings[2]):.3f}")
# ~0.12 — semantically different
print(f"ML vs deep learning:   {cosine_sim(embeddings[0], embeddings[3]):.3f}")
# ~0.72 — related

# Batch pairwise similarity matrix
sim_matrix = embeddings @ embeddings.T
print("Similarity matrix:\n", sim_matrix.round(2))

# ── 3. SEMANTIC SEARCH WITH FAISS ─────────────────────────────────────
# Build a corpus of documents
corpus = [
    "BERT is a bidirectional Transformer encoder for NLP tasks.",
    "GPT uses a causal decoder to generate text autoregressively.",
    "LoRA reduces fine-tuning parameters by injecting low-rank matrices.",
    "Attention mechanisms allow models to weigh token importance.",
    "Vector databases store embeddings for fast similarity retrieval.",
    "Gradient descent minimizes loss by following the negative gradient.",
    "Convolutional networks extract spatial features from images.",
    "Reinforcement learning trains agents via reward signals.",
    "Tokenization converts raw text into integer IDs for models.",
    "Transformers replaced RNNs as the dominant sequence model.",
]

# Encode corpus
corpus_embeds = embedder.encode(corpus, normalize_embeddings=True)
corpus_embeds = corpus_embeds.astype(np.float32)

# Build FAISS index (Inner Product = cosine sim for normalized vecs)
dim = corpus_embeds.shape[1]   # 384
index = faiss.IndexFlatIP(dim)
index.add(corpus_embeds)
print(f"FAISS index: {index.ntotal} vectors of dim {dim}")

# Semantic search
def semantic_search(query: str, k: int = 3):
    q_embed = embedder.encode([query], normalize_embeddings=True).astype(np.float32)
    scores, indices = index.search(q_embed, k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({'doc': corpus[idx], 'score': float(score)})
    return results

results = semantic_search("how do transformers process sequences?")
for r in results:
    print(f"  [{r['score']:.3f}] {r['doc'][:60]}...")

# ── 4. FAISS AT SCALE — IVF index for millions of vectors ─────────────
# Flat index is exact but O(N). For >100K vectors, use approximate search.
n_clusters = 100
quantizer = faiss.IndexFlatIP(dim)
ivf_index = faiss.IndexIVFFlat(quantizer, dim, n_clusters,
                                faiss.METRIC_INNER_PRODUCT)
ivf_index.train(corpus_embeds)    # must train before adding
ivf_index.add(corpus_embeds)
ivf_index.nprobe = 10            # check 10 clusters (speed vs accuracy)
# nprobe=10 gives ~95% recall of exact search, 10x faster

# ── 5. CROSS-ENCODER RERANKING — improve top-K quality ────────────────
from sentence_transformers import CrossEncoder

# Bi-encoder (fast): use for retrieval top-K
# Cross-encoder (slow but accurate): use to rerank top-K
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "how do transformers process sequences?"
candidates = [r['doc'] for r in results]

# Cross-encoder scores the query-document pair jointly
rerank_scores = reranker.predict([(query, doc) for doc in candidates])
reranked = sorted(zip(rerank_scores, candidates), reverse=True)
print("\nReranked results:")
for score, doc in reranked:
    print(f"  [{score:.3f}] {doc[:60]}...")

# ── 6. WORD2VEC — understand the foundations ──────────────────────────
from gensim.models import Word2Vec

sentences_w2v = [
    ['machine', 'learning', 'is', 'transforming', 'industries'],
    ['deep', 'learning', 'requires', 'large', 'data'],
    ['neural', 'networks', 'learn', 'representations'],
    ['transformers', 'use', 'attention', 'mechanisms'],
]
model_w2v = Word2Vec(sentences_w2v, vector_size=64, window=3,
                     min_count=1, workers=4, epochs=100)

vec = model_w2v.wv['learning']          # 64-dim vector
similar = model_w2v.wv.most_similar('learning', topn=3)
print("Most similar to 'learning':", similar)

Embedding Model Comparison

Model	Dims	Speed	Quality	Best For
all-MiniLM-L6-v2	384	Fast	Good	Semantic search, default choice
all-mpnet-base-v2	768	Medium	Very Good	High-quality retrieval
text-embedding-3-small	1536	API call	Excellent	Production OpenAI apps
nomic-embed-text	768	Medium	Excellent	Local, long-context (8192 tokens)
bge-m3	1024	Medium	SOTA	Multilingual, dense + sparse hybrid

Pro Tips

Always L2-normalize embeddings before storing in FAISS. Then dot product equals cosine similarity — cheaper and numerically identical. Set normalize_embeddings=True in SentenceTransformer.

Two-stage retrieval beats one-stage. Use a fast bi-encoder to fetch top-50, then a cross-encoder reranker to score those 50 against the query. 10-15% NDCG improvement, minimal latency cost.

Token counting before API calls prevents surprises. A 100-page PDF can be 80,000 tokens. Count with the tokenizer before chunking to avoid truncation and unexpected cost.

FAISS IndexFlatIP for accuracy, IndexIVFFlat for scale. Below 100K vectors, exact search is fast enough. Above 1M vectors, use IVF with nprobe=10–50 for the accuracy/speed tradeoff.

Embed queries and documents with the same model. Mixing models breaks the embedding space alignment. If you switch embedding models, re-embed your entire corpus.

Matryoshka embeddings (e.g. nomic-embed) can be truncated to fewer dimensions with minimal quality loss. You can store 256-dim and retrieve 768-dim only for top candidates — saves memory.

Resources

MTEB Leaderboard Sentence-Transformers docs FAISS wiki The Illustrated Word2Vec HuggingFace NLP course Ch.1

BERT & Fine-tuning 02 / 05

Bidirectional Understanding
for NLP Tasks

BERT and its variants are the workhorses of NLP understanding tasks: classification, named entity recognition, and question answering. Fine-tuning a pretrained model takes hours instead of weeks and consistently outperforms training from scratch.

Why This Matters

Before BERT (2018), every NLP task required a custom architecture trained from scratch. BERT changed that: one pretrained model, adapted by fine-tuning, achieves state-of-the-art across 11 NLP benchmarks simultaneously. The pattern — pretrain on massive unlabeled text, fine-tune on labeled task data — is now the standard approach for virtually every NLP problem in production. RoBERTa, DeBERTa, and ALBERT are all BERT variants used daily in industry.

Core Concepts

Masked Language Modeling

BERT's pretraining task: randomly mask 15% of tokens, predict them. Forces bidirectional context — the model must use both left and right context.

Next Sentence Prediction

Predict if sentence B follows sentence A. Helps BERT understand relationships between sentences. Later shown to be less important than MLM.

[CLS] Token

Special classification token prepended to every input. Its final hidden state aggregates the whole sequence — used as input to the classification head.

Fine-tuning

Continue training all pretrained weights on your labeled dataset. Use a very small LR (2e-5 to 5e-5). Only a few epochs needed (2–5).

Classification Head

A single linear layer on top of [CLS]. Maps 768-dim to num_classes. This is the only new weight — everything else is pretrained.

Named Entity Recognition

Label each token (B-PER, I-PER, O, B-ORG…). Uses a classification head on every token, not just [CLS].

Extractive QA

Given context + question, predict start and end token positions of the answer span. Two classification heads: one per position.

DistilBERT

Knowledge-distilled BERT: 40% smaller, 60% faster, retains 97% of performance. The practical default for resource-constrained fine-tuning.

BERT Input Format

Input Structure

[CLS] sentence_A [SEP] sentence_B [SEP] [PAD] ... [PAD] ↓ ↓ ↓ ↓ Token IDs: [101] [...] [102] [...] [102] [0] ... [0] Seg. IDs: [ 0 ] [ 0 ] [ 0 ] [ 1] [ 1 ] [0] ... [0] Attn mask: [ 1 ] [ 1 ] [ 1 ] [ 1] [ 1 ] [0] ... [0] For classification: use [CLS] final hidden state → Linear → logits For NER: use each token's final hidden state → Linear → tag For QA: use each token → two Linear heads → start/end logit

Code Tutorial

bert_finetuning.py

import torch
import numpy as np
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    AutoModelForTokenClassification, AutoModelForQuestionAnswering,
    TrainingArguments, Trainer, DataCollatorWithPadding,
    EarlyStoppingCallback
)
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score, f1_score, classification_report
import evaluate

# ── 1. SENTIMENT CLASSIFICATION — fine-tune on IMDb ───────────────────
MODEL_NAME = 'distilbert-base-uncased'   # 40% smaller, 97% of BERT perf
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load dataset (HuggingFace Hub)
dataset = load_dataset('imdb')
print(dataset)
# DatasetDict({'train': 25000, 'test': 25000})

# Tokenize — batch processing is much faster than row-by-row
def tokenize_fn(batch):
    return tokenizer(
        batch['text'],
        truncation=True,
        max_length=256,      # 256 sufficient for sentiment; 512 uses 2× memory
        padding=False        # DataCollatorWithPadding handles per-batch padding
    )

tokenized = dataset.map(tokenize_fn, batched=True, remove_columns=['text'])
tokenized.set_format('torch')

# DataCollator pads to the longest sequence in each batch (not globally)
collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load pretrained model + randomly initialized classification head
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=2)
print(f"Trainable params: {sum(p.numel() for p in model.parameters()):,}")

# Evaluation metric
accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
        'f1': float(f1_score(labels, preds, average='binary'))
    }

# Training arguments — carefully tuned for BERT fine-tuning
args = TrainingArguments(
    output_dir='./bert-imdb',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,           # key: must be small for fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.1,             # 10% of steps for warmup
    lr_scheduler_type='cosine',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    logging_steps=100,
    report_to='none',            # disable W&B for now
    dataloader_num_workers=4,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
trainer.train()
trainer.evaluate()

# ── 2. CUSTOM DATASET — bring your own data ───────────────────────────
import pandas as pd
from sklearn.model_selection import train_test_split

# Load any CSV with 'text' and 'label' columns
df = pd.read_csv('my_dataset.csv')
# df columns: text (str), label (int 0..N-1)

train_df, val_df = train_test_split(df, test_size=0.15, stratify=df['label'], random_state=42)
train_ds = Dataset.from_pandas(train_df)
val_ds   = Dataset.from_pandas(val_df)

# ── 3. INFERENCE — use the fine-tuned model ───────────────────────────
from transformers import pipeline

# HuggingFace pipeline: easiest way to run inference
classifier = pipeline(
    'text-classification',
    model='./bert-imdb',
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)
results = classifier([
    "This film was absolutely incredible!",
    "Complete waste of two hours. Terrible plot.",
])
print(results)
# [{'label': 'POSITIVE', 'score': 0.9987}, {'label': 'NEGATIVE', 'score': 0.9962}]

# Manual inference with probabilities
def predict_proba(texts, model, tokenizer, device='cpu'):
    model.eval()
    encodings = tokenizer(texts, truncation=True, padding=True,
                          max_length=256, return_tensors='pt').to(device)
    with torch.no_grad():
        logits = model(**encodings).logits
    return torch.softmax(logits, dim=-1).cpu().numpy()

# ── 4. NAMED ENTITY RECOGNITION ───────────────────────────────────────
ner_model = AutoModelForTokenClassification.from_pretrained(
    'dslim/bert-base-NER')        # pretrained on CoNLL-2003
ner_tokenizer = AutoTokenizer.from_pretrained('dslim/bert-base-NER')

ner_pipe = pipeline('ner', model=ner_model, tokenizer=ner_tokenizer,
                    aggregation_strategy='simple')
entities = ner_pipe("Elon Musk founded SpaceX in Hawthorne, California.")
for e in entities:
    print(f"  {e['entity_group']:5s} [{e['score']:.2f}]: {e['word']}")
# PER   [0.99]: Elon Musk
# ORG   [0.98]: SpaceX
# LOC   [0.99]: Hawthorne
# LOC   [0.99]: California

# ── 5. EXTRACTIVE QUESTION ANSWERING ──────────────────────────────────
qa_model = pipeline('question-answering',
                    model='deepset/roberta-base-squad2')
context = """
The Transformer model was introduced in the paper 'Attention Is All You Need'
by Vaswani et al. in 2017. It replaced recurrent networks with self-attention
and became the foundation of BERT, GPT, and all modern LLMs.
"""
answer = qa_model(question="When was the Transformer introduced?", context=context)
print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})")
# Answer: 2017 (score: 0.987)

# ── 6. MULTI-LABEL CLASSIFICATION ─────────────────────────────────────
# Same architecture, different loss and output interpretation
model_ml = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=6,        # 6 emotion labels
    problem_type='multi_label_classification'
)
# Use BCEWithLogitsLoss (handled automatically by problem_type)
# Predictions: sigmoid(logits) > 0.5 for each label

# ── 7. MODEL EFFICIENCY TRICKS ────────────────────────────────────────
# Gradient checkpointing — trade compute for memory
model.gradient_checkpointing_enable()

# torch.compile() — 10–20% speedup on A100/H100
model = torch.compile(model)

# Export to ONNX for production inference
dummy = tokenizer("dummy input", return_tensors='pt')
torch.onnx.export(
    model, (dummy['input_ids'], dummy['attention_mask']),
    'model.onnx',
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch', 1: 'seq'},
                  'attention_mask': {0: 'batch', 1: 'seq'},
                  'logits': {0: 'batch'}}
)

Pro Tips

Learning rate is the single most important fine-tuning hyperparameter. Stay in the range 1e-5 to 5e-5. Too high (1e-3) causes catastrophic forgetting — the pretrained knowledge is destroyed.

Use DistilBERT for prototyping, full BERT or RoBERTa for production. DistilBERT is 2× faster and needs 40% less memory. Once your pipeline works, swap to RoBERTa for 2–3% accuracy gain.

DataCollatorWithPadding is almost always better than fixed padding. It pads to the longest sequence in each batch (not globally), saving 30–50% of computation on variable-length inputs.

For imbalanced classes, use WeightedRandomSampler or class_weight. Compute weights as 1/class_frequency, pass to CrossEntropyLoss(weight=...). Crucial for medical, fraud, and rare-event tasks.

DeBERTa-v3 is the current BERT-family SOTA. For classification tasks where accuracy is critical, swap to microsoft/deberta-v3-base. Consistently outperforms BERT and RoBERTa by 2–4%.

For long documents (>512 tokens), use Longformer (allenai/longformer-base-4096) or a sliding window approach. Truncating at 512 loses critical information in legal, scientific, and financial text.

Resources

HuggingFace course Ch.3 BERT paper (Devlin 2018) RoBERTa paper DeBERTa paper Trainer API docs evaluate library

GPT & Text Generation 03 / 05

Autoregressive Models
and the Art of Decoding

GPT-style decoder-only models generate text one token at a time. The decoding strategy — temperature, top-p, beam search — determines whether output is coherent, creative, or diverse. Understanding this unlocks full control over LLM outputs.

Why This Matters

Every commercial LLM — GPT-4, Claude, Gemini, LLaMA, Mistral — is a GPT-style causal decoder. Understanding how autoregressive generation works, what tokens are, and how decoding parameters affect output is essential for anyone building on top of these models. Whether you're calling an API or running a model locally, these fundamentals determine the quality of your outputs.

Core Concepts

Causal Masking

Each token can only attend to previous tokens — not future ones. This enables autoregressive generation: the model can produce token N using only tokens 0..N-1.

Autoregressive Decoding

Generate one token at a time: feed context → predict next token → append → repeat. Output length is unbounded. The fundamental LLM generation loop.

Temperature

Scales logits before softmax: logits / T. T→0: deterministic (argmax). T=1: model distribution. T>1: more uniform/random. Most used hyperparameter for generation.

Top-k Sampling

Keep only the top k most probable tokens, zero out the rest, then sample. Prevents very unlikely tokens. k=50 is a common default.

Nucleus Sampling (Top-p)

Keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive — more tokens sampled when the distribution is flat, fewer when peaked.

Beam Search

Maintain B best sequences at each step. Produces more grammatically consistent output. Worse for open-ended generation; better for translation and summarization.

Repetition Penalty

Penalize tokens that have already appeared. Prevents the degenerate "the the the the" loops common in greedy decoding.

KV Cache

Cache key and value tensors from previous tokens. Avoids recomputing attention over the whole context at each step. Makes generation ~10x faster.

Decoding Strategy Guide

Strategy	Temperature	Use Case	Tradeoff
Greedy	0 (argmax)	Factual QA, extraction	Deterministic but repetitive
Beam Search (B=4)	—	Translation, summarization	More coherent; worse for open-ended gen
Top-p (p=0.9)	0.7–0.9	Creative writing, chat	Diverse and natural; can hallucinate
Top-k + Top-p	0.8	General text generation	Robust combination
Temperature=0.1	0.1	Code generation, structured output	Predictable; less creative

Code Tutorial

gpt_generation.py

import torch
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, TextStreamer,
    GenerationConfig, StoppingCriteria, StoppingCriteriaList
)
import anthropic

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── 1. LOAD A GPT-STYLE MODEL ──────────────────────────────────────────
MODEL = 'gpt2'   # swap for 'mistralai/Mistral-7B-v0.1', 'meta-llama/Llama-3.2-3B', etc.
tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token    # GPT-2 has no pad token

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
    device_map='auto'    # auto-place layers across available GPUs/CPU
)
model.eval()

prompt = "The key insight about large language models is that"
inputs = tokenizer(prompt, return_tensors='pt').to(device)

# ── 2. GREEDY DECODING — deterministic, fastest ────────────────────────
with torch.no_grad():
    greedy_ids = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
        repetition_penalty=1.2,     # prevent loops
        pad_token_id=tokenizer.eos_token_id
    )
output = tokenizer.decode(greedy_ids[0][inputs.input_ids.shape[1]:],
                           skip_special_tokens=True)
print("Greedy:", output)

# ── 3. TEMPERATURE + NUCLEUS SAMPLING ─────────────────────────────────
with torch.no_grad():
    sample_ids = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,          # lower = more focused
        top_p=0.92,               # nucleus sampling
        top_k=50,                 # also cap at top-50
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id
    )
print("Sampled:",
      tokenizer.decode(sample_ids[0][inputs.input_ids.shape[1]:],
                       skip_special_tokens=True))

# ── 4. BEAM SEARCH — best for translation/summarization ───────────────
with torch.no_grad():
    beam_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        num_beams=4,
        early_stopping=True,
        no_repeat_ngram_size=3,    # prevent 3-gram repetition
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id
    )
print("Beam:",
      tokenizer.decode(beam_ids[0][inputs.input_ids.shape[1]:],
                       skip_special_tokens=True))

# ── 5. STREAMING — show tokens as they're generated ───────────────────
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
print("\nStreaming output:")
with torch.no_grad():
    model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        streamer=streamer,         # prints each token as it's decoded
        pad_token_id=tokenizer.eos_token_id
    )

# ── 6. MANUAL AUTOREGRESSIVE LOOP — understand what generate() does ───
def generate_manual(model, tokenizer, prompt, max_new=50, temperature=0.8, top_p=0.9):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    generated = input_ids.clone()

    for _ in range(max_new):
        with torch.no_grad():
            outputs = model(generated)
            next_logits = outputs.logits[:, -1, :] / temperature  # last token

        # Top-p filtering
        sorted_logits, sorted_idx = torch.sort(next_logits, descending=True)
        cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1)
        sorted_logits[cumulative_probs > top_p] = float('-inf')
        filtered_logits = torch.zeros_like(next_logits).scatter(1, sorted_idx, sorted_logits)

        probs = torch.softmax(filtered_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

        generated = torch.cat([generated, next_token], dim=-1)

    new_tokens = generated[0, input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

result = generate_manual(model, tokenizer, "Transformers work by")
print("Manual:", result)

# ── 7. STRUCTURED OUTPUT WITH OUTLINES ────────────────────────────────
# Force the model to output valid JSON using guided generation
import outlines
from pydantic import BaseModel
from typing import List

class PersonInfo(BaseModel):
    name: str
    age: int
    skills: List[str]

outlines_model = outlines.models.transformers(MODEL)
generator = outlines.generate.json(outlines_model, PersonInfo)
person = generator("Extract info: John is 28 years old and knows Python and SQL.")
print(person)  # PersonInfo(name='John', age=28, skills=['Python', 'SQL'])

# ── 8. CALLING AN LLM API (Claude) ────────────────────────────────────
# Production: almost always better to call an API than host yourself
client = anthropic.Anthropic()

# Basic call
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain backpropagation in 3 sentences."}]
)
print(response.content[0].text)

# Streaming API call
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    messages=[{"role": "user", "content": "Write a haiku about gradient descent."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# System prompt + multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "Paris."},
    {"role": "user", "content": "What is its population?"}
]
response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=256,
    system="You are a precise geography assistant. Give only facts.",
    messages=messages
)
print(response.content[0].text)

Pro Tips

Temperature 0.7 + top_p 0.9 is the universal default for open-ended generation. For factual tasks (extraction, classification, code), use temperature ≤ 0.2. For creative tasks, try 0.9–1.1.

Streaming dramatically improves perceived latency. A 3-second response that streams feels faster than a 1-second response that bulk-loads. Always stream for user-facing applications.

Prompt format is as important as model quality. Few-shot examples in the prompt (2–5 input/output pairs) can match fine-tuning for many tasks — and takes seconds not hours.

For structured output use tool/function calling, not "respond in JSON". Tool calling is schema-validated. Plain JSON prompting breaks 10–20% of the time on edge cases.

Perplexity measures generation quality. Lower perplexity = model assigns higher probability to test text. Use it to compare model checkpoints. exp(cross_entropy_loss).

For 7B+ models on a single GPU, use bitsandbytes 4-bit quantization: load_in_4bit=True. Runs a 7B model in ~5GB VRAM with <1% quality loss. Use bfloat16 compute dtype.

Resources

HuggingFace generation docs GPT-2 paper vLLM — fast inference Anthropic API cookbook Outlines library nanoGPT (Karpathy)

Fine-tuning & PEFT 04 / 05

Adapt Large Models
Without Breaking the Bank

Full fine-tuning a 7B model requires 80GB+ VRAM and days of compute. LoRA achieves comparable quality by training only 0.1–1% of parameters. QLoRA goes further — fine-tune LLaMA-3 8B on a single 24GB GPU in under 2 hours.

Why This Matters

Custom LLMs for specific domains — legal reasoning, medical coding, customer support — consistently outperform general-purpose models on those tasks. Fine-tuning used to be prohibitively expensive. PEFT methods like LoRA and QLoRA democratized it: you can now adapt a billion-parameter model to your domain on consumer hardware. This is now a standard production capability, not a research curiosity.

Core Concepts

Full Fine-tuning

Update all model parameters. Best quality but requires full model in optimizer state: 7B params × 16 bytes = ~112GB VRAM for AdamW.

LoRA

Freeze pretrained weights W. Inject trainable ΔW = A·B where rank(ΔW) = r ≪ d. Only train A (d×r) and B (r×d). Typically 0.1–1% of parameters.

QLoRA

4-bit NF4 quantize the base model (saves ~75% memory). Train LoRA adapters in bf16. Fine-tune 7B in ~6GB VRAM, 70B in ~48GB. The current practical gold standard.

LoRA Rank (r)

Controls capacity. r=4 for style adaptation; r=16 standard; r=64 for complex tasks. Higher r = more parameters = more expressiveness but more overfitting risk.

Instruction Tuning (SFT)

Supervised Fine-Tuning on (instruction, response) pairs. Teaches the model to follow natural language instructions. The first step in making a base model useful.

RLHF

Reinforcement Learning from Human Feedback. Train a reward model on human preferences, then fine-tune the LLM with PPO to maximize that reward. How ChatGPT was aligned.

DPO

Direct Preference Optimization. Aligns LLMs from preference data without an explicit reward model or RL. Simpler than RLHF, comparable quality. The modern alignment default.

Catastrophic Forgetting

Over-aggressive fine-tuning overwrites general knowledge. Mitigate with lower LR, fewer epochs, LoRA (which barely touches base weights), and evaluation on general benchmarks.

LoRA — The Mathematics

Low-Rank Adaptation

Original forward: h = Wx W ∈ ℝ^(d×k), frozen LoRA forward: h = Wx + ΔWx = Wx + BAx B ∈ ℝ^(d×r), A ∈ ℝ^(r×k) [r ≪ min(d,k)] Initialization: A ~ N(0, σ²), B = 0 → ΔW = 0 at start Scaling: ΔWx = (α/r) · BAx [α=lora_alpha controls scale] Parameters: r×k + d×r vs d×k [e.g. 8×1024 + 4096×8 = 40,960 vs 4,194,304] Savings: ~99% fewer trainable parameters for r=8 on 4096-dim layer

Code Tutorial

peft_lora_qlora.py

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, TrainingArguments
)
from peft import (
    LoraConfig, get_peft_model, TaskType,
    prepare_model_for_kbit_training,
    PeftModel
)
from trl import SFTTrainer, DPOTrainer, DPOConfig, DataCollatorForCompletionOnlyLM
from datasets import load_dataset, Dataset

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"  # replace with your target model

# ── 1. QLORA SETUP — 4-bit quantization + LoRA ────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat4: best for weights
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 for speed
    bnb_4bit_use_double_quant=True,        # quantize the quantization constants
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"           # required for SFT with causal LMs

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Prepare model for k-bit training: cast LayerNorm, enable gradient checkpointing
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# ── 2. LORA CONFIGURATION ──────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,                   # rank — higher = more capacity, more params
    lora_alpha=32,          # scaling factor (typically 2× rank)
    target_modules=[        # which weight matrices to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # attention
        "gate_proj", "up_proj", "down_proj"       # FFN
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 3,226,390,528 || trainable%: 0.42%

# ── 3. PREPARE INSTRUCTION-TUNING DATASET ─────────────────────────────
# Format: Alpaca-style {"instruction": ..., "input": ..., "output": ...}
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

def format_instruction(sample):
    """Convert to chat template / instruction format."""
    instruction = sample['instruction']
    inp = sample.get('input', '')
    output = sample['output']
    if inp:
        user_msg = f"### Instruction:\n{instruction}\n\n### Input:\n{inp}"
    else:
        user_msg = f"### Instruction:\n{instruction}"
    return {"text": f"{user_msg}\n\n### Response:\n{output}{tokenizer.eos_token}"}

dataset = raw_dataset.map(format_instruction)

# DataCollator that only trains on the completion (response), not the prompt
response_template = "### Response:\n"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer
)

# ── 4. TRAIN WITH TRL SFTTrainer ──────────────────────────────────────
sft_args = TrainingArguments(
    output_dir="./qlora_output",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 16
    learning_rate=2e-4,               # higher LR OK for LoRA (only adapters)
    weight_decay=0.001,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    optim="paged_adamw_32bit",        # paged optimizer saves VRAM
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    logging_steps=25,
    save_steps=200,
    group_by_length=True,            # batch similar-length sequences → less padding
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=sft_args,
    tokenizer=tokenizer,
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,                     # pack short sequences together → faster
)
trainer.train()

# ── 5. SAVE AND MERGE ─────────────────────────────────────────────────
# Save only LoRA adapters (small, ~50MB)
model.save_pretrained("./lora_adapters")
tokenizer.save_pretrained("./lora_adapters")

# Load base model + merge LoRA weights for deployment (optional)
base = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto")
merged = PeftModel.from_pretrained(base, "./lora_adapters")
merged = merged.merge_and_unload()    # fuse LoRA into base weights
merged.save_pretrained("./merged_model")

# ── 6. DPO — Direct Preference Optimization ───────────────────────────
# Alignment from preference pairs — simpler than RLHF, no reward model needed

# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
dpo_dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train[:2000]")

dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    learning_rate=5e-7,         # very small for alignment — preserve capabilities
    beta=0.1,                   # KL divergence coefficient
    bf16=True,
)

# Reference model: the SFT model before alignment (frozen)
ref_model = AutoModelForCausalLM.from_pretrained(
    "./merged_model", torch_dtype=torch.bfloat16, device_map="auto")

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

# ── 7. EVALUATE FINE-TUNED MODEL ──────────────────────────────────────
from lm_eval import evaluator

# Evaluate on standard benchmarks (ARC, HellaSwag, TruthfulQA, MMLU)
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["arc_easy", "hellaswag", "truthfulqa_mc1"],
    num_fewshot=5,
    device="cuda",
)
print(results["results"])

PEFT Method Comparison

Method	VRAM (7B model)	Trainable Params	Quality	When to Use
Full fine-tuning	~80GB	100%	Best	Small models, ample resources
LoRA (bf16)	~14GB	0.1–1%	Near-best	Single A100/H100
QLoRA (4-bit)	~6GB	0.1–1%	Very good	Consumer GPU (RTX 3090/4090)
Prompt tuning	~12GB	<0.01%	Decent	Fixed base model, soft prompts
Adapter layers	~14GB	~0.5%	Good	Multi-task, modular fine-tuning

Pro Tips

Train only on the response, not the prompt. DataCollatorForCompletionOnlyLM masks the instruction tokens so loss is only computed on the generated response. Without this, quality degrades.

Use paged_adamw_32bit optimizer for QLoRA. It keeps optimizer states on CPU when not needed, reducing peak VRAM by ~30%. Essential for fitting 7B on a 24GB GPU with a useful batch size.

Always evaluate on held-out data AND general benchmarks. A fine-tuned model can score 98% on your task while degrading 10% on MMLU due to catastrophic forgetting. Check both.

DPO is now preferred over RLHF for alignment. Simpler to implement, more stable to train, requires no reward model. Use HuggingFaceH4/ultrafeedback_binarized as a starting dataset.

Data quality beats data quantity for fine-tuning. 1,000 high-quality, diverse instruction pairs usually outperform 100,000 noisy ones. Curate aggressively. Filter with another LLM.

Merge LoRA for deployment, keep separate for multi-task. Merging gives a single model for production (no inference overhead). Keeping separate lets you switch tasks by swapping adapter weights.

Resources

LoRA paper (Hu et al. 2021) QLoRA paper (Dettmers 2023) HuggingFace PEFT library TRL SFTTrainer docs DPO paper Axolotl fine-tuning framework

RAG & LLM Agents 05 / 05

Ground LLMs in Facts
and Give Them Tools

LLMs hallucinate. RAG grounds them in your documents and eliminates stale knowledge. Agents go further — they reason, call tools, observe results, and iterate. Together, they power every serious LLM application in production today.

Why This Matters

An LLM without RAG is a confident guesser with a fixed knowledge cutoff. An LLM without tools can describe actions but not take them. RAG pipelines power enterprise document QA, legal research assistants, and technical support bots. Agents power coding assistants, research tools, and workflow automation. If you only learn one LLM application pattern, RAG is it — it is ubiquitous, well-understood, and immediately deployable in production.

Core Concepts

RAG Pipeline

Index: embed documents → vector DB. Query: embed query → retrieve top-K → insert into prompt → generate grounded answer.

Chunking

Split documents into 256–1024 token chunks before embedding. Chunk size controls recall vs precision. Overlap (50–100 tokens) prevents context loss at boundaries.

Retrieval Strategies

Dense (embedding similarity), sparse (BM25 keyword), or hybrid. Hybrid retrieval + cross-encoder reranking is the current production SOTA.

Reranking

After retrieving top-K candidates, a cross-encoder scores each (query, chunk) pair. Expensive but dramatically improves relevance. Run on top-20, return top-5.

Tool Calling

Define functions the LLM can invoke (search, calculator, DB query, API calls). The model outputs structured JSON to call a tool, then receives the result.

ReAct

Reason + Act: LLM alternates Thought → Action → Observation until it reaches a final answer. Enables multi-step problem solving with tool use.

Agentic Loops

The LLM drives a loop: decide → act → observe → decide. Can run until convergence. Requires careful design to prevent infinite loops and runaway costs.

RAGAS Evaluation

Framework for evaluating RAG: faithfulness (is the answer grounded?), answer relevancy, context precision, context recall. Essential for production monitoring.

RAG Architecture

INDEXING (offline)
PDF/HTML/TXT → Text extraction → Chunker (512 tok, 50 overlap)
→ Embedding model → Vector store (Chroma / FAISS / Pinecone)

QUERYING (online)
User question → Embed query → ANN search top-20
→ BM25 sparse retrieval top-20 → Hybrid merge (RRF)
→ Cross-encoder reranker → Top-5 chunks
→ Prompt assembly: [system] + [chunks] + [question]
→ LLM → Grounded answer + source citations

Code Tutorial

rag_agents.py

import os
import json
from pathlib import Path
from typing import List, Dict, Any
import numpy as np
import anthropic

# ── 1. DOCUMENT LOADING & CHUNKING ────────────────────────────────────
from langchain_community.document_loaders import (
    PyPDFLoader, DirectoryLoader, TextLoader, WebBaseLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents from a directory
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} pages")

# RecursiveCharacterTextSplitter: tries paragraph → sentence → word → char splits
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,            # tokens ≈ 0.75 × chars, so ~680 chars
    chunk_overlap=64,          # overlap prevents cross-boundary context loss
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    add_start_index=True,
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
print(f"Sample: {chunks[0].page_content[:200]}")

# ── 2. EMBEDDING + VECTOR STORE ───────────────────────────────────────
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",   # fast, small, excellent quality
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Create persistent Chroma vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs"
)
print(f"Stored {vectorstore._collection.count()} chunks in Chroma")

# Load existing store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_docs"
)

# ── 3. RETRIEVAL STRATEGIES ───────────────────────────────────────────
# Dense retrieval — embedding similarity
dense_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 20}
)

# MMR (Maximum Marginal Relevance) — diverse results, less redundancy
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 10, "fetch_k": 50, "lambda_mult": 0.5}
)

# BM25 sparse retrieval (keyword-based)
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=20)

# Hybrid retrieval: merge dense + sparse with Reciprocal Rank Fusion
from langchain.retrievers import EnsembleRetriever
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4]   # dense usually more important
)

# ── 4. CROSS-ENCODER RERANKING ────────────────────────────────────────
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)

def retrieve_and_rerank(query: str, k_retrieve: int = 20, k_final: int = 5):
    # Step 1: fast retrieval (top-20)
    candidates = hybrid_retriever.invoke(query)[:k_retrieve]

    # Step 2: cross-encoder reranking (top-5)
    pairs = [(query, doc.page_content) for doc in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [doc for _, doc in ranked[:k_final]]

# ── 5. RAG WITH CLAUDE — full pipeline ────────────────────────────────
client = anthropic.Anthropic()

def rag_query(question: str, stream: bool = True) -> str:
    # Retrieve relevant chunks
    relevant_docs = retrieve_and_rerank(question, k_retrieve=20, k_final=5)

    # Build context block
    context_parts = []
    for i, doc in enumerate(relevant_docs, 1):
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', '?')
        context_parts.append(
            f"[Source {i}: {source}, p.{page}]\n{doc.page_content}")
    context = "\n\n---\n\n".join(context_parts)

    prompt = f"""Use ONLY the following context to answer the question.
If the answer is not in the context, say "I don't have information about that."
Always cite the source number(s) you used.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

    if stream:
        print("Answer: ", end="")
        with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        ) as s:
            full_response = ""
            for text in s.text_stream:
                print(text, end="", flush=True)
                full_response += text
        print()
        return full_response
    else:
        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return resp.content[0].text

# Usage
answer = rag_query("What are the main risk factors described in the report?")

# ── 6. TOOL-CALLING AGENT WITH CLAUDE ─────────────────────────────────
import math, subprocess

tools = [
    {
        "name": "calculator",
        "description": "Evaluate a mathematical expression and return the result.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression, e.g. '2**32 / 1024'"}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "search_docs",
        "description": "Search the document corpus for relevant information.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "The search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "run_python",
        "description": "Execute a Python expression and return the output.",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python expression to evaluate"}
            },
            "required": ["code"]
        }
    }
]

def execute_tool(name: str, inputs: dict) -> str:
    """Dispatch tool calls to actual functions."""
    if name == "calculator":
        try:
            result = eval(inputs["expression"], {"__builtins__": {}},
                          {"math": math, "sqrt": math.sqrt})
            return f"Result: {result}"
        except Exception as e:
            return f"Error: {e}"
    elif name == "search_docs":
        docs = retrieve_and_rerank(inputs["query"], k_final=3)
        return "\n\n".join(d.page_content[:400] for d in docs)
    elif name == "run_python":
        try:
            return str(eval(inputs["code"]))
        except Exception as e:
            return f"Error: {e}"
    return "Unknown tool"

def run_agent(user_message: str, max_steps: int = 8) -> str:
    """ReAct-style agentic loop with tool use."""
    messages = [{"role": "user", "content": user_message}]
    system = "You are a helpful research assistant with access to tools. Use them to answer questions accurately."

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system,
            tools=tools,
            messages=messages
        )

        # Append assistant response to history
        messages.append({"role": "assistant", "content": response.content})

        # Check if done
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, 'text'):
                    return block.text
            return "Done"

        # Execute tool calls
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  → Tool: {block.name}({block.input})")
                    result = execute_tool(block.name, block.input)
                    print(f"  ← Result: {result[:100]}...")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})

    return "Max steps reached"

# Run an agent
result = run_agent(
    "Find what the document says about Q3 revenue, then calculate its percentage "
    "change compared to a baseline of $4.2M."
)
print("Agent result:", result)

# ── 7. EVALUATE RAG WITH RAGAS ────────────────────────────────────────
from ragas import evaluate as ragas_eval
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset as HFDataset

# Build evaluation dataset
eval_data = {
    "question": ["What is the revenue for Q3?", "Who is the CEO?"],
    "answer":   ["Q3 revenue was $4.8M.", "Jane Smith is the CEO."],
    "contexts": [["Q3 revenue reached $4.8 million..."],
                  ["Jane Smith was appointed CEO in 2022..."]],
    "ground_truth": ["$4.8M", "Jane Smith"]
}

ragas_dataset = HFDataset.from_dict(eval_data)
ragas_results = ragas_eval(
    ragas_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(ragas_results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}

Chunking Strategy Comparison

Strategy	Chunk Size	Best For	Notes
Fixed size	512 tokens	General purpose	Simple; can split mid-sentence
Recursive character	512 tokens	Most documents	Respects paragraph/sentence structure
Semantic chunking	Variable	Coherent passages	Splits on embedding similarity drops
Document-aware	Variable	PDFs, HTML	Respects headers, sections, tables
Small-to-big	128 + 512	Precision + context	Retrieve small chunks, return parent doc

Pro Tips

Hybrid retrieval (dense + BM25) consistently outperforms either alone. Dense handles semantic similarity; BM25 handles exact keyword matches (model names, product codes, IDs). Always combine them.

Cross-encoder reranking is the highest ROI RAG improvement. Retrieve top-20 with fast bi-encoder, rerank with cross-encoder, return top-5. Typically +8-15 NDCG at modest latency cost.

Chunk overlap prevents boundary blindness. A 50-100 token overlap ensures that sentences split across chunk boundaries still appear in at least one chunk in full context.

Evaluate with RAGAS before deploying. Measure faithfulness (hallucination rate), answer relevancy, and context precision. Set minimum thresholds (e.g. faithfulness > 0.85) as a deployment gate.

Small-to-big retrieval improves coherence. Index small chunks (128 tokens) for precision, but return their parent document (512+ tokens) as context. The LLM needs surrounding text to reason well.

Always cap agent steps and handle tool errors gracefully. Set max_steps=10 hard limit. Catch all tool exceptions and return error strings — never let a crashed tool kill the agent loop.

🎉 Phase 4 Complete! You now command the full NLP & LLM stack — from tokenization to embeddings, BERT classification to GPT generation, LoRA fine-tuning to production RAG pipelines. Phase 5 (Specializations: Generative AI, RL, MLOps) applies these foundations across domains. You're in the top tier of the ML curriculum.

Resources

LangChain docs LlamaIndex docs RAGAS framework Anthropic tool use guide RAG survey paper (2023) BM25 explainer

NLP &Large Language Models

Represent Languageas Mathematical Objects

Bidirectional Understandingfor NLP Tasks

Autoregressive Modelsand the Art of Decoding

Adapt Large ModelsWithout Breaking the Bank

Ground LLMs in Factsand Give Them Tools

NLP &
Large Language Models

Represent Language
as Mathematical Objects

Bidirectional Understanding
for NLP Tasks

Autoregressive Models
and the Art of Decoding

Adapt Large Models
Without Breaking the Bank

Ground LLMs in Facts
and Give Them Tools