NLP &
Large Language Models
Language is the frontier. This phase takes you from converting text to numbers all the way to fine-tuning billion-parameter models and building production RAG pipelines. You'll understand the stack that powers GPT-4, Claude, and every modern language AI — from tokenization to LoRA to agents.
Embeddings
Fine-tuning
Generation
LoRA
Agents
Represent Language
as Mathematical Objects
All NLP begins with one question: how do you turn text into numbers a model can reason about? Modern embeddings encode semantic meaning so precisely that arithmetic in vector space reflects linguistic meaning.
→ Mean pooling (or CLS) → Sentence embedding (768d or 384d)
→ L2 normalize → FAISS / vector DB → Approximate nearest-neighbor search
Query: "What causes inflation?" → encode → find top-K most similar docs
import numpy as np from transformers import AutoTokenizer from sentence_transformers import SentenceTransformer import faiss import torch # ── 1. TOKENIZATION — understand what happens before encoding ────────── # BPE tokenizer (GPT-style) gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2') text = "The Transformer architecture changed NLP forever." tokens = gpt_tokenizer.tokenize(text) ids = gpt_tokenizer.encode(text) print("GPT-2 tokens:", tokens) # ['The', 'ĠTransformer', 'Ġarchitecture', 'Ġchanged', 'ĠNLP', 'Ġforever', '.'] print("Token IDs:", ids) print("Decoded back:", gpt_tokenizer.decode(ids)) # WordPiece tokenizer (BERT-style) bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') bert_tokens = bert_tokenizer.tokenize("unbelievable transformations") print("BERT tokens:", bert_tokens) # ['un', '##believable', 'transform', '##ations'] # Special tokens: [CLS], [SEP], [PAD], [MASK] encoded = bert_tokenizer( "Hello world", return_tensors='pt', padding=True, truncation=True, max_length=128 ) print("Input IDs:", encoded['input_ids']) print("Attention mask:", encoded['attention_mask']) # attention_mask: 1 = real token, 0 = padding (ignored in attention) # Count tokens before sending to an API — critical for cost control def count_tokens(text: str, model: str = 'gpt2') -> int: tok = AutoTokenizer.from_pretrained(model) return len(tok.encode(text)) print(f"Token count: {count_tokens('Hello, how are you today?')}") # ── 2. SENTENCE EMBEDDINGS ──────────────────────────────────────────── # all-MiniLM-L6-v2: fast, good quality, 384 dims — the practical default embedder = SentenceTransformer('all-MiniLM-L6-v2') sentences = [ "Machine learning is transforming industries.", "Artificial intelligence reshapes the business world.", "I love hiking through mountain trails.", "Deep learning requires large amounts of data.", "The weather is beautiful today.", ] embeddings = embedder.encode(sentences, normalize_embeddings=True) print(f"Embedding shape: {embeddings.shape}") # (5, 384) # Cosine similarity (dot product of normalized vectors) def cosine_sim(a, b): return float(np.dot(a, b)) # works if vectors already L2-normalized print(f"ML vs AI similarity: {cosine_sim(embeddings[0], embeddings[1]):.3f}") # ~0.85 — semantically similar print(f"ML vs hiking: {cosine_sim(embeddings[0], embeddings[2]):.3f}") # ~0.12 — semantically different print(f"ML vs deep learning: {cosine_sim(embeddings[0], embeddings[3]):.3f}") # ~0.72 — related # Batch pairwise similarity matrix sim_matrix = embeddings @ embeddings.T print("Similarity matrix:\n", sim_matrix.round(2)) # ── 3. SEMANTIC SEARCH WITH FAISS ───────────────────────────────────── # Build a corpus of documents corpus = [ "BERT is a bidirectional Transformer encoder for NLP tasks.", "GPT uses a causal decoder to generate text autoregressively.", "LoRA reduces fine-tuning parameters by injecting low-rank matrices.", "Attention mechanisms allow models to weigh token importance.", "Vector databases store embeddings for fast similarity retrieval.", "Gradient descent minimizes loss by following the negative gradient.", "Convolutional networks extract spatial features from images.", "Reinforcement learning trains agents via reward signals.", "Tokenization converts raw text into integer IDs for models.", "Transformers replaced RNNs as the dominant sequence model.", ] # Encode corpus corpus_embeds = embedder.encode(corpus, normalize_embeddings=True) corpus_embeds = corpus_embeds.astype(np.float32) # Build FAISS index (Inner Product = cosine sim for normalized vecs) dim = corpus_embeds.shape[1] # 384 index = faiss.IndexFlatIP(dim) index.add(corpus_embeds) print(f"FAISS index: {index.ntotal} vectors of dim {dim}") # Semantic search def semantic_search(query: str, k: int = 3): q_embed = embedder.encode([query], normalize_embeddings=True).astype(np.float32) scores, indices = index.search(q_embed, k) results = [] for score, idx in zip(scores[0], indices[0]): results.append({'doc': corpus[idx], 'score': float(score)}) return results results = semantic_search("how do transformers process sequences?") for r in results: print(f" [{r['score']:.3f}] {r['doc'][:60]}...") # ── 4. FAISS AT SCALE — IVF index for millions of vectors ───────────── # Flat index is exact but O(N). For >100K vectors, use approximate search. n_clusters = 100 quantizer = faiss.IndexFlatIP(dim) ivf_index = faiss.IndexIVFFlat(quantizer, dim, n_clusters, faiss.METRIC_INNER_PRODUCT) ivf_index.train(corpus_embeds) # must train before adding ivf_index.add(corpus_embeds) ivf_index.nprobe = 10 # check 10 clusters (speed vs accuracy) # nprobe=10 gives ~95% recall of exact search, 10x faster # ── 5. CROSS-ENCODER RERANKING — improve top-K quality ──────────────── from sentence_transformers import CrossEncoder # Bi-encoder (fast): use for retrieval top-K # Cross-encoder (slow but accurate): use to rerank top-K reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') query = "how do transformers process sequences?" candidates = [r['doc'] for r in results] # Cross-encoder scores the query-document pair jointly rerank_scores = reranker.predict([(query, doc) for doc in candidates]) reranked = sorted(zip(rerank_scores, candidates), reverse=True) print("\nReranked results:") for score, doc in reranked: print(f" [{score:.3f}] {doc[:60]}...") # ── 6. WORD2VEC — understand the foundations ────────────────────────── from gensim.models import Word2Vec sentences_w2v = [ ['machine', 'learning', 'is', 'transforming', 'industries'], ['deep', 'learning', 'requires', 'large', 'data'], ['neural', 'networks', 'learn', 'representations'], ['transformers', 'use', 'attention', 'mechanisms'], ] model_w2v = Word2Vec(sentences_w2v, vector_size=64, window=3, min_count=1, workers=4, epochs=100) vec = model_w2v.wv['learning'] # 64-dim vector similar = model_w2v.wv.most_similar('learning', topn=3) print("Most similar to 'learning':", similar)
| Model | Dims | Speed | Quality | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | Semantic search, default choice |
| all-mpnet-base-v2 | 768 | Medium | Very Good | High-quality retrieval |
| text-embedding-3-small | 1536 | API call | Excellent | Production OpenAI apps |
| nomic-embed-text | 768 | Medium | Excellent | Local, long-context (8192 tokens) |
| bge-m3 | 1024 | Medium | SOTA | Multilingual, dense + sparse hybrid |
normalize_embeddings=True in SentenceTransformer.
Bidirectional Understanding
for NLP Tasks
BERT and its variants are the workhorses of NLP understanding tasks: classification, named entity recognition, and question answering. Fine-tuning a pretrained model takes hours instead of weeks and consistently outperforms training from scratch.
import torch import numpy as np from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback ) from datasets import load_dataset, Dataset from sklearn.metrics import accuracy_score, f1_score, classification_report import evaluate # ── 1. SENTIMENT CLASSIFICATION — fine-tune on IMDb ─────────────────── MODEL_NAME = 'distilbert-base-uncased' # 40% smaller, 97% of BERT perf tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) # Load dataset (HuggingFace Hub) dataset = load_dataset('imdb') print(dataset) # DatasetDict({'train': 25000, 'test': 25000}) # Tokenize — batch processing is much faster than row-by-row def tokenize_fn(batch): return tokenizer( batch['text'], truncation=True, max_length=256, # 256 sufficient for sentiment; 512 uses 2× memory padding=False # DataCollatorWithPadding handles per-batch padding ) tokenized = dataset.map(tokenize_fn, batched=True, remove_columns=['text']) tokenized.set_format('torch') # DataCollator pads to the longest sequence in each batch (not globally) collator = DataCollatorWithPadding(tokenizer=tokenizer) # Load pretrained model + randomly initialized classification head model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=2) print(f"Trainable params: {sum(p.numel() for p in model.parameters()):,}") # Evaluation metric accuracy = evaluate.load('accuracy') def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) return { 'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'], 'f1': float(f1_score(labels, preds, average='binary')) } # Training arguments — carefully tuned for BERT fine-tuning args = TrainingArguments( output_dir='./bert-imdb', num_train_epochs=3, per_device_train_batch_size=32, per_device_eval_batch_size=64, learning_rate=2e-5, # key: must be small for fine-tuning weight_decay=0.01, warmup_ratio=0.1, # 10% of steps for warmup lr_scheduler_type='cosine', evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True, metric_for_best_model='f1', greater_is_better=True, fp16=torch.cuda.is_available(), logging_steps=100, report_to='none', # disable W&B for now dataloader_num_workers=4, ) trainer = Trainer( model=model, args=args, train_dataset=tokenized['train'], eval_dataset=tokenized['test'], tokenizer=tokenizer, data_collator=collator, compute_metrics=compute_metrics, callbacks=[EarlyStoppingCallback(early_stopping_patience=2)] ) trainer.train() trainer.evaluate() # ── 2. CUSTOM DATASET — bring your own data ─────────────────────────── import pandas as pd from sklearn.model_selection import train_test_split # Load any CSV with 'text' and 'label' columns df = pd.read_csv('my_dataset.csv') # df columns: text (str), label (int 0..N-1) train_df, val_df = train_test_split(df, test_size=0.15, stratify=df['label'], random_state=42) train_ds = Dataset.from_pandas(train_df) val_ds = Dataset.from_pandas(val_df) # ── 3. INFERENCE — use the fine-tuned model ─────────────────────────── from transformers import pipeline # HuggingFace pipeline: easiest way to run inference classifier = pipeline( 'text-classification', model='./bert-imdb', tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1 ) results = classifier([ "This film was absolutely incredible!", "Complete waste of two hours. Terrible plot.", ]) print(results) # [{'label': 'POSITIVE', 'score': 0.9987}, {'label': 'NEGATIVE', 'score': 0.9962}] # Manual inference with probabilities def predict_proba(texts, model, tokenizer, device='cpu'): model.eval() encodings = tokenizer(texts, truncation=True, padding=True, max_length=256, return_tensors='pt').to(device) with torch.no_grad(): logits = model(**encodings).logits return torch.softmax(logits, dim=-1).cpu().numpy() # ── 4. NAMED ENTITY RECOGNITION ─────────────────────────────────────── ner_model = AutoModelForTokenClassification.from_pretrained( 'dslim/bert-base-NER') # pretrained on CoNLL-2003 ner_tokenizer = AutoTokenizer.from_pretrained('dslim/bert-base-NER') ner_pipe = pipeline('ner', model=ner_model, tokenizer=ner_tokenizer, aggregation_strategy='simple') entities = ner_pipe("Elon Musk founded SpaceX in Hawthorne, California.") for e in entities: print(f" {e['entity_group']:5s} [{e['score']:.2f}]: {e['word']}") # PER [0.99]: Elon Musk # ORG [0.98]: SpaceX # LOC [0.99]: Hawthorne # LOC [0.99]: California # ── 5. EXTRACTIVE QUESTION ANSWERING ────────────────────────────────── qa_model = pipeline('question-answering', model='deepset/roberta-base-squad2') context = """ The Transformer model was introduced in the paper 'Attention Is All You Need' by Vaswani et al. in 2017. It replaced recurrent networks with self-attention and became the foundation of BERT, GPT, and all modern LLMs. """ answer = qa_model(question="When was the Transformer introduced?", context=context) print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})") # Answer: 2017 (score: 0.987) # ── 6. MULTI-LABEL CLASSIFICATION ───────────────────────────────────── # Same architecture, different loss and output interpretation model_ml = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=6, # 6 emotion labels problem_type='multi_label_classification' ) # Use BCEWithLogitsLoss (handled automatically by problem_type) # Predictions: sigmoid(logits) > 0.5 for each label # ── 7. MODEL EFFICIENCY TRICKS ──────────────────────────────────────── # Gradient checkpointing — trade compute for memory model.gradient_checkpointing_enable() # torch.compile() — 10–20% speedup on A100/H100 model = torch.compile(model) # Export to ONNX for production inference dummy = tokenizer("dummy input", return_tensors='pt') torch.onnx.export( model, (dummy['input_ids'], dummy['attention_mask']), 'model.onnx', input_names=['input_ids', 'attention_mask'], output_names=['logits'], dynamic_axes={'input_ids': {0: 'batch', 1: 'seq'}, 'attention_mask': {0: 'batch', 1: 'seq'}, 'logits': {0: 'batch'}} )
microsoft/deberta-v3-base. Consistently outperforms BERT and RoBERTa by 2–4%.
allenai/longformer-base-4096) or a sliding window approach. Truncating at 512 loses critical information in legal, scientific, and financial text.
Autoregressive Models
and the Art of Decoding
GPT-style decoder-only models generate text one token at a time. The decoding strategy — temperature, top-p, beam search — determines whether output is coherent, creative, or diverse. Understanding this unlocks full control over LLM outputs.
| Strategy | Temperature | Use Case | Tradeoff |
|---|---|---|---|
| Greedy | 0 (argmax) | Factual QA, extraction | Deterministic but repetitive |
| Beam Search (B=4) | — | Translation, summarization | More coherent; worse for open-ended gen |
| Top-p (p=0.9) | 0.7–0.9 | Creative writing, chat | Diverse and natural; can hallucinate |
| Top-k + Top-p | 0.8 | General text generation | Robust combination |
| Temperature=0.1 | 0.1 | Code generation, structured output | Predictable; less creative |
import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TextStreamer, GenerationConfig, StoppingCriteria, StoppingCriteriaList ) import anthropic device = 'cuda' if torch.cuda.is_available() else 'cpu' # ── 1. LOAD A GPT-STYLE MODEL ────────────────────────────────────────── MODEL = 'gpt2' # swap for 'mistralai/Mistral-7B-v0.1', 'meta-llama/Llama-3.2-3B', etc. tokenizer = AutoTokenizer.from_pretrained(MODEL) tokenizer.pad_token = tokenizer.eos_token # GPT-2 has no pad token model = AutoModelForCausalLM.from_pretrained( MODEL, torch_dtype=torch.float16 if device == 'cuda' else torch.float32, device_map='auto' # auto-place layers across available GPUs/CPU ) model.eval() prompt = "The key insight about large language models is that" inputs = tokenizer(prompt, return_tensors='pt').to(device) # ── 2. GREEDY DECODING — deterministic, fastest ──────────────────────── with torch.no_grad(): greedy_ids = model.generate( **inputs, max_new_tokens=80, do_sample=False, repetition_penalty=1.2, # prevent loops pad_token_id=tokenizer.eos_token_id ) output = tokenizer.decode(greedy_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print("Greedy:", output) # ── 3. TEMPERATURE + NUCLEUS SAMPLING ───────────────────────────────── with torch.no_grad(): sample_ids = model.generate( **inputs, max_new_tokens=150, do_sample=True, temperature=0.8, # lower = more focused top_p=0.92, # nucleus sampling top_k=50, # also cap at top-50 repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id ) print("Sampled:", tokenizer.decode(sample_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) # ── 4. BEAM SEARCH — best for translation/summarization ─────────────── with torch.no_grad(): beam_ids = model.generate( **inputs, max_new_tokens=100, num_beams=4, early_stopping=True, no_repeat_ngram_size=3, # prevent 3-gram repetition num_return_sequences=1, pad_token_id=tokenizer.eos_token_id ) print("Beam:", tokenizer.decode(beam_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) # ── 5. STREAMING — show tokens as they're generated ─────────────────── streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) print("\nStreaming output:") with torch.no_grad(): model.generate( **inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9, streamer=streamer, # prints each token as it's decoded pad_token_id=tokenizer.eos_token_id ) # ── 6. MANUAL AUTOREGRESSIVE LOOP — understand what generate() does ─── def generate_manual(model, tokenizer, prompt, max_new=50, temperature=0.8, top_p=0.9): input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device) generated = input_ids.clone() for _ in range(max_new): with torch.no_grad(): outputs = model(generated) next_logits = outputs.logits[:, -1, :] / temperature # last token # Top-p filtering sorted_logits, sorted_idx = torch.sort(next_logits, descending=True) cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1) sorted_logits[cumulative_probs > top_p] = float('-inf') filtered_logits = torch.zeros_like(next_logits).scatter(1, sorted_idx, sorted_logits) probs = torch.softmax(filtered_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) if next_token.item() == tokenizer.eos_token_id: break generated = torch.cat([generated, next_token], dim=-1) new_tokens = generated[0, input_ids.shape[1]:] return tokenizer.decode(new_tokens, skip_special_tokens=True) result = generate_manual(model, tokenizer, "Transformers work by") print("Manual:", result) # ── 7. STRUCTURED OUTPUT WITH OUTLINES ──────────────────────────────── # Force the model to output valid JSON using guided generation import outlines from pydantic import BaseModel from typing import List class PersonInfo(BaseModel): name: str age: int skills: List[str] outlines_model = outlines.models.transformers(MODEL) generator = outlines.generate.json(outlines_model, PersonInfo) person = generator("Extract info: John is 28 years old and knows Python and SQL.") print(person) # PersonInfo(name='John', age=28, skills=['Python', 'SQL']) # ── 8. CALLING AN LLM API (Claude) ──────────────────────────────────── # Production: almost always better to call an API than host yourself client = anthropic.Anthropic() # Basic call response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": "Explain backpropagation in 3 sentences."}] ) print(response.content[0].text) # Streaming API call with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=512, messages=[{"role": "user", "content": "Write a haiku about gradient descent."}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # System prompt + multi-turn conversation messages = [ {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}, {"role": "user", "content": "What is its population?"} ] response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=256, system="You are a precise geography assistant. Give only facts.", messages=messages ) print(response.content[0].text)
exp(cross_entropy_loss).
load_in_4bit=True. Runs a 7B model in ~5GB VRAM with <1% quality loss. Use bfloat16 compute dtype.
Adapt Large Models
Without Breaking the Bank
Full fine-tuning a 7B model requires 80GB+ VRAM and days of compute. LoRA achieves comparable quality by training only 0.1–1% of parameters. QLoRA goes further — fine-tune LLaMA-3 8B on a single 24GB GPU in under 2 hours.
import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments ) from peft import ( LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training, PeftModel ) from trl import SFTTrainer, DPOTrainer, DPOConfig, DataCollatorForCompletionOnlyLM from datasets import load_dataset, Dataset MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct" # replace with your target model # ── 1. QLORA SETUP — 4-bit quantization + LoRA ──────────────────────── bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4: best for weights bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16 for speed bnb_4bit_use_double_quant=True, # quantize the quantization constants ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" # required for SFT with causal LMs model = AutoModelForCausalLM.from_pretrained( MODEL_ID, quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) # Prepare model for k-bit training: cast LayerNorm, enable gradient checkpointing model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True) # ── 2. LORA CONFIGURATION ────────────────────────────────────────────── lora_config = LoraConfig( r=16, # rank — higher = more capacity, more params lora_alpha=32, # scaling factor (typically 2× rank) target_modules=[ # which weight matrices to adapt "q_proj", "k_proj", "v_proj", "o_proj", # attention "gate_proj", "up_proj", "down_proj" # FFN ], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 13,631,488 || all params: 3,226,390,528 || trainable%: 0.42% # ── 3. PREPARE INSTRUCTION-TUNING DATASET ───────────────────────────── # Format: Alpaca-style {"instruction": ..., "input": ..., "output": ...} raw_dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]") def format_instruction(sample): """Convert to chat template / instruction format.""" instruction = sample['instruction'] inp = sample.get('input', '') output = sample['output'] if inp: user_msg = f"### Instruction:\n{instruction}\n\n### Input:\n{inp}" else: user_msg = f"### Instruction:\n{instruction}" return {"text": f"{user_msg}\n\n### Response:\n{output}{tokenizer.eos_token}"} dataset = raw_dataset.map(format_instruction) # DataCollator that only trains on the completion (response), not the prompt response_template = "### Response:\n" collator = DataCollatorForCompletionOnlyLM( response_template=response_template, tokenizer=tokenizer ) # ── 4. TRAIN WITH TRL SFTTrainer ────────────────────────────────────── sft_args = TrainingArguments( output_dir="./qlora_output", num_train_epochs=2, per_device_train_batch_size=4, gradient_accumulation_steps=4, # effective batch = 16 learning_rate=2e-4, # higher LR OK for LoRA (only adapters) weight_decay=0.001, warmup_ratio=0.03, lr_scheduler_type="cosine", optim="paged_adamw_32bit", # paged optimizer saves VRAM fp16=False, bf16=True, max_grad_norm=0.3, logging_steps=25, save_steps=200, group_by_length=True, # batch similar-length sequences → less padding report_to="none", ) trainer = SFTTrainer( model=model, train_dataset=dataset, args=sft_args, tokenizer=tokenizer, data_collator=collator, dataset_text_field="text", max_seq_length=2048, packing=True, # pack short sequences together → faster ) trainer.train() # ── 5. SAVE AND MERGE ───────────────────────────────────────────────── # Save only LoRA adapters (small, ~50MB) model.save_pretrained("./lora_adapters") tokenizer.save_pretrained("./lora_adapters") # Load base model + merge LoRA weights for deployment (optional) base = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto") merged = PeftModel.from_pretrained(base, "./lora_adapters") merged = merged.merge_and_unload() # fuse LoRA into base weights merged.save_pretrained("./merged_model") # ── 6. DPO — Direct Preference Optimization ─────────────────────────── # Alignment from preference pairs — simpler than RLHF, no reward model needed # Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...} dpo_dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train[:2000]") dpo_config = DPOConfig( output_dir="./dpo_output", num_train_epochs=1, per_device_train_batch_size=2, learning_rate=5e-7, # very small for alignment — preserve capabilities beta=0.1, # KL divergence coefficient bf16=True, ) # Reference model: the SFT model before alignment (frozen) ref_model = AutoModelForCausalLM.from_pretrained( "./merged_model", torch_dtype=torch.bfloat16, device_map="auto") dpo_trainer = DPOTrainer( model=model, ref_model=ref_model, args=dpo_config, train_dataset=dpo_dataset, tokenizer=tokenizer, ) dpo_trainer.train() # ── 7. EVALUATE FINE-TUNED MODEL ────────────────────────────────────── from lm_eval import evaluator # Evaluate on standard benchmarks (ARC, HellaSwag, TruthfulQA, MMLU) results = evaluator.simple_evaluate( model="hf", model_args="pretrained=./merged_model", tasks=["arc_easy", "hellaswag", "truthfulqa_mc1"], num_fewshot=5, device="cuda", ) print(results["results"])
| Method | VRAM (7B model) | Trainable Params | Quality | When to Use |
|---|---|---|---|---|
| Full fine-tuning | ~80GB | 100% | Best | Small models, ample resources |
| LoRA (bf16) | ~14GB | 0.1–1% | Near-best | Single A100/H100 |
| QLoRA (4-bit) | ~6GB | 0.1–1% | Very good | Consumer GPU (RTX 3090/4090) |
| Prompt tuning | ~12GB | <0.01% | Decent | Fixed base model, soft prompts |
| Adapter layers | ~14GB | ~0.5% | Good | Multi-task, modular fine-tuning |
DataCollatorForCompletionOnlyLM masks the instruction tokens so loss is only computed on the generated response. Without this, quality degrades.
HuggingFaceH4/ultrafeedback_binarized as a starting dataset.
Ground LLMs in Facts
and Give Them Tools
LLMs hallucinate. RAG grounds them in your documents and eliminates stale knowledge. Agents go further — they reason, call tools, observe results, and iterate. Together, they power every serious LLM application in production today.
PDF/HTML/TXT → Text extraction → Chunker (512 tok, 50 overlap)
→ Embedding model → Vector store (Chroma / FAISS / Pinecone)
QUERYING (online)
User question → Embed query → ANN search top-20
→ BM25 sparse retrieval top-20 → Hybrid merge (RRF)
→ Cross-encoder reranker → Top-5 chunks
→ Prompt assembly: [system] + [chunks] + [question]
→ LLM → Grounded answer + source citations
import os import json from pathlib import Path from typing import List, Dict, Any import numpy as np import anthropic # ── 1. DOCUMENT LOADING & CHUNKING ──────────────────────────────────── from langchain_community.document_loaders import ( PyPDFLoader, DirectoryLoader, TextLoader, WebBaseLoader ) from langchain.text_splitter import RecursiveCharacterTextSplitter # Load documents from a directory loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader) documents = loader.load() print(f"Loaded {len(documents)} pages") # RecursiveCharacterTextSplitter: tries paragraph → sentence → word → char splits splitter = RecursiveCharacterTextSplitter( chunk_size=512, # tokens ≈ 0.75 × chars, so ~680 chars chunk_overlap=64, # overlap prevents cross-boundary context loss separators=["\n\n", "\n", ". ", " ", ""], length_function=len, add_start_index=True, ) chunks = splitter.split_documents(documents) print(f"Created {len(chunks)} chunks") print(f"Sample: {chunks[0].page_content[:200]}") # ── 2. EMBEDDING + VECTOR STORE ─────────────────────────────────────── from langchain_community.embeddings import HuggingFaceEmbeddings from langchain_community.vectorstores import Chroma, FAISS embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-small-en-v1.5", # fast, small, excellent quality model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True} ) # Create persistent Chroma vector store vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", collection_name="my_docs" ) print(f"Stored {vectorstore._collection.count()} chunks in Chroma") # Load existing store vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=embeddings, collection_name="my_docs" ) # ── 3. RETRIEVAL STRATEGIES ─────────────────────────────────────────── # Dense retrieval — embedding similarity dense_retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 20} ) # MMR (Maximum Marginal Relevance) — diverse results, less redundancy mmr_retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={"k": 10, "fetch_k": 50, "lambda_mult": 0.5} ) # BM25 sparse retrieval (keyword-based) from langchain_community.retrievers import BM25Retriever bm25_retriever = BM25Retriever.from_documents(chunks, k=20) # Hybrid retrieval: merge dense + sparse with Reciprocal Rank Fusion from langchain.retrievers import EnsembleRetriever hybrid_retriever = EnsembleRetriever( retrievers=[dense_retriever, bm25_retriever], weights=[0.6, 0.4] # dense usually more important ) # ── 4. CROSS-ENCODER RERANKING ──────────────────────────────────────── from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512) def retrieve_and_rerank(query: str, k_retrieve: int = 20, k_final: int = 5): # Step 1: fast retrieval (top-20) candidates = hybrid_retriever.invoke(query)[:k_retrieve] # Step 2: cross-encoder reranking (top-5) pairs = [(query, doc.page_content) for doc in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True) return [doc for _, doc in ranked[:k_final]] # ── 5. RAG WITH CLAUDE — full pipeline ──────────────────────────────── client = anthropic.Anthropic() def rag_query(question: str, stream: bool = True) -> str: # Retrieve relevant chunks relevant_docs = retrieve_and_rerank(question, k_retrieve=20, k_final=5) # Build context block context_parts = [] for i, doc in enumerate(relevant_docs, 1): source = doc.metadata.get('source', 'Unknown') page = doc.metadata.get('page', '?') context_parts.append( f"[Source {i}: {source}, p.{page}]\n{doc.page_content}") context = "\n\n---\n\n".join(context_parts) prompt = f"""Use ONLY the following context to answer the question. If the answer is not in the context, say "I don't have information about that." Always cite the source number(s) you used. CONTEXT: {context} QUESTION: {question} ANSWER:""" if stream: print("Answer: ", end="") with client.messages.stream( model="claude-haiku-4-5-20251001", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) as s: full_response = "" for text in s.text_stream: print(text, end="", flush=True) full_response += text print() return full_response else: resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) return resp.content[0].text # Usage answer = rag_query("What are the main risk factors described in the report?") # ── 6. TOOL-CALLING AGENT WITH CLAUDE ───────────────────────────────── import math, subprocess tools = [ { "name": "calculator", "description": "Evaluate a mathematical expression and return the result.", "input_schema": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression, e.g. '2**32 / 1024'"} }, "required": ["expression"] } }, { "name": "search_docs", "description": "Search the document corpus for relevant information.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "The search query"} }, "required": ["query"] } }, { "name": "run_python", "description": "Execute a Python expression and return the output.", "input_schema": { "type": "object", "properties": { "code": {"type": "string", "description": "Python expression to evaluate"} }, "required": ["code"] } } ] def execute_tool(name: str, inputs: dict) -> str: """Dispatch tool calls to actual functions.""" if name == "calculator": try: result = eval(inputs["expression"], {"__builtins__": {}}, {"math": math, "sqrt": math.sqrt}) return f"Result: {result}" except Exception as e: return f"Error: {e}" elif name == "search_docs": docs = retrieve_and_rerank(inputs["query"], k_final=3) return "\n\n".join(d.page_content[:400] for d in docs) elif name == "run_python": try: return str(eval(inputs["code"])) except Exception as e: return f"Error: {e}" return "Unknown tool" def run_agent(user_message: str, max_steps: int = 8) -> str: """ReAct-style agentic loop with tool use.""" messages = [{"role": "user", "content": user_message}] system = "You are a helpful research assistant with access to tools. Use them to answer questions accurately." for step in range(max_steps): response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=system, tools=tools, messages=messages ) # Append assistant response to history messages.append({"role": "assistant", "content": response.content}) # Check if done if response.stop_reason == "end_turn": for block in response.content: if hasattr(block, 'text'): return block.text return "Done" # Execute tool calls if response.stop_reason == "tool_use": tool_results = [] for block in response.content: if block.type == "tool_use": print(f" → Tool: {block.name}({block.input})") result = execute_tool(block.name, block.input) print(f" ← Result: {result[:100]}...") tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) messages.append({"role": "user", "content": tool_results}) return "Max steps reached" # Run an agent result = run_agent( "Find what the document says about Q3 revenue, then calculate its percentage " "change compared to a baseline of $4.2M." ) print("Agent result:", result) # ── 7. EVALUATE RAG WITH RAGAS ──────────────────────────────────────── from ragas import evaluate as ragas_eval from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset as HFDataset # Build evaluation dataset eval_data = { "question": ["What is the revenue for Q3?", "Who is the CEO?"], "answer": ["Q3 revenue was $4.8M.", "Jane Smith is the CEO."], "contexts": [["Q3 revenue reached $4.8 million..."], ["Jane Smith was appointed CEO in 2022..."]], "ground_truth": ["$4.8M", "Jane Smith"] } ragas_dataset = HFDataset.from_dict(eval_data) ragas_results = ragas_eval( ragas_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(ragas_results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}
| Strategy | Chunk Size | Best For | Notes |
|---|---|---|---|
| Fixed size | 512 tokens | General purpose | Simple; can split mid-sentence |
| Recursive character | 512 tokens | Most documents | Respects paragraph/sentence structure |
| Semantic chunking | Variable | Coherent passages | Splits on embedding similarity drops |
| Document-aware | Variable | PDFs, HTML | Respects headers, sections, tables |
| Small-to-big | 128 + 512 | Precision + context | Retrieve small chunks, return parent doc |
max_steps=10 hard limit. Catch all tool exceptions and return error strings — never let a crashed tool kill the agent loop.