LLM Overview and Foundations
Understanding the core concepts behind Large Language Models
What are LLMs?
Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like text. They learn patterns, grammar, facts, and reasoning abilities from the training data.
Key Characteristics:
- Billions of parameters
- Self-supervised learning
- Emergent capabilities
- Context understanding
Evolution Timeline
LLM Training Pipeline
Data Collection
Web crawling, books, code
Pre-Training
Next token prediction
Post-Training
SFT + RLHF
Evaluation
Benchmarks & testing
Deployment
API & applications
Pre-Training
The foundation of LLM capabilities through large-scale training
Data Collection
Manual Crawling
Targeted collection from specific high-quality sources like Wikipedia, academic papers, books, and curated websites.
# Example: Simple web crawler
import requests
from bs4 import BeautifulSoup
def crawl_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
return clean_text(text)
Common Crawl
A massive open repository of web crawl data, containing petabytes of raw web page content collected over years.
Data Cleaning
RefinedWeb
High-quality filtered web data using strict deduplication and quality filtering.
- URL filtering
- Text extraction
- Language identification
- Quality scoring
Dolma
Open corpus for language model pre-training with documented curation pipeline.
- Multi-source mixing
- Deduplication
- Content filtering
- Reproducible pipeline
FineWeb
15 trillion token dataset with aggressive deduplication and filtering.
- MinHash deduplication
- Quality classifiers
- Educational content boost
- Open and reproducible
Interactive: Data Cleaning Pipeline
Tokenization
Byte Pair Encoding (BPE)
BPE is a subword tokenization algorithm that iteratively merges the most frequent pairs of characters or character sequences.
Step 1: Character Vocabulary
Start with individual characters: ['l', 'o', 'w', 'e', 'r', 's', 't']
Step 2: Count Pairs
Find most frequent adjacent pairs in corpus
Step 3: Merge
Merge most frequent pair: 'l' + 'o' → 'lo'
Step 4: Repeat
Continue until vocabulary size is reached
Interactive: Tokenization Visualizer
Architecture
Self-Attention
Q, K, V projections
Scaled dot-product attention
Multi-head mechanism
Feed-Forward
Two linear layers
ReLU/GELU activation
Dimension expansion
Layer Norm
Stabilizes training
Pre-norm vs Post-norm
Attention Formula:
Interactive: Attention Visualization
| Model | Parameters | Context Length | Training Data | Key Features |
|---|---|---|---|---|
| GPT-1 | 117M | 512 | ~5GB (BookCorpus) | Decoder-only, pre-training + fine-tuning |
| GPT-2 | 1.5B | 1024 | 40GB (WebText) | Zero-shot learning, larger scale |
| GPT-3 | 175B | 2048 | ~570GB | Few-shot learning, emergent abilities |
| GPT-4 | ~1.8T (MoE) | 8K-128K | Unknown | Multimodal, advanced reasoning |
| Model | Parameters | Context Length | Key Innovations |
|---|---|---|---|
| LLaMA 1 | 7B-65B | 2048 | RMSNorm, RoPE, SwiGLU |
| LLaMA 2 | 7B-70B | 4096 | Grouped Query Attention, longer training |
| LLaMA 3 | 8B-70B | 8192 | New tokenizer, improved data mix |
| LLaMA 3.1 | 8B-405B | 128K | Massive scale, tool use, multilingual |
Key Architectural Features:
RMSNorm
Simpler normalization without mean subtraction
RoPE
Rotary Position Embeddings for better position encoding
SwiGLU
Improved activation function in FFN layers
GQA
Grouped Query Attention for efficient inference
Text Generation
Greedy Search
Always select the token with the highest probability.
Pros
- Fast and simple
- Deterministic
Cons
- Repetitive output
- Misses better sequences
Beam Search
Maintain multiple candidate sequences (beams) and select the best overall.
Pros
- Better global optimization
- More coherent outputs
Cons
- Computationally expensive
- Can still be repetitive
Top-k Sampling
Sample from the k most likely next tokens.
Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability exceeds p.
Interactive: Text Generation Simulator
Post-Training
Aligning LLMs with human preferences and specific tasks
Supervised Fine-Tuning (SFT)
What is SFT?
SFT involves training the model on a curated dataset of (instruction, response) pairs to learn to follow instructions and provide helpful responses.
SFT Data Examples
Explain quantum computing in simple terms.
Quantum computing uses quantum mechanics to process information. Unlike regular computers that use bits (0 or 1), quantum computers use qubits that can be both 0 and 1 simultaneously...
Popular SFT Datasets:
Alpaca
52K instruction-following examples generated by GPT-3.5
ShareGPT
Human-ChatGPT conversations shared by users
FLAN
Large-scale instruction tuning collection
OpenAssistant
Crowd-sourced conversation trees
Reinforcement Learning from Human Feedback (RLHF)
Step 1: Collect Comparisons
Human annotators compare multiple model outputs and rank them by preference.
Step 2: Train Reward Model
Train a model to predict human preferences based on comparison data.
Step 3: Optimize with RL
Use PPO to optimize the LLM policy against the reward model.
Reward Models
Reward models learn to predict human preferences from comparison data.
# Reward model training objective
def reward_loss(preferred, rejected):
# Bradley-Terry model
return -log(sigmoid(r(preferred) - r(rejected)))
PPO (Proximal Policy Optimization)
PPO optimizes the policy while preventing too large updates.
- Clipped objective prevents large updates
- KL penalty to stay close to SFT model
- Value function for advantage estimation
Verifiable Tasks
Tasks with objectively verifiable outputs enable automatic reward signals.
DPO (Direct Preference Optimization)
A simpler alternative to PPO that directly optimizes on preferences without a reward model.
Interactive: Reward Model Simulator
Task: Compare these two responses
Prompt: "How do I make a cup of tea?"
Response A
Boil water. Put tea bag in cup. Pour water. Wait 3-5 minutes. Remove bag. Add milk/sugar if desired. Enjoy!
Response B
Making tea is easy. Just use hot water and a tea bag. It's a popular drink worldwide with many varieties.
Evaluation
Measuring LLM performance across various dimensions
Traditional Metrics
Perplexity
Measures how well the model predicts the test data. Lower is better.
BLEU Score
Measures n-gram overlap with reference text. Used for translation.
ROUGE Score
Recall-oriented measure for summarization tasks.
- ROUGE-N: N-gram overlap
- ROUGE-L: Longest common subsequence
- ROUGE-W: Weighted LCS
F1 Score
Harmonic mean of precision and recall for QA tasks.
Task-Specific Benchmarks
MMLU
Massive Multitask Language Understanding
57 subjects, 15K questionsHellaSwag
Commonsense reasoning
70K examplesTruthfulQA
Truthfulness evaluation
817 questionsHumanEval
Code generation
164 programming problemsGSM8K
Grade school math
8.5K math problemsMT-Bench
Multi-turn conversation quality
80 multi-turn questionsModel Performance Comparison
Human Evaluation & Leaderboards
Human Evaluation Methods
-
Absolute Rating
Rate outputs on a Likert scale (1-5)
-
Pairwise Comparison
Compare two outputs and pick the better one
-
Ranking
Rank multiple outputs from best to worst
Chatbot Arena (LMSYS)
Crowdsourced platform for blind model comparisons using Elo ratings.
Ratings are approximate and change over time
Chatbots' Overall Design
Building production-ready conversational AI systems
Chatbot Architecture
User Interface Layer
Orchestration Layer
Processing Layer
Data Layer
Key Components
Conversation Memory
Managing context across multiple turns.
- Buffer Memory (recent messages)
- Summary Memory (compressed history)
- Entity Memory (key information extraction)
- Vector Memory (semantic retrieval)
RAG (Retrieval Augmented Generation)
Enhance LLM responses with external knowledge.
Function/Tool Calling
Enable LLMs to interact with external systems.
{
"name": "get_weather",
"parameters": {
"location": "San Francisco",
"unit": "celsius"
}
}
Safety & Guardrails
Ensuring safe and appropriate responses.
- Input validation
- Content filtering
- Output moderation
- Rate limiting
System Prompt Engineering
Role Definition
Define who the assistant is and its capabilities.
Behavioral Guidelines
Specify how the assistant should behave.
Constraints & Boundaries
Define what the assistant should NOT do.
Response Format
Specify the expected output structure.
Interactive: Mini Chatbot Demo
Interactive Playground
Experiment with LLM concepts hands-on
Tokenizer
Explore how text gets converted to tokens
Attention
Visualize self-attention patterns
Generation
Compare different sampling methods
Embeddings
Explore semantic similarity