LLM Playground - Interactive Learning

LLM Overview and Foundations

Understanding the core concepts behind Large Language Models

What are LLMs?

Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like text. They learn patterns, grammar, facts, and reasoning abilities from the training data.

Key Characteristics:

Billions of parameters
Self-supervised learning
Emergent capabilities
Context understanding

Evolution Timeline

2017 Transformer Architecture (Attention Is All You Need)

2018 GPT-1 & BERT

2019 GPT-2 (1.5B parameters)

2020 GPT-3 (175B parameters)

2022 ChatGPT & LLaMA

2023-24 GPT-4, Claude, Llama 2/3, Mixtral

LLM Training Pipeline

1

Data Collection

Web crawling, books, code

2

Pre-Training

Next token prediction

3

Post-Training

SFT + RLHF

4

Evaluation

Benchmarks & testing

5

Deployment

API & applications

Pre-Training

The foundation of LLM capabilities through large-scale training

Data Collection

Manual Crawling

Targeted collection from specific high-quality sources like Wikipedia, academic papers, books, and curated websites.

# Example: Simple web crawler
import requests
from bs4 import BeautifulSoup

def crawl_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text()
    return clean_text(text)

Common Crawl

A massive open repository of web crawl data, containing petabytes of raw web page content collected over years.

250B+ Pages

3PB+ Raw Data

Monthly Updates

Data Cleaning

RefinedWeb

High-quality filtered web data using strict deduplication and quality filtering.

URL filtering
Text extraction
Language identification
Quality scoring

Dolma

Open corpus for language model pre-training with documented curation pipeline.

Multi-source mixing
Deduplication
Content filtering
Reproducible pipeline

FineWeb

15 trillion token dataset with aggressive deduplication and filtering.

MinHash deduplication
Quality classifiers
Educational content boost
Open and reproducible

Interactive: Data Cleaning Pipeline

Raw Text Input:

Remove HTML Normalize Spaces Remove URLs Remove Emails Remove Duplicates

Cleaned Output:

Tokenization

Byte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that iteratively merges the most frequent pairs of characters or character sequences.

Step 1: Character Vocabulary

Start with individual characters: ['l', 'o', 'w', 'e', 'r', 's', 't']

Step 2: Count Pairs

Find most frequent adjacent pairs in corpus

Step 3: Merge

Merge most frequent pair: 'l' + 'o' → 'lo'

Step 4: Repeat

Continue until vocabulary size is reached

Interactive: Tokenization Visualizer

Enter text to tokenize:

Tokens:

Token Count: 0 Vocab Size: 0

Architecture

Self-Attention

Q, K, V projections

Scaled dot-product attention

Multi-head mechanism

Feed-Forward

Two linear layers

ReLU/GELU activation

Dimension expansion

Layer Norm

Stabilizes training

Pre-norm vs Post-norm

Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Interactive: Attention Visualization

Enter a sentence:

Model	Parameters	Context Length	Training Data	Key Features
GPT-1	117M	512	~5GB (BookCorpus)	Decoder-only, pre-training + fine-tuning
GPT-2	1.5B	1024	40GB (WebText)	Zero-shot learning, larger scale
GPT-3	175B	2048	~570GB	Few-shot learning, emergent abilities
GPT-4	~1.8T (MoE)	8K-128K	Unknown	Multimodal, advanced reasoning

Model	Parameters	Context Length	Key Innovations
LLaMA 1	7B-65B	2048	RMSNorm, RoPE, SwiGLU
LLaMA 2	7B-70B	4096	Grouped Query Attention, longer training
LLaMA 3	8B-70B	8192	New tokenizer, improved data mix
LLaMA 3.1	8B-405B	128K	Massive scale, tool use, multilingual

Key Architectural Features:

RMSNorm

Simpler normalization without mean subtraction

RoPE

Rotary Position Embeddings for better position encoding

SwiGLU

Improved activation function in FFN layers

GQA

Grouped Query Attention for efficient inference

Text Generation

Greedy Search

Always select the token with the highest probability.

Pros

Fast and simple
Deterministic

Cons

Repetitive output
Misses better sequences

Beam Search

Maintain multiple candidate sequences (beams) and select the best overall.

Pros

Better global optimization
More coherent outputs

Cons

Computationally expensive
Can still be repetitive

Top-k Sampling

Sample from the k most likely next tokens.

k value: 50

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p.

p value: 0.9

Interactive: Text Generation Simulator

Prompt:

Method:

Temperature: 0.7

Max Tokens: 50

Generated Text:

Post-Training

Aligning LLMs with human preferences and specific tasks

Supervised Fine-Tuning (SFT)

What is SFT?

SFT involves training the model on a curated dataset of (instruction, response) pairs to learn to follow instructions and provide helpful responses.

Instruction Dataset

→

Fine-tune Model

→

Instruction-following LLM

SFT Data Examples

Instruction:

Explain quantum computing in simple terms.

Response:

Quantum computing uses quantum mechanics to process information. Unlike regular computers that use bits (0 or 1), quantum computers use qubits that can be both 0 and 1 simultaneously...

Popular SFT Datasets:

Alpaca

52K instruction-following examples generated by GPT-3.5

ShareGPT

Human-ChatGPT conversations shared by users

FLAN

Large-scale instruction tuning collection

OpenAssistant

Crowd-sourced conversation trees

Reinforcement Learning from Human Feedback (RLHF)

Step 1: Collect Comparisons

Human annotators compare multiple model outputs and rank them by preference.

Step 2: Train Reward Model

Train a model to predict human preferences based on comparison data.

Step 3: Optimize with RL

Use PPO to optimize the LLM policy against the reward model.

Reward Models

Reward models learn to predict human preferences from comparison data.

# Reward model training objective
def reward_loss(preferred, rejected):
    # Bradley-Terry model
    return -log(sigmoid(r(preferred) - r(rejected)))

PPO (Proximal Policy Optimization)

PPO optimizes the policy while preventing too large updates.

L^CLIP(θ) = Ê[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)]

Clipped objective prevents large updates
KL penalty to stay close to SFT model
Value function for advantage estimation

Verifiable Tasks

Tasks with objectively verifiable outputs enable automatic reward signals.

Math Problems

Code Execution

Game Outcomes

Fact Verification

DPO (Direct Preference Optimization)

A simpler alternative to PPO that directly optimizes on preferences without a reward model.

L_DPO = -log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l)))

Interactive: Reward Model Simulator

Task: Compare these two responses

Prompt: "How do I make a cup of tea?"

Response A

Boil water. Put tea bag in cup. Pour water. Wait 3-5 minutes. Remove bag. Add milk/sugar if desired. Enjoy!

Response B

Making tea is easy. Just use hot water and a tea bag. It's a popular drink worldwide with many varieties.

Evaluation

Measuring LLM performance across various dimensions

Traditional Metrics

Perplexity

Measures how well the model predicts the test data. Lower is better.

PPL = exp(-1/N Σ log P(x_i|x_<i))

Example: GPT-3 achieves ~20 PPL on WikiText-103

BLEU Score

Measures n-gram overlap with reference text. Used for translation.

0

25

50

75

100

Poor Understandable Good High Quality

ROUGE Score

Recall-oriented measure for summarization tasks.

ROUGE-N: N-gram overlap
ROUGE-L: Longest common subsequence
ROUGE-W: Weighted LCS

F1 Score

Harmonic mean of precision and recall for QA tasks.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Task-Specific Benchmarks

MMLU

Massive Multitask Language Understanding

57 subjects, 15K questions

HellaSwag

Commonsense reasoning

70K examples

TruthfulQA

Truthfulness evaluation

817 questions

HumanEval

Code generation

164 programming problems

GSM8K

Grade school math

8.5K math problems

MT-Bench

Multi-turn conversation quality

80 multi-turn questions

Model Performance Comparison

Human Evaluation & Leaderboards

Human Evaluation Methods

Absolute Rating
Rate outputs on a Likert scale (1-5)
Pairwise Comparison
Compare two outputs and pick the better one
Ranking
Rank multiple outputs from best to worst

Chatbot Arena (LMSYS)

Crowdsourced platform for blind model comparisons using Elo ratings.

Rank Model Elo Rating

1 GPT-4o 1287

2 Claude 3.5 Sonnet 1271

3 Gemini 1.5 Pro 1260

4 Llama 3.1 405B 1248

5 GPT-4 Turbo 1243

Ratings are approximate and change over time

Chatbots' Overall Design

Building production-ready conversational AI systems

Chatbot Architecture

User Interface Layer

Web/Mobile App API Endpoint Voice Interface

↓

Orchestration Layer

Conversation Manager Context Handler Router

↓

Processing Layer

LLM Inference RAG Pipeline Tool Calling

↓

Data Layer

Vector Store Memory Store Knowledge Base

Key Components

Conversation Memory

Managing context across multiple turns.

Buffer Memory (recent messages)
Summary Memory (compressed history)
Entity Memory (key information extraction)
Vector Memory (semantic retrieval)

RAG (Retrieval Augmented Generation)

Enhance LLM responses with external knowledge.

Query

→

Embed

→

Retrieve

→

Augment

→

Generate

Function/Tool Calling

Enable LLMs to interact with external systems.

{
  "name": "get_weather",
  "parameters": {
    "location": "San Francisco",
    "unit": "celsius"
  }
}

Safety & Guardrails

Ensuring safe and appropriate responses.

Input validation
Content filtering
Output moderation
Rate limiting

System Prompt Engineering

Role Definition

Define who the assistant is and its capabilities.

"You are a helpful customer service agent for TechCorp..."

Behavioral Guidelines

Specify how the assistant should behave.

"Always be polite and professional. If you don't know something, admit it."

Constraints & Boundaries

Define what the assistant should NOT do.

"Never share personal information. Don't provide medical advice."

Response Format

Specify the expected output structure.

"Respond in JSON format with fields: answer, confidence, sources."

Interactive: Mini Chatbot Demo

System Prompt:

Hello! I'm your LLM learning assistant. Ask me anything about Large Language Models!

Interactive Playground

Experiment with LLM concepts hands-on

Tokenizer

Explore how text gets converted to tokens

Attention

Visualize self-attention patterns

Generation

Compare different sampling methods

Embeddings

Explore semantic similarity

LLM Overview and Foundations

What are LLMs?

Key Characteristics:

Evolution Timeline

LLM Training Pipeline

Data Collection

Pre-Training

Post-Training

Evaluation

Deployment

Pre-Training

Data Collection

Manual Crawling

Common Crawl

Data Cleaning

RefinedWeb

Dolma

FineWeb

Interactive: Data Cleaning Pipeline

Tokenization

Byte Pair Encoding (BPE)

Step 1: Character Vocabulary

Step 2: Count Pairs

Step 3: Merge

Step 4: Repeat

Interactive: Tokenization Visualizer

Architecture

Self-Attention

Feed-Forward

Layer Norm

Attention Formula:

Interactive: Attention Visualization

Key Architectural Features:

RMSNorm

RoPE

SwiGLU

GQA

Text Generation

Greedy Search

Pros

Cons

Beam Search

Pros

Cons

Top-k Sampling

Top-p (Nucleus) Sampling

Interactive: Text Generation Simulator

Post-Training

Supervised Fine-Tuning (SFT)

What is SFT?

SFT Data Examples

Popular SFT Datasets:

Alpaca

ShareGPT

FLAN

OpenAssistant

Reinforcement Learning from Human Feedback (RLHF)

Step 1: Collect Comparisons

Step 2: Train Reward Model

Step 3: Optimize with RL

Reward Models

PPO (Proximal Policy Optimization)

Verifiable Tasks

DPO (Direct Preference Optimization)

Interactive: Reward Model Simulator

Task: Compare these two responses

Response A

Response B

Reward Model Prediction:

Evaluation

Traditional Metrics

Perplexity

BLEU Score

ROUGE Score

F1 Score

Task-Specific Benchmarks

MMLU

HellaSwag

TruthfulQA

HumanEval