LLM Playground

An interactive journey through Large Language Models

LLM Overview and Foundations

Understanding the core concepts behind Large Language Models

What are LLMs?

Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like text. They learn patterns, grammar, facts, and reasoning abilities from the training data.

Key Characteristics:

  • Billions of parameters
  • Self-supervised learning
  • Emergent capabilities
  • Context understanding

Evolution Timeline

2017 Transformer Architecture (Attention Is All You Need)
2018 GPT-1 & BERT
2019 GPT-2 (1.5B parameters)
2020 GPT-3 (175B parameters)
2022 ChatGPT & LLaMA
2023-24 GPT-4, Claude, Llama 2/3, Mixtral

LLM Training Pipeline

1

Data Collection

Web crawling, books, code

2

Pre-Training

Next token prediction

3

Post-Training

SFT + RLHF

4

Evaluation

Benchmarks & testing

5

Deployment

API & applications

Pre-Training

The foundation of LLM capabilities through large-scale training

Data Collection

Manual Crawling

Targeted collection from specific high-quality sources like Wikipedia, academic papers, books, and curated websites.

# Example: Simple web crawler
import requests
from bs4 import BeautifulSoup

def crawl_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text()
    return clean_text(text)

Common Crawl

A massive open repository of web crawl data, containing petabytes of raw web page content collected over years.

250B+ Pages
3PB+ Raw Data
Monthly Updates

Data Cleaning

RefinedWeb

High-quality filtered web data using strict deduplication and quality filtering.

  • URL filtering
  • Text extraction
  • Language identification
  • Quality scoring

Dolma

Open corpus for language model pre-training with documented curation pipeline.

  • Multi-source mixing
  • Deduplication
  • Content filtering
  • Reproducible pipeline

FineWeb

15 trillion token dataset with aggressive deduplication and filtering.

  • MinHash deduplication
  • Quality classifiers
  • Educational content boost
  • Open and reproducible

Interactive: Data Cleaning Pipeline

Tokenization

Byte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that iteratively merges the most frequent pairs of characters or character sequences.

Step 1: Character Vocabulary

Start with individual characters: ['l', 'o', 'w', 'e', 'r', 's', 't']

Step 2: Count Pairs

Find most frequent adjacent pairs in corpus

Step 3: Merge

Merge most frequent pair: 'l' + 'o' → 'lo'

Step 4: Repeat

Continue until vocabulary size is reached

Interactive: Tokenization Visualizer

Token Count: 0 Vocab Size: 0

Architecture

Self-Attention

Q, K, V projections

Scaled dot-product attention

Multi-head mechanism

Feed-Forward

Two linear layers

ReLU/GELU activation

Dimension expansion

Layer Norm

Stabilizes training

Pre-norm vs Post-norm

Attention Formula:

Attention(Q, K, V) = softmax(QKT / √dk)V

Interactive: Attention Visualization

Model Parameters Context Length Training Data Key Features
GPT-1 117M 512 ~5GB (BookCorpus) Decoder-only, pre-training + fine-tuning
GPT-2 1.5B 1024 40GB (WebText) Zero-shot learning, larger scale
GPT-3 175B 2048 ~570GB Few-shot learning, emergent abilities
GPT-4 ~1.8T (MoE) 8K-128K Unknown Multimodal, advanced reasoning
Model Parameters Context Length Key Innovations
LLaMA 1 7B-65B 2048 RMSNorm, RoPE, SwiGLU
LLaMA 2 7B-70B 4096 Grouped Query Attention, longer training
LLaMA 3 8B-70B 8192 New tokenizer, improved data mix
LLaMA 3.1 8B-405B 128K Massive scale, tool use, multilingual

Key Architectural Features:

RMSNorm

Simpler normalization without mean subtraction

RoPE

Rotary Position Embeddings for better position encoding

SwiGLU

Improved activation function in FFN layers

GQA

Grouped Query Attention for efficient inference

Text Generation

Greedy Search

Always select the token with the highest probability.

Pros
  • Fast and simple
  • Deterministic
Cons
  • Repetitive output
  • Misses better sequences

Beam Search

Maintain multiple candidate sequences (beams) and select the best overall.

Pros
  • Better global optimization
  • More coherent outputs
Cons
  • Computationally expensive
  • Can still be repetitive

Top-k Sampling

Sample from the k most likely next tokens.

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p.

Interactive: Text Generation Simulator

Post-Training

Aligning LLMs with human preferences and specific tasks

Supervised Fine-Tuning (SFT)

What is SFT?

SFT involves training the model on a curated dataset of (instruction, response) pairs to learn to follow instructions and provide helpful responses.

Instruction Dataset
Fine-tune Model
Instruction-following LLM

SFT Data Examples

Instruction:

Explain quantum computing in simple terms.

Response:

Quantum computing uses quantum mechanics to process information. Unlike regular computers that use bits (0 or 1), quantum computers use qubits that can be both 0 and 1 simultaneously...

Popular SFT Datasets:

Alpaca

52K instruction-following examples generated by GPT-3.5

ShareGPT

Human-ChatGPT conversations shared by users

FLAN

Large-scale instruction tuning collection

OpenAssistant

Crowd-sourced conversation trees

Reinforcement Learning from Human Feedback (RLHF)

Step 1: Collect Comparisons

Human annotators compare multiple model outputs and rank them by preference.

Step 2: Train Reward Model

Train a model to predict human preferences based on comparison data.

Step 3: Optimize with RL

Use PPO to optimize the LLM policy against the reward model.

Reward Models

Reward models learn to predict human preferences from comparison data.

# Reward model training objective
def reward_loss(preferred, rejected):
    # Bradley-Terry model
    return -log(sigmoid(r(preferred) - r(rejected)))

PPO (Proximal Policy Optimization)

PPO optimizes the policy while preventing too large updates.

LCLIP(θ) = Ê[min(rt(θ)Ât, clip(rt(θ), 1-ε, 1+ε)Ât)]
  • Clipped objective prevents large updates
  • KL penalty to stay close to SFT model
  • Value function for advantage estimation

Verifiable Tasks

Tasks with objectively verifiable outputs enable automatic reward signals.

Math Problems
Code Execution
Game Outcomes
Fact Verification

DPO (Direct Preference Optimization)

A simpler alternative to PPO that directly optimizes on preferences without a reward model.

LDPO = -log σ(β log(πθ(yw)/πref(yw)) - β log(πθ(yl)/πref(yl)))

Interactive: Reward Model Simulator

Task: Compare these two responses

Prompt: "How do I make a cup of tea?"

Response A

Boil water. Put tea bag in cup. Pour water. Wait 3-5 minutes. Remove bag. Add milk/sugar if desired. Enjoy!

Response B

Making tea is easy. Just use hot water and a tea bag. It's a popular drink worldwide with many varieties.

Evaluation

Measuring LLM performance across various dimensions

Traditional Metrics

Perplexity

Measures how well the model predicts the test data. Lower is better.

PPL = exp(-1/N Σ log P(xi|x<i))
Example: GPT-3 achieves ~20 PPL on WikiText-103

BLEU Score

Measures n-gram overlap with reference text. Used for translation.

0
25
50
75
100
Poor Understandable Good High Quality

ROUGE Score

Recall-oriented measure for summarization tasks.

  • ROUGE-N: N-gram overlap
  • ROUGE-L: Longest common subsequence
  • ROUGE-W: Weighted LCS

F1 Score

Harmonic mean of precision and recall for QA tasks.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Task-Specific Benchmarks

MMLU

Massive Multitask Language Understanding

57 subjects, 15K questions

HellaSwag

Commonsense reasoning

70K examples

TruthfulQA

Truthfulness evaluation

817 questions

HumanEval

Code generation

164 programming problems

GSM8K

Grade school math

8.5K math problems

MT-Bench

Multi-turn conversation quality

80 multi-turn questions

Model Performance Comparison

Human Evaluation & Leaderboards

Human Evaluation Methods

  • Absolute Rating

    Rate outputs on a Likert scale (1-5)

  • Pairwise Comparison

    Compare two outputs and pick the better one

  • Ranking

    Rank multiple outputs from best to worst

Chatbot Arena (LMSYS)

Crowdsourced platform for blind model comparisons using Elo ratings.

Rank Model Elo Rating
1 GPT-4o 1287
2 Claude 3.5 Sonnet 1271
3 Gemini 1.5 Pro 1260
4 Llama 3.1 405B 1248
5 GPT-4 Turbo 1243

Ratings are approximate and change over time

Chatbots' Overall Design

Building production-ready conversational AI systems

Chatbot Architecture

User Interface Layer
Web/Mobile App API Endpoint Voice Interface
Orchestration Layer
Conversation Manager Context Handler Router
Processing Layer
LLM Inference RAG Pipeline Tool Calling
Data Layer
Vector Store Memory Store Knowledge Base

Key Components

Conversation Memory

Managing context across multiple turns.

  • Buffer Memory (recent messages)
  • Summary Memory (compressed history)
  • Entity Memory (key information extraction)
  • Vector Memory (semantic retrieval)

RAG (Retrieval Augmented Generation)

Enhance LLM responses with external knowledge.

Query
Embed
Retrieve
Augment
Generate

Function/Tool Calling

Enable LLMs to interact with external systems.

{
  "name": "get_weather",
  "parameters": {
    "location": "San Francisco",
    "unit": "celsius"
  }
}

Safety & Guardrails

Ensuring safe and appropriate responses.

  • Input validation
  • Content filtering
  • Output moderation
  • Rate limiting

System Prompt Engineering

Role Definition

Define who the assistant is and its capabilities.

"You are a helpful customer service agent for TechCorp..."
Behavioral Guidelines

Specify how the assistant should behave.

"Always be polite and professional. If you don't know something, admit it."
Constraints & Boundaries

Define what the assistant should NOT do.

"Never share personal information. Don't provide medical advice."
Response Format

Specify the expected output structure.

"Respond in JSON format with fields: answer, confidence, sources."

Interactive: Mini Chatbot Demo

Hello! I'm your LLM learning assistant. Ask me anything about Large Language Models!

Interactive Playground

Experiment with LLM concepts hands-on

Tokenizer

Explore how text gets converted to tokens

Attention

Visualize self-attention patterns

Generation

Compare different sampling methods

Embeddings

Explore semantic similarity