How Language Models Work

A hands-on guide to the core ideas behind language models, from the ground up.

1 What is a Language Model?

A language model is a program that predicts the next token given what came before. That's it. When you type "The cat sat on the" — a language model figures out that "mat" is more likely than "xylophone". By chaining these predictions together, it can generate whole paragraphs of coherent text.

Core idea

Given a sequence of text, predict what comes next. Each prediction is a probability distribution over all possible next tokens.

Why it works

Language has patterns. After learning millions of patterns from text, the model captures grammar, facts, and even reasoning — all from next-token prediction.

Scale matters

Frontier models can have trillions of parameters. Our tiny model has ~20,000. Same idea, vastly different scale. But the fundamentals are identical.

2 Tokenization

Before a model can process text, it needs to break it into tokens — meaningful pieces that each get a numeric ID. Our model uses Byte Pair Encoding (BPE), the same algorithm used by GPT and most modern LLMs. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens like "the", "ing", or "at".

// BPE Algorithm:
1. Start with individual characters as tokens
2. Count all adjacent token pairs
3. Merge the most frequent pair into a new token
4. Repeat until vocab_size is reached

// Result: common patterns become single tokens
"the" → [token_42],   " and" → [token_87],   "ing" → [token_63]
vocab_size = base chars + number of merges (typically 100-300)

3 Embeddings

After tokenization, each token is just a number — an ID like 0, 1, or 42. But a number alone doesn't tell the model anything about what a token means. The model needs a richer representation: a list of numbers (a vector) that captures a token's properties. That's what an embedding is.

Think of it like coordinates. A city can be described by two numbers (latitude, longitude) that tell you where it is relative to other cities. An embedding does the same thing for a token, but in many more dimensions. Each dimension might loosely capture something like "how noun-like is this?" or "does this appear at the start of sentences?" — though the model discovers these dimensions on its own during training.

The lookup table

The model stores a big table (matrix) where each row is one token's embedding. To get the embedding for token #42, it just looks up row 42. The number of dimensions is a configurable parameter — our tiny model defaults to 8, while GPT-style models use thousands. You can adjust this in the Lab.

Similar tokens cluster

As the model trains, tokens that behave similarly — like "the" and "a", or "run" and "walk" — end up with similar embeddings (nearby vectors). You can see this in the embedding plot in the Lab.

Context window

The model looks at several tokens at once (a "context window" of 10 tokens in our case). It concatenates their embeddings into one long vector, giving the neural network a complete picture of the recent context.

// The embedding table: one row per token, each row has embed_dim numbers
C = matrix of shape [vocab_size, embed_dim]

// Example: token "the" has ID 42
// Its embedding is row 42 of the table — a vector of 8 numbers
C[42] = [0.23, -0.71, 0.45, 0.12, ...]

// To feed 10 tokens into the network, concatenate their embeddings
input = concat(C[tok_0], C[tok_1], C[tok_2], C[tok_3], C[tok_4], ...)
// Result: a single vector of 10 × 8 = 80 numbers

4 The Neural Network

The concatenated embeddings flow through a neural network that learns to map patterns in the input to predictions about the next token. Our model uses one hidden layer with a tanh activation function.

Input
10 tokens
Embeddings
10 × 8 = 80
Predictions
softmax → probs
// Forward pass through the network
hidden = tanh(input_vector × W1 + b1)   // 128 neurons learn patterns
logits = hidden × W2 + b2   // raw scores for each token
probs = softmax(logits)   // convert to probabilities (sum to 1)

// Softmax formula: turns any numbers into probabilities
softmax(x_i) = exp(x_i) / sum(exp(x_j) for all j)

5 Training

Training is how the model learns. We show it examples from the text: "given these 10 tokens, the next one should be X." The model makes a prediction, we measure how wrong it was (the loss), and then nudge all the parameters slightly to make a better prediction next time. This is called gradient descent.

// 1. Loss: how wrong was the prediction?
loss = -log(probability assigned to the correct token)
// If model gave 90% to correct answer: loss = 0.105 (low = good)
// If model gave 1% to correct answer: loss = 4.605 (high = bad)

// 2. Gradients: which direction to adjust each parameter?
gradients = d(loss) / d(parameters)   // computed via backpropagation

// 3. Update: nudge parameters to reduce loss
parameters = parameters - learning_rate × gradients
// learning_rate controls step size (typically 0.001 to 0.1)

Batch training

Instead of updating after every single example, we average gradients over a "mini-batch" of examples (e.g., 64 at a time). This makes training faster and more stable.

Overfitting

With enough training, the model memorizes the training text exactly. For our tiny model, this is expected! Real LLMs train on trillions of tokens to avoid this.

6 Generation

To generate text, we start with a seed and repeatedly predict the next token. Each prediction is sampled from the probability distribution — we don't always pick the most likely token. The temperature parameter controls randomness.

// Temperature controls randomness
scaled_logits = logits / temperature

temperature = 0.1 → almost always picks the top prediction (deterministic)
temperature = 1.0 → samples proportionally from the distribution
temperature = 2.0 → more random, more creative/chaotic

The Lab

Train your own tiny language model right in the browser. Add text, hit train, and watch it learn.

1

Training Data

Paste text, load a sample, or upload a .txt file. More text = better patterns to learn.