LLM Anatomy π§
Large Language Models can feel like black boxes. You type prompt, smart text comes back, and somewhere in middle billions of parameters supposedly did “AI”. Inside, though, there is no magic: there is a precise pipeline, repeated and optimized down to the last FLOP.
This guide opens that box by following full data path: input text -> tokens -> vectors -> Transformer blocks -> logits -> generated token.
If you can trace this chain, you can explain most things that matter: limits, cost, latency, hallucinations, output quality.
Tokenization: from text to numbers π’
An LLM does not read words and sentences the way we do. It reads a sequence of numeric IDs produced by a tokenizer.
The tokenizer splits text into pieces called tokens: a whole word, a subword piece, punctuation. Each token has an ID in the model’s vocabulary.
|
|
Why not use whole words? They are too rigid: new names, typos, inflections, code, multilingual text would constantly produce words the model has never seen. Why not use single characters? That would solve the unknown-word problem but make every input much longer, reducing the usable context window. Subword tokens are the reasonable compromise.
In practice, modern tokenizers use variants of BPE or Unigram. They also include special tokens (start/end of text, separators, tool-call markers, and so on). These are not cosmetic details: they directly shape model behavior in chat, coding, and function calling.
Token cost and context window
Two prompts with similar meaning can have very different token cost. Verbose JSON, code, or mixed-language text tends to explode in token count. This impacts:
- inference cost (API bill or GPU time)
- latency
- room left for useful context
Practical rule: prompt optimization is often token optimization, not word optimization. In practice, you are optimizing available context window.
Embeddings: from IDs to vectors π
Numeric IDs are just labels. Token 15339 is not “close to” token 15340 in any meaningful way.
The embedding layer turns each ID into a vector (a list of floating-point numbers), a point in a high-dimensional space where tokens with similar meanings end up close together.
flowchart LR
A["Token ID<br/>464"] --> B["Embedding layer<br/>(lookup table)"]
B --> C["Vector<br/>[0.72, -0.38, 0.11, ...]"]
This is the moment where discrete symbols enter a continuous space. Once they become vectors, the model can compare them, combine them, transform them.
Important: the initial embedding is context-free. The same token sequence gets the same vector regardless of meaning: “mole” in “mole animal” and “mole” in “mole unit” start from the same embedding. Later layers rewrite that vector based on surrounding words.
In many models, input embedding matrix and final vocabulary projection matrix are shared (weight tying). Fewer parameters, often better statistical efficiency.
Attention: how tokens talk to each other π£οΈ
Embeddings alone are too poor. The model needs tokens to exchange information. This is where the attention mechanism comes in.
For each token, the model creates three learned views:
- Query (Q): what this token is looking for
- Key (K): what this token advertises about itself
- Value (V): the information this token can contribute
The model compares Query with Key to compute attention scores, normalizes them with softmax, and uses those weights to mix the Value vectors. Q and K decide where information flows. V is the information that flows.
Compact form for one head:
$$ Attention(Q,K,V)=\mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}} + M\right)V $$
Where $M$ is causal mask: it sets future positions to $-\infty$, so model cannot “cheat” by looking at tokens not generated yet.
flowchart LR
subgraph Input
T1["token<br/>A"]
T2["token<br/>B"]
T3["token<br/>C"]
end
subgraph QKV["Q/K/V Projections"]
Q1["Q"]
K1["K"]
V1["V"]
end
subgraph Att["Attention"]
S["QΒ·K^T / βd"]
SM["softmax"]
M["β weight Γ V"]
end
T1 --> Q1
T1 --> K1
T1 --> V1
Q1 & K1 --> S --> SM
SM & V1 --> M
Multi-Head Attention
A sentence contains many kinds of relationships at once. An adjective modifies a noun. A pronoun refers to something earlier. A closing bracket matches an opening one.
Multi-Head Attention runs several attention operations in parallel, each with its own learned projections. Each head can learn a different type of relationship. The outputs are combined and projected back to the model dimension.
Not every head learns a pretty textbook pattern. Some track syntax, some help local copying behavior, some look redundant. It still works because the combined projection uses this redundancy robustly.
In many recent architectures, you also see GQA (Grouped-Query Attention): multiple Query heads share a smaller number of Key/Value heads. Practical effect: lower KV memory use at inference time, with quality often close to classic MHA.
Positional Encoding / RoPE
“Dog bites man” and “Man bites dog” contain the same words, but they do not mean the same thing. Attention compares tokens by content, but order matters.
Modern models use RoPE (paper), which rotates parts of the Query and Key vectors based on token position. When attention compares a Query with a Key, the result depends on both content and relative position.
RoPE helps generalization on longer contexts, but it is not magic: beyond certain lengths, quality can degrade (“lost in the middle”, less stable retrieval of far-away facts).
Complexity: attention costs $O(n^2)$
If sequence length is $n$, attention map is an $n \times n$ matrix. Doubling context does not double cost: it roughly quadruples it. This is why optimizations such as FlashAttention, sliding windows, paged attention, and sparse variants exist.
Feed-Forward Network: rewriting meanings ποΈ
After attention, each token has gathered information from others. But the vector for “mole” is still ambiguous between “animal” and “unit of measurement”. The Feed-Forward Network (FFN) resolves this ambiguity.
The FFN applies same small neural network to each token independently. It expands vector into wider representation, applies non-linearity (typically GELU or SwiGLU), then projects back to original dimension. Result: richer, more contextualized vector.
flowchart LR
subgraph Block
A["Self-Attention"] --> B["Add & Norm"]
B --> C["Feed-Forward"]
C --> D["Add & Norm"]
end
Each block is wrapped with residual connections (Add) and normalization (LayerNorm or RMSNorm). Residual connections preserve gradient during training and allow stacking dozens or hundreds of layers.
In modern decoders, pre-norm layout is common: normalization before each sub-block, usually more stable for deep stacks.
Complete Transformer architecture ποΈ
A modern LLM (decoder-only like GPT, Llama, Mistral) stacks tens of these blocks in sequence. Each block keeps same tensor shape (sequence x model dimension), but progressively rewrites information content of vectors.
flowchart LR
subgraph Pipeline
T["Tokenization"] --> E["Embedding"]
E --> B1["Block 1"]
B1 --> B2["Block 2"]
B2 --> D["... N blocks"]
D --> L["Final layer norm"]
L --> P["Vocabulary projection"]
P --> S["Softmax"]
S --> G["Generated token"]
end
Many recent models also use MoE (Mixture of Experts) variants: FFN replaced by a pool of experts, and a router activates only a few experts per token. Benefit: huge capacity without activating all parameters every step.
Generation: from logits to text π²
After the last Transformer block, the model still has not produced a word. It has a vector representing its “opinion” on what the next token should be.
A linear layer projects this vector into a score for every token in the vocabulary. These raw scores are called logits. Softmax turns them into probabilities.
Important detail: during training, model computes logits for all sequence positions (next-token supervision at every step). During chat inference, we use only the last position distribution to choose the next emitted token.
flowchart LR
H["Hidden vector"] --> L["Linear layer<br/>(vocabulary Γ 1)"]
L --> LG["Logits<br/>(raw scores)"]
LG --> SM["Softmax (+ temperature)"]
SM --> P["Probabilities"]
P --> DC["Decoding<br/>(greedy / top-k / top-p)"]
DC --> T["Selected token"]
This is where sampling hyperparameters come into play:
- Temperature (T): controls creativity. Higher β more uniform distribution β more randomness. Lower β more deterministic.
- Top-k / Top-p: sampling strategies; top-k keeps only k most likely tokens, top-p selects a minimal cumulative-probability set.
Other controls used in production:
- Repetition penalty: reduces loops and obsessive repetition.
- Frequency/presence penalty: nudges model toward fresher content.
- Stop sequences: cuts output at known delimiters.
Next-token prediction -> full generation
The step that feels “magical” in an LLM is actually a very simple loop:
- predict next-token distribution
- choose one token (greedy or sampling)
- append token to context
- repeat
This autoregressive loop is exactly what you see when output streams token by token.
Prefill vs decode (real latency)
Autoregressive inference has two operational phases:
- Prefill: model processes full prompt in parallel.
- Decode: model emits one token at a time, reusing KV cache.
Prefill dominates cost for long prompts. Decode dominates user-perceived speed (tokens/sec). Good UX means balancing both.
sequenceDiagram
participant U as User
participant M as Model
participant K as KV cache
U->>M: Full prompt
M->>M: Parallel prefill on all tokens
M->>K: Store initial Key/Value tensors
loop Autoregressive decode
M->>K: Read cached context
M->>M: Compute next token
M->>K: Append new K/V
M-->>U: Stream token
end
Training: how an LLM is born π
Building an LLM happens in stages:
flowchart LR
A["Data collection"] --> B["Data curation\n dedup, filtering, quality"]
B --> C["Pre-training\n next-token prediction"]
C --> D["SFT\n supervised dialogues"]
D --> E["Preference tuning\n RLHF or DPO"]
E --> F["Evaluation\n safety, quality, latency"]
F --> G["Deployment\n serving and monitoring"]
1. Pre-training
The model is exposed to massive text corpora and trained to predict next token. Typical objective: minimize cross-entropy loss between predicted distribution and correct token. Backpropagation computes gradients for each weight, optimizer (often AdamW) updates weights. Repeated billions of times: linguistic patterns, factual associations, and syntax emerge.
It is not just “more data = better”. Data curation matters: deduplication, spam removal, domain balancing, filtering harmful or low-quality sources.
2. Supervised Fine-Tuning (SFT)
A base model is a “sentence completer”, not an assistant. SFT shows it thousands of dialogues (question + ideal answer) written by human experts. The model learns to respond helpfully, truthfully, safely.
3. Reinforcement Learning (RLHF / DPO)
The model generates answers to problems with known solutions. Correct answers are selected and used for further training. This phase produced surprising discoveries: “thinking models” (like DeepSeek R1) develop self-reflection, double-checking, and multi-step reasoning capabilities.
In modern stacks, many teams use DPO and other preference-based methods to avoid some engineering complexity of classic RL while still aligning with human preferences.
Compute, scaling laws, trade-offs
Final quality depends on balance across:
- parameter count
- training tokens
- compute budget
Scaling laws show regular trends: as these factors increase, loss usually drops predictably until practical bottlenecks hit (energy cost, memory bandwidth, data quality ceiling).
Practical considerations βοΈ
KV Cache
During generation, the model computes Key and Value for each token. Without caching, every step would recompute everything from scratch. The KV Cache stores K and V tensors of already processed tokens, drastically reducing computational cost at the expense of more memory. This trade-off makes interactive generation feasible.
GQA-style architectures also help here: by sharing part of K/V across groups of query heads, cache growth is smaller and multi-tenant serving scales better.
In multi-tenant systems, cache strategy is critical: eviction policy, paging, and continuous batching often matter more for throughput than raw theoretical TFLOPS.
Quantization
Large models are memory-bound. Quantization reduces the number of bits used to represent weights (from 16 or 32 bits down to 8 or 4 bits), compressing the model and accelerating inference with minimal quality loss. This is what allows running billion-parameter models on consumer hardware.
Important caveat: there is no universal “4-bit behaves the same” rule. Quantization scheme (per-channel, per-group, activation-aware, etc.) strongly affects quality/speed trade-off.
Context window and real limits
Long window does not mean perfect memory. Typical issues:
- attention diffusion over huge contexts
- weaker retrieval for facts hidden in the middle
- non-linear growth in latency/cost
For enterprise tasks, focused context plus retrieval (RAG) is often better than dumping entire knowledge base into one prompt.
Hallucinations: why they happen
An LLM optimizes next-token probability, not absolute truth. Under ambiguous context or missing reliable evidence, it may produce plausible but false text. Common mitigations: source grounding, verifiable citations, tool use, post-generation validation.
Encoder-only, decoder-only, encoder-decoder
Not all Transformers are the same:
- Encoder-only (BERT): read full context, used for classification and embeddings.
- Decoder-only (GPT, Llama, Mistral): generate text autoregressively.
- Encoder-Decoder (T5, original Transformer): translation and summarization.
Modern conversational models are almost all decoder-only.
flowchart TD
T["Transformer families"] --> E["Encoder-only\n BERT"]
T --> D["Decoder-only\n GPT, Llama, Mistral"]
T --> ED["Encoder-Decoder\n T5"]
E --> E1["Tasks: classification\n and embeddings"]
D --> D1["Tasks: autoregressive\n generation"]
ED --> ED1["Tasks: translation\n and summarization"]
Practical examples: from concept to terminal π§ͺ
Theory is great, but when you need to estimate cost or debug weird outputs, you need concrete numbers. Here are two copy-paste TypeScript snippets.
Example 1: estimate token usage and prompt cost
|
|
|
|
Expected output (illustrative values):
|
|
Example 2: controlled decoding (temperature, top-k, top-p)
|
|
Expected output:
|
|
Best practices and anti-patterns βοΈ
What to do
- Always measure tokens before production: true cost lives there, not in character count.
- Track prefill and decode separately: they have different bottlenecks.
- Add grounding (RAG, trusted tools, citations) to reduce hallucinations.
- Version prompts and parameters (temperature, top-p, stop sequences) like code.
What to avoid
- Anti-pattern: giant prompts dumping the whole database “just in case”.
- Anti-pattern: high temperature for critical tasks (compliance, security, numeric outputs).
- Anti-pattern: optimizing only perceived quality while ignoring latency and cost.
- Anti-pattern: no observability for token/sec, cache hit ratio, tool-call failures.
Useful tools in practice π§°
- tiktoken: token estimation and budgeting.
- vLLM: high-performance serving with paged attention.
- llama.cpp: local inference and quantized model testing.
- Langfuse: tracing, observability, and latency/cost analysis.
Operational checklist: before production LLM rollout β
- I estimated input/output token budgets for main use cases.
- I defined guardrails (stop sequences, policy filters, post-output validation).
- I separated prefill/decode KPIs (p95 latency, tokens/sec).
- I tested at least one anti-hallucination strategy (RAG or tool grounding).
- I benchmarked one full-precision baseline and one quantized variant.
- I enabled continuous monitoring for cost, errors, and quality drift.
Related deep dives π
- Prompt Engineering: deeper prompting parameters and strategy.
- AI generative for Dummies: softer intro if you are starting from zero.
- Diagrams with Generative AI: visual workflows for technical docs.
- Context engineering: the real magic behind AI: why context beats prompt-only setups.
References π
Original sources used in this guide
-
3Blue1Brown β Transformers, the tech behind LLMs (Deep Learning Chapter 5): primary walkthrough for token -> embedding -> attention -> logits -> sampling flow.
-
Roy van Rijn β The Anatomy of an LLM: interactive visual guide covering pipeline, training, KV cache, and quantization.
-
Kuhan Sundaram β The Anatomy of an LLM (Part 1): accessible explanation of attention and Transformer blocks.
-
Enrico Piccinin β Dentro ad un LLM come ChatGPT: practical synthesis of base model -> assistant -> alignment stages.
-
3Blue1Brown β Neural Networks / Transformers: visual series on neural networks and Transformers.
-
Jay Alammar β The Illustrated Transformer: classic visual explanation of the Transformer architecture.
-
Andrej Karpathy β Neural Networks: Zero to Hero: code-first path to building a GPT from scratch.
-
Attention Is All You Need: the original Transformer paper.
-
RoFormer: Enhanced Transformer with Rotary Position Embedding: theoretical basis for RoPE.
-
FlashAttention: exact memory-efficient attention for long contexts.
Quick glossary π
- Token: smallest text unit used by the model (word, subword, punctuation, whitespace).
- Tokenizer: algorithm that converts text into token IDs.
- Vocabulary: set of tokens known to the model.
- Embedding: numeric vector associated with a token ID.
- Context window: maximum number of tokens a model can process in one inference pass.
- Query (Q): representation of what a token is looking for in context.
- Key (K): representation of what a token exposes for matching.
- Value (V): information a token can pass to other tokens.
- Self-attention: mechanism where tokens in the same sequence weight each other.
- RoPE: rotary positional encoding applied to Q/K so attention is order-aware.
- FFN (Feed-Forward Network): subnetwork that transforms each token independently after attention.
- GELU: non-linear activation commonly used inside Transformer FFN blocks.
- SwiGLU: gated FFN variant combining gating with SiLU/Swish-like behavior.
- LayerNorm: per-token activation normalization used to stabilize training.
- RMSNorm: root-mean-square normalization variant, often lighter than LayerNorm.
- Residual connection: skip/add connection that preserves signal and helps gradient flow.
- Pre-norm: layout where normalization is applied before attention/FFN sub-blocks.
- MoE (Mixture of Experts): architecture with many experts selectively activated per token.
- Logits: raw scores for each vocabulary token before softmax.
- Softmax: function that converts logits into normalized probabilities.
- Temperature: decoding parameter that makes output more deterministic (low) or more creative (high).
- Top-k / Top-p: sampling strategies that restrict candidate next tokens.
- Prefill: phase where initial prompt is processed in parallel.
- Decode: autoregressive phase generating one token at a time.
- KV cache: cache of past Key/Value tensors used to speed up decode.
- GQA: attention variant where multiple query heads share fewer K/V heads to reduce memory.
- Quantization: lower numeric precision for weights to reduce memory and improve throughput.
- SFT: supervised fine-tuning on prompt/response examples.
- RLHF / DPO: preference-tuning methods to align responses with human preferences.
- Cross-entropy loss: objective function penalizing predicted distributions far from target token.
- Backpropagation: algorithm that propagates error backward to compute gradients.
- AdamW: adaptive optimizer widely used to train Transformer models.
- Hallucination: plausible-looking output not grounded in reliable facts.