There is a version of learning machine learning where you start with code. Clone a repo. Call model.fit(). Watch loss drop. Feel productive in 48 hours.
I did not do that.
Before touching code, I wanted to understand what I was actually looking at. Not the interface. The mechanism. The real picture underneath the function calls.
So I built mental maps. Not definitions. Not steps to memorize. Actual pictures of how concepts relate, what breaks if you remove one, where each idea lives in the bigger machine.
These are the ones that landed hardest.
A Model Is a Frozen Function
f(x; θ) → ŷ
That is it. Input x, parameters θ, output prediction ŷ.
The architecture defines the shape of the function. Training fills in the numbers. After training ends, θ freezes permanently. The model ships. You run it.
Every message you send to an LLM is one forward pass through a frozen function. The numbers have not changed since the training run ended. The intelligence is not in the process. It is encoded in the weights.
The weights are not a database. They are compressed knowledge shaped by exposure to trillions of tokens. That is a completely different thing.
Two Timelines. Never Confuse Them.
| Phase | What happens | Weights |
|---|---|---|
| Training | Find the best θ | Changing every batch |
| Inference | Run f(x; θ) on new input | Frozen permanently |
This is the single mental model that unlocks the most other things. Confusing these two timelines is the source of more misconceptions about AI than almost anything else.
Loss Is a Thermometer, Not a Scoreboard
Loss does not accumulate. It does not keep a running total of mistakes. Every batch is a fresh reading: how wrong is the model right now, on these specific examples?
Cross-entropy makes this concrete:
L = −log(p_correct)
The model assigned a probability to the correct answer. Loss punishes based on how low that probability was.
| Confidence on correct answer | Loss | Reading |
|---|---|---|
| 99% | ~0.01 | barely a scratch |
| 50% | ~0.69 | uncertain |
| 10% | ~2.3 | struggling |
| 1% | ~4.6 | destroyed |
Confident and right. Nearly free. Confident and wrong. Destroyed. The math enforces what makes sense.
Backprop Is Blame Assignment
The model makes a wrong prediction. Loss measures how wrong. Then backpropagation asks one question:
Which weights caused this, and by how much?
Chain rule propagates the error signal backward through every layer, assigning responsibility to each weight proportional to its contribution to the mistake. Gradient descent then corrects those weights.
θ ← θ − η · ∂L / ∂θ
Not magic. Calculus applied systematically at enormous scale.
The Embedding Table Is the Bridge
Models cannot read words. They operate on vectors of numbers.
The embedding table translates. One row per token in the vocabulary. Look up a token ID, get back a vector of hundreds or thousands of numbers. That vector is the token's meaning, as a position in a high-dimensional geometric space.
"king" → [ 0.2, -0.4, 0.8, 0.1, ... ]
"queen" → [ 0.3, -0.3, 0.7, 0.2, ... ]
"dog" → [-0.9, 0.6, -0.2, 0.8, ... ]
Similar meanings end up geometrically close. The model never learned this explicitly. It emerged from seeing which words appear in similar contexts across billions of sentences.
Every projection in a transformer, every Q, K,V matrix, operates on these vectors. Linear transformations on embedded tokens. That is the actual operation.
The Standard Training Recipe
Every modern LLM runs this exact stack:
Optimizer : AdamW
Start : warmup (ramp LR up slowly)
Middle / End : cosine decay (ramp LR down smoothly)
Safety net : gradient clipping (cap any explosions)
AdamW gives each weight its own adaptive learning rate. Weights that barely move get bigger steps. Weights that move a lot get smaller ones. Plus correct weight decay baked in.
Warmup because at the start, weights are random and gradients are chaotic. Big steps early destroy early progress.
Cosine decay because late in training you want smaller and smaller nudges. Settle, do not jump.
Gradient clipping because deep networks compound multiplications during backprop, and one bad batch can send gradients into the thousands. Cap the magnitude. Preserve the direction.
Bigger Is Not Always Better · Chinchilla
For years the assumption was simple: more parameters, more capability. Scale the model.
Then DeepMind showed that was wrong.
They trained a 70B parameter model on 1.4T tokens. GPT-3 is 175B parameters trained on 300B tokens. Same compute budget. Smaller model. Way more data. Chinchilla matched or beat GPT-3 on most benchmarks.
The rule that came out of it:
Optimal training tokens ≈ 20 × number of parameters.
A 7B model needs roughly 140B tokens. A 70B model needs roughly 1.4T tokens. Train with less and you are leaving performance on the table.
The analogy that made it click: a massive model trained on too little data is a genius who only read ten books. The capacity is there. The exposure is not. It just memorizes what little it saw.
Parameters are capacity. Data fills that capacity. Parameters without data are wasted.
Overfitting vs Underfitting · One Diagnostic
| Train loss | Val loss | Diagnosis |
|---|---|---|
| HIGH | HIGH | Underfitting. Never learned. |
| LOW | LOW | Good. This is the goal. |
| LOW | HIGH | Overfitting. Memorized, did not generalize. |
Validation loss is the lie detector. Train loss can always go down if you let the model memorize. Val loss tells you whether it learned something real.
When val loss starts rising while train loss keeps falling, stop. That is overfitting.
The Map Underneath the Code
These are not isolated facts. They are one connected picture.
Input (x)
↓
Embedding table // learned weights θ
↓
Model f(x; θ)
↓
Prediction ŷ
↓ // training only
Loss // cross-entropy: L = −log(p_correct)
↓
Backpropagation // chain rule, computes ∂L/∂θ
↓
AdamW // θ ← θ − η·∂L/∂θ
↓
Updated θ
↓
Repeat until val loss converges
Every block in that diagram connects to real mathematics. Cross-entropy comes from maximum likelihood estimation. Backprop is the chain rule. AdamW gradient is the derivative of the loss with respect to every weight.
The mental maps tell you what you are looking at.
The math tells you why it has to be exactly that way.