Seven Chapters Deep: What Linear Algebra Looks Like From the Inside

There is a table I found months ago. Seven rows. Four columns. It listed matrix transformations—identity, scaling, rotation, shear, projection—and mapped each one to a transformer component. Residual connections. Attention temperature. RoPE encoding. Dimensionality reduction.

I could read every word. I understood none of it.

That table became the reason I opened Gilbert Strang's textbook. And seven chapters later—five hundred pages, dozens of handwritten problem sets, four master notes documents, one Python engine built from scratch—I opened that table again.

Every row clicked.

Not because I memorized the mappings. Because I had built the math underneath them, by hand, one chapter at a time. This is what that looked like from the inside.

The Decision That Changed the Trajectory

My second blog post ended with Chapter 2 complete and a working attention head built from scratch. I had the dot product. I had elimination. I had a Python file that ran Q @ K.T and produced a 4×4 attention map. I thought I was close to understanding transformers.

I was not close.

Chapter 3 opened with a question I could not answer: what is a vector space? Not “a collection of vectors.” The real answer. What are the rules? What qualifies? What fails? I sat there for twenty minutes trying to explain it to myself and could not.

That was the moment I realized the top-down path was exhausted. I had built the attention head, yes. But I had built it the way I build software—by assembling components. The mathematical foundation was missing. The understanding was shallow.

So I made a decision: go deeper. Not wider. Not faster. Deeper. Every section. Every example. Every problem by hand. No skipping.

Seven chapters later, I have a document I call The Constitution. Every section, every mental model, every AI connection, every formula—one document, top to bottom. I read it three times. Each time, more connections appeared that I had not seen before.

Chapter 3 — Vector Spaces and Subspaces

This chapter broke me and rebuilt me.

The nullspace was the hardest concept. Inputs that a matrix maps to zero—not zero inputs. Real, nonzero vectors that produce zero output. The matrix is blind to them. They exist, they carry information, and the matrix cannot see them.

The moment it clicked: I was working a problem where a 3×4 matrix had rank 2. Two pivot columns, two free columns. I set the free variables to 1 and 0, solved backward, and got a special solution. Then I multiplied A times that solution and watched every entry become zero. The input was (−2, 1, 0, 0). It was not the zero vector. But the matrix killed it completely.

That is when I understood what a transformer's attention head cannot see. Each head has a nullspace—input directions that produce zero output. The head is blind to them. Not because the information isn't there. Because the weight matrix's geometry makes those directions invisible.

Chapter 4 — Orthogonality

The four subspaces from Chapter 3 are not just different—they are perpendicular. The row space and nullspace meet at right angles. The column space and left nullspace meet at right angles. Every vector in the input space splits cleanly into a row space piece (signal the head reads) and a nullspace piece (silence). Perpendicular. Zero interference.

Projection is the tool that finds the closest reachable point when the exact target is unreachable. The formula AᵀAx̂ = Aᵀb. I worked it by hand on a 3×2 matrix, verified the error was perpendicular to both columns, and then stared at the equation for a long time.

That equation is the mathematical ancestor of attention. When a query projects onto a key, it is computing a dot product—measuring alignment. The attention mechanism is a learned, softmax-normalized version of the projection formula I derived on paper. Same principle. Different scale.

Chapter 5 — Determinants

One number. The signed volume of the box formed by a matrix's rows. Zero means singular—the box is flat, a dimension collapsed, information destroyed. Nonzero means invertible—the box has real volume, nothing lost.

The rule that stayed with me: det(AB) = det(A) × det(B). Determinants multiply when transformations stack. A 96-layer transformer compounds the volume effect of each layer. If each layer slightly compresses space—|det| = 0.99—after 96 layers the signal shrinks to 0.99⁶⁶ ≈ 0.38. That is the vanishing gradient problem, stated as a volume calculation.

Residual connections fix this. The effective transformation at each layer is I + F(x). The determinant of I + (small matrix) is approximately 1. Volume preserved. Signals survive. That is why deep transformers can train at all.

Chapter 6 — Eigenvalues and Eigenvectors

This chapter gave me the language I had been missing. An eigenvalue tells you how much a matrix stretches along one direction. An eigenvector tells you which direction. Together they describe the natural behavior of any linear transformation.

The computation: det(A − λI) = 0 finds the eigenvalues. Then (A − λI)x = 0 finds the eigenvectors—which is a nullspace problem from Chapter 3. I had already learned how to find nullspaces. The new concept reduced to old machinery. That compounding—new ideas reducing to things I already knew—started happening in every chapter.

The connection to transformers: after k layers, each eigenvalue gets raised to the kth power. λ = 1.01 after 96 layers becomes 2.6. λ = 0.99 becomes 0.38. The eigenvalue spectrum of a weight matrix determines what survives across depth. Training shapes those eigenvalues—pushing the important feature directions toward λ ≈ 1 and the noise directions toward λ ≈ 0.

Chapter 7 — The Singular Value Decomposition

The final chapter. The SVD decomposes any matrix—square, rectangular, any rank—into three pieces: rotate, stretch, rotate. That is all any matrix does. The singular values tell you how much stretching happens in each direction, ordered by importance.

The five-step process: compute AᵀA, find its eigenvalues, take square roots (singular values), find eigenvectors (right singular vectors V), compute Av/σ (left singular vectors U). Each step uses machinery from a previous chapter. AᵀA is always symmetric (Chapter 6.4). Its eigenvalues are always non-negative (Chapter 6.5). The eigenvectors are perpendicular (Chapter 4). The nullspace vectors fill the remaining basis (Chapter 3).

Everything converged into one decomposition: A = UΣVᵀ.

The Constitution

After Chapter 7, I needed something. Not another set of notes. Not a summary. A single document that connected everything—every section from every chapter, every mental model, every formula, every AI connection, every cross-chapter link. A document I could read from top to bottom and feel the full picture emerge.

I call it The Constitution. It runs through all seven chapters. It has connection maps showing how Chapter 1's dot product becomes Chapter 4's projection becomes Chapter 7's SVD. It has a complete transformer map—every linear algebra concept matched to its role in transformer architecture. Twenty-four rows. Every one built from first principles over these sessions.

The document exists to be read repeatedly. First read: let the flow happen. Second read: stop at each section, restate in your own words. Third read: focus on the connection maps. See how eigenvalues (Chapter 6) are square roots of singular values (Chapter 7) which come from AᵀA (Chapter 2, 4, 6) which is always symmetric (Chapter 6.4) which guarantees perpendicular eigenvectors (Chapter 4.1).

Every chapter feeds every other chapter. That was invisible on the first pass. It became obvious on the third.

What I See Now That I Could Not See Before

When I look at a transformer architecture diagram now, I see subspaces. Each attention head has a row space (the input directions it reads) and a nullspace (the directions it ignores). Multi-head attention exists because one head's nullspace might be another head's row space. Thirty-two heads with different row spaces can cover all 4096 input dimensions. The combined blindness approaches zero.

When I see LoRA—adding a low-rank update ΔW = BA to a weight matrix—I see column space expansion. The SVD tells me exactly how good that low-rank approximation is. The singular values I would drop determine how much information I lose.

When I read about training instability, I see eigenvalues compounding. When I see residual connections, I see determinants staying near 1. When I see layer normalization, I see vectors being pushed toward equal length—the “normal” in orthonormal.

None of these connections were available to me before I did the math by hand. They are not the kind of thing you can learn from a summary or a video. They come from friction.

The Numbers

Seven chapters completed. Roughly 500 pages of Strang's textbook. Four chapter master notes documents compiled. One Constitution document connecting everything. One Python engine with five files implementing vectors through attention. Dozens of handwritten problem sets. Three blog posts documenting the journey.

This is not a sprint. It is a deliberate, sustained investment in understanding the mathematics that makes modern AI work.

What Comes Next

The next phase is probability, statistics, and calculus—the three pillars that sit alongside linear algebra inside every transformer.

Probability — because attention weights are probability distributions and every softmax output sums to 1
Statistics — because training is fundamentally a statistical estimation problem from a finite sample
Calculus — because backpropagation is the chain rule applied to a computational graph

Linear algebra gave me the structure—what matrices do, how spaces split, which directions matter. Probability will give me the uncertainty. Calculus will give me the dynamics. Three foundations. Linear algebra is the first one built. The other two are next.

The Line That Keeps Coming Back

In the early sessions, I made a joke about myself. I said I was “the guy with 50,000 fake weights.” All the tokens were there—the vocabulary, the embedding dimensions, the matrix shapes. But the numbers inside were random. Meaningless. Untrained.

Seven chapters of hand computation later, the weights are adjusting. Not all of them. Not perfectly. But the loss is dropping. The gradients are flowing. The representations are starting to mean something.

The weights are getting real.