Start here — the series hub
A short intro to a series where I build a chip that trains a transformer — and document the whole thing, mistakes included.
What this is
I’m building a digital chip that trains a transformer on-chip — not just runs it, but actually learns: forward pass, backpropagation, and weight updates, all in hardware — and taking it through the full design flow to a chip layout. Then I’m writing it up as a series, one step at a time.
This post is the map. It’s short on purpose: what I’m doing, why, what to expect, and where it’s headed.
Why I’m doing it
Two reasons.
First, curiosity. Nearly all the “AI hardware” you read about is inference — running a model someone else already trained. The training side — gradients, backprop, weight updates in silicon — gets far less attention and is the harder, more interesting problem. I wanted to actually build it and see what breaks.
Second, learning in public. The best way I know to really understand something is to build it end-to-end and explain it. So instead of a finished, polished result, this is the process — including the parts where I get it wrong and have to figure out why. That’s where the real learning is anyway.
What to expect from the series
A few promises about how I’ll write these:
- One idea per post. Each post takes the design a single step further, and shows what that change costs and buys.
- Numbers and pictures, not just prose. Area, timing, power, waveforms, layout shots — the actual evidence.
- The mistakes stay in. The router that hung for hours, the “clean” report that wasn’t, the block placed slightly in the wrong spot — those stories are the most useful part.
- Honest about what it is. It’s a small design on a generic, educational process, and it isn’t a fabricated chip. I’ll say so every time. The goal is to demonstrate the mechanism clearly, not to claim a record.
Series index
Plans change, but here’s the direction. I’ll link each post here as it goes up (this Part 0 is the hub — bookmark it):
- #0 — The Genesis (you are here)
- #1 — Backprop on a Chip: the premise, the approach, and how I keep myself honest (checking the math before touching the hardware).
- #2 — A Decoder Is Just an Encoder with a Mask: the minimal GPT-style trainer, and what actually differs between the two.
- #3 — Giving It a Vocabulary: adding a real language-model head so it predicts tokens.
- #4 — Scaling Up (for Free): growing the model, and why this design makes that mostly a memory problem, not a logic one.
- Later: multi-head attention, stacking layers, and hardware-friendly training tricks (low-precision training, memory-efficient attention, and friends).
Who this is for
If you work in chip design, machine learning, or EDA — or you’re just curious what it takes to put learning itself into hardware — you’ll get something out of it. No assumption that you know all three; I’ll explain the crossover pieces as they come up.
And if you spot a mistake, please tell me. For a series like this, being corrected in public is a feature, not a bug.
Next up — #1: Backprop on a Chip.
Leave a Reply