Train a Transformer on Silicon: #0 The Genesis

Table of Contents

Start here — the series hub

A short intro to a series where I build a chip that trains a transformer — and document the whole thing, mistakes included.

What this is

I’m building a digital chip that trains a transformer on-chip — not just runs it, but actually learns: forward pass, backpropagation, and weight updates, all in hardware — and taking it through the full design flow to a chip layout. Then I’m writing it up as a series, one step at a time.

This post is the map. It’s short on purpose: what I’m doing, why, what to expect, and where it’s headed.

Why I’m doing it

Two reasons.

First, curiosity. Nearly all the “AI hardware” you read about is inference — running a model someone else already trained. The training side — gradients, backprop, weight updates in silicon — gets far less attention and is the harder, more interesting problem. I wanted to actually build it and see what breaks.

Second, learning in public. The best way I know to really understand something is to build it end-to-end and explain it. So instead of a finished, polished result, this is the process — including the parts where I get it wrong and have to figure out why. That’s where the real learning is anyway.

What to expect from the series

A few promises about how I’ll write these:

One idea per post. Each post takes the design a single step further, and shows what that change costs and buys.
Numbers and pictures, not just prose. Area, timing, power, waveforms, layout shots — the actual evidence.
The mistakes stay in. The router that hung for hours, the “clean” report that wasn’t, the block placed slightly in the wrong spot — those stories are the most useful part.
Honest about what it is. It’s a small design on a generic, educational process, and it isn’t a fabricated chip. I’ll say so every time. The goal is to demonstrate the mechanism clearly, not to claim a record.

Series index

Plans change, but here’s the direction. I’ll link each post here as it goes up (this Part 0 is the hub — bookmark it):

#0 — The Genesis (you are here)
#1 — Backprop on a Chip: the premise, the approach, and how I keep myself honest (checking the math before touching the hardware).
#2 — A Decoder Is Just an Encoder with a Mask: the minimal GPT-style trainer, and what actually differs between the two.
#3 — Giving It a Vocabulary: adding a real language-model head so it predicts tokens.
#4 — Scaling Up (for Free): growing the model, and why this design makes that mostly a memory problem, not a logic one.
Later: multi-head attention, stacking layers, and hardware-friendly training tricks (low-precision training, memory-efficient attention, and friends).

Who this is for

If you work in chip design, machine learning, or EDA — or you’re just curious what it takes to put learning itself into hardware — you’ll get something out of it. No assumption that you know all three; I’ll explain the crossover pieces as they come up.

And if you spot a mistake, please tell me. For a series like this, being corrected in public is a feature, not a bug.

Next up — #1: Backprop on a Chip.

Print 🖨 eBook 📱

Posted

July 1, 2026

Xiaomeng Wang

Tags:

cadence, digital design, llm