Building a Transformer from Scratch in Rust

I wanted to know what a transformer actually is, not what an API returns when I call it. So I deleted the safety net. No PyTorch, no autograd, no NumPy. Just Rust, a flat Vec<f32>, and a stubborn refusal to let any gradient compute itself. The result became Rustformer, and it taught me more in three weeks than two years of model.fit() ever did.

This post walks through how I got there: the tensor layer, scaled dot-product attention, the manual backward pass, and the memory and performance tradeoffs that only show up once nothing is hidden from you.

Why Rust, and what no-autograd forces you to learn

Python lets you stay vague. You write the forward pass, call .backward(), and the framework quietly threads a computation graph behind your back. That convenience is exactly the thing standing between you and understanding. When there is no autograd, every gradient is a thing you must derive, store, and chain by hand. You cannot hand-wave the chain rule when the compiler will not let you forget a single shape.

Rust adds a second kind of honesty. Ownership means you decide who holds each buffer and when it dies. There is no hidden allocation, no garbage collector pausing under your loss curve. A matmul is a matmul: three nested loops you wrote, over memory you laid out. If it is slow, the profiler points at your code, not a C++ kernel you will never read.

Removing autograd does not make the math harder. It makes the math visible. Every derivative you skipped learning becomes a function you now have to name.

The tensor layer

I refused to build a clever generic tensor. A transformer needs 2D and 3D arrays of f32, and that is all. So a tensor is a flat buffer plus a shape, with strides computed on demand. Row-major, contiguous, boring on purpose.

pub struct Tensor {
    pub data: Vec<f32>,
    pub rows: usize,
    pub cols: usize,
}

impl Tensor {
    pub fn matmul(&self, b: &Tensor) -> Tensor {
        let mut out = vec![0.0; self.rows * b.cols];
        for i in 0..self.rows {
            for k in 0..self.cols {
                let a = self.data[i * self.cols + k];
                for j in 0..b.cols {
                    out[i * b.cols + j] += a * b.data[k * b.cols + j];
                }
            }
        }
        Tensor { data: out, rows: self.rows, cols: b.cols }
    }
}

Note the loop order: i, k, j, not i, j, k. The inner loop walks contiguous memory in both b and out, so the cache stays happy. That single reordering was worth roughly a 3x speedup over the naive version before I touched anything else.

The transformer block

Before the math, the shape of the thing. A block is a small fixed pipeline, and every arrow below is a tensor I had to carry forward and then unwind in reverse during backprop.

One transformer block. The grey paths are the residual connections feeding each add+norm.

Scaled dot-product attention

This is the heart of it. Given queries Q, keys K, and values V, attention computes a weighted average of the values, where the weights come from how well each query matches each key. The scaling by the square root of the head dimension keeps the dot products from exploding into a one-hot softmax.

Scaled dot-product attention dataflow, from Q and K through the softmax weights into a weighted sum over V.

In Rust the forward pass reads almost exactly like the diagram:

fn attention(q: &Tensor, k: &Tensor, v: &Tensor) -> Tensor {
    let scale = 1.0 / (k.cols as f32).sqrt();
    let mut scores = q.matmul(&k.transpose());
    for x in scores.data.iter_mut() {
        *x *= scale;
    }
    softmax_rows(&mut scores); // numerically stable, subtract row max
    scores.matmul(v)
}

The softmax subtracts the per-row maximum before exponentiating. Skip that and a few epochs of training will hand you a buffer full of NaN and no clue why. With no framework to warn you, numerical stability is your job now.

The manual backward pass

Here is where autograd earns its keep, and here is what you learn by doing without it. The forward pass cached three things: the attention weights, Q Kᵀ, and the inputs. Backprop walks them in reverse. The gradient of the output flows back through the value matmul, then through the softmax Jacobian, then splits into Q and K.

// d_out: gradient flowing in from the next layer
fn attention_backward(d_out: &Tensor, ctx: &Cache) -> Grads {
    let d_v = ctx.weights.transpose().matmul(d_out);
    let d_weights = d_out.matmul(&ctx.v.transpose());
    let d_scores = softmax_backward(&d_weights, &ctx.weights);
    let d_q = d_scores.matmul(&ctx.k).scale(ctx.inv_sqrt_d);
    let d_k = d_scores.transpose().matmul(&ctx.q).scale(ctx.inv_sqrt_d);
    Grads { d_q, d_k, d_v }
}

The softmax backward is the part everyone glosses over. For a single row, the gradient is s * (g - dot(g, s)), where s is the softmax output and g is the incoming gradient. Deriving that by hand, then watching a numerical gradient check agree to five decimal places, was the single most satisfying moment of the project. That check is non-negotiable. I perturbed each parameter by a tiny epsilon, measured the finite-difference gradient, and compared. Every layer I wrote got verified this way before I trusted it.

Memory and performance vs Python

The headline number first. Training a small character-level model, 4 layers, 128 dimensions, on a CPU, Rustformer ran a single training step in roughly 38 ms against a comparable pure-Python-plus-NumPy implementation at around 210 ms. Most of that gap is not magic. It is the absence of interpreter overhead and the cache-friendly loop order, plus the fact that I could reuse buffers instead of allocating fresh arrays every step.

Memory told the more interesting story. Because ownership is explicit, I could see exactly which activations had to survive until the backward pass and which could be dropped immediately. In Python those lifetimes are invisible, decided by the GC and the autograd graph. In Rust the borrow checker effectively handed me an activation-memory audit for free. I cut peak memory by about a third just by being honest about what the backward pass actually needed.

The tradeoff is real, though. Every shape mismatch is a compile error you fix before running, which is wonderful, but every new layer is also code you write twice, forward and backward. Python buys iteration speed with that hidden graph. Rust buys clarity and throughput with your time. For learning, the trade was overwhelmingly worth it.

Lessons

A few things I will carry into everything else I build.

Gradient checking is the only thing that matters early. A forward pass that looks right tells you nothing. The numerical check is ground truth.
Memory layout is a hyperparameter. The loop reorder beat every micro-optimization I tried afterward. Know how your data sits in cache.
Numerical stability is a design concern, not a bug fix. Subtract the max, clip where needed, and assume f32 will betray you.
Autograd is a luxury you should understand before you depend on it. Once you have hand-chained the gradients through attention, the framework stops being magic.

Conclusion

Rustformer is not going to dethrone anything, and it was never meant to. It is a transformer I understand to the last gradient, written in a language that refused to let me lie to myself about memory or shapes. If you have only ever called .backward(), I cannot recommend this enough: pick a language with no safety net, delete the autograd, and derive the chain rule until it is yours. You will never look at a model the same way again.