Section 22.4: Knowledge Distillation

"My teacher was a hundred times my size and twice as sure of everything. I kept only the parts of its doubt that were useful, and now I answer in a tenth of the time."
A Student Model That Fit in One GPU

Big Picture

Knowledge distillation trains a small student model to reproduce the behavior of a large teacher, so the cheap student is what you actually deploy. The trick that makes it work is that a teacher's full output, a probability spread over every class or token, carries far more information than the single hard label a dataset provides: the relative weights it assigns to the wrong answers ("dark knowledge") tell the student which mistakes are reasonable and which are absurd. A student that learns from this softened signal reaches accuracy a same-size model trained on hard labels alone cannot. This is a scale-up technique, it makes one node cheaper, but its payoff is a fleet payoff. Every gigabyte of memory and millisecond of latency the student saves is paid once and collected on every replica in a serving fleet, which makes a successful distillation the single largest cost lever in Part V when it lands.

The previous section reduced a model's cost by removing weights from a fixed architecture (Section 22.3, pruning and sparsity). Distillation takes a different route to the same goal: instead of shrinking one model, it trains a separate, smaller model from scratch to imitate the large one. Quantization and pruning are constrained to stay near the original network; distillation is free to choose any student architecture that fits the deployment budget, then teach it to behave like the expensive teacher. That freedom is why distillation routinely produces the steepest accuracy-per-parameter curve of the three, and why the largest frontier labs ship distilled variants of their flagship models for everyday serving.

Like everything in this chapter, distillation is single-node economics: it changes what one replica costs to run. We treat it here as a labeled per-node prerequisite, because you cannot size or price a serving fleet (Chapter 23) until you know the unit cost of the model each node holds. The figure below frames the technique and the rest of the section unpacks it.

Figure 22.4.1: The distillation pipeline. The large teacher (left) turns each input into a softened probability distribution at temperature $T$ (center); the relative mass it places on the wrong classes is the "dark knowledge" a hard label discards. The small student (right) is trained to match that distribution, then is the only model deployed, replicated across the fleet so its per-node saving is collected on every replica.

1. Why a Soft Target Teaches More Than a Hard Label Beginner

A labeled training example gives the model one bit of supervision per class: this image is a cat, not a dog, not a car. A trained teacher, shown the same image, returns something richer: it might assign 0.55 to cat, 0.42 to dog, and 0.03 to car. That distribution encodes a similarity structure the hard label throws away. It says a cat is easily confused with a dog and almost never with a car, a fact the teacher learned from millions of examples and that a small student could never infer from a handful of one-hot labels. Geoffrey Hinton and colleagues named this discarded information dark knowledge, and the central claim of distillation is that learning to reproduce it is what lets a small model punch above its parameter count.

The wrinkle is that a well-trained teacher is often overconfident: it places 0.999 on the right class and spreads almost nothing over the rest, so the informative structure is squashed into digits the student barely sees. Distillation fixes this by raising a temperature $T$ inside the softmax, flattening the distribution so the relative weights of the non-target classes become legible. The student is then trained to match this softened distribution rather than the hard label, and the temperature is the dial that controls how much of the teacher's nuance is exposed.

Key Insight: The Information Is in the Wrong Answers

A hard label tells the student only what the right answer is. The teacher's softened distribution additionally tells it how wrong each wrong answer is, and that ranking of the alternatives is the part a small model cannot recover on its own from limited data. Distillation works precisely to the extent that the student needs this similarity structure and lacks the capacity or the data to learn it directly. When the student is already big enough and the labeled data already plentiful, the soft target adds little; the technique earns its keep at the small, data-limited end, which is exactly the deployment regime that matters for a serving fleet.

2. The Distillation Loss and Temperature Intermediate

Make the temperature precise. Given a teacher's logits $z^{\mathcal{T}}$ and a student's logits $z^{\mathcal{S}}$ over $C$ classes, the temperature-scaled softmax is

$$p_i(z; T) = \frac{\exp(z_i / T)}{\sum_{j=1}^{C} \exp(z_j / T)}.$$

At $T = 1$ this is the ordinary softmax; as $T$ grows, the distribution flattens toward uniform and the relative weights of the small probabilities are amplified. The student is trained to minimize a convex combination of two terms: a distillation term that matches the teacher's softened distribution, and an ordinary term that matches the true hard label when one is available,

$$\mathcal{L} = (1 - \alpha)\, \underbrace{\mathrm{CE}\big(y,\, p(z^{\mathcal{S}}; 1)\big)}_{\text{hard-label loss}} \;+\; \alpha\, T^2 \underbrace{\mathrm{KL}\big(p(z^{\mathcal{T}}; T)\,\|\,p(z^{\mathcal{S}}; T)\big)}_{\text{soft-target loss}}.$$

Two details earn their place. The mixing weight $\alpha$ trades the cheap, abundant soft targets against the scarce, authoritative hard labels; pure distillation is $\alpha = 1$. The $T^2$ factor rescales the soft-target gradient, which otherwise shrinks like $1/T^2$ under temperature scaling, so that the two loss terms keep comparable magnitudes as $T$ changes. With these in place the student's gradient is dominated, in the high-temperature limit, by the difference $z^{\mathcal{S}} - z^{\mathcal{T}}$ between student and teacher logits, which is why distillation is sometimes described as logit matching with a soft floor.

Fun Note: The Cake That Remembered the Recipe

The original distillation paper used an MNIST teacher that had never, during training, seen a single example of the digit 3, yet its students could still recognize threes at test time, learned entirely from how the teacher distributed probability over the other nine digits when shown a three. The knowledge of "three-ness" was hiding in the teacher's confusions, not its correct answers. The wrong columns of the softmax were doing the teaching.

3. From Scratch: A Distilled Student Beats Its Same-Size Twin Intermediate

The claim that soft targets buy generalization a hard-label model of the same size cannot reach is easy to demonstrate with no deep-learning machinery at all. The program below builds a nonlinear three-class problem, trains a large teacher on a big labeled corpus, then trains two small students of identical size: one on the handful of hard labels we can afford, the other on the teacher's softened predictions over a larger unlabeled transfer set. The student is a tiny multinomial classifier using only a fraction of the teacher's features, so it is genuinely capacity-limited, exactly the deployment situation distillation is meant for.

import numpy as np

rng = np.random.default_rng(3)

def make_data(n):                                  # nonlinear 3-class spiral rule
    Z = rng.standard_normal((n, 2)) * 1.3
    r = np.sqrt((Z**2).sum(1))
    ang = np.arctan2(Z[:, 1], Z[:, 0])
    y = (np.floor((ang + np.pi) / (2 * np.pi) * 3 + 0.9 * r)).astype(int) % 3
    return Z, y

def softmax(Z, T=1.0):                             # temperature-scaled softmax
    Z = (Z / T); Z = Z - Z.max(1, keepdims=True)
    E = np.exp(Z); return E / E.sum(1, keepdims=True)

def onehot(y, C=3):
    O = np.zeros((y.size, C)); O[np.arange(y.size), y] = 1.0; return O

def train(F, Targets, epochs=2500, lr=0.8, l2=1e-3):   # fit to (soft or hard) Targets
    W = np.zeros((F.shape[1], Targets.shape[1])); b = np.zeros(Targets.shape[1])
    n = F.shape[0]
    for _ in range(epochs):
        G = (softmax(F @ W + b) - Targets) / n     # cross-entropy gradient
        W -= lr * (F.T @ G + l2 * W); b -= lr * G.sum(0)
    return W, b

acc = lambda F, W, b, y: (np.argmax(F @ W + b, 1) == y).mean()

D = 600                                            # shared random-feature lift
W1 = rng.standard_normal((2, D)) * 1.6
b1 = rng.standard_normal(D) * 1.6
feat = lambda X: np.tanh(X @ W1 + b1)

# TEACHER: large model trained on a big labeled corpus (the expensive asset)
Xbig, ybig = make_data(8000)
Wte, bte = train(feat(Xbig), onehot(ybig))

d = 16                                             # student sees only d features
Xte, yte = make_data(20000)
Fte = feat(Xte)[:, :d]

Xlab, ylab = make_data(40)                         # all the hard labels we can afford
Xtr = np.vstack([Xlab, make_data(2000)[0]])        # + a large UNLABELED transfer set
Ftr_big = feat(Xtr); Ftr_s = Ftr_big[:, :d]

# Student A: hard labels only, on the 40 labeled points
WhA, bhA = train(feat(Xlab)[:, :d], onehot(ylab))
student_hard = acc(Fte, WhA, bhA, yte)

# Student B: distilled. Teacher softly labels the whole transfer set at temperature T.
T = 4.0
soft = softmax(Ftr_big @ Wte + bte, T=T)
WhB, bhB = train(Ftr_s, soft)
student_soft = acc(Fte, WhB, bhB, yte)

teacher_test = acc(feat(Xte), Wte, bte, yte)
pT = Wte.size + bte.size; pS = WhA.size + bhA.size
print(f"teacher params (W,b)     : {pT:,}")
print(f"student params (W,b)     : {pS:,}  ({pS/pT:.1%} of teacher)")
print(f"distillation temperature : {T}")
print("-" * 46)
print(f"teacher test accuracy    : {teacher_test:.3f}")
print(f"student, hard labels     : {student_hard:.3f}")
print(f"student, distilled (soft): {student_soft:.3f}")
print(f"gap to teacher recovered : {(student_soft-student_hard)/(teacher_test-student_hard):.0%}")

Code 22.4.1: Pure-NumPy distillation. Two students of identical size are trained, one on scarce hard labels and one on the teacher's temperature-softened predictions over a larger transfer set, and their test accuracy is compared against the teacher and against each other.

teacher params (W,b)     : 1,803
student params (W,b)     : 51  (2.8% of teacher)
distillation temperature : 4.0
----------------------------------------------
teacher test accuracy    : 0.852
student, hard labels     : 0.585
student, distilled (soft): 0.745
gap to teacher recovered : 60%

Output 22.4.1: The distilled student carries 2.8 percent of the teacher's parameters yet recovers 60 percent of the accuracy gap that separated the hard-label student from the teacher, lifting test accuracy from 0.585 to 0.745 at no change in student size.

The two students are the same model with the same parameter count and the same optimizer; the only difference is the target they were trained against. The hard-label student, starved of data, lands at 0.585. The distilled student, trained on the teacher's softened distribution over a transfer set it never had labels for, reaches 0.745 and closes most of the distance to the teacher. The dark knowledge in those softened probabilities did real work: it taught the small model the similarity structure of the problem that its scarce hard labels could not. This is the entire value proposition of distillation in one experiment, and it scales from this toy classifier up to billion-parameter language models.

Practical Example: Replacing a Flagship LLM on the Hot Path

Who: An ML platform team running a customer-support assistant on a fleet of inference nodes.

Situation: Every request hit a 70-billion-parameter chat model that needed multiple GPUs per replica, and traffic had grown until the GPU bill dominated the product's unit economics.

Problem: Quantization (Section 22.2) had already squeezed the model to 4-bit and pruning gave diminishing returns; the architecture itself was simply too large for the latency target at peak load.

Dilemma: Keep the large model and pay for a sprawling fleet with slack capacity for spikes, or switch to a much smaller model and risk a quality regression that support agents and customers would notice immediately.

Decision: They distilled. The 70B model became the teacher, generating softened responses on a large corpus of real support transcripts, and a 7B student was trained to match them, response-based plus sequence-level distillation on the generated text.

How: The teacher ran an offline, inference-heavy data-generation job (the cost discussed in Section 4) to label millions of transcripts; the student trained on that synthetic corpus, then was evaluated against the teacher on held-out tickets before any rollout.

Result: The 7B student fit on a single GPU per replica, cut per-request latency by more than half, and held within two points of the teacher's task quality. Because the saving was per node, the fleet shrank by roughly a factor of four, and that multiplied saving, not the single-node win, was what justified the project.

Lesson: Distillation is the biggest fleet-cost lever when it lands, because a smaller student is paid for once in training and collected on every replica forever. The teacher's data-generation cost is real but one-time; the serving saving is continuous.

4. Variants, Cost, and When Distillation Wins Advanced

The soft-target recipe above is the response-based variant: the student matches the teacher's final output distribution. Two other families extend it. Feature-based distillation adds terms that align the student's intermediate representations with the teacher's, so the student learns not just what the teacher answers but how it internally arrives there; this helps when the student architecture differs enough that matching only the final layer leaves too little signal. Sequence-level distillation is the form that matters for generative models: instead of matching the teacher's per-token distribution, the student is trained on whole sequences the teacher generates, so it learns to reproduce the teacher's outputs as coherent wholes rather than token by token. Modern LLM distillation leans heavily on this last form, training students on text the teacher writes.

That points at the one real cost of distillation, the cost that ties it back to the distributed theme of this book. Generating the teacher's targets is itself a large inference job: to distill on millions of examples you must run the expensive teacher over all of them first. For a frontier LLM teacher this is a serious distributed-inference workload in its own right, the subject of Chapter 23, and it is the reason distillation is an offline data-generation pipeline, not a training trick you bolt on for free. The economics still favor it because the generation cost is paid once while the serving saving recurs on every replica of the fleet, but the bill is front-loaded onto the very distributed-inference machinery the student is meant to lighten.

When does distillation beat quantization and pruning? Quantization keeps the same architecture and shrinks each number; pruning keeps the architecture and removes some numbers; both are bounded by how far you can perturb a fixed network before it breaks. Distillation is unbounded in that sense: the student can be an entirely different, smaller architecture, so when you need a four- or ten-fold size reduction rather than two-fold, distillation is usually the only one of the three that reaches it without collapse. The cost is that it requires a full training run and the teacher's generated data, whereas post-training quantization needs neither. In practice the three compose: distill to a smaller architecture, then quantize and prune the student, stacking the savings.

Library Shortcut: HuggingFace and TextBrewer Do the Loss Plumbing

Code 22.4.1 wrote the temperature softmax, the soft-target loss, and the training loop by hand to expose the mechanism. In practice you assemble those pieces from a library. HuggingFace Transformers ships distillation example scripts (the DistilBERT recipe that produced a model 40 percent smaller and 60 percent faster at 97 percent of BERT's quality), and the Trainer API lets you override compute_loss to add the temperature-scaled KL term in a few lines. TextBrewer is a dedicated PyTorch distillation toolkit that goes further, declaring response-based, feature-based, and attention-matching losses through a configuration object so you wire teacher-to-student layer mappings without touching the training loop:

# pip install textbrewer
import textbrewer
from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig

distill_cfg = DistillationConfig(
    temperature=4.0,                 # the T from the loss in Section 2
    hard_label_weight=0.0,           # 1 - alpha; here pure soft-target distillation
    kd_loss_type="ce",               # cross-entropy / KL on softened logits
)
train_cfg = TrainingConfig()
distiller = GeneralDistiller(        # handles the teacher forward pass + combined loss
    train_config=train_cfg, distill_config=distill_cfg,
    model_T=teacher, model_S=student,
    adaptor_T=adaptor, adaptor_S=adaptor)

with distiller:                      # one call replaces the hand-written loop above
    distiller.train(optimizer, train_loader, num_epochs=3, callback=None)

Code 22.4.2: The same temperature and soft-target loss as Code 22.4.1, now declared as a TextBrewer configuration. The library runs the teacher forward pass, assembles the combined loss, and manages the loop, collapsing the from-scratch implementation to a configuration object and one train call.

Research Frontier: Distilling Frontier Models and Distillation Scaling Laws (2024 to 2026)

Two threads dominate the recent literature. The first is the systematic distillation of large frontier models into small deployable ones: the Gemma 2 and Gemma 3 small models were trained with knowledge distillation from larger teachers rather than from scratch, and synthetic-data distillation, where a strong teacher generates the entire training corpus for a smaller student, has become a standard route to capable compact models (the open reproductions in the lineage of Alpaca and the broader "distill from a frontier API" recipe). The second is the search for distillation scaling laws: Busbridge et al. (2025) fit predictive laws for how a distilled student's loss depends on student size, teacher size, and the amount of distillation data, and report a counterintuitive finding that a stronger teacher does not always yield a better student, a capacity gap can hurt, which connects distillation directly to the scaling-law machinery for foundation models in Section 19.2. Together these turn distillation from a hand-tuned craft into a budgeted design choice: given a serving budget, the laws tell you which teacher and how much generated data produce the best student.

The takeaway for the rest of Part V is that distillation changes the single most important number a serving fleet is built around: the size of the model on each node. The sections that follow attack the runtime cost of whatever model the node ends up holding. The next one is the most consequential for generative serving, the memory the model spends remembering the conversation so far.

Exercise 22.4.1: Read the Dark Knowledge Conceptual

A three-class image classifier, shown a photo, outputs the softened distribution (cat 0.55, dog 0.42, car 0.03) at temperature $T = 4$. A second image yields (cat 0.50, dog 0.05, car 0.45). Both are labeled "cat," so the hard-label loss treats them identically. Explain precisely what extra information the soft targets give a student about these two images, and why a small student trained on the soft targets could end up with a different, better decision boundary than one trained on the hard labels alone. Then state one situation in which the soft target would add almost nothing, and connect it to the Key Insight in Section 1.

Exercise 22.4.2: Sweep the Temperature and the Mixing Weight Coding

Extend Code 22.4.1. First, sweep the temperature $T$ over $\{1, 2, 4, 8, 16\}$ and plot the distilled student's test accuracy against $T$; identify the temperature that maximizes it and explain the shape of the curve in terms of how flattening the distribution exposes or washes out the dark knowledge. Second, reinstate the hard-label term by training the student on the loss $\mathcal{L} = (1-\alpha)\,\mathrm{CE}(y, p^{\mathcal{S}}) + \alpha\, T^2\,\mathrm{KL}(p^{\mathcal{T}} \| p^{\mathcal{S}})$ over the labeled points, and sweep $\alpha \in \{0, 0.25, 0.5, 0.75, 1\}$. Report which combination of $T$ and $\alpha$ gives the best student and whether mixing in the hard labels ever beats pure distillation here.

Exercise 22.4.3: Price the Distillation Against the Serving Saving Analysis

Suppose a 70B teacher serves a fleet at a cost you would like to cut by distilling to a 7B student. Generating the teacher's soft targets requires running the teacher once over a 5-million-example corpus, at roughly 0.5 seconds per example on one GPU. The resulting student lets you shrink the serving fleet from 400 GPUs to 100 GPUs, each GPU costing two dollars per hour. Estimate the one-time teacher data-generation cost (you may parallelize it across 50 GPUs, the distributed-inference job of Chapter 23) and the continuous fleet saving per day. Compute the break-even time at which the distillation has paid for itself, and argue from these numbers why the saving is "per node, collected on every replica" rather than a one-time gain.