"They asked me to hold sixteen bits per weight, then four. I lost twelve bits and almost none of my opinions."
A Weight Matrix, Newly Compressed
Quantization stores a model's numbers in fewer bits, so the model occupies less memory and moves fewer bytes per token, which is the quantity that actually limits inference on one accelerator. A trained network arrives in 16-bit or 32-bit floating point; mapping those values onto 8-bit or 4-bit integers (or 8-bit floats) with a small scale factor cuts the parameter memory by two to four times and cuts the memory traffic by roughly the same factor. Because token generation is memory-bound, reading the weights once per token is the dominant cost, that traffic reduction translates almost directly into speed. This is a single-node, scale-up technique: it makes one node cheaper, and a serving fleet (Chapter 23) multiplies that per-node saving across every replica. A model that needs a quarter of the memory fits on a quarter of the GPUs, so the fleet shrinks proportionally.
The previous section established the rule that governs this whole chapter: the efficiency of one node sets the unit cost that the fleet multiplies, so a per-node saving is never just a local optimization. Quantization is the first and largest of those savings. It does not change what the model computes in any deep sense; it changes how precisely each number is stored, and modern large models tolerate a surprising amount of imprecision in their weights with negligible loss of quality. The result is the single most cost-effective lever in inference serving, which is why every production stack, from a phone to a datacenter, reaches for it before anything else.
To see why fewer bits buy speed and not just space, recall the roofline model of Chapter 3. Autoregressive generation produces one token at a time against a batch that is often small, so each weight is read from memory and used for only a handful of arithmetic operations before the next weight is needed. The accelerator's arithmetic units sit idle waiting for bytes; the workload lives far to the left of the roofline's ridge, firmly memory-bound. Halving the bytes per weight roughly halves the time spent waiting, which is why quantization speeds up generation even though it adds a tiny amount of arithmetic to rescale the integers back to a usable range.
1. The Quantization Map: Floats Into Integers and Back Intermediate
Quantization is an affine map between a continuous range of real values and a small set of integers. Pick a range of weight values to represent, choose a number of bits $b$, and divide the range into $2^b$ evenly spaced levels. A floating-point weight $w$ becomes an integer $q$, and the integer can be turned back into an approximate float $\hat{w}$, through a scale $s$ and an optional zero-point $z$:
$$ q = \mathrm{clip}\!\left(\mathrm{round}\!\left(\frac{w}{s}\right) + z,\; q_{\min},\; q_{\max}\right), \qquad \hat{w} = s\,(q - z). $$The scale $s$ is the width of one quantization step in the original units; the zero-point $z$ is the integer that represents a true zero, needed when the range is not symmetric about the origin. For weights, which are usually centered near zero, a symmetric scheme with $z = 0$ and a signed integer range is common: an 8-bit signed integer covers $[-128, 127]$, so $q_{\max} = 127$ and the scale is $s = \max|w| / 127$ over whatever group of weights shares that scale. The reconstruction error $w - \hat{w}$ is bounded by half a step, $s/2$, which makes the message of the formula plain: the error scales with $s$, and $s$ is set by the largest magnitude in the group divided by the integer range. One large outlier inflates $s$ and coarsens the quantization of every other weight that shares it. Figure 22.2.1 shows the mapping and the memory it saves.
2. Per-Tensor Versus Per-Channel, and the Outlier Problem Intermediate
The choice of which weights share a scale is where most of the accuracy is won or lost. The cheapest option is per-tensor quantization: one scale for the entire weight matrix. It is compact, one extra number per matrix, but it is fragile, because a single large entry anywhere in the matrix sets $s$ for all of it. The finer option is per-channel (or per-group) quantization: a separate scale for each row, column, or small block of the matrix. It costs a few hundred extra scale values, a rounding error in the total memory, and it isolates a large entry so that it only coarsens its own channel instead of the whole tensor. The numerical formats that make these integer and low-bit-float representations possible are the subject of Chapter 15; here we use them to shrink a deployed model.
Outliers are not a rare nuisance in large transformers; they are systematic. Certain feature dimensions in the activations of large language models carry values tens of times larger than their neighbors, and these outlier channels dominate any shared scale. The activation outlier problem is the central difficulty in pushing quantization below 8 bits, and it has produced a family of methods that either give outlier-bearing channels their own finer scale (per-channel and per-group schemes) or move the difficulty from activations to weights. SmoothQuant, for instance, rescales activations and weights jointly so that the hard-to-quantize outliers are absorbed into the weights, where per-channel scales handle them, leaving the activations smooth enough for plain 8-bit integers. The code below makes the per-tensor versus per-channel gap concrete on a matrix with deliberately injected outliers.
import numpy as np
rng = np.random.default_rng(0)
rows, cols = 512, 512 # a weight matrix W: out x in
W = rng.standard_normal((rows, cols)).astype(np.float32)
# Inject heavy-tailed outliers into a few columns (an input-channel effect).
outlier_cols = rng.choice(cols, size=8, replace=False)
W[:, outlier_cols] *= 20.0
def quant_dequant(W, bits, axis):
qmax = 2 ** (bits - 1) - 1 # symmetric signed range
if axis is None: # per-tensor: one scale
scale = np.max(np.abs(W)) / qmax
q = np.clip(np.round(W / scale), -qmax - 1, qmax)
return q * scale, np.array([scale])
amax = np.max(np.abs(W), axis=axis, keepdims=True) # per-channel: one scale/row
scale = amax / qmax
q = np.clip(np.round(W / scale), -qmax - 1, qmax)
return q * scale, scale.ravel()
def report(name, bits, axis):
Wq, scales = quant_dequant(W, bits, axis)
err = np.linalg.norm(W - Wq) / np.linalg.norm(W) # relative reconstruction error
fp16_bytes = W.size * 2
int_bytes = W.size * bits / 8 + scales.size * 4 # packed ints + fp32 scales
print(f"{name:<26} rel_err={err:.4f} "
f"mem={int_bytes/1024:7.1f} KiB "
f"shrink_vs_fp16={fp16_bytes/int_bytes:4.1f}x")
print(f"fp16 baseline mem={W.size*2/1024:7.1f} KiB")
report("INT8 per-tensor", 8, None)
report("INT8 per-channel", 8, 0)
report("INT4 per-tensor", 4, None)
report("INT4 per-channel", 4, 0)
fp16 baseline mem= 512.0 KiB
INT8 per-tensor rel_err=0.0702 mem= 256.0 KiB shrink_vs_fp16= 2.0x
INT8 per-channel rel_err=0.0079 mem= 258.0 KiB shrink_vs_fp16= 2.0x
INT4 per-tensor rel_err=0.4021 mem= 128.0 KiB shrink_vs_fp16= 4.0x
INT4 per-channel rel_err=0.1450 mem= 130.0 KiB shrink_vs_fp16= 3.9x
Two lessons sit in Output 22.2.1. First, the memory reduction is exactly what the bit count promises: INT8 halves it, INT4 quarters it, and the per-channel scales add a negligible two KiB. Second, the accuracy is governed entirely by how the scale is shared. Per-tensor INT4 destroys the matrix (a $40\%$ relative error) because eight outlier columns set a scale that crushes the other five hundred; per-channel INT4 confines the damage and recovers most of the fidelity at the same compression. This is the empirical core of the outlier story: the bits are cheap, the scale assignment is everything.
Quantization error is bounded by half the step size $s/2$, and $s$ is the largest magnitude in a group divided by the integer range. Reducing the bit count multiplies the error by shrinking the integer range; sharing one scale across more weights multiplies it by enlarging the maximum. The first is a fixed cost you pay for compression; the second is a design choice you control. Per-channel and per-group scales, and outlier-aware methods like SmoothQuant, are all attacks on the same denominator: keep each scale tight by never letting one outlier set the range for many ordinary weights.
3. PTQ, QAT, and the LLM-Specific Methods Advanced
There are two ways to obtain a quantized model. Post-training quantization (PTQ) takes a finished model and quantizes it directly, using a small calibration set, a few hundred unlabeled examples, to observe the typical range of activations and set the scales. PTQ is fast, needs no labels and no gradient, and is the default for large models because retraining them is expensive. Quantization-aware training (QAT) instead simulates the rounding during training or fine-tuning, inserting fake-quantize operations into the forward pass so the optimizer learns weights that are robust to the eventual precision loss. QAT recovers more accuracy, especially at very low bit widths, but it costs a training run, so it is reserved for cases where PTQ's accuracy is not good enough and the model is small enough to retrain.
For large language models specifically, weight-only PTQ has converged on a few high-quality methods. GPTQ quantizes the weights one column at a time and, after rounding each column, adjusts the not-yet-quantized columns to compensate for the error just introduced, an error-correcting sweep based on second-order (Hessian) information that keeps the layer's output close to the original. AWQ (activation-aware weight quantization) starts from the observation that a small fraction of weight channels, the ones multiplied by large activations, matter far more than the rest; it scales those salient channels up before quantizing so they survive the rounding, protecting accuracy without storing them in higher precision. Both are weight-only and post-training, and both target the same regime: 4-bit weights with 16-bit activations, the standard for memory-bound LLM serving.
Why weight-only INT4 specifically? Because in autoregressive generation the weights are read from memory every single token while the activations of a single small batch are tiny by comparison. The weights dominate both the storage and the traffic, so quantizing them to 4 bits captures almost the entire benefit, and leaving activations at 16 bits sidesteps the worst of the activation-outlier problem. The accuracy-versus-compression trade-off is therefore favorable in exactly this corner: 4-bit weight-only quantization typically costs a fraction of a perplexity point on a well-tuned method while quartering the model's footprint, whereas pushing weights to 3 or 2 bits, or quantizing activations too, starts to cost real quality and needs the heavier machinery of QAT or careful per-group outlier handling.
Who: An inference platform engineer at a software company deploying an open-weights 70-billion-parameter chat model.
Situation: In FP16 the model's weights alone need about 140 GB, so serving it meant a node of four 40 GB GPUs just to hold one copy, before any room for the KV cache or batching headroom.
Problem: Each replica tied up four expensive accelerators, and the request volume needed several replicas, so the fleet ballooned and the per-token cost was uncompetitive.
Dilemma: Buy more GPUs to add replicas, scaling the fleet linearly with traffic, or quantize so each replica fits on fewer accelerators, risking a quality regression that users would notice.
Decision: They applied 4-bit weight-only PTQ with an activation-aware method and a calibration set drawn from real traffic, validating quality on their own evaluation suite before rollout.
How: The quantized weights dropped to roughly 35 GB, fitting comfortably on a single 40 GB GPU with room for the KV cache, so one replica went from four GPUs to one.
Result: The fleet shrank by close to four times for the same throughput, perplexity moved by less than a tenth of a point, and the latency improved because generation became less memory-bound.
Lesson: A per-node memory cut is a per-replica cut, and a per-replica cut is a fleet cut. Quantization is where fleet sizing usually starts, and the saving compounds with every replica (Chapter 23).
It is genuinely strange that a model trained with the full expressive range of 16-bit floats barely notices when three quarters of those bits are taken away at inference time. The intuition is that training pushes the network toward solutions that are flat and redundant, so no single weight needs to be specified to high precision. The model spent its training budget learning to not care about the low bits, which is exactly the property that makes it cheap to serve.
4. Why This Matters for Serving: From One Node to a Fleet Beginner
Everything above is a single-node, scale-up technique, and that is precisely why it belongs at the front of a book about scale-out. The fleet you deploy is a multiple of one node's behavior, so a saving on the node is a saving on the fleet, multiplied. A 4-bit model needs a quarter of the weight memory, so it fits on a quarter (or a half, accounting for the KV cache and batching room) of the accelerators per replica; with each replica cheaper, the same request volume is served by a smaller fleet at lower cost. The roofline argument adds a second win: because generation was memory-bound, the lighter traffic also raises per-node throughput, so each replica serves more, shrinking the replica count further.
This is the bridge from per-node economics to fleet sizing that the rest of Part V builds on. Chapter 23 turns these per-node numbers into a replica count and a load-balancing strategy, and Chapter 24 carries the quantized model into a distributed LLM serving stack where the saving compounds with paged KV-cache management and continuous batching. The discipline is always the same one this chapter opened with: measure the unit, then multiply. Quantization is the largest single adjustment you can make to that unit before any distribution begins.
Code 22.2.1 implemented symmetric integer quantization by hand to expose the scale arithmetic. In production you never roll your own; mature libraries pack the integers, fuse the dequantization into the matmul kernel, and ship calibrated low-bit kernels. Loading a 4-bit model is one flag:
# pip install transformers bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
cfg = BitsAndBytesConfig(load_in_4bit=True, # NF4 weight-only quantization
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B",
quantization_config=cfg,
device_map="auto") # ~4x less VRAM
The frontier is moving from integer formats to low-bit floating-point ones and toward quantizing activations as well as weights. Hardware support for 8-bit floats (FP8, in E4M3 and E5M2 layouts) on recent accelerators has made FP8 inference, and increasingly FP8 training, practical, because a low-bit float keeps a wider dynamic range than an integer of the same width and so tolerates outliers better. The 2024 to 2026 line pushes further: NVIDIA's Blackwell generation added hardware FP4, and methods such as QuaRot and SpinQuant rotate the weight and activation spaces with random orthogonal transforms to flatten outliers before quantizing, making aggressive 4-bit weight-and-activation schemes viable. On the weight-only side, AWQ and GPTQ remain the workhorses and have been folded into mainstream serving stacks (vLLM, TensorRT-LLM), with newer variants tightening the per-group calibration. We return to how these per-node formats interact with a distributed serving stack in Chapter 24; for now, note that the field treats the bit width of a served model as a tunable knob, not a fixed property of the checkpoint.
Using the quantization map in Section 1, derive the worst-case per-weight reconstruction error for symmetric $b$-bit quantization of a group whose largest magnitude is $M$. Express it in terms of $M$ and $b$, and use the result to explain, in one or two sentences each, (a) why moving from 8 to 4 bits raises the error and (b) why putting one outlier of magnitude $20M$ into a per-tensor group raises the error for every other weight in that group. Connect your answer to the memory-bound roofline argument from Chapter 3: why is paying a little arithmetic to rescale integers a good trade during generation?
Code 22.2.1 uses symmetric quantization ($z = 0$), which wastes range when the values are not centered on zero (for example a post-ReLU activation, which is non-negative). Extend the quant_dequant function to an asymmetric scheme that computes both a scale $s = (w_{\max} - w_{\min}) / (q_{\max} - q_{\min})$ and a zero-point $z$, mapping the true minimum to $q_{\min}$. Quantize a non-negative matrix (take np.abs(W)) at INT8 with both the symmetric and the asymmetric schemes and report the relative error of each. Explain why the asymmetric scheme wins on non-negative data and roughly how much range the symmetric scheme threw away.
A 70B-parameter model has FP16 weights of about 140 GB. Each serving node has four 40 GB GPUs (160 GB usable), and you must reserve 30% of memory per node for the KV cache and batching headroom. Compute how many nodes one replica needs in FP16, in INT8, and in INT4. If incoming traffic requires twelve replicas, give the total GPU count for each precision and the percentage fleet reduction from FP16 to INT4. State one assumption under which the INT4 saving would be smaller than this arithmetic suggests (hint: think about what does not get quantized and grows with batch size and sequence length). This is the per-node-to-fleet multiplication that Chapter 23 formalizes.