Section 34.3: On-Device Inference | Building Scalable AI

"They quantized me to four bits and dropped me into 4 GB of phone RAM. I lost a little precision and gained a billion homes."
A Quantized Model, Living in 4 GB of Phone RAM

Big Picture

On-device inference is the act of running the entire model on the endpoint that produced the data, with no datacenter in the loop, and it is what makes intelligence distributable to a billion phones, cameras, and sensors at zero per-query server cost. The device has no rack of accelerators, no terabytes of memory, and no wall socket of unlimited power; it has a few gigabytes of RAM shared with the rest of the operating system, a mobile system-on-chip, and a battery that warms in your hand. Fitting a model into that envelope is not a new theory of inference; it is the per-node efficiency toolkit of Chapter 22, quantization and pruning and distillation, deployed at population scale across endpoints that no operator controls. This section sets out the four hard constraints, then the two-stage toolchain (compress the model, then execute it through a mobile runtime that targets the on-chip neural accelerator) that turns a server-class network into something that lives in your pocket.

Every form of distribution in this book so far has moved work between machines an operator owns: shards across a cluster, replicas across a serving fleet, agents across a network. On-device inference moves the model to the one machine the operator does not own, the user's own hardware, and then runs the whole computation there. The motivation is the same triad that pushed work off a single machine in Section 1.1, read in reverse: when the endpoints number in the billions, sending every query to a datacenter is the ceiling. Latency crosses a wireless link twice, privacy-sensitive data leaves the device, the service pays for every inference, and a dropped connection means no answer at all. Running the model locally removes all four problems at once, in exchange for living inside an envelope far smaller than any node in Chapter 24 ever sees.

Figure 34.3.1: On-device inference runs the whole model inside one system-on-chip. The application hands the model to a mobile runtime, which compiles each operator to the most efficient available unit: the CPU as a universal fallback, the GPU for parallel float work, and the neural processing unit (NPU) or digital signal processor (DSP) for the low-power integer matrix multiplies that quantized models are built around. Every unit draws on the same memory, battery, and thermal budget named in the bottom bar, which is the binding constraint this section is about.

1. Four Constraints That Have No Datacenter Remedy Beginner

A datacenter node answers resource pressure by adding more of the resource: more memory, more accelerators, more power and cooling. The device cannot. Its constraints are fixed at manufacture and shared with everything else the user is doing, so the model must be shaped to fit them rather than the other way around. Four of them bind, and they bind together.

Memory. A phone may report 8 GB of RAM, but the operating system, the foreground app, and the graphics surface claim most of it; an inference engine that demands more than a gigabyte or two is liable to be killed by the memory manager. The model's weights must fit in that slice with room left for activations and the key-value cache. Compute. A mobile system-on-chip delivers a fraction of a datacenter accelerator's throughput, and it has no high-bandwidth memory feeding it, so both arithmetic and memory bandwidth are scarce. Energy and thermal. Every inference drains a battery and produces heat with nowhere to go; sustained heavy compute trips a thermal limit and the chip throttles its own clock, so a model that is fast for one query can be slow for the hundredth. No datacenter accelerators. There is no NVLink, no fast interconnect, no second device to offload to; the model runs on what the chip already has, which is a CPU, perhaps a mobile GPU, and increasingly an NPU specialized for integer matrix multiplies.

Key Insight: The Constraints Are Fixed, So the Model Must Move to Meet Them

On a cluster you grow the hardware to fit the model. On a device you shrink the model to fit the hardware, because the memory, compute, power, and silicon are set at manufacture and shared with the rest of the system. This inverts the entire posture of the book: instead of asking how to spread one model across more machines, on-device inference asks how to compress one model until it lives entirely within one machine you do not control. The remedy is never "add a node"; it is always "make the model smaller and the runtime leaner".

2. Stage One: Compress the Model (Borrowing Chapter 22) Intermediate

The first stage of the toolchain is to make the model small enough to fit in memory and cheap enough to run within the energy budget. These are exactly the per-node efficiency techniques developed in Chapter 22 as the baseline that distribution multiplies. Here we are not re-deriving them; we are deploying that same toolkit at the extreme end of the envelope, where the node is a phone and the multiplication factor is a billion endpoints. Three families do the work.

Quantization stores each weight in fewer bits. A parameter held as a 32-bit float can be represented, after calibration, as an 8-bit or even 4-bit integer plus a shared scale, cutting the memory footprint by four or eight times and letting the integer units of an NPU do the arithmetic. Pruning removes weights or whole structures (channels, attention heads) that contribute little, shrinking both storage and compute. Distillation trains a small student model to imitate a large teacher, so that a network small enough for the device inherits much of the behavior of one that never could be. The three compose: a model is commonly distilled to a compact architecture, pruned of dead structure, and then quantized for storage and integer execution.

The size arithmetic is direct. A model with $P$ parameters stored at $b$ bits per parameter occupies

$$\text{bytes} = P \cdot \frac{b}{8},$$

so moving from $b = 32$ to $b = 4$ is an eightfold reduction in the weight footprint, the single largest lever available. The energy story is parallel: to a first approximation the energy per inference scales with the number of arithmetic operations and the bytes moved from memory,

$$E_{\text{infer}} \approx N_{\text{ops}} \cdot e_{\text{op}} + N_{\text{bytes}} \cdot e_{\text{mem}},$$

where $e_{\text{op}}$ and $e_{\text{mem}}$ are the per-operation and per-byte energy costs of the chip. Integer arithmetic has a smaller $e_{\text{op}}$ than floating point, and fewer bits per weight shrinks $N_{\text{bytes}}$, so quantization attacks both terms at once. That is why it is the first reflex of on-device deployment and the lever the demo below measures directly.

The code quantizes one layer's weight matrix from 32-bit float to 8-bit integers using a single per-tensor scale, measures how much accuracy that costs, and then prints the memory footprint of the same parameter count at four common widths.

import numpy as np

rng = np.random.default_rng(0)
# A small dense layer's weight matrix: 256 x 256 fp32 parameters.
W = rng.standard_normal((256, 256)).astype(np.float32)
P = W.size

# --- Symmetric per-tensor int8 quantization (scale + zero-point) ---
qmax = 127
scale = np.max(np.abs(W)) / qmax           # one scale for the whole tensor
W_q = np.round(W / scale).astype(np.int8)  # store int8 codes
W_hat = W_q.astype(np.float32) * scale     # dequantize for inference math

mae = np.mean(np.abs(W - W_hat))                                  # reconstruction error
rel = np.linalg.norm(W - W_hat) / np.linalg.norm(W)

# --- Memory footprint of the SAME parameter count at four widths ---
def mb(bits):
    return P * bits / 8 / 1e6              # bytes = params * bits / 8

print(f"parameters            : {P:,}")
print(f"int8 scale            : {scale:.5f}")
print(f"reconstruction MAE    : {mae:.5f}")
print(f"relative L2 error     : {rel:.4%}")
print()
print(f"fp32 footprint        : {mb(32):.3f} MB")
print(f"fp16 footprint        : {mb(16):.3f} MB")
print(f"int8 footprint        : {mb(8):.3f} MB  ({mb(32)/mb(8):.0f}x vs fp32)")
print(f"int4 footprint        : {mb(4):.3f} MB  ({mb(32)/mb(4):.0f}x vs fp32)")
print()
# Scale the same widths up to a 7-billion-parameter on-device LLM.
B = 7_000_000_000
for bits, name in [(16, "fp16"), (8, "int8"), (4, "int4")]:
    print(f"7B model in {name:<4}     : {B * bits / 8 / 1e9:6.2f} GB")

Code 34.3.1: Symmetric int8 quantization of one weight tensor, with the reconstruction error and the four-width footprint computed from the identity $\text{bytes} = P \cdot b/8$. The same store-as-integer, multiply-by-scale pattern is what a mobile runtime applies to every layer before shipping the model to the device.

parameters            : 65,536
int8 scale            : 0.03726
reconstruction MAE    : 0.00934
relative L2 error     : 1.0793%

fp32 footprint        : 0.262 MB
fp16 footprint        : 0.131 MB
int8 footprint        : 0.066 MB  (4x vs fp32)
int4 footprint        : 0.033 MB  (8x vs fp32)

7B model in fp16     :  14.00 GB
7B model in int8     :   7.00 GB
7B model in int4     :   3.50 GB

Output 34.3.1: Int8 quantization reproduces the weights to within a one percent relative error while cutting the footprint fourfold, and int4 cuts it eightfold. The bottom block is the constraint that decides deployment: a 7-billion-parameter model needs 14 GB in fp16 and will not fit a phone, but at int4 it is 3.5 GB and lives, as the epigraph promises, inside a handful of gigabytes of RAM.

The bottom three lines of Output 34.3.1 are the whole argument for on-device quantization in miniature. A 7-billion-parameter model, a size that is now common for capable assistants, simply does not fit a phone at full or half precision, but the eightfold reduction of 4-bit quantization brings it into the memory slice the operating system will tolerate, at a relative weight error near one percent. The same calibration error that Chapter 22 analyzes as a per-node accuracy trade is, on the device, the price of admission.

Thesis Thread: Per-Node Efficiency, Multiplied by a Billion Endpoints

The thesis of this book is that intelligence is engineered by distributing it across many machines. On-device inference is that thesis taken to its limit: the "many machines" are not a cluster an operator racks and cools, they are a billion phones and sensors the operator never touches, each running the whole model locally. The lever that makes it possible is the per-node efficiency of Chapter 22, the same quantization and pruning that shrink a datacenter node's cost, now deciding whether a model fits a pocket at all. Where Chapter 24 multiplies per-node KV-cache economics across a serving fleet the operator pays for, this section multiplies per-node compression across an endpoint population that costs the operator nothing per query. Same primitive, opposite end of the scale.

3. Stage Two: Execute Through a Mobile Runtime and an NPU Intermediate

A compressed model is a file; running it on the device is the job of a mobile or edge runtime, a lightweight inference engine that loads the model, plans its execution, and dispatches each operator to the best available silicon. The runtime is what Figure 34.3.1 places between the application and the chip, and several have become standard. TensorFlow Lite (now LiteRT) targets Android and microcontrollers and pairs with quantization-aware training. ONNX Runtime Mobile runs the portable ONNX graph format with a small binary footprint. Core ML is Apple's runtime, compiling models to the Apple Neural Engine. ExecuTorch is PyTorch's on-device path, exporting a model to run without the full framework. GGML and llama.cpp brought quantized large language models to commodity CPUs and phones, and are the reason a 4-bit assistant runs on a laptop with no GPU at all.

The runtime's central trick is the delegate or execution provider: a plug-in that hands compatible subgraphs to a hardware accelerator. The most consequential accelerator on a modern device is the NPU (neural processing unit), a block of silicon built for the low-precision integer matrix multiplies that quantized inference is made of. An NPU performs an int8 matmul at a fraction of the energy a CPU spends on the float equivalent, which is why stage one (quantize) and stage two (dispatch to the NPU) are designed together: the model is quantized precisely so the NPU can run it. The DSP plays a similar role on chips without a dedicated NPU. When the runtime meets an operator no accelerator supports, it falls back to the CPU, so correctness never depends on the accelerator being present.

Library Shortcut: Convert and Run a Quantized Model in a Handful of Lines

Code 34.3.1 quantized one tensor by hand to expose the mechanics. In production the runtime quantizes the whole graph and dispatches it to the NPU for you. Converting a model to TFLite with full int8 quantization and running it through ONNX Runtime are each a few lines:

# TensorFlow Lite: post-training int8 quantization of a whole model.
import tensorflow as tf
conv = tf.lite.TFLiteConverter.from_saved_model("model/")
conv.optimizations = [tf.lite.Optimize.DEFAULT]          # enable quantization
conv.representative_dataset = calib_gen                  # calibrates the scales
conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
open("model_int8.tflite", "wb").write(conv.convert())    # 4x smaller file

# ONNX Runtime Mobile: load and run, picking the NPU/GPU delegate if present.
import onnxruntime as ort
sess = ort.InferenceSession("model.onnx",
        providers=["NnapiExecutionProvider", "CPUExecutionProvider"])
out = sess.run(None, {"input": x})                       # runs on NPU, CPU fallback

Code 34.3.2: The two-stage toolchain as production calls. The converter handles calibration, scale selection, and operator fusion that a hand-rolled quantizer would take dozens of lines to reproduce; the runtime handles NPU dispatch and CPU fallback behind a single provider list. For a quantized large language model the equivalent one-liner is llama-cli -m model-q4.gguf -p "...", which loads a 4-bit GGUF and runs it on the CPU.

Practical Example: The Keyboard That Stopped Calling Home

Who: A mobile engineer on a smartphone keyboard team shipping next-word prediction to 200 million phones.

Situation: The predictor ran as a cloud call on every keystroke, which cost money per request, added a round-trip of latency the user felt as lag, and sent typed text off the device.

Problem: A model good enough to predict well was 400 MB in fp32, far too large for the memory the operating system would grant a keyboard, and too slow to run on the CPU within the per-keystroke energy budget.

Dilemma: Keep the cloud call, paying per query forever and leaking keystrokes, or move the model on-device, which meant fitting 400 MB and a tight latency budget into a phone with no datacenter accelerator.

Decision: They moved it on-device, because at 200 million users the per-query server cost and the privacy exposure both scaled with the population, while an on-device model costs nothing per keystroke and never transmits text.

How: They distilled the predictor to a compact architecture, applied quantization-aware int8 quantization, and shipped it as a TFLite model that the runtime dispatched to the phone's NPU, with a CPU fallback for older devices.

Result: The model dropped to roughly 100 MB and ran in single-digit milliseconds per keystroke entirely on-device, with no network call, no per-query cost, and no text leaving the phone, at a small and acceptable loss in top-1 prediction accuracy.

Lesson: When the endpoint population is large, the per-query cost and the privacy exposure are the binding ceilings, and on-device inference removes both at once, in exchange for the compression work of stage one.

4. When the Device Is Enough, and When It Is Not Advanced

On-device inference is the right answer when the compressed model fits the envelope of Section 1 and the latency, cost, and privacy gains justify the accuracy given up to compression. That covers a great deal: wake-word detection, keyboard prediction, on-device translation, photo segmentation, and increasingly small assistant models. It contrasts sharply with the server-side path of Chapter 24, where a frontier model is sharded across many accelerators behind a continuous-batching scheduler, the operator pays for every token, and the memory budget is hundreds of gigabytes rather than a handful. The device gives up that ceiling of capability to gain locality, privacy, and zero marginal cost.

The envelope still binds, though. A model too large to quantize into the device's memory, a task that needs a frontier-scale network, or a workload whose sustained compute would overrun the thermal budget cannot live entirely on the device. The next section takes up exactly that case: when the device alone is not enough, the computation is split, with the early layers running on-device and the rest offloaded to a nearby fog node or the cloud, a hybrid that keeps the locality of on-device inference where it can and reaches for more hardware only where it must. That is the subject of Section 34.4.

Research Frontier: Sub-4-Bit and On-Device LLMs (2024 to 2026)

Because the memory line in Output 34.3.1 decides whether a model fits at all, the active frontier is pushing precision below four bits while holding accuracy. Weight-only schemes such as GPTQ and AWQ quantize large language models to 4-bit and 3-bit with small quality loss, and 2-bit and ternary lines including the BitNet b1.58 family (Ma et al., 2024) train models whose weights are natively low-bit rather than quantized after the fact, collapsing matrix multiplies into additions an NPU executes cheaply. In parallel, Apple, Google, and Microsoft ship small on-device foundation models (the Phi and Gemma-nano lines, on-device Apple Intelligence) engineered around exactly the runtime-plus-NPU stack of Figure 34.3.1. The open question is how far per-weight precision can fall before the calibration error of Code 34.3.1 stops being a one-percent admission price and starts changing what the model can do; the field is mapping that boundary now.

Fun Note: The Model Runs Your Battery, Not Just Your CPU

On a server, a model's cost shows up on an electricity bill the operator never personally feels. On a phone, the same model warms the device in the user's hand and the battery icon ticks down while they watch. It is the one deployment in this book where the end user pays the energy term of $E_{\text{infer}}$ directly, in heat and battery percent, which is why a runtime that wastes a few millijoules per inference gets noticed in a way no datacenter ever notices a stray watt.

Exercise 34.3.1: Does It Fit? Conceptual

A device grants an inference engine 2 GB of RAM, of which roughly 30 percent must be reserved for activations and the key-value cache. Using $\text{bytes} = P \cdot b/8$, decide whether each of the following fits the remaining weight budget, and if not, what compression would make it fit: (a) a 1.5-billion-parameter model in fp16; (b) the same model in int4; (c) a 7-billion-parameter model in int8; (d) a 7-billion-parameter model in int4. State for each which of the four constraints from Section 1 you used.

Exercise 34.3.2: Per-Channel Beats Per-Tensor Coding

Code 34.3.1 uses one scale for the whole weight tensor. Modify it to use a separate scale per output channel (one scale per row of W), then recompute the reconstruction MAE and relative L2 error against the per-tensor version. Construct a weight matrix where one row has a much larger dynamic range than the others, and show that per-channel quantization reduces the error substantially while per-tensor does not. Explain in two sentences why mobile runtimes default to per-channel quantization for weights.

Exercise 34.3.3: On-Device Versus a Round-Trip Analysis

An assistant feature can run either on-device at 120 ms per query drawing 0.5 J, or as a server call with a 40 ms network round-trip plus 25 ms of server compute, at a server cost of \$0.0002 per query. For a feature invoked 50 times per day by 100 million users, compute the daily server cost of the cloud path and the daily energy drawn from each user's battery by the on-device path. Argue from these two numbers which path you would ship, and identify the single change to the workload (latency target, model size, or query volume) that would flip your decision. Relate your answer to the split-computing tradeoff of Section 34.4.