"I left the phone as a thin slab of numbers, half the size of the photo that birthed me, and nobody on the cloud side could tell what I once was a picture of."
An Intermediate Feature Map, Crossing the Wireless Link
Split computing partitions the layers of a single neural network across a weak device and a strong cloud separated by a slow wireless link, so the device runs the early layers, transmits one intermediate activation rather than the raw input, and the cloud finishes the rest. It is the wide-area twin of the model parallelism you met inside a datacenter rack: the same computation graph is cut at a layer boundary and the two halves run on different hardware, except now the "interconnect" between the halves is a cellular uplink that is a million times slower than NVLink. That single fact reshapes every decision. Where you cut is no longer governed by balancing compute across identical GPUs but by finding a layer whose activation is small enough to ship cheaply, yet deep enough that the device has already discarded the raw, privacy-laden input. This section derives the latency model that picks that cut point, then layers on a second collaboration pattern, the confidence-gated cascade, in which easy inputs are answered on the device and only the hard ones are escalated to the cloud at all.
The previous sections of this chapter treated the device as a self-contained inference engine: quantize the model, shrink the KV cache, and run everything locally because the network is unreliable or private. That picture is correct when the device is capable enough to host the whole model, but it collapses for the models that matter most. A modern vision backbone or a billion-parameter language model does not fit, or runs at a fraction of a frame per second, on a phone-class accelerator. The opposite extreme, streaming the raw camera frame or microphone buffer to a datacenter and reading back the answer, leaks the most sensitive data the user owns and pays a bandwidth bill on every inference. Edge-cloud collaboration is the middle path: keep part of the computation on the device, send only what the cloud genuinely needs, and split the model along a seam chosen to minimize total latency. The technical heart of that middle path is split computing, and it is, at its core, model parallelism stretched across a wide-area network.
1. One Model, Two Nodes, a Slow Wire Between Them Intermediate
Recall the structure of pipeline parallelism from Chapter 16. A deep network is a sequence of layers; you cut the sequence at one or more boundaries, place each contiguous block of layers on a different device, and pass the activation tensor at the cut from one device to the next. Inside a training cluster the cut is chosen to balance compute across identical accelerators, and the activation crosses a fast intra-node link whose cost is almost an afterthought. Split computing performs the identical surgery, but across the edge-cloud boundary, and inverts which cost dominates. The device runs the head of the network, layers $1$ through $s$; it transmits the activation produced by layer $s$ across the wireless link; the cloud runs the tail, layers $s+1$ through the output. The two nodes are wildly heterogeneous (a milliwatt NPU on one side, a datacenter GPU on the other) and the wire between them is a cellular or Wi-Fi uplink whose bandwidth is measured in tens of megabytes per second, not the hundreds of gigabytes per second of an in-rack interconnect.
This reframing is the whole point of treating split inference as a distributed-AI method rather than a mobile-systems trick: it partitions one model's computation across heterogeneous nodes separated by a slow link, which is precisely the definition of model parallelism, only with the network promoted from a footnote to the central constraint. Every choice that the in-datacenter version could ignore now binds. The seam must fall where the activation is small. The seam must fall deep enough that what crosses the wire is an abstract feature map, not a reconstructable image. And because the device computes far slower than the cloud, pushing the seam later buys privacy and bandwidth savings but costs device time. Figure 34.4.1 shows the surgery and names the three quantities whose sum we are about to minimize.
The activation at a deep enough layer is both smaller and more private than the raw input. Smaller, because pooling and strided convolutions compress spatial resolution as depth increases, so the feature map at layer $s$ often holds a fraction of the bytes of the original image or audio buffer. More private, because that feature map is a lossy, task-specific encoding: reconstructing the original input from a mid-network activation is difficult and degrades quickly with depth. Split computing turns these two facts into a single design move. Instead of paying to ship the raw input and exposing it to the cloud, the device ships an intermediate representation that costs less bandwidth and reveals less of the user. Choosing the cut is therefore choosing simultaneously how many bytes you pay for and how much of the original signal you are willing to let leave the device.
2. Where to Cut: The Latency Model Intermediate
Make the trade-off precise. Index the layers $1, \dots, n$. Let $c_k$ be the compute cost of layer $k$ in FLOPs and let $a_k$ be the size in bytes of the activation that layer $k$ emits; write $a_0$ for the raw input size. The device runs at $r_{\text{dev}}$ FLOP/s, the cloud at $r_{\text{cld}}$ FLOP/s with typically $r_{\text{cld}} \gg r_{\text{dev}}$, and the uplink moves bytes at $B$ bytes per second. Cutting after layer $s$ (with $s = 0$ meaning "send the raw input and let the cloud do everything", and $s = n$ meaning "run the whole model on the device and send nothing") gives an end-to-end latency
$$T(s) \;=\; \underbrace{\frac{1}{r_{\text{dev}}}\sum_{k=1}^{s} c_k}_{\text{device compute}} \;+\; \underbrace{\frac{a_s}{B}}_{\text{activation transfer}} \;+\; \underbrace{\frac{1}{r_{\text{cld}}}\sum_{k=s+1}^{n} c_k}_{\text{cloud compute}}.$$The optimal split is simply the cut that minimizes this sum,
$$s^\star \;=\; \arg\min_{0 \le s \le n} \; T(s),$$and because there are only $n+1$ candidate cuts, you find it by evaluating $T$ at each one. The shape of $T(s)$ explains why the answer is usually interior rather than at either extreme. As $s$ grows, the device-compute term rises monotonically (the device shoulders more layers, each at its slow rate), while the cloud-compute term falls. The transfer term $a_s/B$ is the interesting one: it follows the activation-size profile, which for convolutional backbones drops steeply over the first few layers and then flattens. So the raw-input cut $s=0$ pays a huge transfer cost ($a_0$ is the largest payload of all), an early interior cut sheds most of that transfer cost for only a little device compute, and a very late cut pays almost no transfer but loads the slow device with nearly the entire network. The minimum sits wherever the falling transfer-plus-cloud cost stops outrunning the rising device cost. This is the same bottleneck reasoning as choosing a pipeline stage boundary in Chapter 16, with the activation-transfer term, negligible in-rack, now often the largest of the three.
Split computing is not a new primitive; it is the model parallelism of Chapter 16 with its interconnect swapped for a wide-area link. The book's spine, that scale-out is the partitioning of one intelligence across many machines, holds here exactly: one network, two heterogeneous nodes, one tensor crossing the seam. What changes is the cost model. In-datacenter pipeline parallelism balances compute because the link is effectively free; over a WAN the link dominates, so the cut migrates toward wherever the activation is smallest. Carry this comparison forward: every time you meet a parallelism strategy in this book, ask what the interconnect costs relative to the compute, because that single ratio decides where the model should be cut.
The demo below turns the formula into an executable search. It profiles a small backbone with a per-layer FLOP vector and a per-layer activation-size vector, fixes a device speed, a cloud speed, and a link bandwidth, and evaluates $T(s)$ at every candidate cut to report $s^\star$ and the latency it achieves against the two naive baselines.
import numpy as np
# A small convolutional-style backbone described by two profiled vectors:
# flops[k] : compute (in MFLOP) to produce layer k's output, running on EITHER node
# act_mb[k] : size (in megabytes) of layer k's output activation, the cut payload
# Layer 0's "input" payload is the raw image; act_mb[0] is therefore the raw input size.
layer = ["input", "conv1", "conv2", "conv3", "pool", "fc1", "fc2", "softmax"]
flops = np.array([0.0, 420.0, 880.0, 950.0, 60.0, 320.0, 90.0, 2.0]) # MFLOP per layer
act_mb = np.array([6.00, 3.10, 1.60, 0.80, 0.40, 0.05, 0.02, 0.004]) # MB emitted by layer
# Heterogeneous nodes: the device is far slower than the cloud accelerator.
DEV_GFLOPS = 28.0 # device throughput, GFLOP/s (a phone-class NPU)
CLD_GFLOPS = 900.0 # cloud throughput, GFLOP/s (a datacenter accelerator)
LINK_MBPS = 50.0 # uplink bandwidth, megabytes/s (a healthy 5G/Wi-Fi link)
# Cut after layer s: device runs layers 1..s, transfers act_mb[s], cloud runs s+1..end.
# s = 0 means "send the raw input, cloud does everything" (pure offload).
# s = last means "device does everything" (pure on-device, nothing transferred).
n = len(layer) - 1 # number of computational layers (indices 1..n)
results = []
for s in range(0, n + 1):
dev_ms = flops[1:s + 1].sum() / DEV_GFLOPS # MFLOP / GFLOPs = ms
cld_ms = flops[s + 1:].sum() / CLD_GFLOPS
xfer_ms = act_mb[s] / LINK_MBPS * 1000.0 # MB / (MB/s) -> s -> ms
results.append(dev_ms + xfer_ms + cld_ms)
best = int(np.argmin(results))
print("optimal split after layer:", layer[best] if best > 0 else "(raw input)")
print(f"best end-to-end latency : {results[best]:.2f} ms")
print(f"pure on-device total : {results[n]:.2f} ms")
print(f"pure cloud offload : {results[0]:.2f} ms")
print(f"speedup of best vs on-device: {results[n] / results[best]:.2f}x")
_demo_split.py alongside this section; the loop here is the load-bearing core that selects $s^\star$. cut after | dev ms | xfer MB | xfer ms | cloud ms | total ms
----------------------------------------------------------------
(raw in) | 0.00 | 6.000 | 120.00 | 3.02 | 123.02
conv1 | 15.00 | 3.100 | 62.00 | 2.56 | 79.56
conv2 | 46.43 | 1.600 | 32.00 | 1.58 | 80.01
conv3 | 80.36 | 0.800 | 16.00 | 0.52 | 96.88
pool | 82.50 | 0.400 | 8.00 | 0.46 | 90.96
fc1 | 93.93 | 0.050 | 1.00 | 0.10 | 95.03
fc2 | 97.14 | 0.020 | 0.40 | 0.00 | 97.55
softmax | 97.21 | 0.004 | 0.08 | 0.00 | 97.29
----------------------------------------------------------------
optimal split after layer: conv1
best end-to-end latency : 79.56 ms
pure on-device total : 97.29 ms
pure cloud offload : 123.02 ms
speedup of best vs on-device: 1.22x
conv1 ships a 3.1 MB activation, so the device pays only 15 ms of compute and the transfer halves, for a 79.56 ms total that beats both extremes. Note the total is non-monotone in the cut position: it dips, rises, dips again, which is exactly why the search over all cuts, not a heuristic, is what finds $s^\star$.The result repays study. Neither naive strategy is optimal, and the curve is not even monotone, so the optimal cut cannot be guessed from "earlier is always better" or "later is always better". It depends jointly on the activation-size profile, the device-to-cloud speed ratio, and the link bandwidth $B$. Halve $B$ and the transfer term doubles, pushing $s^\star$ later (the device should do more so it ships a smaller activation). Make the device faster and $s^\star$ moves later still. Make the cloud cheaper or the link faster and $s^\star$ moves earlier, toward pure offload. Because $B$ on a mobile link varies minute to minute, production systems re-solve this tiny optimization online, which is why the search is written to be cheap.
The latency model needs one number per layer, the activation byte count, and you should never hand-count it. A forward hook reads it straight off the live tensors, so the $a_s$ vector that drives Code 34.4.1 comes from the real model in a few lines:
import torch
sizes = {}
def hook(name):
return lambda m, inp, out: sizes.__setitem__(
name, out.element_size() * out.nelement()) # bytes of this layer's output
for name, module in model.named_children():
module.register_forward_hook(hook(name)) # one hook per top-level layer
model(example_input) # one forward pass populates `sizes`
# sizes[name] now holds a_s in bytes for every candidate cut point.
torch.profiler or fvcore gives both vectors Code 34.4.1 consumes, so the cut-point optimizer runs on measured numbers rather than guesses. Roughly a dozen lines replace a manual audit of the model graph.Who: A computer-vision engineer building a fall-detection feature for a wrist wearable that pairs with a phone hub.
Situation: The first prototype streamed the camera feed to a cloud classifier; latency spiked whenever the cellular uplink congested, and a privacy review flagged the raw video leaving the home.
Problem: The full detector ran at two frames per second on the wearable's tiny accelerator, too slow to be safe, yet shipping raw frames was both the latency bottleneck and the privacy violation.
Dilemma: Run everything on-device and miss falls, or keep offloading raw video and fail the privacy review and the latency budget together.
Decision: Split the detector. They profiled activation sizes with the hook in Code 34.4.2, ran the cut-point search of Code 34.4.1 against the measured uplink bandwidth, and placed the seam after the third convolutional block, where the activation had shrunk to roughly a tenth of the frame and was no longer visually reconstructable.
How: The wearable ran the head and transmitted the feature map to the phone hub, which ran the tail; when the uplink degraded, the device re-solved $T(s)$ and pushed the seam two layers deeper to cut the payload further.
Result: End-to-end latency dropped under the safety budget and held steady across link conditions, and the privacy review passed because raw frames never left the wearable.
Lesson: The right cut is the one the latency model picks against the actual link, and that same cut often resolves the privacy constraint for free, because depth buys both smaller payloads and less reconstructable representations.
3. Cascades and Confidence-Based Offloading Intermediate
Split computing partitions every inference across the two nodes. A second, complementary pattern partitions the inputs instead: handle the easy ones entirely on the device and escalate only the hard ones to the cloud. This is the model cascade, and its on-device building block is the early-exit network. An early-exit model attaches lightweight classifier heads after intermediate layers; during inference, each head produces a tentative prediction together with a confidence, and the moment a head is confident enough, the network exits and returns that prediction without running the remaining layers. The efficiency idea, doing less work on easy inputs, is the per-node thread from Chapter 22; the systems machinery for routing requests to different model tiers is the distributed-inference thread from Chapter 23. The edge-cloud cascade fuses the two: the early-exit heads live on the device, and the "remaining layers" that a low-confidence input would trigger live in the cloud.
Concretely, the device runs a small, fast model (or the head of a large one with an early-exit classifier). For each input it produces a prediction and a confidence score $p$, typically the maximum softmax probability or its negative entropy. If $p$ clears a threshold $\tau$, the device answers locally and the input never touches the network. If $p$ falls short, the device escalates: it sends the input, or, combining the two patterns, the intermediate activation it has already computed, to a larger, more accurate cloud model that produces the final answer. Easy inputs (a clearly-lit frontal face, an unambiguous spoken command) are dispatched in milliseconds on the device; hard inputs (occlusion, noise, an out-of-distribution sample) pay the network round trip but get the accuracy of the big model. Figure 34.4.2 draws the branch.
4. The Expected Cost of a Cascade Advanced
A cascade's economics are governed by one quantity: the escalation rate. Let $q(\tau)$ be the probability that an input fails the confidence test and is sent to the cloud, that is, $q(\tau) = \Pr[p < \tau]$ over the input distribution. Raising $\tau$ makes the device more cautious, so $q$ rises toward $1$; lowering $\tau$ keeps more inputs local, so $q$ falls toward $0$. If the device path costs $T_{\text{dev}}$ and the escalated path costs $T_{\text{dev}} + T_{\text{esc}}$ (the device still ran its small model before deciding to escalate, then paid the transfer plus the cloud compute $T_{\text{esc}}$), the expected per-input latency is
$$\mathbb{E}[T] \;=\; T_{\text{dev}} \;+\; q(\tau)\, T_{\text{esc}},$$and the analogous expression with monetary or energy costs in place of latency governs the bandwidth and cloud bill. The same threshold also sets accuracy: every escalated input gets the big model, so accuracy increases with $q$, while cost increases with $q$ as well. Choosing $\tau$ is therefore choosing a point on an accuracy-versus-cost frontier. The well-calibrated regime is the one that makes cascades pay: if the device model's confidence is trustworthy, the inputs it keeps are exactly the ones it gets right, so a small $q$ buys accuracy close to the cloud model's at a fraction of the average cost. Poor calibration breaks the deal, because the device confidently keeps inputs it is wrong about, and no threshold recovers the lost accuracy. This is why the practical work in cascades is as much about calibrating $p$ as about picking $\tau$.
Split inference partitions the computation of every input across device and cloud; the cascade partitions the inputs, keeping easy ones local and escalating hard ones. They are orthogonal and compose. A combined system runs the device head on every input, exits locally through an early-exit head when confident, and on the uncertain fraction forwards the already-computed activation (not the raw input) to the cloud tail. The device-head compute is then amortized across both paths, the escalated payload is the small activation rather than the raw input, and only the hard inputs pay the WAN round trip at all. The two patterns answer different questions, "where do I cut the model?" and "which inputs deserve the big model?", and the strongest edge-cloud designs answer both at once.
A hierarchical generalization stacks more than two tiers: a tiny model on the device, a medium model on a nearby fog node (the fog tier of this chapter's earlier sections), and a large model in the cloud, with each tier escalating its uncertain inputs to the next. A speculative variant inverts the dependency: the device runs its fast model immediately and acts on the answer, while simultaneously sending the input to the cloud, then reconciles or corrects if the cloud disagrees, hiding the WAN latency behind the device's optimistic response. These hierarchical and speculative patterns are the multi-tier descendants of the two-node cascade, and they reuse exactly the expected-cost arithmetic above, now summed over tiers.
The fixed activation-size profile that Code 34.4.1 optimizes over is itself a target for redesign. Bottleneck-injection methods train a slim autoencoder into the split point so the transmitted tensor is deliberately compressed and made unreconstructable, shrinking the $a_s$ term below what any natural layer offers; recent work pushes this toward learned, task-aware feature compression that jointly optimizes the cut, the bottleneck, and a rate-distortion objective. For language models, split and cascade ideas meet speculative decoding: a small on-device draft model proposes tokens that a cloud model verifies in parallel, the token-level cousin of the input-level cascade in Section 3, and a vigorous 2024 to 2026 line studies edge-cloud collaborative decoding under intermittent links. The unifying frontier question is to learn, end to end, both what crosses the wire and which inputs cross it, rather than choosing a natural layer and a hand-set threshold. We return to the serving-side machinery these methods plug into in Chapter 23.
An attacker who taps the wireless link in a split-inference system intercepts a mid-network feature map: a few thousand floating-point numbers that once were a face but are now a smear of learned filter responses. With the network's tail weights they might recover a coarse silhouette; without them they have noise that happens to classify well. The most private byte is the one you transformed before you sent it, and the eavesdropper, expecting a photo, gets a tensor that refuses to be one.
Using the latency model $T(s)$ from Section 2 and the activation-size profile in Code 34.4.1, reason qualitatively about how $s^\star$ moves in each of these independent changes, and explain the mechanism in one sentence each: (a) the uplink bandwidth $B$ drops by a factor of ten as the user enters a basement; (b) the device gets a new NPU that triples $r_{\text{dev}}$; (c) the cloud tier is moved to a nearby fog node, cutting the link latency but also lowering $r_{\text{cld}}$ to a mid-range GPU. For each, state whether the cut moves earlier (toward pure offload) or later (toward on-device) and why.
Extend Code 34.4.1 so that the optimal cut is recomputed for a sweep of link bandwidths $B \in \{5, 10, 20, 50, 100\}$ MB/s. Print, for each $B$, the optimal cut layer and its end-to-end latency, and confirm that $s^\star$ moves later as $B$ shrinks. Then add a fixed per-transfer link latency $\ell$ (a constant added to the transfer term whenever $0 < s < n$) and show how a large $\ell$ biases the optimizer toward the two extreme cuts that avoid the wire entirely or send the smallest possible payload.
A device model answers correctly on 90% of inputs and emits a confidence $p$. Suppose the escalation rate is $q(\tau) = \tau$ (a uniform calibration), the device path costs $T_{\text{dev}} = 8$ ms, and the escalated path adds $T_{\text{esc}} = 60$ ms. The cloud model is correct on 99% of the inputs it sees. Write the expected latency $\mathbb{E}[T] = T_{\text{dev}} + q(\tau) T_{\text{esc}}$ and an expression for overall accuracy as a function of $\tau$ under the assumption that escalated inputs are exactly the ones the device would otherwise get wrong. Find the smallest $\tau$ that reaches 97% overall accuracy and report the expected latency there. Then explain, in two or three sentences, how the answer degrades if the device is miscalibrated so that the inputs it escalates are no better than random with respect to its own errors.