"I am a camera in a mesh of ten thousand, reporting only what changed. Alone I see a corner; together we see the room, and no one of us ever holds the whole picture."
A Sensor Node Pruning Its Own Telemetry
Distributed sensing turns many partial, noisy, local observations into one shared estimate of the world that no single sensor could produce, and it must do so under a hard bandwidth budget that forbids shipping every raw byte to the center. A camera mesh, a roadside-to-vehicle network, or a field of environmental motes each sees only a fragment: one viewing angle, one location, one moment. The intelligence lives not in any one node but in the fusion of their reports. This section is about that fusion: how each node decides locally what is worth transmitting, how partial observations combine into a single estimate whose uncertainty is smaller than any contributor's, and how nodes reach agreement on that estimate without a central oracle. Per-sensor perception models are the easy part; the fleet-level world model is the distributed-systems problem.
The chapters before this one treated the edge as a place to run a model that was trained elsewhere. Distributed sensing inverts the emphasis. Here the edge is not a deployment target but a measurement instrument with thousands of apertures, and the central question is how to assemble a coherent view from observations that are scattered across space, arriving at different times, corrupted by different noise, and far too voluminous to centralize in full. A single traffic camera infers vehicle positions in its own frame; a city's worth of cameras, fused, infers a live traffic state that drives signal timing. A single vehicle's radar sees the car ahead; vehicles exchanging messages see around the blind corner. The pattern is always the same. Many sensors produce observations, and those observations must be combined into intelligence under constraints that make naive centralization impossible.
1. Many Apertures, One World Model Beginner
Call a deployment a sensor network when its nodes both observe and communicate: a camera mesh watching a stadium, motes measuring soil moisture across a farm, microphones localizing a sound, or vehicles broadcasting their kinematics over vehicle-to-everything (V2X) links. Each node carries a perception model, often the same on-device network discussed earlier in this chapter, that turns a raw signal into a compact observation: a bounding box, a detection, an embedding, an estimate of some physical quantity with an attached uncertainty. The observations are partial by construction. A camera sees only its frustum; a vehicle sees only what is not occluded; a microphone hears a time-of-arrival, not a position. No node has the data to answer the question the fleet is actually asked.
Multi-view perception is the canonical case. Several cameras with overlapping fields of view each detect the same pedestrian from a different angle; fusing their detections recovers a 3D track that no single 2D view determines. The same logic drives V2X: a vehicle that cannot see a crossing pedestrian receives a detection from a roadside unit or another car that can, and acts on a world model assembled from messages rather than from its own sensors alone. We treat these systems as a stream-processing problem (the observations are an unbounded, time-stamped flow, the subject of Chapter 9) layered on a graph (the nodes and their communication links form exactly the kind of network analyzed in Chapter 13). The fusion that sits on top is what this section adds.
Each sensor's perception model is a solved, single-node problem: detect, classify, estimate, attach an uncertainty. The hard, distinctly distributed part is everything above the node, deciding what to transmit when bandwidth is scarce, combining observations that overlap and disagree, and converging on one estimate without a central authority. A fleet of mediocre sensors with good fusion routinely beats a single excellent sensor, because fusion reduces uncertainty in a way no individual aperture can. Design effort spent on the combine step pays off faster than effort spent making any one node see better.
2. In-Network Aggregation: Decide Locally What to Transmit Intermediate
The defining constraint of distributed sensing is bandwidth. Ten thousand cameras cannot stream raw video to a datacenter; the uplink does not exist, and even if it did, the bill and the latency would. The remedy is to push computation to the sensor and transmit only what fusion actually needs. This is in-network aggregation: each node, and each intermediate fog aggregator, decides locally what is worth sending. A node that sees nothing new sends nothing, the dashed link in Figure 34.5.1. A node that detects an event sends the detection, not the frame. An aggregator that receives ten overlapping detections of the same object fuses them into one and forwards the single fused result, so traffic shrinks as it climbs the hierarchy rather than growing.
The central design choice is what abstraction to ship. At one extreme, send raw data: maximum information for the fusion center, maximum bandwidth. At the other, send final decisions: minimum bandwidth, but the center cannot re-fuse what it never saw, and early hard decisions discard the soft evidence that good fusion depends on. The productive middle is to ship detections or embeddings with attached uncertainty, compact enough to fit the budget yet soft enough to fuse correctly. A bounding box with a covariance, or a fixed-length feature vector, is typically two to four orders of magnitude smaller than the frame it summarizes, and it preserves exactly the quantities the next section needs.
Who: A perception engineer at a logistics company instrumenting a warehouse with 400 ceiling cameras.
Situation: The first design streamed H.264 from every camera to a central server that ran detection and tracking; the aggregate uplink saturated the site network and added seconds of latency.
Problem: Forklift-collision alerts needed sub-200-millisecond reaction, and centralized video could not meet it at 400 cameras.
Dilemma: Buy a far larger network and a bigger central server (scale up the pipe), or push detection onto each camera and centralize only the results (scale out the sensing).
Decision: They ran an on-device detector per camera and transmitted only detections, each a few dozen bytes with a position covariance, plus a heartbeat when a camera saw nothing.
How: Fog aggregators on each aisle fused overlapping detections of the same forklift before forwarding, so the central tracker received one fused track per object rather than per camera.
Result: Uplink traffic fell by more than three orders of magnitude, end-to-end alert latency dropped under the budget, and tracking accuracy improved because multi-view fusion resolved occlusions that any single camera missed.
Lesson: The right unit to transmit is rarely the raw signal and rarely the final decision; it is the soft, uncertainty-tagged observation that fusion can still combine.
3. Distributed Data Fusion: Combining Partial Observations Intermediate
Once compact observations arrive, fusion combines them into one estimate. The principle is that a measurement should count in proportion to its reliability, and reliability is the inverse of variance. Suppose $K$ sensors each report a noisy estimate $x_k$ of the same scalar quantity, with the estimate from sensor $k$ modeled as the truth plus zero-mean noise of variance $\sigma_k^2$. The minimum-variance unbiased combination is the inverse-variance weighted mean,
$$\hat{x} = \frac{\sum_{k=1}^{K} \sigma_k^{-2}\, x_k}{\sum_{k=1}^{K} \sigma_k^{-2}}, \qquad \sigma_{\hat{x}}^{2} = \left( \sum_{k=1}^{K} \sigma_k^{-2} \right)^{-1}.$$Two facts make this the workhorse of distributed sensing. First, a precise sensor (small $\sigma_k^2$) dominates a noisy one automatically; no thresholding or hand-tuning is needed, the weights $\sigma_k^{-2}$ do it. Second, the fused variance $\sigma_{\hat{x}}^2$ is strictly smaller than every individual variance, because adding any positive precision $\sigma_k^{-2}$ to the denominator only shrinks the result. The fleet is genuinely more certain than its best member. The vector form replaces variances with covariance matrices and inverse-variances with precision (inverse-covariance) matrices; this is exactly the update a Kalman filter performs, fusing a prediction with a new measurement, and it is also the Bayesian posterior when the priors and likelihoods are Gaussian. We treat these as one idea at a high level: a covariance-weighted combination of partial observations, equation above, that any node can compute from messages alone.
The code below makes the variance-shrinking claim concrete. Four heterogeneous sensors, a precise lidar through a noisy radar, each report an estimate of one position. We fuse them by the equation above and check that the fused standard deviation is below the best single sensor's.
import numpy as np
rng = np.random.default_rng(7)
true_pos = 12.0 # the quantity every sensor estimates
# Four heterogeneous sensors: a precise lidar, two mid cameras, a noisy radar.
sigma = np.array([0.30, 0.90, 1.10, 2.50]) # per-sensor noise std dev
var = sigma ** 2 # per-sensor variance
est = true_pos + rng.standard_normal(len(sigma)) * sigma # each node's local estimate
# Covariance-weighted (inverse-variance) fusion: the precision-weighted mean.
w = (1.0 / var) / np.sum(1.0 / var) # weights sum to 1
fused = np.sum(w * est) # the shared estimate
fused_var = 1.0 / np.sum(1.0 / var) # fused variance
print("true value :", true_pos)
for i in range(len(sigma)):
print(f" sensor {i} est={est[i]:6.3f} sigma={sigma[i]:.2f} weight={w[i]:.3f}")
print("fused estimate :", f"{fused:.3f}")
print("fused error :", f"{abs(fused - true_pos):.3f}")
print("best single sigma :", f"{sigma.min():.3f}")
print("fused sigma :", f"{np.sqrt(fused_var):.3f}")
print("fused var < best var :", fused_var < var.min())
w are the normalized inverse variances, so the noisy radar contributes little; the fused variance fused_var is computed directly from the equation above.true value : 12.0
sensor 0 est=12.000 sigma=0.30 weight=0.833
sensor 1 est=12.269 sigma=0.90 weight=0.093
sensor 2 est=11.698 sigma=1.10 weight=0.062
sensor 3 est= 9.774 sigma=2.50 weight=0.012
fused estimate : 11.980
fused error : 0.020
best single sigma : 0.300
fused sigma : 0.274
fused var < best var : True
Code 34.5.1 fuses scalars by hand. For the vector case, fusing a state prediction with a new multi-sensor measurement, you would otherwise write the matrix precision update, the Kalman gain, and the covariance bookkeeping yourself. The filterpy library reduces the measurement-fusion step to a single call:
from filterpy.kalman import KalmanFilter # pip install filterpy
import numpy as np
kf = KalmanFilter(dim_x=2, dim_z=2) # state: 2D position
kf.x = np.array([0., 0.]) # prior estimate
kf.P = np.eye(2) * 5.0 # prior covariance (uncertain)
kf.H = np.eye(2) # sensor observes position directly
kf.R = np.diag([0.09, 6.25]) # sensor noise: precise in x, noisy in y
kf.predict()
kf.update(np.array([12.0, 11.0])) # fuse one sensor's measurement
print(kf.x) # posterior is precision-weighted automatically
predict and update; the library handles the Kalman gain, the precision weighting, and the posterior covariance that fold the prior and the measurement together.4. Consensus on a Shared Estimate Advanced
Fusion as written assumes one place that holds all $K$ observations. In a true mesh there is no such place; nodes see only their neighbors, and we still want every node to converge on the same fused estimate. Distributed averaging is the bridge. If each node repeatedly replaces its value with a weighted average of its own and its neighbors', then on a connected graph every node's value converges to the network-wide average, and no node ever needs the full set of observations. Iterating
$$x_i^{(t+1)} = x_i^{(t)} + \epsilon \sum_{j \in \mathcal{N}(i)} \left( x_j^{(t)} - x_i^{(t)} \right)$$drives all $x_i$ to the global mean for a suitable step size $\epsilon$, using only neighbor-to-neighbor messages. Carrying the precision $\sigma_k^{-2}$ as a second averaged quantity recovers the inverse-variance weighting of the equation in Section 3, so the mesh computes the same covariance-weighted fusion in a fully decentralized way. This is the same gossip and decentralized-averaging machinery that powers decentralized learning in Chapter 14; here it produces a shared world estimate rather than a shared model. Consensus also buys robustness: with no central fusion node, there is no single point of failure, and the trade is more rounds of communication for that resilience, the recurring tension of this book between bandwidth and reliability.
Distributed sensing is the perception face of the book's thesis. The work, building a model of the world, is partitioned across many machines, each holding a fragment, and a combine step (covariance-weighted fusion, reached by consensus) reassembles a result none of them could compute alone. The combine step here is a cousin of the all-reduce that synchronizes gradients in data-parallel training: a sum of per-node quantities, weighted and shared, that leaves every participant holding the same answer. Whenever you meet a distributed-sensing system, ask the two questions this book always asks: what does each node transmit, and how is the combine performed under a bandwidth budget.
5. The Bandwidth Budget Sets the Design Intermediate
Every choice in the preceding sections reduces to one trade: how many bytes per observation, against how much fusion quality those bytes buy. Sending raw data maximizes fusion fidelity and saturates the link. Sending hard detections minimizes traffic and starves the fusion center of the soft evidence it needs to weigh sources well. Sending uncertainty-tagged detections or embeddings sits where most production systems land, small enough to fit the budget and soft enough that the covariance-weighted combine still behaves. The right point on this curve depends on the link, the event rate, and how much the application is hurt by a slightly worse estimate, and it is an engineering measurement rather than a default. These per-sensor world models feed directly into the action layer: a robot or autonomous vehicle consumes the fused estimate to plan and act, the subject of Section 34.8 on robotics and autonomous systems.
Cooperative (or collaborative) perception is an active frontier in autonomous driving and multi-robot systems. The question of what to transmit has itself become a learning problem: rather than fixing detections or embeddings by hand, methods in the V2X-ViT and CoBEVT lineage learn an intermediate feature representation that each agent shares, fusing bird's-eye-view features across vehicles and roadside units with attention so the fusion is robust to pose error and communication latency. A parallel thread learns a transmission policy that sends only the features most useful to neighbors under a bandwidth cap, treating the uplink as a constrained channel to be optimized end-to-end with the perception model. The same fused-world-model idea scales up to coordinated fleets of robots and drone swarms, where many agents must agree on a shared map while moving, taken up in the multi-agent robotics case study of Chapter 39. The open problems are familiar from this section: fusing under pose and clock error, staying robust when some agents are wrong or adversarial, and learning what to send when bandwidth is the binding constraint.
We now have the full distributed-sensing loop: per-sensor perception, in-network aggregation that decides locally what to transmit, covariance-weighted fusion that makes the fleet more certain than any node, and consensus that spreads the shared estimate without a center. The next section turns from sensing the world to acting on a model that was learned across many devices, continuing the chapter's edge story in Section 34.6.
For each system, state whether each sensor should transmit raw data, hard detections, or uncertainty-tagged detections/embeddings, and justify it from the bandwidth-versus-fusion-quality trade of Section 5: (a) a 5,000-camera city traffic mesh feeding a central signal-timing controller; (b) two cars at an intersection cooperating over a V2X link with a few milliseconds to spare; (c) a battery-powered acoustic mote network that must run for a year on one charge. Explain why shipping final decisions would degrade the fused result in at least one of these cases.
Start from Code 34.5.1. Add a fifth sensor that reports a wildly wrong estimate (say, 40.0) but advertises a small variance, claiming high confidence it does not deserve. Show that the covariance-weighted fusion is now pulled badly off the true value, because it trusts the advertised variance. Then implement a simple robust fix: down-weight any sensor whose estimate is more than three fused standard deviations from the current fused mean, refuse its claimed precision, and re-fuse. Report the fused estimate before and after, and explain why advertised uncertainty cannot be trusted in an open mesh.
Consider $K$ nodes on a ring, each holding one scalar estimate, running the distributed-averaging iteration of Section 4 to agree on the fused mean. Argue qualitatively how the number of rounds to reach agreement grows with $K$ on a ring versus on a fully connected graph, and relate this to the graph structure studied in Chapter 13. Then state the trade against a single central fusion node: how many messages does each approach send per estimate, and what reliability does the consensus version buy in exchange for its extra rounds?