Part VIII: Case Studies and Capstone Projects
Chapter 39: Multi-Agent Robotics and Drone Swarms

Shared Situational Awareness

"They handed me a map, stitched from a hundred partial views that never fully agree, and asked me to believe in it anyway. So I weighted each view by how little it doubted itself, and called the seam a world."

A Pose Graph Reconciling Its Loop Closures
Big Picture

A swarm acts on a world that no single robot can see, so its first collective task is to build one shared world model out of many partial, noisy, and slightly inconsistent local views. Each robot perceives a fragment: a patch of map, a few tracked objects, an estimate of its own pose. Alone, each fragment is impoverished and uncertain. Fused correctly, they become a global situational picture that is more complete than any one robot's view and more accurate than the best of them. This section shows how the swarm assembles that picture: each robot maps and localizes locally, ships compact summaries rather than raw sensor streams, and the team merges those summaries by adding information, not by averaging guesses. The mathematics is the distributed sensing and fusion you already met in Section 34.5, now carrying the weight of an entire mission, and the engineering is the consistency problem from Chapter 2 applied to a belief instead of a database.

The previous section gave the swarm a way to move and stay together; this section gives it something to be together about. Coordinated motion is only useful in service of a shared understanding of the environment: where the obstacles are, which cells of the search area are already cleared, where the target was last seen, how each robot's own position relates to the others. None of this lives in one place. It is distributed across the swarm exactly as the sensing that produced it is distributed, and the central problem of shared situational awareness is to turn that distribution from a liability into an advantage. A swarm of inexpensive sensors, fused well, can localize a faint signal that no individual sensor could resolve, the same way an ensemble of weak estimators can beat a single strong one.

We build that capability in layers. First, each robot must map its surroundings and locate itself within them, which is the collaborative simultaneous localization and mapping (SLAM) problem. Second, the robots must fuse their overlapping estimates of shared quantities, which is distributed state estimation. Third, they must keep the resulting shared belief consistent despite a network that drops messages and delivers them late, which is the eventual-consistency problem in a new guise. We take these in turn, then close with what happens when two robots simply disagree about what they saw.

Local submaps (partial views) Robot A Robot B Robot C share & merge Merged global map + pose graph pose node shared landmark odometry / observation edge inter-robot loop closure
Figure 39.5.1: Building one world from partial views. Each robot (left) carries a small local submap of poses (blue) and landmarks (orange) covering only what it has seen. Robots that observe a common landmark or revisit a place another robot mapped contribute a shared landmark or an inter-robot loop closure (dashed edges, right), which stitches the separate submaps into a single connected pose graph and one global map. The dashed edges are exactly the cross-robot constraints that fix the relative alignment of otherwise independent local frames.

1. Collaborative Mapping and Localization Intermediate

A single robot doing SLAM estimates a trajectory of its own poses and a map of landmarks simultaneously, because neither can be pinned down without the other: you need the map to localize, and you need your location to build the map. The modern formulation casts this as a graph. Nodes are robot poses and landmark positions; edges are constraints, each one a measurement that says "this pose and that landmark differed by roughly this much, with roughly this uncertainty". Solving SLAM means finding the node positions that best satisfy all the edges at once. That graph view is precisely the pose graph or factor graph, and it is the same machinery that Chapter 13 studies when computation is partitioned across the vertices of a large graph; here the graph is a map and its vertices are spread across robots.

Multi-robot SLAM scales this out. No robot builds the whole map. Each one runs local SLAM over its own trajectory, producing a compact submap: a small pose graph annotated with the landmarks it has seen. The submaps are then shared and merged. Merging is not concatenation, because two robots' local coordinate frames are arbitrary and unrelated until something links them. The link comes from overlap: when robot A and robot B both observe the same landmark, or when B drives through a place A already mapped and recognizes it (an inter-robot loop closure), that shared observation becomes a cross-robot edge in the combined graph. Adding enough such edges anchors every robot's frame to a common one, and a single graph optimization over the union yields the global map. Figure 39.5.1 shows three submaps fused through shared landmarks and loop-closure edges into one connected pose graph.

Key Insight: Share Summaries, Not Sensor Streams

The swarm cannot afford to stream raw point clouds or video between robots; the radio link is the scarcest resource in the system. Collaborative SLAM works because the quantity that must travel between robots is small: a submap is a handful of poses, a list of landmarks, and the relative-pose constraints between them, all far smaller than the sensor data that produced them. Each robot does the heavy local perception, then ships a compressed belief. This is the same move that makes distributed sensing tractable in Section 34.5: push computation to where the data is born and communicate only the distilled estimate.

2. Distributed State Estimation and Data Fusion Advanced

Mapping fixes the geometry; fusion fixes the values. Many quantities in the shared world model are observed by several robots at once: the position of a moving target, the location of a landmark in the overlap region, a robot's own pose seen partly by itself and partly by a neighbor. Each robot holds a local estimate of such a quantity together with an uncertainty, and the swarm must combine these into one estimate that is at least as good as the best contributor. The right tool, introduced for sensing in Section 34.5, is the information form of the Kalman filter, because in that form fusion is addition.

Suppose robot $k$ estimates a shared state $x$ as a Gaussian with mean $\hat{x}_k$ and covariance $P_k$. The information form represents the same belief by an information matrix and an information vector,

$$Y_k = P_k^{-1}, \qquad y_k = P_k^{-1} \hat{x}_k.$$

The information matrix $Y_k$ is the inverse covariance, so a confident robot (small $P_k$) carries large information, and a vague one carries little. The decisive fact is that for conditionally independent observations, the fused belief over all $K$ robots is obtained by simply summing their contributions,

$$Y_{\text{fused}} = \sum_{k=1}^{K} Y_k, \qquad y_{\text{fused}} = \sum_{k=1}^{K} y_k,$$

after which the fused estimate and covariance are recovered by inverting back,

$$P_{\text{fused}} = Y_{\text{fused}}^{-1}, \qquad \hat{x}_{\text{fused}} = P_{\text{fused}}\, y_{\text{fused}} = \left( \sum_k P_k^{-1} \right)^{-1} \sum_k P_k^{-1} \hat{x}_k.$$

The fused mean is a covariance-weighted average of the local means: each robot's estimate is weighted by its information $P_k^{-1}$, so precise robots dominate and noisy ones contribute little. Two properties make this the backbone of swarm fusion. First, the fused information $Y_{\text{fused}}$ is the sum of nonnegative-definite terms, so it is at least as large as any single $Y_k$; equivalently $P_{\text{fused}} \preceq P_k$ for every $k$, meaning the fused estimate is no less certain than the best individual one. Second, summation is associative and commutative, so robots can fuse in any order, in any grouping, along any spanning tree of the communication graph, and reach the identical result, which is exactly what a decentralized swarm with no fixed leader needs. We make both properties concrete in Code 39.5.1.

The code below fuses partial state estimates from four robots, each watching the same landmark with a different sensor noise, by adding their information matrices and vectors. It then cross-checks the decentralized sum against a centralized weighted-least-squares fusion that has access to every observation at once.

import numpy as np

rng = np.random.default_rng(7)
d = 3                                  # shared state: a 3D landmark position
x_true = np.array([4.0, -2.0, 1.0])

# Four robots, each with its own sensor covariance (robot 4 is far and noisy).
covs = [np.diag([0.50, 0.80, 0.30]),
        np.diag([1.20, 0.40, 0.90]),
        np.diag([0.70, 1.50, 0.60]),
        np.diag([2.00, 1.80, 2.20])]
locals_x = [x_true + rng.multivariate_normal(np.zeros(d), P) for P in covs]

# Information-filter fusion: each robot ships (Y_k, y_k); the team ADDS them.
Y_sum, y_sum = np.zeros((d, d)), np.zeros(d)
for x_k, P_k in zip(locals_x, covs):
    Y_k = np.linalg.inv(P_k)           # information matrix = inverse covariance
    Y_sum += Y_k                       # add information matrices
    y_sum += Y_k @ x_k                 # add information vectors
P_fused = np.linalg.inv(Y_sum)         # back to covariance form
x_fused = P_fused @ y_sum              # covariance-weighted mean

# Centralized cross-check: one weighted least squares over all observations.
H = np.vstack([np.eye(d) for _ in covs])
R = np.zeros((d * len(covs),) * 2)
for i, P_k in enumerate(covs):
    R[i*d:(i+1)*d, i*d:(i+1)*d] = P_k
z = np.concatenate(locals_x)
P_central = np.linalg.inv(H.T @ np.linalg.inv(R) @ H)
x_central = P_central @ (H.T @ np.linalg.inv(R) @ z)

for k, P_k in enumerate(covs, 1):
    print(f"robot {k}: trace(P)={np.trace(P_k):.2f}")
print("trace(P_fused)            :", f"{np.trace(P_fused):.3f}")
print("min single-robot trace    :", f"{min(np.trace(P) for P in covs):.3f}")
print("fused tighter than best?  :", np.trace(P_fused) < min(np.trace(P) for P in covs))
print("max |x_fused - x_central| :", f"{np.max(np.abs(x_fused - x_central)):.2e}")
print("max |P_fused - P_central| :", f"{np.max(np.abs(P_fused - P_central)):.2e}")
Code 39.5.1: Decentralized information-filter fusion of four robots' estimates of one landmark. Fusion is the sum of information matrices and vectors; the result is compared against a centralized weighted least squares that sees every observation simultaneously.
robot 1: trace(P)=1.60
robot 2: trace(P)=2.50
robot 3: trace(P)=2.80
robot 4: trace(P)=6.00
trace(P_fused)            : 0.563
min single-robot trace    : 1.600
fused tighter than best?  : True
max |x_fused - x_central| : 0.00e+00
max |P_fused - P_central| : 0.00e+00
Output 39.5.1: The fused covariance trace (0.563) is smaller than every robot's, including the most precise (1.600), so fusion strictly reduces uncertainty. The fused estimate matches the centralized fusion to machine precision, confirming that summing information decentrally is exactly equivalent to fusing everything at one node.

Two readings of Output 39.5.1 matter. The fused uncertainty is lower than any single robot's, which is the quantitative content of "the swarm sees what no member can": four mediocre sensors, fused, beat the best of them. And the decentralized sum equals the centralized solve bit for bit, which is the swarm analogue of the exact-gradient identity that opened this book: just as data-parallel training reorganizes one gradient into a sum of partial gradients without approximation (Section 1.1), information fusion reorganizes one estimate into a sum of partial informations without approximation. The combine step is again addition, and again it is exact.

Thesis Thread: The Same Sum, Now Over Beliefs

The swarm computes a global situational picture that no single agent can perceive, and the mechanism is the additive combine that recurs throughout this book. In data-parallel training the workers add gradients (Section 1.1); in the parameter server they add updates (Chapter 11); here the robots add information matrices and vectors. Perception, not just optimization, is distributed across the team and recombined by a sum. Whenever a swarm produces a collective estimate, ask which quantity is being summed and whether the sum is order-independent; for the information filter it is, which is why a leaderless swarm can fuse along any communication tree and still agree.

Library Shortcut: Factor-Graph Back Ends Do the Optimization for You

Code 39.5.1 fused a single static landmark by hand. A real collaborative-SLAM system fuses thousands of poses and landmarks tied by inter-robot loop closures, and you do not write that solver yourself. Factor-graph libraries such as GTSAM and the Ceres Solver let you declare the variables and the constraints and then run an optimized sparse nonlinear least squares over the whole graph:

import gtsam
from gtsam import noiseModel

graph = gtsam.NonlinearFactorGraph()
NOISE = noiseModel.Diagonal.Sigmas([0.7, 0.7, 0.3])   # per-edge uncertainty

# Each robot adds odometry and observation factors; shared landmarks and
# inter-robot loop closures are just factors that touch two robots' variables.
graph.add(gtsam.BetweenFactorPose2(A1, A2, odom_A, NOISE))     # robot A motion
graph.add(gtsam.BetweenFactorPose2(B3, A2, loop_BA, NOISE))    # B-to-A loop closure
# ... add every robot's factors into ONE graph ...

result = gtsam.LevenbergMarquardtOptimizer(graph, initial).optimize()
marginals = gtsam.Marginals(graph, result)        # per-variable covariances, for free
Code 39.5.2: The same fuse-by-graph idea at scale. A hand-rolled multi-robot bundle adjustment of hundreds of lines collapses to declaring factors and calling an optimizer; GTSAM handles the sparse linear algebra, the information-form marginalization, and the covariance recovery that Code 39.5.1 did manually for one variable.

3. A Consistent Shared Belief Under Communication Limits Advanced

The fusion of Section 2 assumed every contribution arrives. In a swarm it does not. Radios drop packets, robots move out of range, and a message that does arrive may be seconds stale. The shared world model is therefore a replicated, distributed data structure subject to exactly the partition and delay pressures that Chapter 2 framed for any distributed system. We cannot demand that all robots hold an identical belief at every instant; that would require synchronous, lossless communication the swarm does not have. We instead settle for the weaker and achievable guarantee from Chapter 2: eventual consistency. If observations stop arriving, every robot that has received the same set of updates converges to the same world model, regardless of the order in which the updates were applied.

The information filter makes eventual consistency unusually easy to achieve, because its combine operation is a commutative, associative sum. An update is a pair $(Y_k, y_k)$; applying updates in any order yields the same total, so a robot does not need messages in sequence, only eventually. This is the conflict-free structure that Chapter 2 identifies as the friendly case for replication: when the merge of two states is order-independent and idempotent under deduplication, replicas reconcile without coordination. The practical requirement is therefore not ordering but exactly-once accounting: each robot must fold every distinct observation into its sum once and only once, never twice and never zero times, which swarms enforce by tagging observations with unique source-and-sequence identifiers and discarding duplicates. The observations themselves arrive as a continuous flow, so each robot runs the fusion as a streaming computation over an unbounded sequence of incoming summaries, the online-processing pattern developed in Chapter 9.

Key Insight: Commutative Merge Buys Consistency Without a Clock

The reason a leaderless swarm can hold a coherent shared belief over a lossy network is that its fusion operator is a sum. Addition does not care about order or grouping, so robots need no global clock, no leader to serialize updates, and no guarantee of in-order delivery; they need only that each observation is eventually counted exactly once. Replace the additive information filter with an operator that depends on update order, and you reintroduce the coordination cost that Chapter 2 shows is expensive under partitions. The choice of fusion algebra is a consistency decision in disguise.

4. Handling Disagreement and Conflicting Observations Advanced

Eventual consistency promises that robots converge on the same belief; it does not promise that belief is correct. Two robots can legitimately report incompatible observations: one sees the target at a location the other reports empty, a faulty sensor injects a confident but wrong estimate, or two loop closures imply contradictory alignments of the same place. Naive information fusion handles mild disagreement gracefully, because the covariance weighting already discounts a vague observation, but it is fragile against a confident outlier, which carries a large $Y_k$ and can drag the fused estimate toward a falsehood. A swarm that trusts every report equally is one bad sensor away from a shared delusion.

The defenses operate at two levels. At the measurement level, each cross-robot constraint is gated before it enters the fusion: the innovation, the gap between a new observation and the current belief, is tested against its expected covariance, and observations that fall outside a chi-squared confidence bound are rejected as inconsistent rather than fused. At the graph level, robust pose-graph optimization treats every inter-robot loop closure as provisional and lets the optimizer switch off the ones that cannot be reconciled with the bulk of the constraints, so a single spurious loop closure cannot warp the whole map. This is the same robustness-to-bad-contributors theme that distributed learning meets as Byzantine-robust aggregation in Chapter 35: a sum is efficient but trusting, so production systems wrap the sum in a test that asks each contribution to prove it belongs before it is counted.

Practical Example: The Warehouse Fleet That Stopped Trusting One Camera

Who: A robotics engineer running a fleet of twenty autonomous forklifts that share a live map of a warehouse.

Situation: Each forklift localized against shared visual landmarks and fused its pose estimates into a common map so the fleet could route around one another.

Problem: One forklift's camera mount loosened, so it reported tightly confident but badly wrong landmark positions, and because its information matrices were large, the naive sum pulled the shared map several meters off, triggering phantom near-collisions across the fleet.

Dilemma: Drop all inter-robot fusion and lose the accuracy gain that made the shared map worthwhile, or keep fusing and remain hostage to any single miscalibrated sensor.

Decision: They kept the information-filter fusion but gated every incoming constraint with a chi-squared innovation test and made the pose-graph optimizer robust to outlier loop closures.

How: Observations whose innovation exceeded the confidence bound were quarantined rather than summed, and a robust kernel let the optimizer down-weight the loosened camera's loop closures automatically.

Result: The faulty forklift's reports were rejected within seconds, the shared map stayed accurate, and the fleet flagged the outlier robot for maintenance instead of inheriting its error.

Lesson: Fusion must be robust before it is precise. A swarm's shared belief is only as trustworthy as its worst unchecked contributor, so the gate that rejects bad observations is not an optimization, it is a safety requirement.

Research Frontier: Robust, Communication-Efficient Collaborative SLAM (2024 to 2026)

Two threads dominate current work on swarm situational awareness. The first is making collaborative SLAM robust and fully decentralized at scale: systems in the lineage of DOOR-SLAM and Kimera-Multi (Tian et al., 2023, and its follow-ups) perform distributed pose-graph optimization with robust outlier rejection over real radio links, and recent work pushes toward dozens of agents with no central server and explicit bandwidth budgets. The second is learned and semantic shared maps: instead of fusing geometric landmarks alone, robots fuse compact neural scene representations and object-level semantics, so the shared belief is a learned world model rather than a point cloud, which raises new fusion questions about merging latent representations consistently across agents. Both threads circle the same constraint this section opened with, that the bottleneck is the link, and treat the bytes-per-shared-belief as the quantity to be engineered down while keeping the fused estimate trustworthy.

The swarm now holds something no member could build alone: a single, eventually consistent, outlier-resistant world model assembled from a hundred partial views. That belief is not an end in itself. It exists to be acted upon, and the next section closes the loop by putting this shared situational picture under distributed control, so that the team's motion and task decisions are computed against the world model this section taught it to build. That is the subject of Section 39.6.

Exercise 39.5.1: Why Order Does Not Matter Conceptual

A swarm fuses observations over a lossy network, so different robots receive the same set of updates in different orders and at different times. Explain why the information-filter fusion of Section 2 guarantees that any two robots holding the same set of updates reach an identical belief, and connect this to the eventual-consistency guarantee of Chapter 2. Then describe one concrete fusion rule that would break this guarantee, and state what coordination it would force the swarm to add back.

Exercise 39.5.2: A Confident Liar Coding

Extend Code 39.5.1 by adding a fifth robot whose estimate is badly wrong (say, off by ten units in one axis) but whose covariance is very small, so it claims high confidence. First fuse all five robots with the plain information sum and measure how far the fused estimate moves from the true landmark. Then add a chi-squared innovation gate that rejects any observation whose squared Mahalanobis distance from the current fused estimate exceeds a threshold, and show that gating restores an accurate fused estimate. Report both fused errors and explain why a large information matrix makes an outlier especially dangerous.

Exercise 39.5.3: How Many Robots Until It Is Tight Enough? Analysis

Suppose $K$ robots each estimate the same scalar quantity with independent observations of identical variance $\sigma^2$. Using the information form, derive the fused variance as a function of $K$ and $\sigma^2$, and show it falls as $\sigma^2 / K$. Given a target fused standard deviation, compute the number of robots required, and discuss how this clean $1/K$ scaling degrades once the observations are correlated (for example, when robots share a common landmark whose own position is uncertain). What does the correlated case imply for a swarm that fuses many views of the same few landmarks?