Part VIII: Case Studies and Capstone Projects
Chapter 39: Multi-Agent Robotics and Drone Swarms

Communication Constraints

"I sent the update three times. I will never know if it arrived, because the neighbor I sent it to has already flown over a hill, and so have I."

A Message, Hoping Its Neighbor Is Still in Range
Big Picture

In a robot swarm the network is not a fast fabric you can ignore; it is the binding constraint that shapes every algorithm in this chapter. Links are wireless and range-limited, so two agents can talk only while they are close enough; the agents move, so the set of who-can-talk-to-whom changes from one instant to the next; packets are lost, bandwidth is scarce, and there is no switch in the middle guaranteeing delivery. This is the exact opposite of the datacenter assumption that underwrote every collective in Chapter 4, where the interconnect was fast, fixed, and reliable. A swarm algorithm that waits for a global barrier or assumes a stable topology will deadlock the first time a drone crests a ridge. The remedy, developed in this section, is to design for a network that is unreliable and changing by construction: disseminate information by gossip rather than coordination, and send only what changed rather than everything.

The previous sections built up the swarm as a controlled, learning collective: agents with policies, local sensing, and shared goals. Every one of those mechanisms quietly assumed the agents could exchange information. This section pays that bill. It is the section the rest of the chapter rests on, because the character of the communication network is what forces a robot swarm to look so different from the data-parallel clusters of Part IV. In a datacenter, the cables do not move and the switch does not forget; in a swarm of drones over a field, the network is a living thing that grows and tears itself apart as the agents fly. We start from the physical reality of that link, build up to the communication graph it induces, and then show why robust dissemination under churn pushes the design toward the gossip and no-coordinator ideas first met in Chapter 1 and Chapter 2.

1. The Link Is Wireless, Range-Limited, and Mobile Beginner

A wire in a datacenter connects two fixed endpoints at a fixed, high rate, and stays connected. A radio link between two flying robots has none of those properties. It exists only while the two agents are within communication range, a distance set by transmit power, antenna, and the radio-frequency environment; it carries a bandwidth that falls off sharply with distance and with interference from every other agent transmitting nearby; it adds latency that is variable rather than fixed; and it drops packets routinely, because radio is a shared, noisy medium with no switch arbitrating access. None of these are occasional faults to be handled in an error path. They are the normal operating condition, present on every link, every second.

Four properties of this link drive the rest of the section, and it helps to name them separately because each one breaks a different datacenter assumption. Range means connectivity is local: an agent can reach only the handful of others within its radius, never the whole swarm at once. Mobility means that set of reachable neighbors changes continuously as the agents move, so the topology is a moving target rather than a wiring diagram. Intermittency means a link that exists now may be gone in a second and back a second later, so a missing message is not evidence of a failed peer. Scarcity means bandwidth and airtime are shared and small, so every byte transmitted competes with every other agent's bytes, and the practical limit is reached long before the datacenter's would be. We made the case in Section 34.7 that latency at the edge is a first-class design variable; here latency is joined by three siblings, and together they dominate.

Key Insight: In a Swarm, the Network Is the Binding Constraint, Not a Fast Fabric

Every parallel method in Parts III and IV was designed against a network that was fast, fixed, and reliable, so the algorithm could treat communication as a tax to be minimized but otherwise trusted. A swarm inverts all three assumptions at once: the network is slow, moving, and lossy, so the algorithm cannot trust it at all. The design consequence is sharp. You may not assume any specific message arrives, any specific peer is reachable, or the topology this round resembles last round's. Algorithms that survive are the ones that are correct even when most messages are lost and the graph is rewired underneath them, which is why a swarm reaches for gossip where a datacenter reaches for all-reduce.

2. The Communication Graph, and How Mobility Rewrites It Intermediate

Make the link concrete with a model that is standard for exactly this setting: the random geometric graph. Place $N$ agents at positions $p_1, \dots, p_N$ in the plane, and connect two agents whenever they are within communication radius $r$. The resulting graph $G = (V, E)$ has an edge set

$$E = \{\, (i, j) : \lVert p_i - p_j \rVert \le r,\; i \ne j \,\},$$

and this graph, not a fixed wiring diagram, is the substrate every swarm algorithm runs on. Its structure follows from the geometry. If the agents are spread over a region of area $\mathcal{A}$ with density $\rho = N / \mathcal{A}$, the expected number of neighbors of an agent, its degree, is approximately

$$\mathbb{E}[\deg(i)] \approx \rho \, \pi r^2 = \frac{N \pi r^2}{\mathcal{A}},$$

because each agent connects to whoever falls inside the disc of radius $r$ around it. Degree grows with the square of the radius, so a small increase in transmit power (a larger $r$) buys a disproportionately denser graph, at a cost in energy and interference. There is a sharp threshold below which the graph fractures into disconnected islands no message can cross; for $N$ agents in a unit square, connectivity requires a radius on the order of $r \sim \sqrt{\log N / (\pi N)}$, the well-known scaling for connectivity of random geometric graphs. Below that radius, the swarm is not one network but several, and no amount of protocol cleverness joins them.

Time t: a value starts at agent A A orange = holds the value, gray = does not yet Time t+1: agents moved, graph rewired A link broke new link the value still spreads, now along different edges
Figure 39.4.1: The communication graph is a moving target. Both panels show the same six agents a moment apart. Between $t$ and $t+1$ the agents fly, one in-range link breaks (dashed red) and one new link forms (green), yet a value seeded at agent A keeps spreading because gossip rides whatever edges exist this round rather than a fixed route. An algorithm that had pinned a delivery path through the broken link at $t$ would now be stuck; one that re-gossips to current neighbors is not.

The single fact that distinguishes this graph from every network earlier in the book is that it is dynamic. In a datacenter, $G$ is the wiring; you measure it once and plan collectives against it, as Chapter 4 does when it lays out a ring or tree all-reduce over a fixed topology. In a swarm, $G$ at time $t+1$ differs from $G$ at time $t$ because the agents have moved: some pairs crossed inside $r$ and gained an edge, others crossed outside and lost one. The edge set is a function of motion, $E(t)$, and no plan computed against $E(t)$ can be trusted at $E(t+1)$. This is the structural reason a swarm cannot reuse the static, topology-aware collective schedules of Chapter 4: there is no fixed topology to be aware of.

3. Gossip and Epidemic Dissemination Under Churn Intermediate

If you cannot pin a route and cannot run a barrier, how does a value held by one agent reach the whole swarm? The answer is the oldest robust idea in distributed systems, met first as the no-coordinator principle of Section 1.4 and as a failure-tolerant primitive in Chapter 2: gossip. Each agent periodically tells whatever it knows to whoever happens to be in range, and those neighbors retell it to theirs. Information spreads like an epidemic, hop by hop, with no central plan and no fixed path. Crucially, gossip needs no agreement on the topology and tolerates massive message loss, because a fact that fails to cross a link this round simply tries again next round over whatever links exist then. The dynamic graph that defeated routing is exactly the regime gossip was built for.

The cost of this robustness is rounds, not bytes. On a graph with good expansion, a single value started at one agent reaches all $N$ agents in a number of synchronous gossip rounds that scales like the graph diameter, on the order of $O(\log N)$ for well-connected random graphs, growing toward the diameter as the graph thins out near the connectivity threshold. The relevant quantity is the mixing or dissemination time: how many rounds of local exchange are needed before every node holds the value. Contrast this with the all-to-all interconnect of Chapter 4, where every node can reach every other in a single hop and a broadcast finishes in one round. The swarm pays in rounds what the datacenter pays for once in cabling, and accepts that price because rounds survive a network that cabling assumptions do not.

The demonstration below makes the trade visible. It builds a range-limited communication graph over sixty agents scattered in a unit square, seeds one value at a single agent, and floods it epidemically, counting the rounds until every agent holds it. It sweeps the communication radius to show both the connectivity threshold (below which the value can never reach everyone) and the way dissemination time falls as the graph densifies, and it compares against the all-to-all clique that stands in for a datacenter fabric.

import numpy as np

rng = np.random.default_rng(7)
N = 60                      # agents (robots/drones)
side = 1.0                  # square arena side length
pos = rng.random((N, 2)) * side

def neighbors(pos, radius):
    """Adjacency of a random geometric graph: link iff within comm radius."""
    d = np.linalg.norm(pos[:, None, :] - pos[None, :, :], axis=2)
    A = (d <= radius) & (d > 0.0)
    return A

def rounds_to_spread(A):
    """Synchronous epidemic flood from one source: rounds until all 'know'.
    Each round, every informed node informs every current neighbor.
    Returns rounds, or -1 if the source's component does not cover all N."""
    informed = np.zeros(A.shape[0], dtype=bool)
    informed[0] = True       # agent 0 starts with the value
    for r in range(1, A.shape[0] + 1):
        newly = (A & informed[:, None]).any(axis=0) | informed
        if newly.all():
            return r
        if (newly == informed).all():
            return -1         # stalled: graph is disconnected
        informed = newly
    return -1

# Communication radius sweep: connectivity and epidemic spread time.
print(f"{'radius':>7} {'avg degree':>11} {'connected':>10} {'gossip rounds':>14}")
for radius in [0.10, 0.15, 0.20, 0.25, 0.30, 0.40]:
    A = neighbors(pos, radius)
    avg_deg = A.sum(axis=1).mean()
    rounds = rounds_to_spread(A)
    connected = "yes" if rounds > 0 else "no"
    shown = str(rounds) if rounds > 0 else "-"
    print(f"{radius:>7.2f} {avg_deg:>11.2f} {connected:>10} {shown:>14}")

# All-to-all baseline: a clique reaches everyone in exactly one round.
A_full = np.ones((N, N), dtype=bool) ^ np.eye(N, dtype=bool)
print()
print("all-to-all (datacenter fabric) rounds:", rounds_to_spread(A_full))
Code 39.4.1: Epidemic dissemination of one value across a range-limited swarm graph, swept over communication radius, with an all-to-all clique as the datacenter reference. The flood is the worst-case-robust form of gossip: in each round every informed agent informs every current neighbor.
 radius  avg degree  connected  gossip rounds
   0.10        1.33         no              -
   0.15        3.27         no              -
   0.20        5.43        yes              8
   0.25        8.37        yes              5
   0.30       11.80        yes              5
   0.40       19.47        yes              3

all-to-all (datacenter fabric) rounds: 1
Output 39.4.1: Below radius $0.20$ the graph is fragmented and the value never reaches everyone (dissemination fails, marked "-"). Above the threshold, more communication range means higher degree and fewer rounds to cover the swarm: eight rounds at $r = 0.20$ down to three at $r = 0.40$. The datacenter clique finishes in a single round; the swarm pays the difference in rounds, which is the price of running over a range-limited, lossy network instead of a fixed fast fabric.

Two lessons sit in that table. First, connectivity is not free: there is a radius below which no protocol disseminates anything, because the graph itself is in pieces, and a swarm controller must keep the agents close enough, or numerous enough, to stay above it. Second, above the threshold, dissemination time and communication range trade off smoothly, so a swarm designer chooses an operating point that balances rounds-to-converge against the energy and interference cost of a larger radius. Neither lesson has an analog in the datacenter, where the single-round clique of Chapter 4 makes both questions vanish. The shared awareness that Section 39.5 builds, every agent maintaining a consistent enough picture of the swarm's state, is precisely a value being gossiped over this graph, round after round, fast enough to stay useful as the world changes.

Thesis Thread: The Same Primitive, Now Against a Hostile Network

Gossip is not new to this chapter. It carried decentralized averaging in federated learning (Chapter 14) and stood as the no-coordinator alternative to a central reducer back in Section 1.4. What the swarm changes is the network underneath it. In a datacenter, gossip is a design choice you make to avoid a coordinator bottleneck; in a swarm, it is the only thing that works, because the network is range-limited, mobile, and lossy by physics, not by configuration. The scale-out lesson sharpens here: a distributed algorithm earns its keep precisely by tolerating a network that is unreliable and changing, the exact opposite of the datacenter assumption that every collective in Part IV was allowed to make.

4. Bandwidth-Aware Communication: Send Only What Changed Advanced

Robustness through gossip buys reach; it does not buy bandwidth. Airtime is the scarcest resource in a swarm, because every agent shares one radio medium and interference rises with traffic, so the second design imperative is to put as few bytes on the wire as possible. The governing inequality is blunt: if each agent must send a message of $B$ bytes every round and there are $N$ agents contending for a shared channel of capacity $C$ bytes per second, then the achievable round rate is bounded by

$$f_{\text{round}} \;\lesssim\; \frac{C}{N \, B},$$

so halving the bytes per message $B$ doubles how often the swarm can exchange state, and the system stays responsive only while $N B f_{\text{round}}$ sits comfortably below $C$. The lever the designer controls is $B$, and the cheapest way to shrink it is to stop sending what the receiver already knows. Instead of broadcasting a full state vector every round, an agent sends only the delta from what it last transmitted, an event or a changed field rather than a snapshot. When the swarm is calm, $B$ collapses toward zero and the channel is nearly idle; when something changes, the bytes are spent where they carry information.

The learning analog of this idea is where the frontier sits, and it sets up the policy work of Section 39.7. In multi-agent reinforcement learning, the agents need not be told what to communicate; they can learn it. Differentiable communication channels let a policy learn to emit a short message only when it is worth the bandwidth, and learned attention or gating learns which neighbor to address, so the message rate itself becomes a quantity the swarm optimizes rather than a fixed broadcast schedule. The principle is the same as the hand-coded delta, raised one level: under a bandwidth budget, the best swarms say little and say it only when it changes the listener's decision.

Library Shortcut: Gossip and Bounded Communication Without Hand-Rolling the Protocol

Code 39.4.1 floods by hand to expose the mechanism. In a real swarm you do not implement epidemic dissemination, neighbor discovery, and bandwidth budgeting from scratch. ROS 2 ships on DDS, whose discovery and quality-of-service settings (best-effort vs reliable, history depth, deadline) give per-topic control over exactly the loss-versus-latency trade this section is about; setting a topic to best-effort with shallow history is the production form of "send the latest, drop the stale." For the learned-communication side, MARL frameworks expose differentiable message channels as a few lines on top of a policy network, so a learned "send only what changed" gate is a configuration rather than a custom protocol. The from-scratch flood of Code 39.4.1 is roughly forty lines; the production path is a handful of QoS flags plus an off-the-shelf policy wrapper, with the framework handling discovery, retransmission policy, and serialization.

Practical Example: The Survey Drones That Stopped Talking So Much

Who: A robotics engineer fielding a twenty-drone team mapping a forest canopy.

Situation: Each drone broadcast its full state estimate, a few kilobytes, ten times a second to share awareness across the team.

Problem: As the drones spread out and the canopy attenuated the radio, the shared channel saturated, packets collided, and shared awareness went stale exactly when the team was most dispersed and needed it most.

Dilemma: Increase transmit power to widen range and fight attenuation, which worsens interference and battery drain for everyone, or cut the traffic each drone generates so the channel clears, at the risk of neighbors holding slightly older estimates.

Decision: They cut the traffic, switching from full-state broadcast to delta updates gossiped to in-range neighbors, sending the full vector only on a heartbeat once per second and otherwise sending only fields that moved past a threshold.

How: They set the state topic to best-effort with a one-deep history so stale frames were dropped, and added a change-threshold gate so a hovering drone fell nearly silent.

Result: Average bytes per round fell by roughly an order of magnitude, the channel stopped saturating, and effective round rate $f_{\text{round}}$ rose, so awareness was fresher across the dispersed team than the original chatty design had achieved.

Lesson: Under a shared, scarce channel, sending less can make the swarm know more, because freshness is bounded by airtime, and airtime is spent on bytes you can choose not to send.

Research Frontier: Learned, Sparse, Bandwidth-Aware Communication (2024 to 2026)

The line that began with differentiable inter-agent communication (CommNet, DIAL, and the attention-based TarMAC) has turned squarely toward the bandwidth question this section raises. Recent multi-agent reinforcement learning work treats the communication budget as an explicit constraint and learns gating policies that decide whether to transmit at all, scheduling messages so that total airtime stays bounded while task performance holds, and learns whom to address so a message goes to the one neighbor whose decision it changes rather than to all in range. A parallel thread studies these policies under the realistic dynamic, lossy graphs of this section rather than the fully connected toy graphs of early work, asking how learned protocols degrade as links drop and the topology churns. The unifying research target is a swarm that learns to communicate as little as the task allows over a network it cannot trust, which is the natural endpoint of treating the network, not the compute, as the binding constraint. We give these learned protocols their full treatment in Section 39.7.

5. Why This Inverts the Datacenter Assumption Intermediate

It is worth stating plainly what has changed from Part IV, because the contrast is the whole point. There, the network was a fast fabric: high bandwidth, fixed topology, and reliable delivery, so the algorithm's job was to minimize a communication tax it could otherwise trust, and the entire collective machinery of Chapter 4 was built on that trust. A ring all-reduce assumes the ring stays wired; a tree broadcast assumes the tree's edges deliver. Here, none of those assumptions hold. The network is the binding constraint, not a fast fabric, and the algorithm's job is not to minimize a trusted tax but to remain correct over an untrusted, shifting medium. That is why the same value that an all-reduce would synchronize in one trusted round must instead be gossiped over many rounds against a graph that rewrites itself, as Output 39.4.1 quantified.

The scale-out lesson the book has built toward lands with full force in the swarm. Distribution always trades local certainty for global reach; in the datacenter the network made that trade cheap, and the chapters of Part IV could treat communication as nearly free and nearly certain. Strip away the fast fabric and the trade is laid bare: a distributed algorithm is only as good as its tolerance for a network that loses messages, moves its endpoints, and changes its shape, because that is what a network actually is once you leave the machine room. Designing for that network, rather than against an idealization of it, is the discipline this chapter teaches, and gossip plus bandwidth-awareness are its two load-bearing techniques. Everything that follows, the shared awareness of Section 39.5 most immediately, is an application of these two ideas to a network that will not sit still.

Fun Note: The Acknowledgment That Never Came Home

There is a small grief unique to swarm engineering. In a datacenter, when a message is not acknowledged, something is wrong and an alarm should fire. In a swarm, an unacknowledged message is Tuesday. The peer is probably fine; it simply flew behind a tree. Engineers who come from the datacenter spend their first week building elaborate retry-and-alert logic for lost packets, and their second week deleting all of it, because in a swarm the only sane response to silence is to shrug, gossip again next round, and trust that the epidemic will get there even if no single message ever does.

Exercise 39.4.1: Why Not Just All-Reduce? Conceptual

A colleague proposes running a standard ring all-reduce across a thirty-drone swarm to average each drone's local estimate every round, citing its efficiency in the datacenter. Using the four link properties of Section 1 (range, mobility, intermittency, scarcity), explain at least three distinct ways this plan fails over a swarm network that it would not face in a datacenter. Then state what gossip gives up relative to all-reduce, and why that trade is the right one here.

Exercise 39.4.2: Mobility Breaks the Plan Coding

Extend Code 39.4.1 so the agents move: at each round, add a small random displacement to every position, then recompute the graph before the next gossip step. Seed one value at agent 0 and flood it, but now over the per-round graph rather than a fixed one. Measure rounds-to-spread as a function of the per-round movement magnitude at a fixed radius near the connectivity threshold. Report whether mobility helps or hurts dissemination, and explain the result in terms of links that break versus links that form as agents mix.

Exercise 39.4.3: Bandwidth Versus Freshness Analysis

Using the channel inequality $f_{\text{round}} \lesssim C / (N B)$, suppose a shared channel of $C = 1$ megabyte per second serves $N = 25$ agents. Compute the maximum round rate if each agent sends a full $4$-kilobyte state vector every round, then again if a delta scheme cuts the average message to $200$ bytes. Translate both into the staleness of the freshest neighbor estimate (round period in milliseconds). Then argue, in terms of the gossip dissemination time from Output 39.4.1, why a faster round rate can matter more for swarm awareness than a larger communication radius, and identify the regime where the opposite is true.