Part VII: Cluster, Edge, and Reliable Infrastructure
Chapter 34: Edge, Fog, and On-Device Distributed AI

Robotics and Autonomous Systems

"My reflexes live in my wrist, my conscience lives in the cloud, and the only thing standing between them is a network link with opinions about latency."

A Robot, Its Brain Split Between Onboard and the Cloud
Big Picture

A robot is not one program that thinks and moves; it is a distributed system of sensing, planning, and control nodes exchanging messages on a real-time bus, with the heaviest cognition pushed off the body to a fog or cloud tier, and a fleet of such robots sharing maps and learned policies over the network. The work splits along a hard physical boundary: a control loop that must close in milliseconds to keep the machine upright stays onboard, while perception, mapping, and global planning, which need far more compute than the body carries, run offboard and feed the loop asynchronously. This section reads a single robot as the smallest interesting distributed AI system, shows why the onboard-offboard split is forced by deadlines rather than chosen for convenience, and scales the picture up to fleets that learn together. It is the bridge from the sensing of Section 34.5 and the latency budgets of Section 34.7 to the multi-robot swarms of Chapter 39.

The earlier sections of this chapter treated the edge device as a thing that senses (Section 34.5) and a thing that must respect a latency budget (Section 34.7). A robot is where those two pressures collide hardest, because the device does not merely report or infer; it moves a physical body through a world that does not pause while a packet is in flight. That physical coupling makes the robot the clearest case in the book of distribution forced by an external clock. The deadline is not a service-level target a product manager negotiated; it is the rate at which a falling body must be corrected, and missing it is not slow, it is a fall. Everything in this section follows from taking that clock seriously and asking which work can possibly run inside it.

What surprises newcomers is that the distribution starts inside one robot, before any network appears. A modern autonomous machine is already a graph of communicating processes, called nodes, that publish sensor readings, planned trajectories, and actuator commands to one another over a message bus. The camera driver, the obstacle detector, the path planner, and the motor controller are separate programs, often on separate cores or separate boards, coordinating through a publish-subscribe layer that is, in the precise sense of this book, a distributed system on wheels. Once we see one robot as a cluster, scaling out to many robots is a change of degree, not of kind.

1. One Robot Is Already a Cluster Beginner

The dominant way to build a robot's software, the Robot Operating System in its second generation (ROS 2), makes the distributed structure explicit and unavoidable. The system is decomposed into nodes, each a process responsible for one concern: reading a lidar, fusing an inertial measurement unit with wheel odometry, building a local cost map, planning a path, or driving the wheels. Nodes never call one another directly. They communicate by publishing typed messages to named topics and subscribing to the topics they need, exactly the publish-subscribe pattern that decouples producers from consumers across a cluster. The transport underneath, the Data Distribution Service (DDS), discovers peers, manages quality-of-service contracts such as reliability and deadline, and moves messages between nodes whether they sit on the same core, on two boards inside the chassis, or on two machines across a wireless link.

This is the same publish-subscribe decoupling that stream-processing systems use to fan data through a cluster (Chapter 9), turned inward and given a real-time contract. Because the nodes are location-transparent, moving a node from the robot's onboard computer to a fog server is, in principle, an edit to a configuration file rather than a rewrite. That property is what makes the onboard-offboard split tractable: the SLAM node can run on the body today and on a fog box tomorrow, and the controller subscribing to its pose estimates does not need to know which. Figure 34.8.1 draws one robot as this graph of nodes over a DDS bus, with the dashed line marking the real-time boundary that the next section formalizes.

One robot as a graph of ROS 2 nodes over a DDS publish-subscribe bus DDS bus · topics with reliability and deadline QoS Sensor drivers lidar, IMU, camera Controller 100 Hz, onboard State estimator fuse odometry Perception detect, segment Global planner map, route SLAM / mapping offloadable to fog onboard / real-time offboard / best-effort
Figure 34.8.1: A single robot read as a distributed system. Sensor, estimator, and controller nodes (left, amber) form the onboard real-time loop; perception, mapping, and global-planning nodes (right, blue) carry heavier compute and can run offboard on a fog server. All nodes exchange typed messages over the DDS publish-subscribe bus (green), so a node's physical location is transparent to its subscribers. The dashed line is the real-time boundary that Section 3 turns into a deadline inequality.

2. The Real-Time Loop Stays Onboard Beginner

The reason the split exists is a deadline. A balancing or stabilizing controller runs at a fixed period, often 100 Hz to 1 kHz, and within each period it must read the latest sensor state, compute a correction, and command the actuators. If that loop slips, the machine does not degrade gracefully; a quadrotor tumbles, a legged robot collapses, a manipulator overshoots its target. This is the deadline coupling introduced in Section 34.7 at its most unforgiving, because here the consequence of a missed deadline is mechanical and irreversible rather than a delayed response a user might forgive.

Two facts follow immediately. First, the control loop must run on the body, because no network can guarantee a millisecond-scale round trip across an unreliable wireless link, and a loop that depends on a remote reply inherits that link's worst case as its own. Second, the heavy cognition that the loop's decisions ultimately rest on, building a map of the environment, detecting and classifying obstacles, planning a route across a building, cannot fit in the loop's budget and so must run elsewhere and feed the loop asynchronously. The loop does not wait for the latest plan; it acts on the most recent plan it has, while a slower offboard pipeline keeps that plan as fresh as the network allows. This is precisely the sensing tier of Section 34.5 producing data that a separate, slower cognition tier consumes.

Key Insight: The Loop Acts on the Freshest Plan, Not the Latest One

An autonomous system cannot block its control loop waiting for offboard perception, because the perception round trip is many control periods long. Instead the loop and the planner run as decoupled producers and consumers: the planner publishes an updated plan whenever it finishes, and the loop reads whatever plan is currently posted, every period, on time. Correctness becomes a statement about plan freshness (how stale is the plan the loop is acting on?) rather than plan latency (how long until the next plan arrives?). Offboarding heavy cognition is therefore safe exactly when the world changes slower than the plan goes stale, and dangerous when it does not.

3. How Much Perception Can You Offload? Intermediate

The split admits a budget. Let the control period be $T$, the fixed onboard per-step work (read sensors, run the controller, command actuators) be $c$, and the in-loop planning cost, if any, be $p_{\text{in}}$. The loop meets its deadline only when the onboard work fits inside the period,

$$c + p_{\text{in}} \le T.$$

Any planning whose cost exceeds the remaining budget $T - c$ cannot run in the loop and must be offboarded. When it is offboarded, the loop pays nothing per step for it, but the plan the loop acts on is stale by the time the offboard pipeline takes to produce it and return it. With offboard perception-and-planning time $p_{\text{off}}$ and network round-trip time $R$, the plan freshness, the age of the newest plan available to the loop, is bounded by

$$\text{staleness} \;=\; p_{\text{off}} + R \;=\; \left\lceil \frac{p_{\text{off}} + R}{T} \right\rceil \text{ control periods}.$$

The design question of Section 34.7 returns sharpened: offboarding is safe when the environment's rate of change is slow relative to this staleness, so that a plan computed a few periods ago is still approximately correct now. The code below simulates both regimes, an onboard-only reactive planner that fits in the loop, and an offboard heavy planner reached over a network, and reports deadline behavior and plan freshness for each.

PERIOD_MS = 10.0          # control-loop period (100 Hz)
ONBOARD_PLAN_MS = 1.2     # cheap reactive planner that fits onboard
PERCEPTION_OFFBOARD_MS = 35.0  # heavy perception + global planner, runs remotely
RTT_MS = 18.0             # network round trip to the fog planner

def control_step_cost(plan_source):
    # Per-step onboard work: read sensors, run controller, write actuators.
    sense, control, actuate = 0.8, 1.5, 0.6
    fixed = sense + control + actuate
    if plan_source == "onboard":
        return fixed + ONBOARD_PLAN_MS      # plan computed inside the loop
    return fixed                            # plan arrives asynchronously, not in-loop

def run(plan_source, steps=200):
    plan_latency = (PERCEPTION_OFFBOARD_MS + RTT_MS) if plan_source == "offboard" else ONBOARD_PLAN_MS
    misses, worst = 0, 0.0
    for _ in range(steps):
        step_ms = control_step_cost(plan_source)
        worst = max(worst, step_ms)
        if step_ms > PERIOD_MS:             # a missed deadline is a fall
            misses += 1
    return misses, worst, PERIOD_MS - worst, plan_latency

print(f"control period          : {PERIOD_MS:.1f} ms  ({1000/PERIOD_MS:.0f} Hz)\n")
for src in ("onboard", "offboard"):
    misses, worst, headroom, lat = run(src)
    print(f"plan source = {src:9s} | worst step {worst:5.1f} ms | "
          f"headroom {headroom:5.1f} ms | deadline misses {misses:3d}/200 | "
          f"plan freshness {lat:5.1f} ms")

budget_ms = PERIOD_MS - control_step_cost("offboard")   # in-loop room left for planning
print(f"\nin-loop budget left after onboard control : {budget_ms:.1f} ms")
print(f"=> heavy perception ({PERCEPTION_OFFBOARD_MS:.0f} ms) MUST run offboard; "
      f"it is {PERCEPTION_OFFBOARD_MS/budget_ms:.0f}x the in-loop budget")
print(f"=> end-to-end plan latency when offboarded  : "
      f"{PERCEPTION_OFFBOARD_MS + RTT_MS:.0f} ms "
      f"({(PERCEPTION_OFFBOARD_MS + RTT_MS)/PERIOD_MS:.0f} control periods stale)")
Code 34.8.1: A sense-plan-act budget check. The onboard control work is timed against a 10 ms period; the heavy perception-and-planning cost is compared to the in-loop budget that remains after control, and the offboard plan's staleness is reported in control periods.
control period          : 10.0 ms  (100 Hz)

plan source = onboard   | worst step   4.1 ms | headroom   5.9 ms | deadline misses   0/200 | plan freshness   1.2 ms
plan source = offboard  | worst step   2.9 ms | headroom   7.1 ms | deadline misses   0/200 | plan freshness  53.0 ms

in-loop budget left after onboard control : 7.1 ms
=> heavy perception (35 ms) MUST run offboard; it is 5x the in-loop budget
=> end-to-end plan latency when offboarded  : 53 ms (5 control periods stale)
Output 34.8.1: Both regimes meet the 10 ms deadline, but they buy it differently. Keeping a cheap reactive planner onboard gives a 1.2 ms fresh plan; offboarding the 35 ms perception pipeline frees the loop (7.1 ms headroom) at the cost of a plan that is 53 ms, five control periods, stale. The heavy planner is five times too expensive to run in the loop, which is why it has to leave the body.

The numbers state the trade precisely. The onboard loop never misses, in either regime, because we refused to put anything in it that did not fit. The cost of offboarding shows up not as a missed deadline but as staleness: the loop is steering on a plan computed five periods ago. For a warehouse robot crossing an open floor at walking pace, five periods of staleness is nothing; for a drone dodging a thrown ball, it is a crash. The split is correct only relative to how fast the world moves, which is the judgment Section 34.7 taught us to make with a latency budget and Figure 34.8.1 drew as a dashed line.

Library Shortcut: A ROS 2 Publisher Is the Whole Bus in Ten Lines

Building the publish-subscribe transport of Figure 34.8.1 from scratch (peer discovery, typed serialization, reliability and deadline contracts) would be hundreds of lines. The ROS 2 Python client, rclpy, gives you a node that publishes velocity commands onto a topic in about ten, and DDS handles discovery, transport, and quality-of-service underneath:

import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist          # standard velocity-command message

class Controller(Node):
    def __init__(self):
        super().__init__("controller")
        self.pub = self.create_publisher(Twist, "/cmd_vel", 10)   # topic + queue depth
        self.create_timer(0.01, self.tick)    # 100 Hz control loop, the onboard deadline

    def tick(self):
        cmd = Twist()
        cmd.linear.x = 0.4                     # act on the freshest plan held in state
        self.pub.publish(cmd)                  # DDS delivers to every subscriber

rclpy.init(); rclpy.spin(Controller())         # discovery, transport, QoS all handled
Code 34.8.2: A 100 Hz controller node publishing to /cmd_vel with rclpy. The timer enforces the onboard deadline of Code 34.8.1; the publisher and DDS layer replace the hand-rolled message bus, so a planner node on a fog server can subscribe to the same robot's topics with no change to this code.
Practical Example: The Warehouse Robot That Stopped Carrying Its Map

Who: A robotics platform engineer at a fulfillment-center automation company running a few hundred mobile picking robots per building.

Situation: Each robot ran full onboard SLAM, holding and updating its own occupancy map of the warehouse, on an embedded computer sized (and priced) to fit that load.

Problem: Maps drifted apart between robots, every robot rediscovered the same moved shelf independently, and the onboard compute set a hard floor on the cost of every unit in the fleet.

Dilemma: Keep heavy SLAM onboard for autonomy under network loss, simple but redundant and expensive, or offload mapping to a fog server and share one map across the fleet, cheaper and consistent but dependent on the wireless link staying healthy.

Decision: They split the loop. A lightweight reactive obstacle-avoidance controller and local pose estimator stayed onboard so a robot could always stop and hold safely; global mapping and route planning moved to a fog server that maintained one shared map for the building.

How: The onboard nodes published odometry and scans over DDS to a fog SLAM node; the fog node published pose corrections and a shared cost map back. When the link dropped, robots fell back to the onboard reactive layer and slowed to a safe pace until the fog returned.

Result: One coherent map for the whole building, cheaper robots (the embedded computer no longer had to size for SLAM), and a shelf moved by one robot was known to all of them within seconds. The control loop's deadline was never at risk because nothing real-time had left the body.

Lesson: Offload the cognition that benefits from being shared and is tolerant of staleness; keep onboard only what the deadline and the worst-case network demand. The map is a fleet asset, the reflex is a body asset.

4. Cloud Robotics: One Map, Many Bodies Intermediate

Once heavy cognition lives off the body, it is a short step to letting many bodies share it. This is cloud robotics: a fog or cloud tier holds assets that are expensive to compute and valuable to share, and the robots become comparatively thin clients of those assets. Three shared assets matter most. A shared map means a feature mapped by one robot is immediately usable by every other, so the fleet's collective experience of a building converges to one representation rather than diverging into hundreds of private ones, exactly the consolidation the warehouse example above achieved. Offloaded SLAM moves the localization-and-mapping computation itself off the body, so a robot contributes its sensor stream and receives a pose estimate, paying network latency instead of onboard compute and energy. Shared skills and policies let a manipulation behavior learned by one robot, or in simulation, be distributed to the whole fleet as a downloaded model.

The shared map is a sharded, replicated data structure of exactly the kind Part II built, and keeping the fleet's view of it consistent is the partitioning-and-consistency problem of Chapter 2 wearing a robotics hat. The policy distribution, meanwhile, raises a learning question: how does a fleet improve from the pooled experience of all its members without shipping every robot's raw sensor logs to a central server? That is fleet learning, and it is the same federated pattern this book develops elsewhere.

Thesis Thread: Fleet Learning Is Federated Learning on Wheels

A fleet of robots that each gather private experience and jointly improve a shared policy, without centralizing raw data, is precisely the federated-learning setup of Chapter 14. Each robot is a client computing local updates on its own interaction logs; a coordinator aggregates those updates into a better shared policy and pushes it back. The control loop you kept onboard in Section 2 is the client's inference path; the offboard policy you downloaded in cloud robotics is the federated model. The robotics framing adds non-stationarity (the policy under training is also the one driving the robots that generate the data) and the distributed-RL machinery of Chapter 20, where actors collect trajectories and learners update the shared policy, is the engine that turns pooled fleet experience into a better controller.

5. From One Robot to a Fleet That Coordinates Advanced

The publish-subscribe bus that ties one robot's nodes together does not stop at the chassis. Because DDS topics are discovered across the network, several robots can join one logical bus and subscribe to each other's poses, intentions, and observations, which is the seed of multi-robot coordination. A fleet of warehouse robots negotiating who takes which aisle, a team of drones holding a formation while one of them loses GPS, a group of delivery robots merging at a doorway, all are distributed-coordination problems layered on top of the single-robot architecture of this section. The shared map of Section 4 becomes shared situational awareness; the shared policy becomes a joint policy in which each robot's best action depends on what the others are doing.

That dependence is what turns a fleet from many independent autonomous systems into one multi-agent system, and it is exactly where this book's multi-agent thread takes over. The coordination mechanisms (task allocation, formation control, conflict resolution) build on the game-theoretic and swarm foundations of Chapter 31, and the full treatment of robot teams and drone swarms as a distributed AI system is the case study of Chapter 39. This section has built the single node of that larger graph: a robot that is itself a cluster, split across a real-time boundary, reaching out to a fog tier for the cognition it cannot carry, and ready to be one member of a coordinating fleet.

Research Frontier: Cloud-and-Fleet Robot Learning (2024 to 2026)

Two lines are reshaping the onboard-offboard split. First, large vision-language-action models such as the RT-2 and OpenVLA lineage (Brohan et al., 2023; Kim et al., 2024) and the open-source generalist policies released around the Open X-Embodiment collaboration train one policy on the pooled experience of many robot embodiments, then distribute it across heterogeneous fleets, which is fleet learning at foundation-model scale and pushes the heaviest inference toward the offboard tier these models are too large to run on the body in full. Second, cloud-robotics frameworks built on ROS 2 and DDS, together with edge-AI accelerators on the robot, are making the boundary in Figure 34.8.1 dynamic: a node migrates between body and fog at runtime as the network and the deadline allow, an instance of the offloading-decision problem of Section 34.7 solved continuously rather than once at design time. The open question both lines share is guaranteeing the real-time loop's safety while the cognition it depends on is being trained, distributed, and relocated underneath it.

Fun Note: The Robot That Was Fearless Until the Wi-Fi Dropped

A demo robot once navigated a conference floor beautifully, right up to the moment it rolled behind a metal pillar that blocked the access point. Its global planner lived on a laptop in the booth, and with the link gone the robot fell back to its onboard reflex layer, which knew exactly one thing: do not hit the wall in front of you. It stopped dead and waited, perfectly safe and perfectly useless, until someone carried the laptop closer. The reflexes were onboard, the courage was on the network.

Exercise 34.8.1: Which Node Goes Where? Conceptual

For a legged delivery robot, classify each of the following nodes as onboard-real-time, onboard-best-effort, or offboard, and justify each from the deadline and staleness arguments of Sections 2 and 3: (a) the balance controller stabilizing the legs; (b) the obstacle detector running a neural network on camera frames; (c) the building-wide route planner; (d) the local pose estimator fusing IMU and joint encoders; (e) the fleet policy that decides which packages to prioritize. State, for each offboard choice, what the robot must do if the network drops, and why that fallback keeps it safe.

Exercise 34.8.2: Stretch the Budget Until It Breaks Coding

Modify Code 34.8.1 so the onboard fixed work $c$ grows (for example, add a heavier onboard filter that raises control from 1.5 ms toward 9 ms) and find the value of $c$ at which the onboard-only regime starts missing the 10 ms deadline. Then, holding $c$ fixed at a safe value, sweep the offboard cost $p_{\text{off}}$ and round-trip $R$ and plot plan staleness in control periods. Identify the staleness at which a robot moving at 1.5 m/s would travel more than 10 cm between plan updates, and argue from your plot what maximum offboard latency is acceptable for that robot.

Exercise 34.8.3: Cost of Offloading SLAM for a Fleet Analysis

A fleet of 200 robots each streams a lidar scan of 1.5 MB at 10 Hz to a fog SLAM service. Estimate the aggregate uplink bandwidth the fog tier must absorb, and compare it to keeping SLAM onboard (zero network, but a more expensive computer per robot). Suppose moving SLAM offboard lets you replace each robot's compute module with one costing \$400 less, while the fog server and network upgrade cost \$30,000. At what fleet size does offloading pay for itself, ignoring the consistency and freshness benefits? Then argue qualitatively how the shared-map consistency of Section 4 changes the calculation in offloading's favor.