Part VI: Distributed AI and Multi-Agent Systems
Chapter 31: Swarm Intelligence and Collective Behavior

Particle Swarm Optimization

"I found a good spot. The swarm found a better one. So I pointed myself somewhere between the two and trusted the crowd to be only half wrong."

A Particle Splitting the Difference
Big Picture

Particle swarm optimization turns a hard optimization problem into a population of simple agents that each remember their own best discovery, listen for the swarm's best discovery, and steer between the two. No agent computes a gradient, no agent sees the whole picture, and no central solver coordinates the search; the optimum emerges from many cheap evaluations of the objective combined with one tiny piece of shared information, the best position found so far. That structure is what makes the method matter for this book: the expensive part, evaluating the objective at each particle, happens independently and in parallel, while the only thing that must cross the network is a single best-position vector, an all-reduce so small it is almost free. This section gives the velocity and position update rules, the three coefficients that tune exploration against exploitation, the choice between a fully connected swarm and a neighborhood topology, and a runnable swarm that converges on a deliberately nasty test function.

The previous section put ants on a graph and let pheromone trails solve discrete routing problems. Continuous optimization asks a different question: given a real-valued objective $f(\mathbf{x})$ over a vector $\mathbf{x} \in \mathbb{R}^d$, find the input that minimizes it. When $f$ is smooth and you can differentiate it, gradient descent is the right tool and most of Part IV is about scaling it across machines. But many objectives are black boxes: a physics simulation, a wind-tunnel surrogate, the validation accuracy of a trained model as a function of its hyperparameters. You can evaluate them, sometimes at considerable cost, but you cannot differentiate them, and they may be riddled with local minima. Particle swarm optimization (PSO), introduced by Kennedy and Eberhart in 1995, attacks exactly this regime with a swarm of candidate solutions that explore the space together.

One particle, one update step x (current) personal best p swarm best g inertia w·v cognitive c₁(p−x) social c₂(g−x) new v → new x the three pulls sum to one step; coefficients set how far the particle leans toward self versus swarm
Figure 31.4.1: The velocity update for a single particle. Three vectors combine: inertia (keep going the way you were going), a cognitive pull toward the particle's own best-seen position $\mathbf{p}$, and a social pull toward the swarm's best-seen position $\mathbf{g}$. Their weighted sum is the new velocity, which moves the particle to its new position. The coefficients $w$, $c_1$, $c_2$ control the relative strength of the three pulls and thereby the balance between exploring new regions and exploiting known-good ones.

1. The Swarm as a Search Population Beginner

A particle swarm is a fixed population of $n$ candidate solutions, each a point $\mathbf{x}_i \in \mathbb{R}^d$ in the search space together with a velocity $\mathbf{v}_i \in \mathbb{R}^d$. You can picture each particle as a tiny agent flying over the landscape of the objective function, its height at any moment given by $f(\mathbf{x}_i)$. The swarm's job is to drive as many particles as possible down to the lowest valley. Two pieces of memory make this work. Each particle remembers $\mathbf{p}_i$, the best position it personally has visited (the one with the smallest objective value it has ever recorded). The swarm as a whole tracks $\mathbf{g}$, the best position any particle has visited. These are the only states a particle needs beyond its current position and velocity.

What makes PSO a member of the swarm-intelligence family, alongside the ant colonies of the previous section and the flocks of the next, is that the global behavior is not programmed anywhere. Each particle follows a fixed, local rule, and the convergence of the population onto good optima is emergent. There is no master that decides where the swarm should look next; there is only a shared scalar-and-vector summary, $\mathbf{g}$, that every particle reads and that any particle can improve. The contrast with a centralized optimizer is the same contrast this part of the book keeps returning to, and it is exactly what lets the method scale.

2. The Velocity and Position Update Intermediate

At each iteration, every particle updates its velocity and then its position. The velocity update is the heart of the method, and it is a weighted sum of the three pulls drawn in Figure 31.4.1. For particle $i$ at iteration $t$,

$$\mathbf{v}_i^{\,t+1} = \underbrace{w\,\mathbf{v}_i^{\,t}}_{\text{inertia}} + \underbrace{c_1\,\mathbf{r}_1 \odot (\mathbf{p}_i - \mathbf{x}_i^{\,t})}_{\text{cognitive}} + \underbrace{c_2\,\mathbf{r}_2 \odot (\mathbf{g} - \mathbf{x}_i^{\,t})}_{\text{social}},$$

followed by the position update

$$\mathbf{x}_i^{\,t+1} = \mathbf{x}_i^{\,t} + \mathbf{v}_i^{\,t+1}.$$

Here $\mathbf{r}_1$ and $\mathbf{r}_2$ are vectors of independent uniform random numbers in $[0,1]$, drawn fresh each step, and $\odot$ is the elementwise product. The randomness is essential: it keeps the swarm from collapsing into a single deterministic trajectory and lets particles overshoot, undershoot, and probe around the two attractors rather than marching straight at them. Each of the three terms has a clear job. The inertia term $w\,\mathbf{v}_i^{\,t}$ carries the particle's momentum forward, so it keeps exploring in its current direction rather than snapping instantly to an attractor. The cognitive term pulls the particle back toward its own best discovery $\mathbf{p}_i$, the "nostalgia" that keeps it loyal to good places it has personally seen. The social term pulls it toward the swarm's best $\mathbf{g}$, the conformity that lets one particle's lucky find recruit the whole population.

Key Insight: Three Coefficients Buy the Exploration-Exploitation Balance

The entire character of a particle swarm is set by three numbers. A large inertia $w$ (near 1) keeps velocities high and the swarm exploring; a small $w$ (toward 0.4) damps motion and lets the swarm settle, which is why many implementations decay $w$ over time, exploring early and exploiting late. The cognitive coefficient $c_1$ rewards individual memory and spreads the swarm out; the social coefficient $c_2$ rewards consensus and pulls it together. Push $c_2$ too high and the swarm rushes to the first decent point it finds and stagnates there (premature convergence); push $c_1$ too high and the particles wander as loners and never agree. The widely used defaults $w \approx 0.72$, $c_1 = c_2 \approx 1.49$ come from a stability analysis (Clerc and Kennedy's constriction result) that keeps velocities from exploding, and they are a sane starting point precisely because they balance the self-pull against the swarm-pull.

3. Global Best Versus Local Best Intermediate

The social term hides a design decision: what does "the swarm's best" mean to a given particle? In the global-best (gbest) variant, every particle is connected to every other, so $\mathbf{g}$ is the single best position in the entire swarm and all particles feel the same social pull. This is fast: one good discovery propagates to the whole population in a single iteration. It is also the variant most prone to premature convergence, because if that one best position sits in a deep but suboptimal local minimum, the entire swarm dives in after it and may never escape.

The local-best (lbest) variant gives each particle a neighborhood instead of the whole swarm. A particle's social attractor is the best position among only its neighbors on some fixed communication graph, the classic choice being a ring in which each particle talks to two others. Now a good discovery spreads slowly, hopping neighbor to neighbor around the ring rather than broadcasting instantly. The swarm explores more independent regions in parallel, which makes it slower to converge but markedly more robust against being trapped: a subgroup stuck in a bad basin does not immediately drag the rest in with it. This is the same idea, expressed on a population of optimizers, that Section 29.9 develops as the communication-graph topology of a multi-agent system, where the connectivity of the agent graph trades speed of consensus against resistance to a bad signal propagating too quickly. The neighborhood graph is the swarm's communication topology, and choosing it is choosing how fast information, good or bad, travels.

Fun Note: The Swarm That Agrees Too Fast Is the Swarm That Is Wrong Together

There is a pleasing social parable in the gbest-versus-lbest distinction. A fully connected swarm is a room where everyone instantly hears the loudest claim and rushes toward it; if the claim is right, they win quickly, and if it is wrong, they are all wrong in unison. A ring-connected swarm is a room of overlapping small conversations where a good idea has to earn its way around, slowly, which is annoying when the idea is correct and a lifesaver when it is not. Optimization, like committees, often does better with a little friction in the wiring.

4. Why PSO Is Embarrassingly Parallel Intermediate

Here is where PSO earns its place in a book about scaling out. Within one iteration, the expensive operation is evaluating the objective $f(\mathbf{x}_i)$ at each of the $n$ particles, and those evaluations are completely independent: particle $i$'s objective value does not depend on particle $j$'s. If $f$ is a black box that takes seconds or minutes to evaluate, a wind-tunnel surrogate or a full model-training run, you can scatter the $n$ particles across $n$ workers, evaluate them all at once, and the per-iteration wall-clock collapses from $n$ sequential evaluations to one. This is the textbook definition of an embarrassingly parallel workload, and it is the dominant cost whenever the objective is expensive.

The only information that must cross the network is the best position. After the workers report their objective values, the swarm needs to know the single best (gbest) or each neighborhood's best (lbest) so that every particle can apply its social pull next iteration. Finding the global minimum over $n$ scalar values and broadcasting the winning position is a reduction followed by a broadcast: precisely the all-reduce primitive built from scratch in Section 4.3, except that the reduced payload is a single $d$-dimensional vector and one scalar, kilobytes rather than the gigabyte gradient buffers of distributed training. The communication-to-computation ratio is therefore tiny: heavy independent compute, a featherweight collective. That is the ideal shape for scale-out, and it is why PSO parallelizes so naturally for exactly the expensive black-box objectives where you most need the speed.

Thesis Thread: The Same All-Reduce, Now Choosing an Optimum

The all-reduce that summed gradient vectors in Section 1.1 returns here with a different reduction operator. Data-parallel training reduces with sum (average the gradients); a global-best particle swarm reduces with argmin (find the worker holding the best position) and broadcasts that position. The local-best variant replaces the single global reduction with neighborhood reductions over a communication graph, which is exactly the topology-aware collective that Section 4.3 treats in general form. One primitive, two reduction operators, two topologies: the swarm is a distributed optimizer wearing the same plumbing as the trainer.

5. A Swarm That Converges, From Scratch Intermediate

The code below implements both variants on the Rastrigin function, a standard optimization torture test whose surface is a smooth bowl studded with a regular grid of local minima, with the global minimum of $0$ at the origin. A method that only does local descent gets stuck in the nearest ripple; a working swarm has to use its exploration to hop across the ripples toward the center. The implementation is plain NumPy: positions, velocities, the three-term velocity update, and the two topologies (a single shared best for gbest, a three-particle ring neighborhood for lbest).

import numpy as np

def rastrigin(P):
    # P: (n_particles, d). Global minimum 0 at the origin.
    A = 10.0
    return A * P.shape[1] + np.sum(P**2 - A * np.cos(2 * np.pi * P), axis=1)

def run_pso(topology, n=40, d=2, iters=60, seed=0):
    rng = np.random.default_rng(seed)
    w, c1, c2 = 0.72, 1.49, 1.49          # inertia, cognitive, social coefficients
    lo, hi = -5.12, 5.12                   # Rastrigin search box per dimension
    X = rng.uniform(lo, hi, (n, d))        # particle positions
    V = rng.uniform(-1, 1, (n, d))         # particle velocities
    Pbest = X.copy()                       # each particle's own best position
    Pbest_val = rastrigin(X)
    history = []
    for _ in range(iters):
        if topology == "gbest":            # fully connected: one shared best
            g = Pbest[np.argmin(Pbest_val)]
            G = np.broadcast_to(g, X.shape)
        else:                              # lbest ring: best among the 2 ring neighbours
            order = np.argsort(Pbest_val)
            G = np.empty_like(X)
            for i in range(n):
                nb = [(i - 1) % n, i, (i + 1) % n]
                G[i] = Pbest[nb[int(np.argmin(Pbest_val[nb]))]]
        r1, r2 = rng.random((n, d)), rng.random((n, d))
        V = w * V + c1 * r1 * (Pbest - X) + c2 * r2 * (G - X)   # velocity update
        X = np.clip(X + V, lo, hi)                              # position update
        val = rastrigin(X)
        improved = val < Pbest_val
        Pbest[improved], Pbest_val[improved] = X[improved], val[improved]
        history.append(Pbest_val.min())
    return Pbest[np.argmin(Pbest_val)], Pbest_val.min(), history

for topo in ("gbest", "lbest"):
    best_x, best_f, hist = run_pso(topo)
    print(f"{topo:6s}  best f = {best_f:.6e}   at ({best_x[0]:+.4f}, {best_x[1]:+.4f})")
    print(f"        best-so-far at iter 0 / 10 / 30 / 59 : "
          f"{hist[0]:7.3f} {hist[10]:7.3f} {hist[30]:7.4f} {hist[59]:.2e}")
Code 31.4.1: A complete particle swarm in NumPy, run for both topologies on the 2D Rastrigin function. The velocity update is the one line implementing the equation from Section 2; the only difference between gbest and lbest is how the social attractor G is computed, a single shared best versus a per-particle ring neighborhood best.
gbest   best f = 5.907045e-05   at (+0.0005, -0.0002)
        best-so-far at iter 0 / 10 / 30 / 59 :   6.363   1.052  0.0181 5.91e-05
lbest   best f = 3.079467e-03   at (+0.0013, +0.0037)
        best-so-far at iter 0 / 10 / 30 / 59 :   9.480   1.470  0.0682 3.08e-03
Output 31.4.1: Both swarms find the global minimum near the origin, driving the objective from single-digit values down toward zero across 60 iterations. The fully connected gbest swarm converges faster and lands closer (objective $\approx 6\times10^{-5}$); the ring-connected lbest swarm converges more slowly (objective $\approx 3\times10^{-3}$ at the same iteration), trading speed for the broader, more cautious search that the topology buys.

The numbers tell the gbest-versus-lbest story cleanly. Both topologies escape the Rastrigin ripples and home in on the origin, the global optimum, so both work. But gbest reaches a far smaller objective at iteration 59 because its instant broadcast of the best position lets the whole swarm pile onto the most promising spot early. The lbest swarm is one to two orders of magnitude behind at the same iteration count, the visible price of letting good news travel only neighbor to neighbor; the compensating benefit, harder to see on a single friendly seed but real across many runs and harder problems, is that the slower spread makes lbest much less likely to commit the whole swarm to a wrong basin. On a problem with one dominant deceptive minimum, the rankings can reverse, which is exactly why the topology is a knob and not a default.

Practical Example: Tuning a Black-Box Simulator Nobody Could Differentiate

Who: A mechanical engineer optimizing the blade geometry of a small wind turbine.

Situation: Each candidate geometry was scored by a computational-fluid-dynamics simulation that took about four minutes to run and returned a single efficiency number, with no usable gradient.

Problem: The design space had eleven continuous parameters and several local optima where intuitive hand-tuning kept getting stuck, and a grid search over eleven dimensions was hopeless.

Dilemma: Run a sequential gradient-free optimizer that respects the four-minute cost but evaluates one design at a time, or run a population method that needs many evaluations per iteration but could evaluate them all at once on the available 32-core cluster.

Decision: They chose a 32-particle global-best PSO, mapping one particle to one core so that an entire iteration of CFD runs finished in roughly four minutes of wall-clock rather than 32 times that.

How: Particles were scattered to workers, each ran its CFD evaluation independently, the controller reduced to find the best efficiency and broadcast that geometry, and the swarm updated; only a small parameter vector and one score crossed the network per particle per iteration.

Result: Forty iterations (about three hours of wall-clock, versus four days sequentially) found a geometry two percentage points more efficient than the best hand-tuned design, escaping the local optimum that had trapped the manual search.

Lesson: When the objective is an expensive black box, the population size that looks wasteful sequentially becomes free parallel width, and PSO's tiny shared state keeps the cluster busy with compute instead of communication.

6. PSO Against Gradients, and Where It Is Used Advanced

It is worth being precise about when to reach for a swarm and when not to. PSO is derivative-free: it never needs $\nabla f$, only the ability to evaluate $f$. That is its whole reason for existing and its main weakness inverted. When you do have a cheap, reliable gradient, gradient-based optimization is dramatically more sample-efficient, converges with provable rates, and should be your default; throwing a swarm at a smooth convex problem is wasteful. PSO earns its keep precisely where gradients are unavailable or unhelpful: black-box objectives you can only query, non-differentiable or discontinuous objectives, noisy simulations, and rugged multi-modal landscapes where a gradient would just slide you into the nearest local pit. In those regimes its population-based exploration is a genuine advantage rather than a crutch.

The most common place a reader of this book will meet PSO is hyperparameter optimization. Validation performance as a function of learning rate, weight decay, and architecture choices is a black box with no gradient and many local optima, and each evaluation is a full training run, expensive and independent. That is the PSO sweet spot, and population-based search sits alongside random search, Bayesian optimization, and evolutionary methods in the distributed HPO toolkit that Chapter 21 builds out, where the parallel evaluation of a population across a cluster is the central engineering concern. Beyond HPO, PSO and its relatives are workhorses of engineering design optimization: antenna and circuit layout, structural and aerodynamic shape design, controller tuning, and power-system scheduling, wherever a simulator scores a continuous design vector and no gradient is on offer.

Library Shortcut: pyswarms and scikit-opt Replace the Loop

The roughly thirty lines of Code 31.4.1 are a teaching implementation; in practice you reach for a library that provides the topologies, coefficient schedules, boundary handling, and parallel objective evaluation already tested. pyswarms offers global-best and local-best optimizers as two classes, and scikit-opt packs PSO alongside other swarm and evolutionary methods behind one interface:

# pip install pyswarms
import numpy as np
import pyswarms as ps

def rastrigin(P):                                  # P: (n_particles, d), vectorized
    return 10*P.shape[1] + np.sum(P**2 - 10*np.cos(2*np.pi*P), axis=1)

opts = {"w": 0.72, "c1": 1.49, "c2": 1.49}
opt = ps.single.GlobalBestPSO(n_particles=40, dimensions=2, options=opts)
best_cost, best_pos = opt.optimize(rastrigin, iters=60)   # swap for LocalBestPSO for lbest
Code 31.4.2: The same global-best swarm in a handful of lines with pyswarms. Switching to LocalBestPSO (and passing a neighborhood size k) gives the lbest variant; the library handles velocity clamping, boundary conditions, convergence history, and an n_processes argument that farms the objective evaluations out across processes, the parallelism that Section 4 motivates.

7. Strengths, Weaknesses, and Honest Limits Intermediate

PSO's strengths are real and explain its popularity. It is simple to implement and to explain, it has very few parameters to set, it makes no assumptions about the objective beyond being able to evaluate it, and its parallel structure matches expensive black-box problems beautifully. Those four properties, simplicity, few knobs, generality, and parallelism, are why it shows up across engineering disciplines that have no interest in optimization theory for its own sake.

The weaknesses are equally real and must be stated plainly. PSO is a heuristic: it offers no guarantee of finding the global optimum and no convergence-rate bound of the kind gradient methods enjoy on well-behaved problems. It can stagnate, the swarm collapsing onto a suboptimal point with all velocities near zero and no mechanism to reignite exploration, which is the failure mode of overly social settings. Its performance is sensitive to the coefficients and the population size, so the method that has "few parameters" still requires some tuning to work well on a hard problem. And on high-dimensional problems the population needed to cover the space grows uncomfortably, so PSO is most at home in low to moderate dimensions. None of this disqualifies it; it means you use a swarm where its assumptions fit (expensive, non-differentiable, low-to-moderate-dimensional black boxes) and reach for gradients or Bayesian optimization where theirs fit better. The collective failure modes of swarms in general, stagnation among them, get their own dedicated treatment later in this chapter.

Research Frontier: Swarms Meet Surrogates and Learned Optimizers (2024 to 2026)

Two active lines are reshaping where PSO sits. The first is surrogate-assisted swarm optimization for expensive objectives: a cheap learned model (a Gaussian process or a neural surrogate) approximates the costly $f$, the swarm searches the surrogate aggressively, and only the most promising particles are evaluated on the real objective, cutting the number of expensive evaluations by large factors. Recent work pushes this to high-dimensional and constrained engineering problems where vanilla PSO struggles, and pairs it with the distributed-evaluation infrastructure of Chapter 21. The second line treats the swarm itself as something to be learned or hybridized: adaptive and self-tuning PSO variants adjust $w$, $c_1$, $c_2$ online from swarm diagnostics rather than fixing them, and hybrids splice PSO's global exploration with a local gradient or quasi-Newton polish for the final descent. The throughline for this book is that the parallel-evaluation backbone stays the same; the research is about spending those parallel evaluations more wisely, which is exactly the comparison this section drew against gradient methods, now made adaptive.

You now have the continuous-optimization member of the swarm-intelligence family: a population of particles, a three-term velocity rule, two coefficients that trade exploration against exploitation plus an inertia that paces the search, a topology choice that trades convergence speed against robustness, and a parallel structure that reduces communication to a single best-position vector. The next section keeps the swarm but drops the objective function entirely: in flocking and distributed consensus, the agents have no global goal to optimize, only local rules of attraction, alignment, and separation, and we ask what coherent collective motion emerges and when the flock can be said to agree. That story begins in Section 31.5.

Exercise 31.4.1: Reading the Three Coefficients Conceptual

Without running code, predict the qualitative effect of each change to Code 31.4.1, and explain it in terms of the inertia, cognitive, and social terms: (a) setting $w = 0.2$ from the first iteration; (b) setting $c_2 = 3.0$ while leaving $c_1 = 1.49$; (c) setting $c_1 = 3.0$ while leaving $c_2 = 1.49$; (d) setting all three of $w$, $c_1$, $c_2$ near zero. For each, state whether the swarm is more likely to converge prematurely, stagnate, or wander, and tie your answer to the exploration-exploitation balance.

Exercise 31.4.2: A Deceptive Landscape Where lbest Wins Coding

Replace the Rastrigin objective in Code 31.4.1 with a one-dimensional or two-dimensional function that has a wide, deep decoy basin near the starting region and a narrow global minimum far away (for example, a sum of two Gaussians of different widths and depths). Run both gbest and lbest from many random seeds (say 50) and report, for each topology, the fraction of runs that find the true global minimum rather than the decoy. Explain how the result illustrates the premature-convergence weakness of gbest and the robustness that the ring topology of Section 3 buys, and relate the effect to how fast information travels on the communication graph.

Exercise 31.4.3: Costing the Parallel Speedup Analysis

Suppose each objective evaluation takes $T_f$ seconds, the swarm has $n$ particles, you run $I$ iterations, and you have $W$ workers ($W \le n$). Ignoring the swarm-update arithmetic, write the wall-clock time for a sequential run and for a parallel run that scatters particles across workers and does one all-reduce per iteration costing $T_c$ seconds. Using the all-reduce cost model from Section 4.3, argue why the parallel speedup approaches $W$ when $T_f \gg T_c$, and identify the regime (cheap objective, large swarm, slow network) in which the all-reduce stops being negligible. State the condition on $T_f$ and $T_c$ under which parallelizing PSO is worthwhile.