Section 28.3: Nash Equilibria and Solution Concepts

"They asked me to predict what the agents would do. I gave them a fixed point and a warning: there are three of these, and I cannot tell you which one they will pick."
An Equilibrium With Selection Anxiety

Big Picture

A solution to a game is not a single best move but a stable joint configuration: a strategy for every agent such that no agent, acting alone, can do better by changing its own choice. That configuration is the Nash equilibrium, and it is the load-bearing prediction of the whole field. When many AI agents share an environment (a fleet of bidders, a swarm of robots, competing recommender policies, a market of trading bots), no single agent's behavior is well defined until we know what the others are doing, so the only coherent notion of "the outcome" is a joint one that is consistent with itself. This section defines that fixed point, proves it always exists for finite games, confronts the inconvenient facts that there can be many of them and that finding one is computationally hard, and explains why distributed AI usually reaches equilibria by learning rather than by solving. Those two facts, multiplicity and hardness, are exactly why the rest of this chapter and the multi-agent learning chapters exist.

The previous section gave us the two ways to write a game down: the normal form, a payoff matrix over simultaneous choices, and the extensive form, a tree of sequential moves. Writing a game down is not the same as saying what will happen in it. We now need a solution concept: a precise rule that takes a game and returns the strategy profiles we are willing to call outcomes. A good solution concept is predictive (agents acting in their own interest will tend to land there), self-consistent (it does not assume an agent will throw away payoff), and computable, at least in principle. The central solution concept, the one almost every multi-agent method in this book ultimately leans on, is the Nash equilibrium, and the cleanest way into it is to ask what each agent would do if it already knew everyone else's plan.

1. Best Response: The Atom of Every Solution Beginner

Fix all the other agents' strategies and ask a single agent a purely local question: given what everyone else is doing, what is the best I can do? The answer is the agent's best response. Write a strategy profile as $\sigma = (\sigma_i, \sigma_{-i})$, where $\sigma_i$ is agent $i$'s own strategy and $\sigma_{-i}$ collects the strategies of all the others. Agent $i$'s strategy is a best response when it maximizes $i$'s expected payoff $u_i$ holding the rest fixed:

$$\sigma_i \in \operatorname*{arg\,max}_{\sigma_i'} \; u_i(\sigma_i', \sigma_{-i}).$$

Best response is a one-agent computation: it needs no agreement, no global view, only the beliefs an agent holds about the others. That locality is exactly why it scales to many interacting agents and why it is the building block of both the equilibrium definition below and the learning dynamics at the end of this section. The catch is that everyone is computing a best response at the same time, against a target that is itself shifting. A solution is the place where all of those local optimizations stop pointing somewhere else.

Key Insight: An Equilibrium Is a Mutual Best Response, Not a Global Optimum

A Nash equilibrium asks only that every agent is best-responding to the others. It does not ask that the outcome be good for the group, efficient, or fair. The prisoner's dilemma equilibrium (both defect) is strictly worse for both agents than mutual cooperation, yet it is the unique Nash equilibrium because each agent, taking the other's choice as fixed, prefers to defect. Equilibrium predicts what self-interested agents settle on; whether that settlement is any good is a separate question, the price-of-anarchy question we open in Section 28.5.

2. The Nash Equilibrium Intermediate

A strategy profile $\sigma^* = (\sigma_1^*, \dots, \sigma_n^*)$ is a Nash equilibrium when every agent is simultaneously best-responding to all the others. Equivalently, no agent can raise its own expected payoff by a unilateral deviation:

$$u_i(\sigma_i^*, \sigma_{-i}^*) \;\ge\; u_i(\sigma_i', \sigma_{-i}^*) \qquad \text{for every agent } i \text{ and every alternative } \sigma_i'.$$

Read the inequality slowly: for each agent $i$, fixing everyone else at $\sigma_{-i}^*$, no other strategy $\sigma_i'$ does strictly better than $\sigma_i^*$. That is the entire content of the concept. It is a fixed point of the joint best-response map: plug the profile into "everyone best-responds" and the same profile comes back out. A pure-strategy equilibrium has each agent committing to one action; a mixed-strategy equilibrium has agents randomizing, and the inequality is then over expected payoffs. The randomization is not hand-waving: in a mixed equilibrium each agent is deliberately indifferent among the actions it mixes over, because if it strictly preferred one, it would put all its weight there and the profile would not be stable.

Figure 28.3.1: Two faces of best-response dynamics. On the left, a coordination game: from any off-diagonal cell each agent jumps to its better action, and the motion can only stop on one of the two diagonal cells, which are the pure Nash equilibria (so the game has two, the multiplicity problem of Section 4). On the right, matching pennies: best responses chase each other around a cycle and never settle on a cell, so there is no pure equilibrium; the only equilibrium is the mixed profile $(\tfrac12,\tfrac12)$ marked by the star at the center. The code in Code 28.3.1 computes both of these outcomes directly.

3. Existence: Nash's Theorem Intermediate

It would be a poor foundation if some games simply had no solution. Matching pennies, on the right of Figure 28.3.1, has no pure equilibrium at all: whatever deterministic choice one agent makes, the other has a strict reason to deviate, and the chase never ends. The resolution, and one of the most consequential results in the theory of interacting agents, is Nash's theorem.

Nash's theorem (1950): every finite game, meaning finitely many agents each with finitely many actions, has at least one Nash equilibrium in mixed strategies. The proof applies a fixed-point theorem (Brouwer's, or Kakutani's for the set-valued best-response map) to the joint best-response map on the compact, convex space of mixed-strategy profiles; a fixed point of that map is exactly a profile where everyone is best-responding, which is a Nash equilibrium. The price of guaranteed existence is that the guarantee is only for mixed strategies: matching pennies has an equilibrium, but you have to allow randomization to find it. For AI this is reassuring and demanding at once. Reassuring, because any finite multi-agent interaction we model has a well-defined predicted outcome. Demanding, because that outcome may require agents to randomize, and, as the next section shows, there may be more than one of them and reaching one may be expensive.

4. Multiplicity and the Selection Problem Intermediate

Existence is not uniqueness. The coordination game on the left of Figure 28.3.1 has two pure equilibria, and many games have more. When a game has several equilibria, the theory predicts a set of outcomes but does not, by itself, say which one a population of agents will reach. This is the equilibrium-selection problem, and it is not a technicality: two driver-assist policies that both prefer the same side of the road have two equilibria (all-left and all-right), both perfectly stable, and the entire safety question is which one a fleet coordinates on. Refinements try to narrow the set by adding plausibility conditions an equilibrium must survive, and the most important ones for AI are these.

A dominant-strategy equilibrium is the strongest and rarest: each agent has a single action that is best regardless of what the others do, so no beliefs about opponents are needed at all. When it exists (as in the prisoner's dilemma, where defection dominates), prediction is trivially robust, which is why Section 28.6 prizes mechanisms that make truth-telling a dominant strategy. A subgame-perfect equilibrium refines the set for sequential (extensive-form) games of Section 28.2: it requires the strategy to be a Nash equilibrium in every subgame, ruling out threats that an agent would not actually carry out, and you compute it by backward induction from the leaves of the game tree. A correlated equilibrium goes the other way and enlarges the solution set: a trusted shared signal (a traffic light is the canonical example) privately recommends an action to each agent, and the profile is a correlated equilibrium when no agent wants to deviate from its recommendation given what the signal implies about the others. Correlated equilibria can achieve payoffs no Nash equilibrium reaches, they are easier to compute (a linear program rather than a fixed-point search), and natural learning dynamics converge to them, which is why they recur in multi-agent learning.

Fun Note: The Traffic Light Is a Coordination Device, Not a Suggestion

A correlated equilibrium is the formal reason a traffic light works. The light is a shared random signal that whispers "go" to one direction and "stop" to the other. Neither driver wants to run a red, not out of obedience but because, given that the cross traffic was told "go," stopping is a best response. The signal lets the two of them reach an outcome (one flow moves, nobody crashes) that beats every equilibrium they could coordinate on with no signal at all. A great deal of multi-agent coordination is, underneath, the search for a good shared light.

5. The Computational Reality: PPAD-Hardness and Learning Advanced

Here the theory collides with practice in a way that shapes how AI actually uses equilibria. Nash's theorem proves an equilibrium exists but is non-constructive: it does not hand you one. Computing a Nash equilibrium is, in general, PPAD-complete (Daskalakis, Goldberg, and Papadimitriou, 2009), a complexity class for fixed-point problems widely believed to admit no polynomial-time algorithm. The practical upshot is blunt: for a general game with many agents or many actions, you should not expect to compute an exact Nash equilibrium directly, and any system design that assumes you can is building on sand.

Distributed AI sidesteps the wall with a different question. Instead of solving for an equilibrium, let the agents learn their way toward one by repeatedly playing and adapting. Two dynamics dominate. In fictitious play, each agent best-responds to the empirical frequency of the others' past actions, updating its beliefs as it observes more rounds; for important game classes (zero-sum and certain potential games) the empirical play converges to a Nash equilibrium. In no-regret learning, each agent runs an online algorithm that guarantees its average payoff approaches that of the best fixed action in hindsight; a classical result says that when all agents use no-regret algorithms, the empirical distribution of play converges to the set of correlated equilibria. These dynamics are local, communication-light, and robust to the messiness of real systems, which is exactly the profile distributed AI needs. They are the bridge from the static fixed point of this section to the repeated-game and learning treatment of Section 28.7, and to the policy-learning machinery of multi-agent reinforcement learning in Chapter 30, which builds on the distributed reinforcement-learning infrastructure of Chapter 20.

Research Frontier: Equilibrium-Finding in Enormous Games (2024 to 2026)

The learn-rather-than-solve strategy has scaled to games with more states than there are atoms in the observable universe. Counterfactual regret minimization (CFR), a no-regret dynamic of exactly the kind in Section 5, is the engine behind superhuman poker (Libratus, Pluribus), and a vigorous 2024 to 2026 line pushes it further: predictive and magnetic CFR variants tighten last-iterate convergence rather than only the time-averaged guarantee, and deep-learning function approximators (deep CFR and player-of-games style architectures from DeepMind) replace tabular regret tables so the method reaches games far too large to enumerate. Equilibrium-style reasoning also drives agents in games that mix cooperation and competition: Meta's CICERO reached human-level play in Diplomacy by pairing a planning module that searches for jointly stable action profiles with a language model for negotiation, and follow-on work studies regret-based and no-regret learners as the backbone of large language-model agent populations. The throughline is the message of this section: at scale you do not compute the fixed point, you build a learner whose trajectory provably approaches it.

Thesis Thread: Equilibria Bound What Coordination Can Achieve

This book is about getting many machines to act as one coherent intelligence, and game theory sets the hard limits on that project. When the machines are self-interested agents rather than obedient workers, the reachable outcomes are not whatever a designer wishes but the equilibria of the induced game. You cannot engineer a fleet of bidding, routing, or trading agents into a jointly optimal state if that state is not an equilibrium, because some agent will always have a profitable unilateral deviation and will take it. Equilibrium analysis is therefore the feasibility check for distributed coordination, and the learning dynamics above are how a real system reaches a feasible point without a central solver. The mechanism design of Section 28.6 is the constructive flip side: reshape the game so the equilibrium you want is the one the agents fall into on their own.

6. Computing and Learning Equilibria from Scratch Intermediate

The code below makes every idea in this section concrete with nothing but NumPy. It finds the pure Nash equilibria of two $2 \times 2$ games by direct best-response analysis, derives the mixed equilibrium of matching pennies in closed form from the indifference condition, and then runs fictitious play to show empirical behavior converging toward that mixed equilibrium, the right panel of Figure 28.3.1 brought to life round by round.

import numpy as np

# Two 2-player games as payoff bimatrices (A = row player's payoff, B = column's).
def pure_nash(A, B):
    """Return all pure-strategy Nash equilibria as (row, col) index pairs."""
    n, m = A.shape
    eq = []
    for i in range(n):
        for j in range(m):
            row_best = A[i, j] >= A[:, j].max() - 1e-12   # row can't beat col j
            col_best = B[i, j] >= B[i, :].max() - 1e-12    # col can't beat row i
            if row_best and col_best:
                eq.append((i, j))
    return eq

# Coordination game: two pure equilibria on the diagonal.
A1 = np.array([[3.0, 0.0], [0.0, 2.0]])
B1 = np.array([[2.0, 0.0], [0.0, 3.0]])
print("Game 1  (coordination)")
print("  pure Nash equilibria (row,col):", pure_nash(A1, B1))

# Matching pennies: no pure equilibrium, one mixed equilibrium.
A2 = np.array([[1.0, -1.0], [-1.0, 1.0]])
B2 = -A2
print("Game 2  (matching pennies)")
print("  pure Nash equilibria (row,col):", pure_nash(A2, B2) or "none")

# Closed-form mixed equilibrium: each player randomizes to make the OTHER
# indifferent between its two actions (the indifference condition).
def mixed_2x2(A, B):
    a, b, c, d = A[0,0], A[0,1], A[1,0], A[1,1]
    q = (d - c) / (a - b - c + d)        # col's prob on action 0
    e, f, g, h = B[0,0], B[0,1], B[1,0], B[1,1]
    p = (h - f) / (e - g - f + h)        # row's prob on action 0
    return p, q

p, q = mixed_2x2(A2, B2)
print(f"  mixed equilibrium: row plays action 0 w.p. {p:.3f}, col w.p. {q:.3f}")

# Fictitious play: each agent best-responds to the empirical mix of the other.
def fictitious_play(A, B, steps, seed=0):
    rng = np.random.default_rng(seed)
    row_counts = np.zeros(A.shape[0]); col_counts = np.zeros(A.shape[1])
    row_counts[rng.integers(A.shape[0])] += 1     # seed the history
    col_counts[rng.integers(A.shape[1])] += 1
    for _ in range(steps):
        col_mix = col_counts / col_counts.sum()
        row_mix = row_counts / row_counts.sum()
        row_counts[int(np.argmax(A @ col_mix))] += 1   # row best-responds
        col_counts[int(np.argmax(row_mix @ B))] += 1   # col best-responds
    return row_counts / row_counts.sum(), col_counts / col_counts.sum()

for T in (100, 1000, 20000):
    rm, cm = fictitious_play(A2, B2, T)
    print(f"  fictitious play, {T:6d} rounds: row mix {rm.round(3)}, col mix {cm.round(3)}")

Code 28.3.1: Equilibria three ways with only NumPy: an exhaustive best-response scan for pure equilibria, the indifference condition solved in closed form for the mixed equilibrium of matching pennies, and fictitious play as a learning dynamic that converges toward it.

Game 1  (coordination)
  pure Nash equilibria (row,col): [(0, 0), (1, 1)]
Game 2  (matching pennies)
  pure Nash equilibria (row,col): none
  mixed equilibrium: row plays action 0 w.p. 0.500, col w.p. 0.500
  fictitious play,    100 rounds: row mix [0.446 0.554], col mix [0.485 0.515]
  fictitious play,   1000 rounds: row mix [0.516 0.484], col mix [0.494 0.506]
  fictitious play,  20000 rounds: row mix [0.503 0.497], col mix [0.502 0.498]

Output 28.3.1: The coordination game reports its two pure equilibria; matching pennies reports none, plus the closed-form mixed equilibrium $(\tfrac12,\tfrac12)$. Fictitious play, computing nothing but repeated best responses, drifts from a noisy $(0.45, 0.55)$ at 100 rounds to within a few thousandths of the mixed equilibrium by 20,000 rounds. No solver ever ran; the agents learned the fixed point.

The two halves of the output are the two ways AI meets an equilibrium. The top half computes equilibria, which is feasible here only because the games are tiny and the indifference condition is a two-variable equation; scale either game up and the PPAD wall of Section 5 makes this approach hopeless. The bottom half learns one, and that approach keeps working as the game grows, which is why it, not the solver, is the template for large multi-agent systems.

Library Shortcut: nashpy Finds Equilibria in a Few Lines

The exhaustive scan and the closed-form indifference solve in Code 28.3.1 are exactly what a game-theory library packages for you. Nashpy takes the same payoff matrices and enumerates the equilibria, pure and mixed together, through vertex enumeration or the Lemke-Howson algorithm, so the dozen lines of best-response logic collapse to a single call that also handles the mixed cases the hand-rolled scanner would miss:

import nashpy as nash
import numpy as np

game = nash.Game(np.array([[1, -1], [-1, 1]]),    # row payoffs (matching pennies)
                 np.array([[-1, 1], [1, -1]]))    # column payoffs
for row_strategy, col_strategy in game.support_enumeration():
    print(row_strategy, col_strategy)             # -> [0.5 0.5] [0.5 0.5]

Code 28.3.2: The same matching-pennies equilibrium as Output 28.3.1, now from nashpy.support_enumeration. The library handles vertex and support enumeration internally; you supply the bimatrix and read off every equilibrium, mixed strategies included. It remains a small-game tool, since enumeration inherits the same hardness that Section 5 describes.

Practical Example: The Ad-Bidding Fleet That Stopped Oscillating

Who: A platform engineer running a fleet of automated bidding agents in a real-time ad auction.

Situation: Hundreds of advertiser agents each set bids every few seconds against a shared inventory, all optimizing their own return on spend.

Problem: The agents oscillated: each one best-responded to last interval's prices, prices swung, everyone overcorrected, and spend and fill rates lurched instead of settling.

Dilemma: Compute the auction's Nash equilibrium centrally and push fixed bids to every agent, accurate in principle but PPAD-hard at this scale and brittle when advertisers churn, or let the agents keep adapting but tame the instability so behavior converges.

Decision: They abandoned the central solve and reshaped the learning dynamic, replacing raw best-response with a no-regret update that smooths each agent's reaction to recent prices.

How: Each agent ran a regret-matching update over a small set of bid multipliers, mixing over them rather than jumping to the single myopic best response, the smoothed analogue of the fictitious play in Code 28.3.1.

Result: The empirical bid distribution settled toward a stable correlated outcome within minutes of churn instead of oscillating indefinitely, and realized spend tracked targets without any agent solving the full game.

Lesson: At fleet scale you do not compute the equilibrium, you engineer a learning dynamic that converges to one. Convergence, not the closed-form solution, is the deliverable.

7. Why Equilibria Are the Right Lens, and Their Limits Beginner

Equilibria earn their central place because they are the only self-consistent predictions of self-interested behavior: any outcome that is not an equilibrium contains an agent with a reason to move, so it cannot be where the system rests. That makes equilibrium analysis the natural feasibility test for any coordination scheme, the foundation the cooperative games of Section 28.4 and the mechanisms of Section 28.6 build on, and the static backdrop against which the learning dynamics of Section 28.7 play out. The same lens recurs whenever agents interact in this book, from contract-net bidding in distributed problem solving (Chapter 27) to the communication-efficient adaptation studied in distributed optimization (Chapter 10).

The limits are equally important to carry forward. Equilibrium assumes agents are rational and, in the classic definition, that beliefs are somehow consistent, assumptions real bots and humans violate. Multiplicity means the concept often predicts a set, not a point, and leaves selection unresolved. Hardness means exact computation is usually off the table. And equilibrium is silent on quality: the rest point can be far from the social optimum, the gap we will quantify as the price of anarchy in Section 28.5. These are not reasons to abandon the lens but the reasons the chapter continues: cooperative solution concepts, welfare and efficiency, mechanism design, and learning dynamics each address one of these limits in turn.

Exercise 28.3.1: Find Every Equilibrium by Hand Conceptual

Consider the "Battle of the Sexes" coordination game with row payoffs $A = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}$ and column payoffs $B = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. (a) Find the two pure-strategy Nash equilibria by checking, cell by cell, that neither agent wants to deviate. (b) Using the indifference condition, find the mixed-strategy equilibrium: the probability $p$ with which the row agent plays action 0 that makes the column agent indifferent, and vice versa. (c) Explain in one sentence why this game's three equilibria make it the textbook illustration of the equilibrium-selection problem from Section 4.

Exercise 28.3.2: Make Fictitious Play Converge (and Watch It Struggle) Coding

Starting from Code 28.3.1, (a) run fictitious_play on the coordination game A1, B1 and report which pure equilibrium the empirical mix converges to; then change the random seed and show that the selected equilibrium can change, a direct demonstration of equilibrium selection. (b) Modify the function to also record, every 1000 rounds, the maximum absolute change in the row mix over the last 1000 rounds, and plot or print it for matching pennies to show convergence slowing as $1/T$. (c) Construct a $3 \times 3$ game (for example, rock-paper-scissors as a zero-sum bimatrix) and confirm fictitious play still approaches its known mixed equilibrium $(\tfrac13, \tfrac13, \tfrac13)$.

Exercise 28.3.3: Cost of Solving vs Cost of Learning Analysis

A platform must coordinate $n$ self-interested agents, each choosing among $m$ actions. (a) Argue why exhaustive best-response checking for pure equilibria costs on the order of $m^n$ profile evaluations, and why this is hopeless even for modest $n$ and $m$ (compute the count for $n = 20$, $m = 5$). (b) Explain why a no-regret learning dynamic instead costs roughly $O(m)$ work per agent per round and converges in a number of rounds independent of the number of agents, then state what the learners give up in exchange (which equilibrium concept they reach, and whether the outcome is exact or average-iterate). (c) Connect this trade-off to the PPAD-hardness of Section 5: which of the two approaches does the hardness result rule out at scale, and which does it leave standing?