Section 28.7: Repeated Games and Learning Dynamics

"On the first round I defected and felt clever. By the thousandth round, surrounded by agents who remembered, I had learned to say hello."
An Agent That Discovered Tomorrow

Big Picture

Everything earlier in this chapter assumed agents that compute an equilibrium once and play it; real multi-agent AI is agents that play the same game over and over and learn from each other, and that single change rewrites which outcomes are rational and where the solution concepts come from. When a game repeats, a defection today can be punished tomorrow, so cooperation that no one-shot equilibrium would predict becomes self-enforcing; the Folk Theorem makes this precise by showing a vast set of outcomes, including the cooperative ones, are equilibria of the repeated game. And when agents cannot compute equilibria at all, they adapt with simple update rules whose long-run averages still land on the solution concepts of Section 28.3. This section closes the chapter by joining static game theory to the learning dynamics that the next two chapters build on, and it explains the non-stationarity that makes multi-agent learning the hard problem it is.

The earlier sections of this chapter treated a game as a one-time event. Two agents meet, each reasons about the other, and they settle on an equilibrium such as the mutual defection of the Prisoner's Dilemma, which is grim but rational when you will never see your opponent again. That framing already carried us through representations (Section 28.3), cooperative coalitions, social welfare, and mechanism design. But it misses the defining feature of the systems this part of the book is about. A fleet of trading agents, a swarm of delivery robots, a market of bidding services, a population of language-model agents negotiating on a shared task: these do not meet once. They interact continuously, they remember, and crucially they change their behavior in response to what the others did. The static theory is the grammar; repetition and learning are where the sentences get written.

Two distinct ideas live in this section, and it helps to keep them apart. The first is repeated play with fixed, fully rational agents: the agents still reason about equilibria, but now over an infinite or unknown-length horizon, and the threat of future punishment expands what counts as equilibrium behavior. The second is learning dynamics: agents that are not computing equilibria at all, but adapting through trial, feedback, and regret, whose collective trajectory nonetheless converges toward the same solution concepts. The first explains why cooperation can be stable; the second explains how a population of adaptive agents actually reaches stable behavior. Both are the bridge into multi-agent reinforcement learning in Chapter 30.

Figure 28.7.1: Two mechanisms that change the static picture. Left: in a one-shot game, defection is dominant and the only equilibrium is mutual defection, because no future round exists to deter it. Right top: when the game repeats, a single defection can be answered with a punishment round, so defecting today costs more than it gains and cooperation becomes self-enforcing. Right bottom: two no-regret learners adjust their strategies round by round, and although neither computes an equilibrium, their time-averaged play converges to the equilibrium line, the result formalized in Sections 3 and 4 below.

1. Why Repetition Changes the Game Beginner

Recall the Prisoner's Dilemma payoff structure: each agent is tempted to defect because defection is the dominant move in a single encounter, yet both defecting leaves them worse off than both cooperating. In one shot, rationality and the bad outcome coincide. Now play the same game between the same two agents an indefinite number of times, with each agent valuing future payoffs through a discount factor $\delta \in [0, 1)$, so that a reward $r$ received $t$ rounds from now is worth $\delta^t r$ today. The total value of a strategy is the discounted sum $\sum_{t=0}^{\infty} \delta^t r_t$. The single move is unchanged, but the strategic object is no longer a choice of action; it is a choice of plan, a rule that maps the entire history of play to the next move.

This is what opens the door to cooperation. Suppose both agents adopt the rule "cooperate as long as the other has always cooperated; defect forever the moment they defect." If you defect once, you gain the one-round temptation payoff, but you forfeit every future round of mutual cooperation, trading a stream worth roughly $\frac{R}{1-\delta}$ for a one-time gain followed by the punishment stream. When $\delta$ is large enough, that trade is a loss, so cooperation is each agent's best response to the other. Cooperation has become a Nash equilibrium of the repeated game without changing a single payoff in the underlying game. The only new ingredient is a shared future worth protecting, captured entirely by $\delta$.

Key Insight: A Shared Future Is the Enforcement Mechanism

In a one-shot game, an agent's only leverage over another is the current move. In a repeated game, the leverage is the future: every later round is a potential reward to grant or a punishment to inflict. Cooperation that is irrational in isolation becomes rational when defecting today triggers a worse tomorrow, and the discount factor $\delta$ sets the exchange rate between the two. Raise $\delta$ (agents that expect to keep interacting and value the future) and the set of sustainable cooperative outcomes grows; lower it (agents about to disband, or a known final round) and the cooperation collapses back to the one-shot equilibrium. This is the same intuition that later governs whether a fleet of self-interested service agents holds a truce or races to the bottom.

2. The Folk Theorem and Tit-for-Tat Intermediate

The observation that cooperation can be an equilibrium generalizes dramatically. The Folk Theorem states that in an infinitely repeated game with sufficiently patient players (discount factor close enough to one), essentially any outcome in which each agent earns at least its guaranteed worst-case payoff can be sustained as a Nash equilibrium of the repeated game. The agent's worst-case guarantee is its minmax value, the most it can secure when every other agent ganks it:

$$\underline{v}_i = \min_{a_{-i}} \max_{a_i} u_i(a_i, a_{-i}),$$

and the theorem says any payoff profile giving every agent strictly more than its $\underline{v}_i$ is reachable as an equilibrium, supported by the threat that any deviator gets minmaxed forever after. The name is a piece of discipline humor: the result was folklore, known and used before anyone published a careful proof.

The Folk Theorem is a double-edged sword. Its optimistic reading is that cooperation is not a fragile accident but a robust possibility, sustainable by patient agents who can punish. Its sobering reading is that the repeated game has a staggering multiplicity of equilibria, almost any outcome at all, so equilibrium analysis alone no longer predicts what agents will actually do. Something must select among the equilibria, and that selection is exactly what learning dynamics provide.

Fun Note: The Tournament That Crowned Niceness

In 1980 the political scientist Robert Axelrod ran a computer tournament: researchers submitted strategies for the iterated Prisoner's Dilemma, and every strategy played every other in a round robin. The winner, submitted by Anatol Rapoport, was the shortest program in the field, just four lines of Tit-for-Tat: cooperate first, then copy whatever the opponent did last. It won again in a second, larger tournament even after everyone knew it was the one to beat. Axelrod distilled its success into a recipe that still reads like advice for agent design: be nice (never defect first), be retaliatory (punish defection at once), be forgiving (return to cooperation the moment the opponent does), and be clear (so others can learn to trust you).

Tit-for-Tat is the canonical reciprocal strategy, and its tournament dominance is the empirical companion to the Folk Theorem: simple reciprocity, with no model of the opponent and no equilibrium computation, is enough to sustain cooperation against a wide field of strategies. Our runnable demo below reproduces the round-robin and confirms that reciprocal strategies finish at the top. Reciprocity also generalizes: variants that forgive occasional mistakes (Generous Tit-for-Tat, win-stay-lose-shift) outperform strict Tit-for-Tat once moves are noisy, a foreshadowing of the noisy, partially observed worlds that Chapter 29 agents inhabit.

3. Agents That Adapt Instead of Compute Intermediate

Computing a Nash equilibrium is hard in general, and worse, it presumes every agent knows the full game and trusts the others to be equally rational. Real agents rarely enjoy that luxury. So instead of computing an equilibrium, they learn one: they keep a running record of what worked, adjust their behavior toward it, and let the equilibrium emerge from the dynamics. Three update rules anchor this view, and each is a building block for the algorithms of Chapter 30.

Fictitious play. The simplest learner assumes its opponents play according to the empirical frequencies of their past moves, then best-responds to that belief. Each round it updates the frequency counts and best-responds again. In many important classes of games (zero-sum games, potential games), the empirical play of fictitious players converges to a Nash equilibrium. The weakness is that it needs a model of the opponent's action distribution and assumes that distribution is stationary, an assumption that fails precisely when the opponent is also learning.

Replicator dynamics. Borrowed from evolutionary biology, this view treats a strategy's share of a large population as growing in proportion to how much better than average it performs. Strategies that beat the population average spread; those below it die out. Replicator dynamics turns "which strategies are stable?" into "which population mixes are evolutionary fixed points?", and its rest points are closely tied to Nash equilibria. It is the natural language for the emergence of cooperation in a population: a small cluster of reciprocal cooperators can invade and take over a population of defectors when interactions repeat, which is the formal version of why Tit-for-Tat spread through Axelrod's later evolutionary tournaments.

No-regret learning. The most powerful and most practical idea drops opponent modeling entirely. An agent measures its regret: how much better off it would have been had it always played some fixed action instead of what it actually did. Formally, after $T$ rounds the regret for action $a$ is

$$R_T(a) = \sum_{t=1}^{T} \big( u(a, a^{-}_t) - u(a_t, a^{-}_t) \big),$$

the cumulative gain from substituting $a$ in every round, where $a^{-}_t$ is what the others did at round $t$. An algorithm is no-regret if its average regret $\max_a R_T(a) / T \to 0$ as $T$ grows: in the long run it does as well as the best single action chosen in hindsight. Regret matching achieves this with a strikingly simple rule: play each action with probability proportional to its accumulated positive regret. No model of the opponent, no equilibrium solve, just "do more of what you wish you had done."

Key Insight: No-Regret Learning Converges to Correlated Equilibria

Here is the result that ties learning back to the solution concepts of Section 28.3. If every agent in a repeated game runs a no-regret algorithm, the empirical distribution of their joint play converges to the set of correlated equilibria of the stage game. No agent intends an equilibrium, no agent models any other, yet the population's average behavior provably lands on one. This is the deepest bridge in the chapter: it makes equilibrium not an assumption about hyper-rational reasoning but a consequence of agents that merely avoid persistent regret. Self-play training of game-playing AI, including the regret-minimization engines behind superhuman poker, is no-regret learning run at scale, and the same principle underwrites the convergence guarantees that Chapter 30 seeks for multi-agent reinforcement learning.

4. A Tournament and a Convergence, From Scratch Intermediate

The demo below makes both ideas concrete in pure Python. Part A runs an Axelrod-style round-robin of the iterated Prisoner's Dilemma among five strategies, scoring each against every opponent (including itself) over 200 rounds, to show that reciprocity wins. Part B drops two regret-matching learners into Rock-Paper-Scissors, a zero-sum game whose unique equilibrium is the uniform mix, and tracks whether their time-averaged strategies converge to it. The first half illustrates Sections 1 and 2; the second illustrates the no-regret result of Section 3.

import random
random.seed(7)

# ----- Part A: iterated Prisoner's Dilemma round-robin -----
# Payoff to the row player. R=3 (both C), S=0, T=5, P=1 (both D): T > R > P > S.
PAYOFF = {('C','C'): 3, ('C','D'): 0, ('D','C'): 5, ('D','D'): 1}

def always_c(me, opp):      return 'C'
def always_d(me, opp):      return 'D'
def tit_for_tat(me, opp):   return 'C' if not opp else opp[-1]   # copy last move
def grim(me, opp):          return 'D' if 'D' in opp else 'C'     # never forgive
def random_player(me, opp): return random.choice(['C','D'])

STRATS = {'Tit-for-Tat': tit_for_tat, 'Always-Cooperate': always_c,
          'Always-Defect': always_d, 'Grim-Trigger': grim, 'Random': random_player}

def play(a_fn, b_fn, rounds=200):
    ha, hb, sa, sb = [], [], 0, 0
    for _ in range(rounds):
        a, b = a_fn(ha, hb), b_fn(hb, ha)
        sa += PAYOFF[(a, b)]; sb += PAYOFF[(b, a)]
        ha.append(a); hb.append(b)
    return sa, sb

ROUNDS = 200
totals = {n: 0 for n in STRATS}
for na, fa in STRATS.items():
    for nb, fb in STRATS.items():
        sa, _ = play(fa, fb, ROUNDS); totals[na] += sa

print("Iterated Prisoner's Dilemma round-robin (%d rounds/match)" % ROUNDS)
print("%-18s %10s %12s" % ("strategy", "total", "avg/round"))
for n in sorted(totals, key=totals.get, reverse=True):
    print("%-18s %10d %12.3f" % (n, totals[n], totals[n] / (len(STRATS) * ROUNDS)))

# ----- Part B: two regret-matching learners on Rock-Paper-Scissors -----
ACTIONS = ['R', 'P', 'S']
RPS = {'R': {'R':0,'P':-1,'S':1}, 'P': {'R':1,'P':0,'S':-1}, 'S': {'R':-1,'P':1,'S':0}}

def regret_match(reg):                       # strategy from positive regrets
    pos = [max(r, 0.0) for r in reg]; z = sum(pos)
    return [p / z for p in pos] if z > 0 else [1/3, 1/3, 1/3]

def sample(strat):
    r, c = random.random(), 0.0
    for i, p in enumerate(strat):
        c += p
        if r <= c: return i
    return len(strat) - 1

reg1, reg2 = [0.0]*3, [0.0]*3
sum1, sum2 = [0.0]*3, [0.0]*3
T = 20000
for _ in range(T):
    s1, s2 = regret_match(reg1), regret_match(reg2)
    for i in range(3): sum1[i] += s1[i]; sum2[i] += s2[i]
    a1, a2 = sample(s1), sample(s2)
    u1 = RPS[ACTIONS[a1]][ACTIONS[a2]]            # counterfactual regret update
    for i in range(3): reg1[i] += RPS[ACTIONS[i]][ACTIONS[a2]] - u1
    u2 = RPS[ACTIONS[a2]][ACTIONS[a1]]
    for i in range(3): reg2[i] += RPS[ACTIONS[i]][ACTIONS[a1]] - u2

avg1 = [x / T for x in sum1]
print("\nRegret matching on Rock-Paper-Scissors (T=%d)" % T)
print("player 1 time-averaged strategy: R=%.3f P=%.3f S=%.3f" % tuple(avg1))
print("equilibrium target             : R=0.333 P=0.333 S=0.333")
print("max deviation from uniform     : %.3f" % max(abs(p - 1/3) for p in avg1))

Code 28.7.1: A repeated-game tournament and a no-regret convergence in one file. Part A scores reciprocal strategies against unconditional ones; Part B runs counterfactual regret matching until the learners' average play approaches the unique equilibrium of Rock-Paper-Scissors.

Iterated Prisoner's Dilemma round-robin (200 rounds/match)
strategy                total    avg/round
Grim-Trigger             2594        2.594
Tit-for-Tat              2445        2.445
Always-Defect            2204        2.204
Always-Cooperate         2127        2.127
Random                   1914        1.914

Regret matching on Rock-Paper-Scissors (T=20000)
player 1 time-averaged strategy: R=0.332 P=0.334 S=0.335
equilibrium target             : R=0.333 P=0.333 S=0.333
max deviation from uniform     : 0.001

Output 28.7.1: Reciprocity tops the table: the two retaliating-but-cooperative strategies (Grim-Trigger and Tit-for-Tat) outscore unconditional defection, because they harvest mutual cooperation whenever the opponent allows it while refusing to be exploited. Below, the regret-matching learner's time-averaged strategy lands within 0.001 of the uniform equilibrium, a no-regret algorithm reaching the solution concept without ever solving for it.

The two results are the chapter's two halves in miniature. The tournament shows why cooperation is sustainable: strategies that reciprocate beat strategies that exploit, exactly as the Folk Theorem and Axelrod predict. The convergence shows how adaptive agents find equilibrium: by minimizing regret, not by computing a fixed point. Reference Figure 28.7.1 alongside the output, the left panel is Part A's logic, the right panel is Part B's trajectory.

Library Shortcut: Axelrod-Python and OpenSpiel Do the Bookkeeping

Code 28.7.1 hand-rolled five strategies and a scoring loop. The axelrod library ships more than 200 published iterated-Prisoner's-Dilemma strategies and runs full tournaments, noise, and evolutionary dynamics in a few lines; DeepMind's open_spiel provides regret-matching and counterfactual-regret-minimization solvers, fictitious play, and replicator dynamics over a large catalog of games. The roughly fifty lines above collapse to a handful:

# pip install axelrod
import axelrod as axl
players = [axl.TitForTat(), axl.Cooperator(), axl.Defector(),
           axl.Grudger(), axl.Random()]
results = axl.Tournament(players, turns=200, repetitions=5).play()
print(results.ranked_names)        # reciprocal strategies rank at the top

# pip install open_spiel  -> regret-matching / CFR equilibrium solvers
# from open_spiel.python.algorithms import cfr
# solver = cfr.CFRSolver(pyspiel.load_game("kuhn_poker"))

Code 28.7.2: The same tournament as Output 28.7.1 in five lines with axelrod, plus the entry point to OpenSpiel's regret solvers. The libraries handle strategy catalogs, repetitions, noise models, and equilibrium computation that a from-scratch implementation would have to grow into.

Practical Example: The Bidding Agents That Learned a Truce

Who: A platform engineer running an internal ad-bidding marketplace where dozens of automated bidding agents compete for the same impressions.

Situation: Each agent maximized its own advertiser's return, and the auction repeated millions of times a day among a stable set of recurring competitors.

Problem: Treating each auction as one-shot, the agents bid aggressively toward the one-shot equilibrium, burning budget in a price war that left every advertiser worse off without improving allocation.

Dilemma: Hard-code a price floor (a blunt mechanism that distorts allocation and invites gaming), or let the agents keep learning and hope a less wasteful pattern emerged on its own.

Decision: They replaced the myopic bidders with no-regret learners and lengthened each agent's effective horizon, so that today's overbidding was weighed against the future rounds it would spoil.

How: Each agent ran a regret-matching update over a discretized bid grid, exactly the rule in Code 28.7.1, with the empirical joint bid distribution logged for monitoring.

Result: Within days the joint play settled near a correlated equilibrium, the destructive price war damped out, and average advertiser return rose, the no-regret-to-equilibrium result of Section 3 playing out in production.

Lesson: When self-interested agents interact repeatedly, you often do not need to impose cooperation by fiat; the right learning rule plus a real shadow of the future lets it emerge, and the equilibrium it reaches is a property you can monitor rather than a number you assume.

5. Why This Matters for Multi-Agent AI Advanced

The reason this section closes the chapter, rather than sitting as an afterthought, is that real multi-agent AI systems are not agents computing equilibria; they are agents learning against each other. Every agent's policy is changing as it learns, which means that from any single agent's point of view the environment is non-stationary: the rules of the world (the behavior of the others) shift underneath it precisely because the others are adapting too. This is the central difficulty that makes multi-agent reinforcement learning fundamentally harder than the single-agent case, where the environment holds still while the agent learns. We name it here because it is a direct consequence of the learning dynamics above, and we hand it forward to its full treatment in Chapter 30, where moving-target learning, opponent modeling, and convergence guarantees are the main event.

Non-stationarity is also why the convergence results of Section 3 are so valuable: they are among the few guarantees that survive when everyone learns at once. No-regret learning converging to correlated equilibria, and fictitious play converging in zero-sum and potential games, are the theoretical anchors that keep multi-agent learning from being pure empirical guesswork. They tell us which classes of multi-agent problems are tractable and which are not, and they connect the practical RL infrastructure of Chapter 20, the actor-learner machinery that runs self-play at scale, to the equilibrium concepts that say what that self-play is converging toward.

Thesis Thread: From One Decision to a Population of Learners

This chapter began with a single self-interested agent reasoning about a one-shot game and ends with a population of agents learning against each other across time. That is the same scale-out move the whole book makes, now applied to intelligence itself: a decision that one agent could compute in isolation becomes a dynamic that many distributed, adapting agents must reach together. The non-stationarity we just named is the multi-agent cousin of the coordination and staleness problems that haunted distributed optimization (the synchronous-versus-asynchronous tension of Chapter 27 and earlier parts). Distributing intelligence, the sixth axis from Chapter 1, means accepting that the agents will learn, will drift, and will have to find their equilibria on the move.

Research Frontier: Cooperation and Opponent Modeling in LLM Agents (2024 to 2026)

The repeated-game lens has surged back into relevance as large-language-model agents began interacting in markets, debates, and negotiations. A 2024 to 2026 line of work places LLM agents in iterated social dilemmas (Prisoner's Dilemma, public-goods and ultimatum games) and measures whether they sustain cooperation, finding that prompting an explicit "shadow of the future", reputation, or Tit-for-Tat-style reciprocity sharply raises cooperation rates, while myopic agents defect just as one-shot theory predicts. Generative agent societies (in the lineage of Park et al.'s simulated towns) exhibit emergent norms and reciprocal punishment that look strikingly like the Folk Theorem at work. In parallel, the equilibrium-learning frontier has scaled regret minimization far beyond poker: methods such as DeepMind's Student of Games and follow-on work unify search and no-regret learning across both perfect- and imperfect-information games, and opponent-modeling research asks how an agent should best-respond to other learners rather than to a fixed distribution, the exact non-stationary setting of Chapter 30. The open question threading all of it is the cooperation-stability question of this whole part: under what conditions do populations of self-interested learning agents converge on cooperation rather than on a race to the bottom?

6. Chapter Summary and What Comes Next Beginner

This section completes Chapter 28, which assembled the game-theoretic foundation that the rest of Part VI stands on. We can now state that foundation as one connected story.

Key Takeaway: Game Theory Is the Foundation for Self-Interested Agents

Chapter 28 built the toolkit for reasoning about agents that pursue their own goals. We started with representations of strategic interaction (normal-form and extensive-form games), defined Nash equilibrium and the other solution concepts that say what rational agents settle on (Section 28.3), and turned to cooperative games and fair division, where coalitions form and the Shapley value splits the gains. We measured outcomes against social welfare and Pareto optimality, quantified how far selfish behavior drifts from the social optimum with the price of anarchy, and used mechanism design and auctions to engineer games whose equilibria are the outcomes we want. Finally, repeated games and learning dynamics showed that when agents interact over time and adapt, cooperation becomes self-enforcing (the Folk Theorem, Tit-for-Tat) and equilibrium emerges from no-regret learning rather than from computation. Together these give us a language for self-interested, distributed, learning agents, which is exactly the language the next chapters speak.

From here the book turns from foundations to systems. Chapter 29 builds multi-agent systems proper, agent architectures, communication, coordination, negotiation, coalition formation, and consensus, applying this chapter's equilibria and mechanisms to real coordination protocols. Chapter 30 then makes the learning explicit, lifting repeated normal-form games to Markov games and confronting head-on the non-stationarity that Section 5 named. The reciprocity, regret, and equilibrium-selection ideas you met here are the seeds of those chapters; watch for them returning as reward shaping, centralized-training-decentralized-execution, and opponent modeling.

Exercise 28.7.1: When Cooperation Unravels Conceptual

Consider the infinitely repeated Prisoner's Dilemma with the payoffs in Code 28.7.1 (temptation $T=5$, reward $R=3$, punishment $P=1$, sucker $S=0$) and both agents using the grim-trigger strategy. (a) Write the inequality on the discount factor $\delta$ under which cooperating forever beats defecting once and then suffering eternal mutual defection. (b) Solve for the threshold $\delta^\ast$. (c) Now suppose the game has a known final round $T_{\text{end}}$. Argue by backward induction why cooperation unravels entirely, and explain what this implies for designing agent interactions where the horizon is common knowledge versus uncertain.

Exercise 28.7.2: Add a Noisy, Forgiving Strategy Coding

Extend Code 28.7.1 in two ways. First, add a "Generous Tit-for-Tat" that copies the opponent's last move but forgives a defection with probability 0.1, and a "win-stay-lose-shift" strategy (repeat your last move if it earned $T$ or $R$, otherwise switch). Second, inject noise: with probability 0.05 each intended move is flipped before it is played. Re-run the round robin and report the new ranking. Explain why strict Tit-for-Tat suffers under noise (two strict copiers can fall into an echoing defection spiral) while forgiving variants recover, and connect this to the partially observed, error-prone settings of Chapter 29.

Exercise 28.7.3: Regret and the Equilibrium It Finds Analysis

Using the regret-matching learner in Part B of Code 28.7.1, (a) instrument the loop to record the maximum average regret $\max_a R_T(a)/T$ every 1000 iterations and confirm it trends toward zero. (b) Replace Rock-Paper-Scissors with a $2 \times 2$ coordination game that has two pure equilibria and one mixed; run many random restarts and report which equilibrium the time-averaged play converges to and how often. (c) Explain, with reference to the no-regret-to-correlated-equilibrium result of Section 3, why the empirical joint distribution of the two learners is the right object to inspect, rather than either learner's strategy alone.

Project Ideas

Three open-ended projects to carry the chapter's ideas into running code. Each is sized for a few hours to a weekend of agent-built experimentation.

1. An Axelrod-style ecological tournament. Using the axelrod library (Code 28.7.2), assemble a field of 30 or more published strategies and run an evolutionary tournament in which each strategy's population share grows in proportion to its score, iterating until the population stabilizes. Report which strategies survive, whether reciprocity dominates, and how the surviving mix shifts when you add move noise or shorten the match length (lower the effective $\delta$). Tie your findings back to the Folk Theorem and replicator dynamics.

2. No-regret learners reaching equilibrium. Implement counterfactual regret minimization (extending Part B of Code 28.7.1, or via open_spiel) and demonstrate convergence to equilibrium on three games of increasing difficulty: Rock-Paper-Scissors, a coordination game with multiple equilibria, and Kuhn poker (an imperfect-information game). Plot average regret versus iterations for each and verify the empirical joint play approaches a correlated equilibrium. Discuss which games converge fastest and why.

3. LLM agents in an iterated social dilemma. Place two or more language-model agents in a repeated public-goods or Prisoner's-Dilemma game and measure their cooperation rate under different prompts: a myopic prompt with no mention of the future, a prompt that makes the repeated horizon explicit, and a prompt that grants each agent memory of the other's reputation. Quantify how the "shadow of the future" changes behavior, and compare the agents' emergent strategies to Tit-for-Tat. This connects the chapter's theory directly to the 2024 to 2026 research frontier and to the multi-agent systems of Chapter 29.