Section 32.5: Debate, Critique, and Reflection Across Agents

"I gave a confident wrong answer. Then a second agent gave a confident wrong answer that agreed with mine, and we called it consensus."
A Judge That Forgot to Check for Diversity

Big Picture

A single forward pass commits to one chain of reasoning and lives with its mistakes; a group of agents that disagree, critique, and revise turns extra computation into extra accuracy, the agentic echo of ensembling and code review. The same statistical fact that makes a committee of weak classifiers beat any one member, and the same social fact that makes a second reviewer catch the bug the author was blind to, apply to language agents: independent perspectives catch errors a single pass misses. The patterns in this section, self-reflection, a separate critic, multi-agent debate, and voting, all spend more agent calls to think harder, an instance of inference-time scaling. They are also a consensus protocol over opinions, which means they inherit a failure mode from Section 31.9: when agents copy each other instead of reasoning independently, they herd onto a confident wrong answer, and the gain evaporates. Quality through disagreement only works while the disagreement is real.

The previous section, Section 32.4, ran many agents in parallel to do more work per unit of wall-clock: a fan-out of independent subtasks recombined into one result. This section keeps the fan-out but changes its purpose. Instead of dividing labor, the agents now reconsider the same problem from different angles, and we combine not to assemble a larger answer but to arrive at a better one. The unit of work is no longer a subtask; it is an opinion, and the orchestration question becomes how to aggregate opinions so that the group is more reliable than any member. That is a question Part VI has asked before, in the wisdom of crowds (Section 31.1) and in distributed consensus (Section 29.9), and the answers transfer almost directly to language agents.

Figure 32.5.1: Three ways agents improve quality through disagreement and revision. Left: a generator drafts, an independent critic approves or sends it back to revise (the generate-critique loop, and the self-loop variant where one agent plays both roles is reflection). Middle: several agents argue different positions over rounds and a judge resolves the debate. Right: many independent agents answer and an aggregator takes the majority. All three trade more LLM calls for higher accuracy; the right two are consensus over opinions and depend on the votes being independent.

1. Reflection: An Agent Revises Its Own Work Beginner

The cheapest quality lever is to let a single agent look at its own output and try again. A first pass produces a draft; a second pass, prompted to find flaws in that draft, produces a critique; a third pass revises in light of the critique. Nothing about the model changed between passes, yet the revised answer is frequently better, because the act of evaluating an answer is an easier task than producing one, and reframing the prompt from "solve this" to "what is wrong with this solution?" surfaces errors the generation pass glossed over. This is self-reflection, and the Reflexion pattern adds memory to it: the agent writes its self-critique into a scratchpad that conditions the next attempt, so repeated trials accumulate lessons rather than repeating the same mistake.

Reflection has a ceiling that distinguishes it from the multi-agent patterns to come. The same model, with the same training and the same blind spots, is grading its own homework. If the error stems from a misconception the model holds confidently, it will tend to approve its own wrong answer, because the faculty that generated the mistake also judges it. Reflection reliably catches careless errors (a dropped constraint, an arithmetic slip, an unhandled case) and reliably misses errors rooted in the model's own confident misunderstanding. To escape that ceiling you need a second, genuinely different perspective, which is where separate agents earn their cost.

Key Insight: Evaluating Is Easier Than Generating, and Independence Is What Pays

Every pattern in this section rests on one asymmetry: checking a candidate answer is an easier task than producing it, so a verification pass catches errors the generation pass missed. But self-reflection caps out where the checker shares the generator's blind spots. The accuracy gain scales with the independence of the second opinion, not merely its presence. A separate critic, an adversary in a debate, or an independent voter each buys more than a self-review, and a crowd of identical agents that all copy one another buys nothing at all. Spend your extra compute on diversity, not on repetition.

2. Critic and Verifier Agents: The Generator-Critic Split Beginner

The next step separates the roles across two agents. A generator proposes a solution; a distinct critic, often given a different prompt, a different model, or access to a tool the generator lacks (a code runner, a calculator, a retrieval index from Chapter 25), checks the proposal and either approves it or returns specific objections. The generator revises and resubmits until the critic is satisfied or a round budget is exhausted. This is the generator-critic split, and it is the agentic form of the adversarial-verify idea: a verifier whose only job is to find the flaw is harder to fool than a generator hoping it got the answer right.

The split is powerful precisely when verification is cheaper or more reliable than generation. A unit test either passes or it does not; a SQL query either parses or it does not; a cited fact either appears in the retrieved document or it does not. When the critic can ground its judgment in an external check rather than its own opinion, the loop converges on answers that satisfy a real constraint rather than answers that merely sound right. The diagram's left panel (Figure 32.5.1) is this loop: the value of the arrow back from critic to generator is exactly the objection that the next draft must address.

Library Shortcut: Frameworks Provide the Critic Loop as a Pattern

Hand-wiring a generator-critic loop means managing two prompts, a round counter, a stopping rule, and the message-passing between roles. Agent frameworks ship this as a primitive. In AutoGen, a reflection pattern or a two-agent chat between an AssistantAgent and a critic AssistantAgent runs the revise-until-approved loop for you; LangGraph expresses it as a graph with a conditional edge that routes back to the generator node while the critic rejects:

# LangGraph: a generate node, a critic node, and a loop edge.
from langgraph.graph import StateGraph, END

g = StateGraph(dict)
g.add_node("generate", generate_draft)        # produces state["draft"]
g.add_node("critique", run_critic)            # sets state["approved"] = True/False
g.add_edge("generate", "critique")
g.add_conditional_edges(
    "critique",
    lambda s: "done" if s["approved"] or s["round"] >= 3 else "again",
    {"again": "generate", "done": END},       # loop back, or stop
)
g.set_entry_point("generate")
app = g.compile()                             # the whole revise loop, ready to run

Code 32.5.1: The generate-critique loop as a compiled graph. The framework owns the message routing, the round budget, and the stopping rule; you supply only the two node functions and the condition. Roughly the same control flow you would write by hand collapses into one conditional edge.

3. Multi-Agent Debate: Consensus Over Opinions Intermediate

Debate generalizes the critic into a symmetric contest. Several agents independently answer the same question, then read one another's answers and reasoning and are asked to critique, defend, or revise their position over a few rounds. A judge agent (or a simple majority) resolves the final answer. The mechanism that makes debate work is that a wrong answer usually cannot survive scrutiny from an agent that reasoned its way to a different conclusion: the disagreement forces each side to expose its reasoning, and exposed reasoning is checkable. Reported results across reasoning and factuality benchmarks show debate lifting accuracy over a single agent and over naive self-reflection, with the gain largest on problems where a single chain of thought is brittle.

Read as a distributed system, debate is a consensus protocol whose values are opinions rather than log entries, and the connection to Section 29.9 is exact. The agents are replicas proposing values; the rounds are message exchanges; the judge or majority is the commit rule. The crucial difference from machine consensus is that we do not merely want agreement, we want agreement on the correct value, and that only happens if the proposals were independent enough that the correct answer had support that the incorrect ones could not match. The middle panel of Figure 32.5.1 shows the structure: cross-talk among agents, then resolution by a judge.

Research Frontier: Debate, Reflexion, and Inference-Time Scaling (2024 to 2026)

Three threads converge here. Multi-agent debate (Du et al., 2023, and a wave of 2024 to 2025 follow-ups) showed that having several language-model instances propose and then debate answers improves factuality and arithmetic and reasoning accuracy, and later work studies how many agents and rounds actually help before returns flatten. Reflexion (Shinn et al., 2023) framed self-critique with episodic memory as verbal reinforcement learning, and its descendants explore richer critic signals. Both sit inside the broader inference-time scaling agenda made prominent by reasoning models in 2024 to 2025: spend more compute at inference (more samples, more debate rounds, longer deliberation) to raise quality without retraining, and study the accuracy-versus-compute curve directly. A sobering counter-thread quantifies sycophancy and herding, showing that agents readily abandon a correct answer when peers disagree, which is why preserving diversity and using a strong, independent judge are active design concerns rather than afterthoughts.

4. Voting and Self-Consistency: Wisdom of Crowds, Returned Intermediate

The simplest aggregation skips the cross-talk entirely: sample many answers independently and take the majority. When the many answers come from one model run several times at nonzero temperature, this is self-consistency; when they come from genuinely different agents, it is an ensemble vote. Either way it is the wisdom of crowds from Section 31.1, applied to agent outputs, and the same Condorcet arithmetic governs it. If each of $k$ agents answers a two-way question correctly and independently with probability $p > \tfrac{1}{2}$, the majority of $k$ is correct with probability

$$P_{\text{maj}}(k) = \sum_{j=\lceil k/2 \rceil}^{k} \binom{k}{j} \, p^{\,j} (1-p)^{\,k-j},$$

which rises toward $1$ as $k$ grows. With $p = 0.62$ a single agent is right about $62\%$ of the time, while a majority of seven independent agents is right about $74\%$ of the time, and a majority of fifteen does better still. The entire benefit rides on the word independent: the formula assumes the agents' errors are uncorrelated. The moment the votes become correlated, because the agents share a prompt that biases them the same way, or because they copy a leader, the effective $k$ collapses toward one and the gain disappears. This is the herding failure of Section 31.9, an information cascade in which each agent's confidence is borrowed from the crowd rather than earned from the problem.

The code below makes all of this concrete with stub agents and no language model at all. A task has a hidden correct label; each stub agent emits the correct label with probability $p$, mimicking a noisy but better-than-chance reasoner. We measure four strategies on the same stream of tasks: a single agent, a generate-then-critique pair, a diverse seven-agent vote, and a herding seven-agent vote in which followers copy the running majority instead of thinking for themselves.

import random
random.seed(7)

N_TASKS = 20000
P_AGENT = 0.62          # base per-agent accuracy: better than chance, far from perfect
N_VOTERS = 7
P_CRITIC = 0.70         # critic's accuracy at judging "is this answer wrong?"

def agent_answer(truth, p=P_AGENT):
    return truth if random.random() < p else (1 - truth)   # noisy guess

def single_agent(truth):
    return agent_answer(truth)

def generate_then_critique(truth):
    draft = agent_answer(truth)
    draft_ok = (draft == truth)
    critic_says_ok = draft_ok if random.random() < P_CRITIC else (not draft_ok)
    return draft if critic_says_ok else agent_answer(truth)   # revise once if doubted

def diverse_vote(truth, k=N_VOTERS):
    votes = [agent_answer(truth) for _ in range(k)]          # k INDEPENDENT agents
    return 1 if sum(votes) > k / 2 else 0

def herding_vote(truth, k=N_VOTERS, copy_prob=0.85):
    votes = [agent_answer(truth)]                            # the leader answers first
    for _ in range(k - 1):
        running = 1 if sum(votes) > len(votes) / 2 else 0
        if random.random() < copy_prob:
            votes.append(running)                           # copy the crowd, add no info
        else:
            votes.append(agent_answer(truth))               # think independently
    return 1 if sum(votes) > k / 2 else 0

def accuracy(strategy):
    correct = 0
    for _ in range(N_TASKS):
        truth = random.randint(0, 1)
        correct += (strategy(truth) == truth)
    return correct / N_TASKS

print(f"single agent                  : {accuracy(single_agent):.3f}")
print(f"generate-then-critique        : {accuracy(generate_then_critique):.3f}")
print(f"diverse vote  (k={N_VOTERS})           : {accuracy(diverse_vote):.3f}")
print(f"herding vote  (k={N_VOTERS}, copy=.85) : {accuracy(herding_vote):.3f}")

Code 32.5.2: Four interaction strategies over stub agents with noisy correctness. The agents carry no language model; P_AGENT stands in for a reasoner that is right more often than chance. The herding strategy makes followers copy the running majority, deliberately destroying vote independence.

single agent                  : 0.629
generate-then-critique        : 0.712
diverse vote  (k=7)           : 0.748
herding vote  (k=7, copy=.85) : 0.620

Output 32.5.2: Critique lifts a single agent from $0.629$ to $0.712$, and a diverse seven-agent vote reaches $0.748$, matching the Condorcet prediction for $p = 0.62$. The herding vote, despite using the same seven agents, falls back to $0.620$: copying the leader collapses the effective committee to one, so the wisdom-of-crowds gain vanishes.

Thesis Thread: Wisdom of Crowds and Consensus, Scaled Out to Agents

The aggregation you just measured is the same primitive that ran through Part VI. The Condorcet jury theorem behind the diverse vote is the wisdom-of-crowds engine of Section 31.1; the debate-with-a-judge structure is the opinion consensus of Section 29.9; and the collapse of the herding vote is the information cascade of Section 31.9, now wearing language agents as its replicas. Distributed problem solving improved quality by combining many independent computations; multi-agent interaction improves quality by combining many independent opinions. The lesson is identical at both scales: aggregation pays only while the things you aggregate stay independent.

5. The Price: More Calls for More Quality Advanced

None of this is free. A single answer costs one LLM call; generate-then-critique costs at least two and often more as it loops; a $k$-agent vote costs $k$ calls; a debate over $r$ rounds with $m$ agents costs on the order of $m \cdot r$ calls plus the judge. These patterns multiply the dominant cost of an agentic system, the model calls, so a quality gain only justifies itself when the answer is worth the extra compute and latency. A throwaway summarization request does not deserve a five-agent debate; a medical triage recommendation or an irreversible financial action may deserve far more. This is the inference-time-scaling trade in its starkest form: you are buying accuracy with calls, and the exchange rate has diminishing returns, since $P_{\text{maj}}(k)$ flattens as $k$ grows while the cost rises linearly. The accounting that decides how much deliberation a task can afford is the subject of Section 32.10.

There is also a correctness hazard beyond cost. Because debate and voting are consensus protocols, they can converge confidently on a wrong answer when diversity fails, and a confident wrong answer is more dangerous than an uncertain one because downstream agents trust it. Practical defenses mirror the diversity-preservation tactics of swarm systems: vary the prompts, models, or temperatures so the agents do not share one bias; keep the judge independent of the debaters and, where possible, ground it in an external verifier rather than a vote; and watch for the telltale of a cascade, an answer whose support grew because agents agreed with each other rather than with the evidence. Disagreement is the asset these patterns monetize, so any orchestration that quietly suppresses disagreement is destroying the value it was built to create.

Practical Example: The Debate That Caught the Contract Error

Who: An applied-AI team building a contract-review assistant for a legal-tech company.

Situation: A single LLM pass extracted obligations and deadlines from uploaded contracts, and reviewers trusted it, until a missed indemnity clause in a signed deal triggered a costly dispute.

Problem: The single pass was confidently wrong on the hard clauses, exactly the ones where one chain of reasoning is brittle, and there was no second opinion to catch it.

Dilemma: Fine-tune a stronger single model, slow and expensive with no guarantee on the long tail, or spend more calls at inference by adding agents that disagree and a judge that resolves them.

Decision: They added a three-agent debate (extract, challenge, defend) with an independent judge, but only on clauses the first pass flagged as high-risk, so the extra compute landed where it mattered.

How: Each debater used a different prompt and temperature to keep the opinions independent; the judge was barred from the debate transcript's tone and grounded its ruling in quoted clause text, not in which side argued louder.

Result: Recall on high-risk clauses rose sharply at roughly three times the per-clause cost, applied to under a tenth of the clauses, so total spend grew modestly while the dangerous misses fell. An early version that reused one prompt across all three debaters had herded onto the same misreads and was scrapped.

Lesson: Spend the extra calls where errors are costly, enforce real diversity among the debaters, and keep the judge independent; a debate of clones is just one agent paying triple.

Fun Note: Three Agents in a Trench Coat

A debate in which every agent shares the same prompt, model, and temperature is three agents in a trench coat pretending to be a committee. They will nod along to one another, reach unanimous consensus in record time, and be exactly as wrong as a single agent would have been, only now with a quorum to blame. If your committee never argues, you do not have a committee; you have one opinion with extra latency.

Exercise 32.5.1: When Does the Critic Help? Conceptual

The generate-then-critique strategy in Code 32.5.2 used a critic with accuracy $P_{\text{critic}} = 0.70$ at judging whether a draft is wrong. Reason about two limits. First, what happens to the strategy's accuracy if the critic is at chance ($P_{\text{critic}} = 0.5$), and why? Second, suppose the critic is worse than chance because it shares and amplifies the generator's bias; argue qualitatively why adding such a critic could drag accuracy below a single agent. Connect your answer to the Key Insight that the gain scales with the critic's independence, not its mere presence.

Exercise 32.5.2: The Condorcet Curve and the Cost of Voters Coding

Using the binomial formula for $P_{\text{maj}}(k)$ with $p = 0.62$, compute and plot the majority accuracy for odd $k$ from $1$ to $31$. Mark the point where adding two more voters raises accuracy by less than one percentage point. Then overlay a cost line that grows linearly in $k$ (one call per voter) and identify the $k$ beyond which the marginal accuracy per extra call is no longer worth it. Explain how your chosen operating point would shift if each agent were stronger ($p = 0.75$) or weaker ($p = 0.55$), and what happens to the curve as $p \to 0.5$.

Exercise 32.5.3: Measuring the Herding Penalty Analysis

Modify herding_vote in Code 32.5.2 to sweep copy_prob from $0.0$ (fully independent) to $1.0$ (everyone copies the leader) and record the resulting accuracy at each setting. Plot accuracy against copy_prob and explain the shape: where does the diverse-vote gain survive, and at what copy probability does the committee's accuracy collapse to roughly the single-agent value? Define an "effective number of independent voters" $k_{\text{eff}}$ implied by the measured accuracy via the Condorcet formula, and describe how $k_{\text{eff}}$ falls as correlation rises. Relate this to the information-cascade analysis of Section 31.9.