Section 40.10: Project Extension | Building Scalable AI

"I began as one file answering one question over one folder. They kept giving me tools, and agents, and replicas, until one day I noticed I had become a fleet that thought I was a chatbot."
A Single-File Agent, Dreaming of the Fleet It Could Become

Big Picture

This chapter taught the distributed agentic application as a stack of cooperating layers; this final section hands it back to you as a buildable project that grows a single-process RAG chatbot into that full application one layer at a time, with a measurable milestone at every step. You will start where every practitioner starts, a single Python process that retrieves over a few documents, calls a model, and answers. Then you will scale each layer in the order the chapter introduced it: distributed document processing and chunking (Section 40.2), an embedding pipeline (Section 40.3), a sharded vector index with access-control filtering (Section 40.4), advanced multi-hop RAG (Section 40.5), a multi-agent orchestrator with tools (Section 40.6), a vLLM serving fleet (Section 40.7), cost control by caching and cascade (Section 40.8), and an evaluation harness that reports task success against cost (Section 40.9). The project is the whole book in miniature: one distributed agentic application that touches every axis of distribution from Section 1.2, and it is the capstone of the case studies in Part VIII.

A case study read passively teaches the shape of a system; a case study rebuilt teaches its cost. The nine sections before this one walked the distributed agentic application as a finished artifact, naming the distributed mechanism behind each layer and the chapters that own it. This section inverts that posture. It gives you a staged construction plan in which you begin with a baseline small enough to fit one process and honest enough to answer real questions, then remove one single-process assumption per layer, measuring what the distribution bought you before moving on. Each layer draws on a specific earlier section, so the project doubles as a guided tour back through the chapter and the book: when you serve the fleet you are applying Chapter 24, when you shard the index you are applying Chapter 25, when you orchestrate the agents you are applying Chapter 32, and the whole shape echoes the web-scale RAG capstone of Chapter 36. The discipline that makes the project worth its time is the one Section 1.1 opened with: distribute a layer only when a ceiling forces it, and prove with a number that the distribution helped.

Figure 40.10.1: The staged build. The baseline at the top runs the whole RAG chatbot in one process; each numbered layer below scales one part of it out, drawing on the section and chapter that own that mechanism, and attaches a measurable milestone (chunk throughput, embedding efficiency, recall with zero access-control leaks, multi-hop success, task success, tail latency, cost per task, success against cost). Carried to the end, the eight layers exercise all six axes of distribution from Section 1.2, which is why this single project is the book in miniature and the capstone of the case studies.

1. The Baseline You Scale Out From Beginner

Every honest scale-out project begins with a single-process baseline, for two reasons. The first is correctness: the distributed application must answer at least as well as the baseline, and you cannot check that against a baseline you never built. The second is measurement: speedup, task success, and cost per task, the headline metrics of Chapter 3 and Chapter 5, are defined only relative to a one-process reference. Your baseline is a single file over a small document set that performs the whole loop in memory: embed the documents, retrieve the passages relevant to a query, call a tool when the question needs a fact the model should not guess, and template the retrieved evidence into an answer. The corpus is small enough to fit one process and structured enough that a genuinely two-hop question (one whose answer requires chaining two retrieved facts) is answerable, because the multi-hop capability of Section 40.5 is exactly what the later layers must preserve.

The code below is that baseline compressed to one dependency-free file: the distributed agentic application in miniature. It embeds a handful of documents with a deterministic hashing encoder, then runs the agent loop on a two-hop question, retrieve the cluster that serves the fleet, then retrieve and tool-extract how many replicas that named cluster runs, and templates the two facts into an answer. The check at the end, that the loop actually resolves both hops, is the task-success invariant your project must preserve at every later layer.

import hashlib, math, re

DIM = 64  # toy embedding dimension; a real encoder emits 384-1024
DOCS = {
    0: "the inference fleet is served by a cluster named orion",
    1: "the orion cluster runs eight vllm replicas behind one router",
    2: "data parallelism splits training gradients across worker machines",
    3: "a multi agent orchestrator routes subtasks to specialist agents",
    4: "prefix caching reuses the shared system prompt across requests",
    5: "a vector index is sharded so no single node holds the whole corpus",
}

def _h(tok):                       # deterministic hash so the demo is reproducible
    return int(hashlib.md5(tok.encode()).hexdigest(), 16)

def embed(text):                   # hashing bag of words: a stand in for an encoder
    v = [0.0] * DIM
    for tok in re.findall(r"[a-z0-9]+", text.lower()):
        v[_h(tok) % DIM] += 1.0
    n = math.sqrt(sum(x * x for x in v)) or 1.0
    return [x / n for x in v]

def cosine(a, b):
    return sum(x * y for x, y in zip(a, b))

CORPUS = {d: embed(t) for d, t in DOCS.items()}

def retrieve(query, k=1):          # the retriever the agent calls as a tool
    q = embed(query)
    ranked = sorted(CORPUS.items(), key=lambda kv: cosine(q, kv[1]), reverse=True)
    return [d for d, _ in ranked[:k]]

def count_replicas_tool(text):     # a toy "tool": extract a replica count from text
    m = re.search(r"(\d+|one|two|eight)\s+vllm", text)
    word = {"one": 1, "two": 2, "eight": 8}
    if not m:
        return None
    tok = m.group(1)
    return int(tok) if tok.isdigit() else word.get(tok)

# Two hop question: which cluster serves the fleet, and how many replicas does it run?
QUESTION = "how many vllm replicas does the cluster serving the inference fleet run"

# Hop 1: retrieve the cluster name for the inference fleet.
hop1 = retrieve("which cluster serves the inference fleet", k=1)[0]
cluster = re.search(r"named (\w+)", DOCS[hop1]).group(1)

# Hop 2: retrieve facts about that named cluster, then call the tool.
hop2 = retrieve(f"how many vllm replicas does the {cluster} cluster run", k=1)[0]
replicas = count_replicas_tool(DOCS[hop2])

# Template generate: fill a deterministic answer from the two retrieved hops.
answer = (f"The {cluster} cluster serves the inference fleet and runs "
          f"{replicas} vLLM replicas.")

print("question        :", QUESTION)
print("hop 1 doc        :", hop1, "->", f"cluster = {cluster}")
print("hop 2 doc        :", hop2, "->", f"tool replicas = {replicas}")
print("answer           :", answer)
print("answered 2-hop   :", cluster == "orion" and replicas == 8)

Code 40.10.1: The agentic application in miniature: a retrieve, tool-call, template-generate loop over six documents. The agent resolves a two-hop question by retrieving the cluster name, then retrieving and tool-extracting that cluster's replica count, and templates both facts into the answer; the final flag reports whether the two-hop chain succeeded.

question        : how many vllm replicas does the cluster serving the inference fleet run
hop 1 doc        : 0 -> cluster = orion
hop 2 doc        : 1 -> tool replicas = 8
answer           : The orion cluster serves the inference fleet and runs 8 vLLM replicas.
answered 2-hop   : True

Output 40.10.1: The miniature agent answers the two-hop question correctly: hop one resolves the cluster name (orion), hop two retrieves that cluster's document and the tool extracts the replica count (eight), and the template composes both into a grounded answer. This task-success invariant is what your project must hold as you scale every layer out.

Key Insight: Build the Single-Process Agent First, or You Have Nothing to Beat

The temptation is to start distributed, because the orchestrator and the serving fleet are the interesting parts. Resist it. Without a single-process baseline you cannot compute a speedup, cannot detect a task-success regression when an agent or a shard misbehaves, and cannot tell whether the distributed application is correct or merely plausible. The two-hop success in Output 40.10.1 is exactly the kind of check a baseline makes possible, and every later layer (a second agent, an approximate index, a cheaper cascade tier) is a change that could quietly break it. The baseline is cheap, it runs in a fraction of a second, and every milestone in this project is measured against it. A scale-out project without a baseline is not a project; it is a hope.

2. Staging the Scale-Out, Milestone by Milestone Intermediate

With the baseline in hand, you scale the application out one layer at a time, in the order of Figure 40.10.1, never advancing until the current layer hits its milestone. The discipline of one-layer-at-a-time matters because it isolates cause and effect: when task success jumps or tail latency spikes, exactly one thing changed, and you know which section's mechanism to blame. Table 40.10.1 is the project plan. Each row names the layer, the binding ceiling that forces the distribution, the section and chapter that supply the mechanism, and the measurable milestone that tells you the layer is done.

Table 40.10.1: The staged build plan for the distributed agentic application. Scale out top to bottom; do not advance until the milestone is met. Each layer removes one single-process assumption and draws its mechanism from the named section and chapter.

Layer	Binding ceiling	Mechanism (section / chapter)	Milestone to hit
1. Document processing + chunking	Corpus too big / slow for one node	Section 40.2, MapReduce (Ch 6)	Near-linear chunk throughput across workers
2. Embedding pipeline	Encoding the corpus is GPU-bound	Section 40.3, data-parallel encode (Ch 15)	Efficiency $E \ge 0.8$ across the GPU fleet
3. Sharded index + ACL filtering	Index exceeds one node; multi-tenant	Section 40.4, sharded vector search (Ch 25)	Recall held; zero access-control leaks
4. Multi-hop RAG	Single-shot retrieval misses chained facts	Section 40.5, iterative retrieval (Ch 36)	Two-hop task success above single-shot
5. Multi-agent orchestrator + tools	One agent cannot plan and act reliably	Section 40.6, orchestration (Ch 32)	Task success above the single-agent baseline
6. vLLM serving fleet	Request volume exceeds one server	Section 40.7, replicated serving (Ch 24)	p99 latency under the SLO at target QPS
7. Cost control: caching + cascade	Token spend dominates the bill	Section 40.8, prefix cache, cascade (Ch 22)	Cost per task down at held task success
8. Evaluation harness	Quality and cost must survive distribution	Section 40.9, distributed eval (Ch 5)	Task success and cost per task reported together

The milestones are quantitative on purpose. "Added an orchestrator" is not a milestone; "the multi-agent orchestrator raised task success from the single-agent baseline of seventy percent to eighty-eight percent on the held-out task set" is. Layers 1 and 2 are throughput-and-capacity layers, judged by speedup and the corpus size they unlock. Layer 3 is a quality-and-safety layer, where approximate nearest neighbor (Chapter 12) trades a little recall for a large latency win while access-control filtering must leak nothing across tenants. Layers 4 and 5 are task-success layers, where multi-hop retrieval and the multi-agent orchestrator (Chapter 32) raise the fraction of tasks the system actually completes. Layer 6 is a tail-latency layer, where replication (Chapter 24) buys throughput under a hard p99 budget. Layer 7 is a cost layer, where prefix caching and a cheap-to-expensive model cascade (Chapter 22) cut the bill without dropping success. Layer 8 closes the loop by proving, with the distributed evaluation of Chapter 5, that the fully distributed application still answers as well as the baseline, at a cost per task you can defend.

Practical Example: The Capstone That Grew One Layer at a Time

Who: A graduate student building this exact project as a term capstone on a four-node lab cluster plus two cloud GPUs.

Situation: The single-process baseline over a few hundred internal documentation pages answered single-hop questions acceptably but failed most two-hop ones and saturated one GPU under load.

Problem: The assignment required a distributed agentic design, but jumping straight to a multi-agent everything produced a system that was slower than the baseline, leaked documents across tenants, and was impossible to debug.

Dilemma: Rebuild the whole thing distributed in one leap, fast to write but a black box when it underperformed, or scale one layer at a time against milestones, slower to start but debuggable throughout.

Decision: The staged plan of Table 40.10.1. The student scaled the serving fleet only after profiling showed the single replica was the binding ceiling, and stopped at each milestone before touching the next layer.

How: Data-parallel embedding hit efficiency $0.85$; the index was sharded across four nodes with per-tenant access-control filters that passed a zero-leak audit; multi-hop RAG and a planner-plus-specialist orchestrator lifted two-hop success from $0.41$ to $0.86$; three vLLM replicas held p99 under the SLO; prefix caching plus a small-then-large cascade cut cost per task by forty percent at held success.

Result: The final application matched or beat the baseline on every held-out task, served the target load under its latency budget, and, because each layer was measured, every number in the writeup was defensible.

Lesson: Profile to find the binding layer, scale that one first, and never advance past an unmet milestone. A staged build is the only kind whose final numbers you can trust.

3. The Numbers Your Project Must Hit Intermediate

A scale-out project lives or dies by whether the gains it claims are real, so each milestone is a target you compute, not a feeling. The throughput layers are governed by Amdahl's law from Chapter 3: speedup on $p$ machines is $S(p) = T_1 / T_p$, parallel efficiency is $E(p) = S(p) / p$, and if a fraction $f$ of the pipeline is inherently serial (the final agent reasoning step, the single generator call), then

$$S(p) = \frac{1}{f + \dfrac{1 - f}{p}}, \qquad \lim_{p \to \infty} S(p) = \frac{1}{f}.$$

This sets the ceiling on the embedding and document-processing layers: if their serial fraction is $f = 0.1$, no number of machines passes a tenfold speedup, so a milestone demanding twentyfold is unphysical and the fix is to shrink $f$, not add nodes. The serving and cost layers, by contrast, are sized by tokens and concurrency rather than speedup. The dominant operating cost of an agentic task is its tokens: if a task issues $r$ model calls averaging $t_{\text{in}}$ input and $t_{\text{out}}$ output tokens, at per-token prices $c_{\text{in}}$ and $c_{\text{out}}$, the cost per task is

$$C_{\text{task}} = r \,\bigl(t_{\text{in}}\, c_{\text{in}} + t_{\text{out}}\, c_{\text{out}}\bigr).$$

A multi-agent task with $r = 6$ calls of $t_{\text{in}} = 1500$ and $t_{\text{out}} = 300$ tokens, at $c_{\text{in}} = 3 \times 10^{-6}$ and $c_{\text{out}} = 15 \times 10^{-6}$ dollars per token, costs $C_{\text{task}} = 6\,(1500 \cdot 3 + 300 \cdot 15)\times 10^{-6} = 0.054$ dollars per task. Prefix caching the shared system prompt (Chapter 22) removes most of $t_{\text{in}}$ on the repeated calls, and a cascade that resolves the easy fraction on a cheap model lowers the effective $c$, which is exactly what the layer-7 milestone measures. The serving fleet is sized by concurrency: to sustain a target $\lambda$ requests per second when one replica serves at $\mu$ requests per second, you need at least $N_{\text{rep}} = \lceil \lambda / (\rho\, \mu) \rceil$ replicas for a target utilization $\rho < 1$ that leaves headroom for the p99 tail. At $\lambda = 40$, $\mu = 6$, and $\rho = 0.7$, that is $N_{\text{rep}} = \lceil 40 / 4.2 \rceil = 10$ replicas, the stage-6 capacity argument from Section 40.7 made arithmetic. Compute these targets before you build, so each milestone is a prediction you test rather than a result you rationalize.

Library Shortcut: Each Layer Is a Few Lines in a Production Tool

The hand-rolled retrieve-tool-template loop in Code 40.10.1 is for understanding; in the real project each layer maps to a framework that handles the distribution for you. Code 40.10.2 names that mapping, turning the staged plan into a near-deployment plan exactly as the design checklist did in Section 1.8:

# Each staged milestone -> the production tool that owns it.
STACK = {
    "doc processing":   "Ray Data / Spark   # shard the corpus, chunk in parallel",
    "embedding":        "sentence-transformers + torch DDP  # data-parallel encode",
    "sharded index+ACL":"Qdrant / FAISS-IVF # ANN shards, per-tenant filters",
    "multi-hop RAG":    "LangGraph          # iterative retrieve-reason graph",
    "agent orchestrator":"LangGraph         # planner + specialist agents, tools",
    "serving fleet":    "vLLM + Ray Serve   # replicate, batch, prefix-cache",
    "cost control":     "vLLM prefix cache + a small-then-large model cascade",
    "evaluation":       "Ray / RAGAS        # distributed task-success + cost",
}
for layer, tool in STACK.items():
    print(f"{layer:20s}-> {tool}")

Code 40.10.2: The eight staged layers mapped to the production tool that owns each. A finished Table 40.10.1 row selects a key, and the value is the framework the corresponding section teaches; the from-scratch agent loop of Code 40.10.1 collapses to a LangGraph orchestration over a vLLM serving fleet and a Qdrant index, with Ray Serve replicating the whole thing.

4. Extension Challenges Worth the Cluster Advanced

Once the eight layers hit their milestones you have a working distributed agentic application, and the project becomes a platform for the harder questions the chapter only gestured at. Each extension below adds one capability that a real fleet needs, and each reaches into a different part of the book, so finishing them turns the capstone from a pipeline into a system. Add a second specialist agent, splitting the single worker into a retrieval agent and a calculation-or-code agent that the planner routes between, and measure the task-success lift on tasks that need both skills against the extra orchestration latency it costs (Chapter 32). Add prefix caching to the serving fleet explicitly, sharing the long system prompt and the retrieved context across an agent's repeated calls, and quantify the tokens and dollars saved per task against the layer-7 cost milestone (Chapter 22). Add a Byzantine-tolerant tool-result check, so that when several agents or shards return a fact the orchestrator takes a quorum rather than the first answer, and a single corrupted tool result cannot poison the final response, the robust-aggregation idea from Chapter 35.

The deepest extension is the self-host versus API decision that governs the whole economics of the application. Run the same evaluation harness against two backends, a self-hosted vLLM fleet you size with the $N_{\text{rep}}$ formula above and a hosted model API priced per token, and find the crossover volume at which owning the fleet beats renting it. The self-hosted path pays a fixed hourly cost for the replicas whether or not they are busy, so its cost per task falls as utilization rises; the API path pays $C_{\text{task}}$ per task regardless of volume. Below the crossover the API wins on simplicity and idle cost, above it the fleet wins on marginal cost, and the exact crossover depends on your task's token profile and your achieved utilization, which is why this is an extension you measure rather than assert. Each of these extensions is a small, bounded change to a working system, which is exactly the posture in which distributed-systems concepts are learned best: against a baseline you can measure, in an application you already understand.

Research Frontier: Where Distributed Agentic Applications Are Heading (2024 to 2026)

The extensions above track live research lines, so a capstone that implements them is working at the current edge. Agentic RAG, where a planning agent issues multiple retrieval rounds and reasons over them rather than retrieving once, has moved from prototype to product, and frameworks in the LangGraph and AutoGen lineage now treat the retriever, tools, and sub-agents as a typed graph the orchestrator traverses. On the serving side, prefix and prompt caching has become a first-class vLLM and provider feature precisely because agentic workloads reissue the same long system prompt dozens of times per task, and automatic prefix caching with paged KV memory (the lineage of PagedAttention) is what makes the layer-7 economics work. Multi-agent reliability is the newest frontier: as orchestrators fan work out to sub-agents and tools of varying trust, Byzantine-robust aggregation of tool and agent outputs (the lineage of robust distributed learning, now applied to agent ensembles) is moving from theory toward practical guardrails. Finally, the self-host-versus-API question has spawned a small literature on agentic-workload cost modeling and routing, where a cascade or router sends each call to the cheapest backend that can handle it; treating model choice itself as a per-call distributed decision is, as of 2026, among the most active frontiers in the whole application.

5. Chapter Summary and What You Built Beginner

This section closes Chapter 40, so it is worth stating the through-line the whole chapter built. We began with the problem definition (Section 40.1): build an agentic application that answers and acts over a document corpus and a set of tools at a scale no single process can hold, retrieve, reason over, serve, or afford. From there the chapter walked the application layer by layer, and every layer was the same move applied to a different part of the system, partition the work across machines, move only what must be moved between them, and recombine the result correctly. Distributed document processing and chunking (Section 40.2) distributed the data with the MapReduce model of Chapter 6. The embedding pipeline (Section 40.3) applied the data parallelism of Chapter 15 to encode the corpus across a GPU fleet. The sharded vector index with access-control filtering (Section 40.4) partitioned the index and turned a query into the scatter-gather pattern of Chapter 25. Advanced multi-hop RAG (Section 40.5) made retrieval iterative, the bridge from Chapter 36. The multi-agent orchestrator with tools (Section 40.6) coordinated specialist agents with the orchestration machinery of Chapter 32. The vLLM serving fleet (Section 40.7) replicated the serving path of Chapter 24, cost control (Section 40.8) cut the bill with the per-node economics of Chapter 22 multiplied across the fleet, and evaluation (Section 40.9) proved the distributed application preserved task success at a defensible cost. The chapter is, end to end, one distributed agentic application that spans all six axes of Section 1.2: data, training, model, inference, cluster coordination, and intelligence.

Thesis Thread: One Application, All Six Axes, One Thesis

The book's spine is that AI at scale is the engineering of systems whose data, computation, models, inference, and decisions are distributed across many machines, and that each distribution is forced by a ceiling, not chosen for elegance. The distributed agentic application is the clearest single demonstration of that thesis in the book, because one application hits every ceiling in turn: the corpus forces distributing data, the encoder forces distributing training-style compute, a large generator forces distributing the model, the request volume forces distributing inference, the fleet forces cluster coordination, and the multi-agent orchestrator forces distributing intelligence itself. The staged project in this section is the thesis made buildable: you do not read about the six axes, you implement them, one ceiling at a time, and watch the single-file chatbot become the fleet it was always going to be. That is the capstone of the case studies, and the bridge into Chapter 41, where you choose your own problem and defend its distribution axis from scratch.

Key Takeaway: Chapter 40 as a Buildable System

A distributed agentic application is not eight unrelated tricks; it is one system in which every layer is the same partition-move-recombine move applied to a different resource, and the whole is a fleet of cooperating agents, tools, and model replicas. (1) Distribute the document processing so a corpus no node can hold is chunked in parallel. (2) Data-parallel the embedding so encoding scales near-linearly with the GPU fleet. (3) Shard the index and filter by access control so the corpus exceeds any node while leaking nothing across tenants. (4) Make retrieval multi-hop so chained-fact questions succeed. (5) Orchestrate specialist agents with tools so task success rises above one agent acting alone. (6) Replicate the vLLM serving path so request volume meets a hard p99 budget. (7) Cache prefixes and cascade models so cost per task falls at held success. (8) Evaluate across the fleet to prove distribution kept task success while reporting cost. Built in this order against milestones, the application exercises all six axes of distribution, which is why it is the book in miniature and the capstone of the case studies.

Project Ideas: Build the Agentic Application, Then Push It

Each idea is sized so that carrying it through the staged plan of Table 40.10.1 becomes a capstone in the sense of Chapter 41. Core build: start from the Code 40.10.1 baseline over a small document set and scale all eight layers out on a small cluster (or cloud spot instances), recording the task-success, p99-latency, cost-per-task, and recall milestone at each layer; the deliverable is a writeup in which every number is measured against the baseline. Second specialist agent: split the worker into a retrieval agent and a calculation-or-code agent that the planner routes between, and report the task-success lift on dual-skill tasks against the orchestration latency it adds (Chapter 32). Prefix caching: add explicit prefix caching of the shared system prompt and retrieved context, and report the tokens and dollars saved per task against the layer-7 cost milestone (Chapter 22). Byzantine-tolerant tool check: take a quorum over repeated agent or shard tool results so one corrupted result cannot poison the answer, the robust-aggregation idea of Chapter 35. Self-host versus API: run the evaluation harness against a self-hosted vLLM fleet sized by the $N_{\text{rep}}$ formula and a hosted API priced per token, and find the crossover volume at which owning the fleet beats renting it.

Exercise 40.10.1: Name the Ceiling, Layer by Layer Conceptual

For each of the eight layers in Table 40.10.1, state the single binding ceiling (data, model, throughput, quality, or cost) that forces the distribution, and the axis from Section 1.2 it maps to. Then identify the two layers whose milestone is a task-success target rather than a speed target, and explain why scaling those layers wrong can make the application answer worse, not just slower. Finally, argue which layer you would scale out first on a real corpus and why profiling, not intuition, should decide that order.

Exercise 40.10.2: Extend the Miniature Agent Coding

Starting from Code 40.10.1, (a) add a second tool, a simple arithmetic evaluator, and a third document stating a per-replica request rate, then pose a three-hop question whose answer requires the cluster name, the replica count, and a multiplication, and confirm the agent resolves all three hops. (b) Add a second specialist agent so the planner routes retrieval questions to one agent and arithmetic to the other, and report which questions now succeed that failed before. (c) Corrupt one document's replica count and add a quorum check that retrieves the fact from two phrasings and takes the majority, then show that the corrupted value no longer reaches the answer. Relate each part to the layer-5 and Byzantine-tolerance milestones in Table 40.10.1.

Exercise 40.10.3: Size the Fleet and the Bill Analysis

An agentic task issues $r = 8$ model calls averaging $t_{\text{in}} = 2000$ input and $t_{\text{out}} = 400$ output tokens, at prices $c_{\text{in}} = 3 \times 10^{-6}$ and $c_{\text{out}} = 15 \times 10^{-6}$ dollars per token. (a) Compute the cost per task $C_{\text{task}}$ from Section 3, then recompute it assuming prefix caching removes the input tokens on all but the first call. (b) The application must sustain $\lambda = 25$ tasks per second, each task occupying one replica for the duration of its calls, and one replica serves $\mu = 5$ tasks per second; using $N_{\text{rep}} = \lceil \lambda / (\rho\, \mu) \rceil$ with target utilization $\rho = 0.7$, compute the replica count. (c) If a self-hosted replica costs two dollars per hour, compute the self-hosted cost per task at the utilization above and compare it to the API $C_{\text{task}}$, stating the crossover task volume at which self-hosting wins.