Part VIII: Case Studies and Capstone Projects
Chapter 40: Distributed LLM and Agentic Applications

Distributed LLM and Agentic Applications

The final case study: a production agentic application that reads a corpus too large for one machine, retrieves under security constraints, reasons in many steps with tools and sub-agents, and answers thousands of users on a fleet of model replicas without blowing its latency or its budget.

Conceptual illustration for Chapter 40: Distributed LLM and Agentic Applications

"The user asked one question. I read forty documents I was allowed to see, called three tools, spun up two sub-agents, and queued behind nine hundred other requests for a slice of a model that lives on someone else's GPU. They are still waiting for one sentence. I am, somehow, a distributed system."

An Agent That Discovered It Was a Cluster
Big Picture

An LLM-powered agentic application looks like a single helpful assistant and is, underneath, a distributed system of cooperating agents, tool services, and model replicas that spans every axis this book has named. A user types one question; the system ingests and indexes a corpus too large for any one machine, retrieves the few passages that question needs under security rules about who may see what, reasons over several steps by calling tools and delegating to sub-agents, and produces the answer from a fleet of model servers shared by thousands of other users, all inside a latency budget the user can feel and a cost budget the operator must defend. This is the integrative capstone of the case studies, the chapter that pulls together the web-scale retrieval-augmented generation of Chapter 36, the sharded vector search of Chapter 25, the distributed LLM serving of Chapter 24, and the agent orchestration of Chapter 32 into one running application. The thread through every section is that an agentic application is not a prompt and a model but a distributed system of agents, tools, and replicas, and that reasoning, retrieval, and serving each become a distribution problem the moment the corpus, the user count, or the model outgrows a single box. By the end you will be able to read any agentic product as a cluster of cooperating services, and to see why retrieval, orchestration, serving, and cost control are the four axes its design must answer at once.

Chapter Overview

Part VIII assembles the book into end-to-end systems, and this is its fifth and most integrative assembly, the case study that asks every earlier part to run at once. The four case studies before it each isolated one binding constraint: a corpus too large to retrieve from one index, a model too sensitive to centralize, a recommender too wide for one host, a swarm too physical to coordinate from a center. This one removes the isolation. A production agentic application, a research assistant over a company's documents, a customer-support agent with tools, a coding agent that reads a repository and acts on it, is distributed on every axis simultaneously because no single one of its jobs fits on one machine. The document corpus does not fit, so ingestion is a distributed data pipeline. The embedding and the index do not fit, so retrieval is sharded vector search. The reasoning is multi-step and calls out to tools and sub-agents, so control is distributed agent orchestration. And the model is shared by thousands of concurrent users, so generation rides a distributed serving fleet. The application is the place where all four meet, and the chapter is the discipline of making them meet within a budget.

The defining shape of the system is an agentic loop wrapped around a retrieval-and-serving substrate. Section 40.1 fixes the problem and its constraints: the agentic task, the corpus that must be ingested, the security boundary that decides which passages a given user may retrieve, and the cost and latency budgets that rule out a design that simply calls the largest model on every step. Section 40.2 builds distributed document processing, the ingestion pipeline that parses, chunks, and cleans a corpus spread across many machines before any of it can be embedded or indexed. Section 40.3 stands up the embedding pipelines that turn that processed corpus into vectors at scale, a throughput problem in its own right and the bridge into search. Section 40.4 confronts retrieval head on with sharded vector search, the index partitioned across nodes and queried under the security-aware filtering that Section 40.1 made a requirement rather than an afterthought.

The middle of the chapter assembles retrieval and reasoning into an agent. Section 40.5 builds RAG at scale, wiring the sharded index of Section 40.4 into the generation loop so that grounding many concurrent queries in retrieved evidence stays correct and affordable, the case-study payoff of Chapter 36. Section 40.6 turns the single call into a system with distributed agent orchestration, the planner, tools, and sub-agents that the orchestration machinery of Chapter 32 coordinates across a network of services. Section 40.7 puts the generation itself on a distributed model-serving fleet with vLLM, the continuous-batching and paged-attention serving of Chapter 24 made the throughput engine under every agent step.

The final stretch makes the assembled system affordable, measurable, and yours to extend. Section 40.8 takes on cost control across the fleet: routing cheap steps to small models and hard steps to large ones, caching what repeats, and capping the agentic loop so a single runaway query cannot spend the day's budget. Section 40.9 turns to evaluation, the hard problem of measuring whether a multi-step agentic application is actually correct, grounded, and within budget, from retrieval metrics through LLM-as-a-judge to end-to-end task success. Section 40.10 closes with a project extension that hands the reader the levers, swapping the index, adding tools and sub-agents, tightening the security boundary, or changing the serving and routing policy, so the case study becomes a system to build and defend rather than only to read. Read in order, the ten sections make the argument the whole of Part VIII has been building toward: a real distributed AI system is shaped by its binding constraints, and when an application must read everything, reason in many steps, and serve everyone at once, retrieval, orchestration, serving, and cost stop being separate concerns and become a single distributed design.

Prerequisites

This chapter is the most integrative synthesis in the book, so it assumes the parts it composes rather than reteaching them. From Chapter 24 it assumes distributed LLM serving, the continuous batching, paged attention, and multi-replica fleet that Section 40.7 turns into the generation engine under every agent step. From Chapter 25 it assumes distributed retrieval and vector search, the sharded approximate-nearest-neighbor index that Section 40.4 queries under security-aware filtering. From Chapter 32 it assumes distributed agent orchestration, the planner, tool calls, and sub-agent coordination that Section 40.6 assembles into the agentic loop. From Chapter 36 it assumes the web-scale retrieval-augmented generation case study, the ingestion-to-grounding pipeline that Section 40.2 through Section 40.5 specialize to a single application's corpus. From Chapter 22 it assumes per-node inference efficiency, the quantization and KV-cache numbers that Section 40.7 and Section 40.8 multiply across the fleet to make a budget hold. From Chapter 26 it assumes MLOps for distributed AI, the deployment, monitoring, and evaluation discipline that Section 40.9 turns on the assembled agentic system. A reader comfortable with those threads can read this chapter as the place where retrieval, orchestration, serving, and operations finally run together as one application.

Learning Objectives

The One Idea to Carry Out of This Chapter

If you keep one thing from this chapter, keep this: an agentic application is a distributed system of cooperating agents, tool services, and model replicas that spans every axis of distribution at once, and its reasoning, retrieval, and serving each become distribution problems the instant the corpus, the user count, or the model stops fitting on one machine. The earlier case studies each pivoted on one binding constraint; this one pivots on the fact that an agentic product has all of them together. The corpus does not fit, so ingestion, embedding, and indexing are distributed data pipelines feeding a sharded vector index. The right passages depend on who is asking, so retrieval carries a security boundary, not just a similarity score. The reasoning is multi-step and reaches out to tools and sub-agents, so control is orchestration across a network of services rather than a single prompt. The model is shared by thousands of users, so generation rides a serving fleet whose throughput, governed by continuous batching and paged attention, sets the ceiling on how much an agent may think. And because every step costs tokens and milliseconds, the whole loop runs against a budget that forces cheap steps onto small models, caches what repeats, and caps runaway reasoning before it can spend the day. Read forward, the chapter walks that system from the agentic problem to the deployed, evaluated, budgeted application. Read as a question, it is the checklist you carry into any agentic product: where does the corpus live, who may retrieve what, how does the reasoning fan out across services, what fleet serves the tokens, and does the whole loop stay inside its latency and cost budget? The roadmap below walks the ten sections that build that application end to end.

Chapter Roadmap

Read the ten sections in order and you will have traced one realistic system from an agentic problem to a deployed, evaluated, budget-bound application built on a corpus that does not fit, retrieval that respects who is asking, reasoning that fans out across services, and a fleet that serves everyone at once: Sections 40.1 through 40.4 fix the problem and build the distributed data path from ingestion through embedding to sharded, security-aware search; Sections 40.5 through 40.7 assemble retrieval, orchestration, and a serving fleet into a working agent; and Sections 40.8 through 40.10 make that agent affordable, measurable, and yours to extend. The thread to watch runs back to Chapter 25 and Chapter 24: the sharded index built there to retrieve fast and the serving fleet built there to generate cheaply return here as the two ends of every agent step, which is why Section 40.4 and Section 40.7 are the technical hinges on which the whole application turns.

What's Next?

This chapter assembled the book into one running application: an agentic system that ingests a corpus, retrieves under a security boundary, reasons across tools and sub-agents, and serves thousands of users from a model fleet, all inside a cost and latency budget. It is the last of the worked case studies, and with it the integrative survey of real distributed AI systems is complete. Chapter 41: Capstone Project Design turns the book from reading systems to building one of your own. Where the case studies handed you finished architectures to dissect, the capstone hands you the design space and asks you to choose: pick a problem, name the ceiling that forces it to distribute, select the axis that answers that ceiling, and defend the trade in communication and cost the choice commits you to. The six axes of Chapter 1 and the design-space checklist you met there return as the rubric for your own distributed AI system. Read it next to turn five case studies' worth of pattern recognition into a project you propose, build, and defend.

Bibliography & Further Reading

Agents & Tools

Yao, S., Zhao, J., Yu, D., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv:2210.03629

Interleaves chain-of-thought reasoning with tool-calling actions so a model can plan, act, and observe in a loop; the reasoning-and-acting pattern at the core of the agentic orchestration in Section 40.6.

📄 Paper

Schick, T., Dwivedi-Yu, J., Dessi, R., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023. arXiv:2302.04761

Shows a model can learn when and how to call external tools (search, calculator, APIs) from self-supervised data; the grounding for the tool services that an agent step in Section 40.6 dispatches to.

📄 Paper

Patil, S. G., Zhang, T., Wang, X., Gonzalez, J. E. "Gorilla: Large Language Model Connected with Massive APIs." NeurIPS 2024. arXiv:2305.15334

Trains a model to select and invoke the right API from a large, changing catalog; the tool-routing problem that scales when an agent in Section 40.6 has many services to choose from.

📄 Paper

Shinn, N., Cassano, F., Gopinath, A., et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. arXiv:2303.11366

Has an agent reflect on its own failed trajectories in natural language and retry better; the self-correction loop that makes the multi-step reasoning of Section 40.6 more robust without retraining.

📄 Paper

Wu, Q., Bansal, G., Zhang, J., et al. "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." 2023. arXiv:2308.08155

A framework for building applications from multiple conversing agents that delegate and coordinate; a concrete realization of the sub-agent orchestration that Section 40.6 distributes across services.

🔧 Tool

RAG & Serving

Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401

The paper that named RAG, coupling a neural retriever to a generator so answers are grounded in retrieved passages; the foundation for the RAG-at-scale loop of Section 40.5.

📄 Paper

Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)." SOSP 2023. arXiv:2309.06180

Introduces paged attention and continuous batching to pack the KV cache and raise serving throughput sharply; the engine of the distributed model-serving fleet in Section 40.7.

📄 Paper

Khattab, O., Singhvi, A., Maheshwari, P., et al. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR 2024. arXiv:2310.03714

Treats a multi-step LLM pipeline as a program to be compiled and optimized rather than hand-prompted; the systematic way to build and tune the retrieval-and-reasoning chains of Section 40.5 and Section 40.6.

🔧 Tool

Evaluation & Protocols

Es, S., James, J., Espinosa-Anke, L., Schockaert, S. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." 2023. arXiv:2309.15217

A reference-free framework that scores faithfulness, answer relevance, and context relevance for RAG systems; the retrieval-and-grounding metrics that anchor the evaluation of Section 40.9.

🔧 Tool

Zheng, L., Chiang, W.-L., Sheng, Y., et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685

Validates using a strong model as an automated judge of open-ended answers and surfaces its biases; the LLM-as-a-judge backbone of the end-to-end task evaluation in Section 40.9.

📄 Paper

Anthropic. "Model Context Protocol (MCP)." 2024. modelcontextprotocol.io

An open protocol standardizing how agents connect to tools, data sources, and context servers; the interface discipline that lets the orchestrated tools and sub-agents of Section 40.6 compose across a network of services.

📖 Spec