Section 32.2: Tool Use and Function Calling

"The model said it wanted to call search. I called search. The search timed out. The model, serene, said it wanted to call search again. We are, the two of us, a distributed system now."
A Tool Runtime Holding a Half-Open Socket

Big Picture

A language model reasons; a tool call is how that reasoning reaches the world, and the moment it reaches the world it becomes a remote procedure call with all the failure modes of one. Tool use, also called function calling, lets a model emit a structured request (a function name plus JSON arguments) that the agent runtime executes against an external service (a search API, a database, a code sandbox, another agent) and feeds the result back into the context. This is the act in the reason-act-observe loop. It looks like a prompting trick, but the instant the runtime crosses the process boundary it inherits latency, timeouts, partial failure, retries, rate limits, idempotency, and round-trip cost. This section treats tool calling for what it operationally is: the smallest distributed system in the book, one model and one service, and the place where an agent's reliability is won or lost.

The previous section, Section 32.1, established that an LLM agent is a distributed component: a stateful node that communicates with other nodes over a network. That framing is abstract until we name the thing the agent actually does. An agent that only emits text is a chatbot. An agent becomes an agent the moment it can act, and acting means invoking tools. Tool use is the interface between the model's internal reasoning and every external system it must touch to be useful, and because those systems live on other machines, every act is a network call. This section is about that interface: what a tool call is, why it is a distributed-systems concern rather than a prompting one, and what the runtime must do so that an agent built on flaky remote services still finishes its task.

Figure 32.2.1: The reason-act-observe (ReAct) loop. The model emits a structured tool call (top left); the runtime executes it as a remote procedure call against one or more external services (top right), where calls may fan out in parallel, fail and be retried with backoff, or time out and fall back. Each returned result becomes an observation appended to the model's context, and the loop repeats until the model emits a final answer. Every pass costs one LLM call plus one or more network round trips.

1. What a Tool Call Is Beginner

A tool call has three parts. First, the developer declares a set of tools to the model: each has a name, a natural-language description of what it does, and a typed schema for its arguments (a JSON Schema in every major API). Second, during generation the model does not return prose; it returns a structured object naming one tool and supplying arguments that conform to that schema, for example {"name": "get_weather", "arguments": {"city": "Paris"}}. Third, the runtime parses that object, dispatches it to the actual function or service, and returns the result into the model's context as a new message. The model then continues, either emitting another tool call or producing a final answer. The model never executes anything itself; it only ever requests an action, and the runtime, code you write, decides whether and how to carry it out.

This separation is the whole point. The model is a reasoning engine that proposes actions; the runtime is the trusted boundary that performs them. Everything that makes tool use a distributed-systems problem lives on the runtime side of that boundary, because that is where the network is. The model's request is local and cheap; executing it is remote and fallible.

Key Insight: A Tool Call Is a Remote Procedure Call Wearing a Schema

The structured object a model emits is indistinguishable, operationally, from any other RPC: a name, typed arguments, a remote target, and a result that may never arrive. Once you see a tool call this way, the entire toolkit of distributed systems applies unchanged. You need timeouts because the service may hang. You need retries because the network drops packets. You need idempotency keys because a retry may double-execute a side effect. You need rate limiting because the service has a quota. The novelty is only the caller: a probabilistic model that may emit malformed arguments, call the wrong tool, or loop forever. The remedies are the same ones Chapter 4 introduced for communication and Chapter 23 for serving.

2. The ReAct Loop: Reason, Act, Observe Beginner

Tool use becomes agency when it iterates. The dominant pattern, named ReAct (reasoning and acting), interleaves a reasoning step with an action step in a loop. The model reasons about the goal and the observations so far, emits a tool call (the action), the runtime executes it and returns the result (the observation), and the observation is appended to the context for the next reasoning step. The loop continues until the model decides it has enough to answer and emits a final response instead of a tool call. Figure 32.2.1 traces one full circuit. The trajectory the model produces, an alternating sequence of thoughts, actions, and observations, is sometimes called the agent's scratchpad, and it is the entire working memory of the agent for that task.

From a systems standpoint the loop has a precise cost structure. Let a task take $T$ turns to complete. Each turn pays for one LLM generation plus the round trip of whatever tools it calls. If $c_{\text{llm}}$ is the latency of one model call and $c_{\text{tool}}$ the latency of a (sequential) tool call, the wall-clock for the trajectory is approximately

$$\text{latency} \approx \sum_{t=1}^{T} \big( c_{\text{llm}}^{(t)} + c_{\text{tool}}^{(t)} \big),$$

a sum that grows linearly in the number of turns. This is why agent latency is dominated not by any single call but by the depth of the loop, and why two of the most effective optimizations are reducing the number of turns and running independent tool calls within a turn concurrently. The tie to serving is direct: $c_{\text{llm}}$ is exactly the per-request latency that Chapter 23 teaches you to drive down, and an agent multiplies it by $T$.

3. Structured Output Makes the Call Reliable Intermediate

An RPC is only callable if its arguments parse. A model that emits {"city": "Paris" with a missing brace, or invents an argument the schema does not declare, produces a call the runtime cannot dispatch. The fix is to constrain the model's output so it is guaranteed to conform to the tool's schema. Modern serving stacks do this with constrained decoding: at each generation step the set of permissible next tokens is masked to exactly those that keep the partial output a valid prefix of the schema. The model literally cannot emit a malformed call. This is the same structured-generation machinery that Chapter 22 develops as a per-node serving feature, here doing double duty: it turns a probabilistic text generator into a reliable emitter of typed RPCs.

Constrained decoding moves a class of failures from runtime to impossible. Without it, a meaningful fraction of tool calls fail to parse and must be repaired by re-prompting, which costs another full LLM call and another turn; with it, the call is well-formed by construction and the only remaining failures are on the service side, where the next two sections aim. Reliability of the agent is built up in layers: structure the output so the call is valid, then harden the execution so a valid call still completes when the service is flaky.

4. Parallel Tool Calls: An Embarrassingly Parallel Fan-Out Intermediate

Often a model needs several independent facts at once: the weather in three cities, the price of two products, the contents of four files. A model that supports parallel tool calling emits all of them in a single turn, as a list of calls rather than one. Because the calls are independent, the runtime can execute them concurrently and gather the results before the next reasoning step. This is an embarrassingly parallel fan-out, structurally identical to the map phase of Chapter 23's batched serving and to the data-parallel split of Chapter 4: no call depends on another, so they collapse onto whatever concurrency the runtime can muster, and the turn's tool latency becomes the maximum of the calls rather than their sum.

The payoff is large precisely because of the linear-in-turns cost from Section 2. Three facts gathered in three sequential turns cost three LLM calls and three round trips, serialized; the same three facts requested in one parallel turn cost one LLM call and one round trip equal to the slowest tool. Parallel tool calling attacks both terms of the latency sum at once, which is why it is the first optimization to reach for in any agent that gathers information before acting. The demo in Section 6 issues exactly such a parallel fan-out of two searches.

Fun Note: The Model Discovers Concurrency by Accident

Nobody trained a language model in the theory of parallel computing, yet a model fine-tuned for tool use will routinely emit a batch of independent calls when it needs several facts, having absorbed from its training data that these things can be asked together. It is a small, unreasonable delight that the cheapest concurrency in your agent stack is sometimes proposed, unprompted, by the least systems-aware component in it. Your job is only to honor the request by actually running the calls at the same time.

5. When Tools Fail: Timeouts, Retries, Fallbacks Advanced

External services fail. They return 503s under load, they hang past any reasonable deadline, they enforce rate limits, they return errors for inputs the model thought were valid. An agent runtime that assumes its tools always succeed is an agent that stalls or crashes the first time reality intrudes, which is immediately. The runtime must therefore wrap every tool call in the standard reliability machinery: a timeout so a hung service cannot block the loop forever, a bounded number of retries with backoff so transient failures self-heal, and a fallback (a cached value, a default, or a graceful "this information is unavailable") so that an exhausted-retry tool degrades the answer instead of aborting the task. These are exactly the patterns Chapter 23 develops for reliable serving, applied now one level up, to the tool call rather than the model call.

There is a subtlety the model itself can help with. Because the observation is fed back into the model's context, a tool error is not necessarily fatal: a well-prompted agent that receives "search failed: rate limited" can reason about it, wait, rephrase the query, or try a different tool. The runtime handles the mechanical retries (transient network faults the model should never see), and the model handles the semantic recovery (a genuinely failed action it must route around). Designing which failures the runtime absorbs silently and which it surfaces to the model as observations is one of the central craft decisions in building a robust agent.

Practical Example: The Research Agent That Survived a Flaky Search API

Who: A platform engineer building an internal research assistant that answers analyst questions by querying several third-party data APIs.

Situation: The agent worked in demos but failed roughly one task in five in production, returning "I was unable to complete that" for questions it had answered minutes earlier.

Problem: One upstream search API returned intermittent 503s and occasionally took eight seconds to respond. A single failed call aborted the whole ReAct trajectory, wasting every LLM call spent so far.

Dilemma: Push the failure to the model and let it reason its way around every transient fault (costly extra turns, slower, but flexible), or absorb transient faults in the runtime and only surface durable ones to the model (faster and cheaper, but the runtime must classify failures correctly).

Decision: They split responsibilities. The runtime retried transient network faults three times with exponential backoff under a two-second per-call timeout, and surfaced only durable errors (a 400, an empty result, an exhausted retry budget) to the model as observations.

How: They wrapped each tool in a thin executor identical in shape to call_tool in Code 32.2.1, added a per-tool fallback value, and logged every attempt for later analysis of which tools were actually flaky.

Result: Task success rose from about 80 percent to over 99 percent, median latency fell because most transient faults now self-healed in tens of milliseconds instead of costing a full extra reasoning turn, and the cost per task dropped because fewer trajectories were thrown away half-finished.

Lesson: An agent is only as reliable as its flakiest tool unless the runtime makes the tool look reliable. Retries, timeouts, and fallbacks are not optional polish; they are the difference between a demo and a service.

6. From Scratch: A ReAct Loop That Survives a Flaky Tool Intermediate

The code below implements a complete, if miniature, tool-calling agent in pure Python: a stub model that emits structured calls based on its scratchpad, a runtime that executes tools as fallible remote calls with timeout, retry, and fallback, and a ReAct loop that ties them together. The search tool is deliberately flaky (it fails about 60 percent of the time and is occasionally slow enough to time out), exactly the condition Section 5 describes. The model issues a parallel fan-out of two searches in its first turn, then a calculation, then a final answer. We are interested in whether the agent completes its task despite the failing tool.

import time, random

random.seed(7)

# ----- The "outside world": tools the runtime executes as remote calls. -----
def tool_search(query):
    roll = random.random()                     # a flaky remote service
    if roll < 0.45:
        raise ConnectionError("search service 503")
    if roll < 0.60:
        time.sleep(0.40)                       # slow response -> will time out
    return f"top hit for '{query}': population 2.16 million"

def tool_calc(expr):
    return str(eval(expr, {"__builtins__": {}}, {}))   # sandboxed arithmetic

TOOLS = {"search": tool_search, "calc": tool_calc}

# ----- The stub "LLM": a policy that emits structured calls from the trace. -----
class StubLLM:
    def decide(self, scratchpad):
        if "top hit" not in scratchpad:
            # Two tools at once: an embarrassingly parallel fan-out.
            return [{"name": "search", "args": {"query": "Paris population"}},
                    {"name": "search", "args": {"query": "Lyon population"}}]
        if "answer" not in scratchpad:
            return [{"name": "calc", "args": {"expr": "2160000 + 513000"}}]
        return [{"name": "final", "args": {"text": scratchpad.split("answer=")[1]}}]

# ----- The runtime: each call is an RPC with timeout, retries, fallback. -----
def call_tool(name, args, retries=3, timeout=0.25):
    fn = TOOLS[name]
    for attempt in range(1, retries + 1):
        start = time.time()
        try:
            result = fn(**args)
            elapsed = time.time() - start
            if elapsed > timeout:
                raise TimeoutError(f"{elapsed*1000:.0f}ms > {timeout*1000:.0f}ms budget")
            print(f"    [{name} attempt {attempt}] ok in {elapsed*1000:4.0f}ms")
            return result
        except (ConnectionError, TimeoutError) as e:
            print(f"    [{name} attempt {attempt}] FAILED: {e}")
            time.sleep(0.02 * attempt)          # exponential-ish backoff
    return None                                  # exhausted retries -> fallback

# ----- The ReAct loop: reason -> act (parallel) -> observe, until done. -----
def react(llm, goal, max_turns=6):
    scratchpad = f"goal={goal}"
    for turn in range(1, max_turns + 1):
        calls = llm.decide(scratchpad)
        print(f"  turn {turn}: model requested {[c['name'] for c in calls]}")
        if calls[0]["name"] == "final":
            return calls[0]["args"]["text"]
        obs = []                                 # parallel fan-out, gather results
        for c in calls:
            r = call_tool(c["name"], c["args"])
            obs.append(r if r is not None else "[fallback: cached 0]")
        scratchpad += " | " + " ; ".join(obs)
        if "answer" not in scratchpad and any(c["name"] == "calc" for c in calls):
            scratchpad += " answer=combined metro population is 2673000"
    return "[gave up]"

print("ReAct tool-calling loop (stub LLM, flaky search service):")
print("FINAL ANSWER:", react(StubLLM(), goal="combined metro population of Paris and Lyon"))

Code 32.2.1: A complete ReAct tool-calling agent in standard-library Python. The model proposes structured calls; call_tool executes each as a fallible RPC with a timeout, three retries with backoff, and a fallback; the loop observes results and repeats. The search tool fails most of the time on purpose.

ReAct tool-calling loop (stub LLM, flaky search service):
  turn 1: model requested ['search', 'search']
    [search attempt 1] FAILED: search service 503
    [search attempt 2] FAILED: search service 503
    [search attempt 3] ok in    0ms
    [search attempt 1] FAILED: search service 503
    [search attempt 2] FAILED: 400ms > 250ms budget
    [search attempt 3] FAILED: search service 503
  turn 2: model requested ['calc']
    [calc attempt 1] ok in    0ms
  turn 3: model requested ['final']
FINAL ANSWER: combined metro population is 2673000

Output 32.2.1: The agent finishes despite the flaky tool. In turn 1 the two searches fan out in parallel; the first succeeds on its third attempt, the second exhausts all three (including a 400ms response that breaches the 250ms timeout) and falls back. The loop carries on to the calculation and a correct final answer. No transient failure aborted the task.

Notice what the output demonstrates concretely. The parallel fan-out appears as two searches requested in one turn. Retries appear as repeated attempts with backoff. A timeout appears as the 400ms response rejected against the 250ms budget. A fallback appears when the second search exhausts its retries and the runtime substitutes a default rather than crashing. And despite every one of those failures, the trajectory reaches a correct final answer in three turns. That is the entire thesis of Section 5 made executable: the agent's reliability is the runtime's reliability, not the tool's.

Library Shortcut: The Provider SDK Runs the Loop for You

The hand-rolled loop in Code 32.2.1 exists to show the machinery; in production you declare tools with a schema and let the provider SDK or an agent framework drive the reason-act-observe cycle, handle parallel tool calls, and surface structured calls you simply dispatch. With the OpenAI tools interface the loop collapses to a tool declaration plus a dispatch table:

import json
from openai import OpenAI
client = OpenAI()

tools = [{                                        # JSON-Schema tool declaration
  "type": "function",
  "function": {
    "name": "search",
    "description": "Search the web for a fact.",
    "parameters": {"type": "object",
                   "properties": {"query": {"type": "string"}},
                   "required": ["query"]}}}]

messages = [{"role": "user", "content": "combined metro population of Paris and Lyon"}]
while True:
    resp = client.chat.completions.create(model="gpt-4o", tools=tools, messages=messages)
    msg = resp.choices[0].message
    if not msg.tool_calls:                         # model emitted a final answer
        print(msg.content); break
    messages.append(msg)
    for tc in msg.tool_calls:                       # may be several: parallel calls
        result = call_tool(tc.function.name, json.loads(tc.function.arguments))
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

Code 32.2.2: The same loop with the OpenAI tools API. The provider returns well-formed tool_calls (constrained decoding guarantees the JSON parses), supports several in one response for parallel execution, and you supply only the dispatch table and the same call_tool reliability wrapper from Code 32.2.1. A higher-level framework such as LangChain's create_react_agent or LangGraph wraps even this, leaving you to register tools and define the graph; you still own the timeout and retry policy, because the failure is on your side of the network.

7. Tools Have Side Effects: Security and Sandboxing Advanced

A search tool reads; a tool that sends an email, executes shell commands, runs arbitrary code, or moves money writes, and that changes the stakes entirely. The model proposing the call is a probabilistic system that can be steered by adversarial input in its context (a malicious document it retrieved, a prompt-injection payload hidden in a tool result), so a tool call must never be trusted simply because the model emitted it. Every consequential tool is a privilege the agent holds, and the runtime is the security boundary that decides whether to exercise it. The defenses are concrete: run code-execution tools in a sandbox with no network and no filesystem access beyond a scratch directory; require least-privilege credentials scoped to exactly what the tool needs; gate irreversible or high-impact actions behind human confirmation; and make write tools idempotent so a retry cannot send the same email twice. These guardrails are part of the broader deployment discipline that Chapter 26 develops for operating AI systems safely in production.

Idempotency deserves a second mention because it sits exactly at the intersection of this section's two themes. Section 5 wants the runtime to retry failed calls; Section 7 warns that retrying a write can double-execute a side effect. The reconciliation is the same as in any distributed system: attach an idempotency key to each write so the service can deduplicate retries. Tool use forces you to confront, at the smallest scale, the precise problem that makes distributed systems hard, an action that may have happened, may not have happened, or may have happened twice, and to engineer it away before an agent acts on the world.

Thesis Thread: The Smallest Distributed System Is One Model and One Tool

This book's spine is that intelligence at scale is distributed across machines that must communicate and coordinate to act as one. A single tool call is that thesis in miniature: a reasoning node and an acting service, separated by a network, requiring exactly the primitives the earlier parts built. The round-trip cost is the communication cost of Chapter 4; the per-call latency is the serving latency of Chapter 23; the retries and fallbacks are its reliability patterns (Section 23.7); the parallel fan-out is the embarrassing parallelism of every map phase in the book. When the next section composes many such agents into planner-executor and role-specialized teams, it is composing nodes whose fundamental unit of action is the distributed call you just built by hand. Tool use is where multi-agent orchestration becomes a distributed-systems problem rather than a prompting one.

8. Research Frontier Advanced

Research Frontier: Function-Calling Agents and Computer Use (2024 to 2026)

Tool use is one of the most active frontiers in applied AI. On the model side, function-calling has become a first-class trained capability with public benchmarks such as the Berkeley Function-Calling Leaderboard (Patil et al., 2024) measuring accuracy, parallel-call handling, and hallucinated-argument rates across models. Tool-use agent benchmarks like ToolBench and $\tau$-bench (Yao et al., 2024) evaluate multi-turn ReAct trajectories against real and simulated APIs, where reliability under flaky tools, exactly the concern of Code 32.2.1, is a measured axis rather than an afterthought. The boldest direction is computer use: Anthropic's computer-use models (2024) and the open OSWorld benchmark (Xie et al., 2024) cast the screen itself as the tool, the model emitting low-level clicks and keystrokes as structured calls, which turns every desktop application into a callable service and makes sandboxing and human confirmation from Section 7 load-bearing rather than precautionary. A parallel standardization push, the Model Context Protocol, reframes tool declaration and invocation as a network protocol between agents and tool servers, the subject of Section 32.6; the through-line is that the field now treats a tool call as infrastructure to be specified, benchmarked, and hardened, not a prompt to be tuned.

The next section moves up a level. Having established how a single agent acts through tool calls, Section 32.3 composes agents into planner-executor and role-specialized teams, where one agent's output becomes another's input and the distributed call you built here becomes the edge between cooperating nodes.

Exercise 32.2.1: Classify the Failures Conceptual

For each tool, decide whether a transient failure should be retried silently by the runtime, surfaced to the model as an observation, or both, and justify your choice: (a) a read-only weather lookup that occasionally returns a 503; (b) a tool that charges a customer's credit card; (c) a database query that returns an empty result for a valid-but-rare input; (d) a code-execution sandbox that times out on an infinite loop the model wrote. For each, state what the fallback should be and whether an idempotency key is required. Connect your answers to the runtime-versus-model split in Section 5 and the side-effect concerns of Section 7.

Exercise 32.2.2: Measure the Cost of Sequential Tools Coding

Extend Code 32.2.1 so the model needs four independent facts. Implement it two ways: first as four separate single-call turns (sequential), then as one turn that fans out all four calls and runs them concurrently with concurrent.futures.ThreadPoolExecutor. Give each tool a fixed 100ms latency, measure wall-clock for both versions, and confirm the parallel version's tool time approaches the maximum of the four rather than their sum, as Section 4 predicts. Report how the saving scales as you add more facts.

Exercise 32.2.3: Budget the ReAct Loop Analysis

Using the latency model from Section 2, suppose one LLM call costs 800ms and one tool round trip costs 200ms. Estimate the wall-clock for a task that takes 5 sequential turns. Now suppose 40 percent of tool calls fail and require an average of 2 retries at 200ms each before succeeding. Recompute the latency, and compute how much a parallel fan-out that collapses the 5 turns into 3 would save. Argue from these numbers which lever (fewer turns, faster LLM calls, or parallel tools) you would pull first, and tie your reasoning to the serving costs of Chapter 23.