Section 40.4: Sharded Vector Search

"I am one slice of an index, and I rank only what I am allowed to see. When the coordinator asks, I answer for my shard and forget the rest exists. Somehow, between the four of us, the agent always gets the right ten."
A Vector Index, Sharded Across Nodes That Vote on Relevance

Big Picture

The agent's retrieval tool is, underneath the API, a distributed approximate-nearest-neighbor index that no single machine can hold, queried under a per-user permission filter that no single machine can ignore. An enterprise knowledge corpus runs to hundreds of millions of chunks, far past the memory of one node, so the vectors are sharded; the agent that calls the tool acts on behalf of a specific user, so every result must be filtered by access-control rules before it leaves the cluster. This section assembles the serving path for that tool. It applies the vector-search machinery of Chapter 25 and the scatter-gather ranking of Section 36.6, then adds the enterprise constraints that make agentic retrieval distinct: security-aware filtering at scale, hybrid lexical-plus-dense scoring, the build-versus-buy decision around managed vector databases, and the freshness guarantee the agent expects as the corpus changes underneath it.

The previous section gave the agent a memory and a planning loop; this section gives it eyes into the enterprise's documents. When the agent decides it needs a fact it does not carry in context, it calls a retrieval tool, and that tool must return the most relevant passages the current user is permitted to read, in tens of milliseconds, from a corpus too large to live on one machine. The retrieval primitives are exactly those Chapter 25 developed for distributed search and Section 36.6 developed for sharded ranking; what changes here is the operating context. A web-scale RAG pipeline retrieves over a public corpus where every document is fair game. An enterprise agent retrieves over a corpus where the same query from two different users must return different results, because they have different permissions, and a single leaked passage is a security incident, not a relevance miss. We build the sharded index first, then layer the permission filter, the hybrid score, the managed-platform option, and the freshness contract on top of it.

Figure 40.4.1: The retrieval tool's serving path. The agent sends a query vector $q$ together with the acting user's identity; the coordinator attaches the user's permission groups and scatters the request to every shard. Each shard runs its own approximate-nearest-neighbor search (HNSW or IVF-PQ), filters the candidates against the access-control list, and returns its local top-$k$. The coordinator gathers the $S$ partial lists and merges them into a single global top-$k$ that, by construction, contains only items the user is allowed to read.

1. The Sharded ANN Index Behind the Tool Intermediate

An enterprise corpus of a few hundred million chunks, embedded at 768 or 1024 dimensions in single precision, occupies hundreds of gigabytes before any index structure is added, and the navigable-small-world graphs or inverted lists that make search fast inflate that further. No single node holds it, so the index is partitioned across $S$ shards exactly as Chapter 25 describes in its treatment of index sharding and replication, and as Section 36.6 applies to sharded retrieval. Each shard owns a disjoint slice of the corpus and builds a self-contained approximate-nearest-neighbor structure over its own vectors: a Hierarchical Navigable Small World graph (HNSW) when memory is plentiful and recall must be high, or an inverted file with product quantization (IVF-PQ) when the footprint must be compressed. The two coexist in one cluster, because shards can be tuned independently; Figure 40.4.1 shows both in the same fleet.

A query for the global top-$k$ becomes a scatter-gather, the pattern Section 36.6 builds in detail. The coordinator sends the query vector to every shard, each shard returns its own local top-$k$ candidates with their similarity scores, and the coordinator merges the $S$ lists and keeps the best $k$ overall. The correctness of the merge rests on a simple fact: a globally top-ranked item must be top-ranked within whatever shard happens to hold it. Formally, if shard $s$ returns its top-$k$ set $T_s$ and we keep

$$T = \operatorname{top\text{-}k} \Big( \bigcup_{s=1}^{S} T_s \Big), \qquad \operatorname{score}(x, q) = \langle e_x, q \rangle,$$

then $T$ equals the exact global top-$k$ whenever every globally-best item survived its shard's local cutoff. With exact per-shard ranking that condition always holds for $k$ items, because a shard contributing $m \le k$ of the global winners ranks all of them inside its own top-$k$. The approximation enters only through the ANN search inside each shard, not through the merge, which is why we can reason about recall shard-by-shard. The latency of the whole operation is governed by the slowest shard, not the sum,

$$\text{latency} \approx \max_{s=1,\dots,S} \big( t^{\text{ann}}_s + t^{\text{filter}}_s \big) + t^{\text{merge}},$$

so a single straggler shard, the failure mode named in Section 2.7, sets the user-visible response time. Replicating hot shards and hedging requests against the slowest replica is the standard remedy, carried over from Chapter 25's discussion of replication.

Key Insight: The Merge Is Exact; Only the Per-Shard Search Approximates

Sharding does not, by itself, cost you recall. The scatter-gather merge of per-shard top-$k$ lists is mathematically exact when each shard ranks its own slice exactly, because a global winner is always a local winner on its home shard. Every recall loss you measure traces to the approximate-nearest-neighbor search inside a shard (the HNSW graph missing a neighbor, the IVF probe skipping the right cell), never to the act of splitting the corpus. This lets you budget recall locally, tune each shard's ANN parameters in isolation, and trust that the distribution layer adds latency but not error.

2. Permission Filtering at Scale Advanced

Here the enterprise agent departs from the public-corpus RAG of Chapter 36. The agent acts for a user, and that user may read only the documents their access-control list (ACL) permits: their team's wiki, the projects they are staffed on, the customer accounts they own. The retrieval tool must therefore return the top-$k$ relevant items among those the user may see, which is a filtered nearest-neighbor query. Getting this wrong is not a quality regression; it is a data-exfiltration path, and a capable agent will happily summarize whatever you hand it, permissions or not. The filter is a security boundary, and it belongs inside the retrieval cluster, never in a post-hoc check the agent could be prompted around.

The subtlety is where the filter runs. Filtering after the ANN search (post-filtering) is simple but can starve the result: if the user may see one percent of the corpus and the ANN search returns the global top 100, perhaps only one survives, and the user-visible recall collapses. Filtering before the search (pre-filtering) restricts the candidate set to permitted items but can defeat the very graph or inverted-list structure that makes ANN fast, because the reachable neighborhood is now sparse. Production systems take a middle path: they fold the ACL predicate into the ANN traversal itself, so the search walks the graph but only accepts permitted nodes as candidates, and they over-fetch (raise the per-shard $k$) to keep enough permitted items in flight. The recall the user actually experiences is a filtered quantity,

$$\text{recall@}k_{\text{filt}} = \frac{\lvert R_k(q,u) \cap G_k(q,u) \rvert}{k}, \qquad G_k(q,u) = \operatorname{top\text{-}k}\{ x : \text{visible}(x,u) \},$$

where $G_k(q,u)$ is the exact top-$k$ over the user-visible subset and $R_k(q,u)$ is what the sharded system returned. Two retrievals for the same query against the same index can have different $G_k$, because $\text{visible}(x,u)$ depends on the user. The demonstration in Code 40.4.1 builds a sharded index with per-item ACL tags, runs the filtered scatter-gather, and checks that the result equals $G_k$ exactly and never includes a forbidden item.

The code below constructs $S$ shards over a synthetic corpus, tags each item with the groups allowed to read it, and answers a query for a user in two groups. Each shard applies the ACL predicate to its own candidates before ranking, then returns its local top-$k$; the coordinator merges. We verify the merged result against a brute-force exact filtered top-$k$ and confirm that no item the user may not see ever appears.

import numpy as np

# A tiny sharded vector index with per-item ACL tags. Each item belongs to one
# shard and carries a set of group ids the viewer must intersect to be allowed.
rng = np.random.default_rng(7)
N, d, S = 6000, 32, 4               # items, embedding dim, shards
groups = [f"g{i}" for i in range(6)]

vecs = rng.standard_normal((N, d)).astype(np.float64)
vecs /= np.linalg.norm(vecs, axis=1, keepdims=True)        # cosine == dot product
# Each item is readable by a random subset of 1-2 groups.
acl = [set(rng.choice(groups, size=rng.integers(1, 3), replace=False)) for _ in range(N)]
shard_of = rng.integers(0, S, size=N)                      # static placement

# The querying agent acts on behalf of a user in these groups.
user_groups = {"g1", "g4"}
def visible(i):                                            # ACL predicate
    return len(acl[i] & user_groups) > 0

q = rng.standard_normal(d); q /= np.linalg.norm(q)
k = 10

# ---- Exact filtered top-k (ground truth): score every VISIBLE item globally.
allow = np.array([visible(i) for i in range(N)])
scores = vecs @ q
masked = np.where(allow, scores, -np.inf)
exact = np.argsort(-masked)[:k]
exact_set = set(int(i) for i in exact)

# ---- Sharded scatter-gather: each shard filters by ACL FIRST, then ranks its
# own visible items, returns its local top-k; a coordinator merges the S lists.
def shard_topk(sid):
    idx = np.where(shard_of == sid)[0]
    idx = idx[[visible(int(i)) for i in idx]]              # ACL filter on the shard
    if idx.size == 0:
        return []
    s = vecs[idx] @ q
    order = np.argsort(-s)[:k]
    return [(int(idx[j]), float(s[j])) for j in order]

partials = [shard_topk(sid) for sid in range(S)]           # S parallel workers
merged = sorted(sum(partials, []), key=lambda t: -t[1])[:k]
merged_set = set(i for i, _ in merged)

print("items / shards / dim     :", N, S, d)
print("viewer groups            :", sorted(user_groups))
print("visible items (of N)     :", int(allow.sum()))
print("exact filtered top-k     :", sorted(exact_set))
print("sharded gather top-k     :", sorted(merged_set))
print("recall@k vs exact        :", f"{len(exact_set & merged_set)/k:.2f}")
print("any forbidden item leaked:", any(not visible(i) for i in merged_set))

items / shards / dim     : 6000 4 32
viewer groups            : ['g1', 'g4']
visible items (of N)     : 2861
exact filtered top-k     : [238, 305, 2260, 2608, 2874, 3376, 3499, 4069, 4522, 5706]
sharded gather top-k     : [238, 305, 2260, 2608, 2874, 3376, 3499, 4069, 4522, 5706]
recall@k vs exact        : 1.00
any forbidden item leaked: False

Code 40.4.1: A filtered scatter-gather over four ACL-tagged shards, verified against the exact filtered top-$k$. Each shard applies the permission predicate to its own slice before ranking, so a forbidden item is never even a candidate; the merge reproduces $G_k(q,u)$ from the equation above exactly, recall@$k$ is $1.00$, and the leak check confirms every returned item is visible to the user. The per-shard ranking here is exact, isolating the merge correctness from ANN approximation.

Practical Example: The Agent That Retrieved Across a Permission Wall

Who: A platform engineer rolling out an internal copilot for a 12,000-person company.

Situation: The copilot answered questions over the company's entire document store, sharded across eight vector-search nodes, and was used by everyone from interns to the finance team.

Problem: A pilot user in engineering asked about "next quarter's headcount plan" and the agent cheerfully retrieved and summarized a confidential finance document that the user's role did not permit.

Dilemma: Apply the ACL filter after retrieval, a one-line post-filter that was simple but left permitted-result recall unpredictable and kept forbidden vectors in the candidate pool, or push the filter into each shard's ANN traversal, which was more work but made a leak structurally impossible.

Decision: They pushed the filter into the shards, indexing each chunk with its ACL group set and admitting only permitted nodes as ANN candidates, with per-shard over-fetch to protect recall, exactly the structure of Code 40.4.1.

How: Each shard stored a compact group bitmask per vector; the traversal rejected any candidate whose mask did not intersect the caller's groups before scoring it.

Result: The leak class disappeared in audit, two different users got correctly different answers to the identical query, and the over-fetch raised median latency by under two milliseconds.

Lesson: In an enterprise agent, the permission filter is part of the index, not a wrapper around it. Put it where a prompt cannot reach it.

3. Hybrid Lexical and Dense Retrieval Intermediate

Dense embeddings capture meaning but miss exact tokens: an account number, an error code, a product SKU, a person's surname. An enterprise agent asked to "find the ticket where customer ACME-4471 reported error E2207" needs those literals matched, not paraphrased. Hybrid retrieval, developed for the distributed case in Chapter 25's section on distributed hybrid search, runs a lexical index (BM25 over an inverted index) alongside the dense ANN index and fuses their rankings. The fusion is typically reciprocal rank fusion or a weighted score combination,

$$\text{score}_{\text{hybrid}}(x) = \alpha \cdot \tilde{s}_{\text{dense}}(x) + (1-\alpha) \cdot \tilde{s}_{\text{lex}}(x),$$

where the two component scores are normalized to a common range and $\alpha$ trades semantic recall against literal precision. In a sharded deployment both indexes are partitioned the same way, so each shard contributes a fused local top-$k$ and the scatter-gather of Section 1 carries over unchanged; the only addition is that each shard now runs two retrievers and fuses before it returns. The permission filter of Section 2 applies identically to both, because the ACL is a property of the item, not of how it was scored. For the agent, hybrid retrieval is what makes the tool dependable across both "explain our refund policy" (dense wins) and "pull invoice INV-99812" (lexical wins) without the user choosing a mode.

4. Build Versus Buy: Managed Vector Databases Beginner

Everything in Sections 1 through 3, sharded ANN, scatter-gather, ACL filtering, hybrid fusion, replication, and freshness, is offered as a managed service by vector databases and the managed platforms of Section 33.9. Pinecone, Weaviate, Qdrant, Milvus, and the vector extensions of the cloud data platforms expose a single query endpoint that hides the shards, runs the filtered ANN, and handles the index updates. The build-versus-buy decision is the enterprise version of the right-tool principle that runs through this book. Building your own gives control over the ANN parameters, the filter semantics, and the placement, and avoids per-query egress pricing at very high volume; buying gives you an SLA, automatic replication and rebalancing, and metadata-filter support you do not have to make correct yourself, which for a security-critical filter is a meaningful reduction in risk. For most agentic applications the managed path is the right first move, because the differentiating work is the agent, not the reimplementation of a sharded index that a vendor has already hardened. The library shortcut below shows how thin the application-facing query becomes when the platform owns the distribution.

Library Shortcut: A Managed Vector DB Collapses the Whole Path to One Query

The sharded scatter-gather, the per-shard ACL filter, and the merge of Code 40.4.1 are exactly what a managed vector database exposes behind one call. With FAISS you still own the index but get the ANN structure for free; with a hosted service you own neither the index nor the distribution. Both reduce the from-scratch path to a few lines:

# Option A: FAISS, you own the shard, the library owns the ANN structure.
import faiss, numpy as np
index = faiss.IndexHNSWFlat(d, 32)          # HNSW graph, M=32 neighbors
index.add(vecs.astype('float32'))           # build this shard's index
scores, ids = index.search(q[None].astype('float32'), k)   # local top-k

# Option B: a managed vector DB owns the shards, the ANN, AND the filter.
# The metadata filter is the security boundary; the service applies it
# inside the sharded search, not after it.
result = index_client.query(
    vector=q.tolist(),
    top_k=k,
    filter={"acl_groups": {"$in": ["g1", "g4"]}},   # only items the user may read
)

Code 40.4.2: The dozens of lines of manual sharding, filtering, and merging in Code 40.4.1 collapse to one search call (FAISS, single shard) or one query with a metadata filter (managed service, all shards). The service handles process-group setup, shard placement, replication, and the ACL-aware traversal internally; the application supplies only the query vector, $k$, and the caller's permitted groups.

5. Freshness and Consistency as the Index Updates Advanced

An enterprise corpus is not static. Documents are created, edited, and deleted; permissions change when someone joins a team or a project is reclassified. The index must reflect these changes, and the agent expects a degree of freshness: a document uploaded a minute ago should be retrievable, and a document the user just lost access to must vanish from their results immediately, because a stale ACL is a leak. These are the consistency questions of Section 2.5 applied to the retrieval layer. Most vector systems are eventually consistent for content: a write enters a fast mutable segment, becomes searchable within seconds, and is later merged into the compacted immutable index, so newly added vectors are visible before the background reindex completes. For permission revocations the looser guarantee is unacceptable, so production systems evaluate the ACL against a strongly-consistent authorization store at query time rather than trusting the possibly-stale group tags baked into the index, accepting a small extra lookup to close the revocation gap. The split is deliberate: eventual consistency for which documents exist, strong consistency for who may read them.

Thesis Thread: One Tool Call, a Whole Distributed System Underneath

To the agent, retrieval is a single tool invocation that returns ten passages. Underneath sits the full scale-out stack this book has built: the corpus is partitioned across shards (the sharding of Section 2.3), each shard runs an approximate index (Chapter 25), the query is a scatter-gather collective whose cost is set by the slowest shard (Section 36.6), and freshness rides on a consistency model chosen per field (Section 2.5). The agentic application does not add a new distributed primitive; it composes the ones already developed and adds a security boundary, the permission filter, that the public-corpus case never needed. Scale-out is not a feature of the agent, it is the substrate the agent stands on.

Research Frontier: Filtered ANN and Secure Retrieval (2024 to 2026)

Filtered nearest-neighbor search is an active research line precisely because the naive pre- and post-filter strategies both degrade. Methods such as Filtered-DiskANN and ACORN (2024) build filter-aware graph indexes that keep traversal efficient even when the permitted subset is a small fraction of the corpus, closing the recall gap that Section 2 described. A parallel thread targets the security boundary itself: privacy-preserving and access-controlled retrieval that enforces permissions cryptographically or with provable isolation between tenants, so a multi-tenant managed index cannot leak across customers even under a compromised query path. On the agentic side, work on retrieval as a learned tool (the agent deciding when and what to retrieve, and reranking with the same LLM that consumes the results) is pushing the filter and the freshness contract up into the agent's reasoning loop. The open problem the field is converging on is a sharded index that is simultaneously fast, high-recall under tight filters, and verifiably permission-safe at multi-tenant scale.

Fun Note

An agent with retrieval but no ACL filter is the world's most enthusiastic intern: handed a key to every filing cabinet, it will earnestly fetch the one document you least wanted it to read and summarize it beautifully. The filter is the part of onboarding where someone explains which drawers are not yours.

The tool is now complete: a sharded, hybrid, permission-filtered, fresh vector index that returns the right passages for the right user in a single call. The next section connects it to generation. The retrieved passages become the context a distributed LLM serving stack conditions on, turning retrieval into retrieval-augmented generation for the enterprise agent; that consumer is built in Section 40.5, and Section 40.6 shows the agent calling this index as one tool among several in its reasoning loop.

Exercise 40.4.1: Pre-Filter, Post-Filter, or In-Traversal? Conceptual

A user is permitted to read 0.5 percent of a 200-million-chunk corpus. Explain, for each of the three filtering strategies (post-filter the ANN output, pre-filter to the permitted set before searching, and filter inside the ANN traversal), what goes wrong or right with both recall and latency at this selectivity. Then state which you would deploy and what the per-shard over-fetch factor protects against. Connect your answer to the filtered recall@$k$ definition in Section 2.

Exercise 40.4.2: Break and Fix the Merge Under Approximation Coding

Modify Code 40.4.1 so that each shard returns an approximate local top-$k$: have shard_topk randomly drop each candidate with probability 0.1 before ranking, simulating ANN misses. Measure recall@$k$ against the exact filtered top-$k$ over many random queries. Then raise the per-shard fetch size from $k$ to $2k$ and remeasure. Report how over-fetch recovers recall, and confirm that the leak check still reports no forbidden item regardless of fetch size, demonstrating that over-fetch trades latency for recall but never for safety.

Exercise 40.4.3: Latency of the Slowest Shard Analysis

Suppose a query scatters to $S = 16$ shards, each shard's combined ANN-plus-filter time is drawn independently from a distribution with mean 8 ms and a 99th-percentile tail of 40 ms, and the merge costs 1 ms. Using the latency equation from Section 1, argue why the expected end-to-end latency is far closer to the tail than to the mean, and estimate how it grows as $S$ increases. Then explain how replicating each shard and hedging the query against the faster of two replicas changes the calculation, tying your reasoning to the straggler discussion of Section 2.7.