What we’re trying to ship
You have a prototype that uses an LLM to answer questions over your internal documents (policies, runbooks, specs, tickets). The demo works. Now you want to ship a production “RAG” (Retrieval-Augmented Generation) service that:
- Returns answers with citations to source snippets
- Doesn’t leak sensitive data across users/tenants
- Has predictable latency and cost
- Doesn’t silently hallucinate with high confidence
- Can be operated at 2am without a PhD in embeddings
In scope: text documents, chunking, embeddings, vector search, reranking, prompting, citations, authZ, observability, rollout, cost controls.
Out of scope: training/fine-tuning your own foundation model, multimodal RAG, and fully autonomous agents that take actions.
Assumptions (say these out loud to your team):
- Traffic shape: bursty QPS during business hours, long tail at night
- Data sensitivity: mixed (public, internal, confidential); you must assume users will paste secrets into prompts
- Deployment: service behind your SSO, running in your cloud/VPC; you can call an external LLM API or host one
Bench setup
Most teams “benchmark” RAG by asking 20 questions and eyeballing answers. That’s a vibe check, not an engineering artifact. A bench that survives contact with production has three parts: a fixed corpus snapshot, a fixed question set, and a repeatable scoring harness.
Prototype setup (the common starting point)
- Ingest: PDF/HTML/text → split into chunks
- Embed chunks → store in a vector DB
- Query: embed question → top-k retrieval → stuff chunks into prompt → LLM answer
Make it a real bench
Hard requirements:
- Freeze a corpus snapshot (and version it). Otherwise you can’t compare runs.
- Build a gold QA set with expected citations (not just expected text).
- Record every run artifact: chunking params, embedding model, vector index config, prompt, top-k, reranker, LLM model, temperature, max tokens.
Practical scoring (no fake numbers required):
- Citation correctness: “Does the cited text actually support the claim?”
- Answer faithfulness: “Is the answer entailed by retrieved text?”
- Coverage/recall: “Did we retrieve the right source chunk anywhere in top-k?”
- Latency budget split: retrieval vs generation
- Cost per query (units that matter; see cost section)
Tip: store bench inputs/outputs as JSONL so you can diff runs and regress quickly.
What the benchmark actually tells you (and what it doesn’t)
What it tells you:
- Whether retrieval finds relevant snippets for your question distribution
- Whether your prompt format reliably produces citations and refusal behavior
- Sensitivity to chunk size, overlap, top-k, reranking
- Rough latency/cost shape per query class (short vs long answers)
What it doesn’t tell you (and will bite you):
- Permissioning correctness (bench data rarely tests cross-tenant leakage)
- Worst-case latency under load (vector DB tail latency + LLM queueing)
- Corpus drift: new docs, reorgs, broken HTML, duplicate content, stale versions
- Adversarial prompting (users trying to exfiltrate or override system behavior)
- Operational failure modes: partial outages, timeouts, model/provider regressions
- Real user intent: “What is X?” in a bench is not “I’m on-call and need the exact runbook step.”
Rule: treat bench wins as “eligible for a production trial,” not “ready.”
Production constraints
Define constraints before you argue about vector DBs.
Latency
Set an SLO (example shape, not a number): “Interactive answers should feel fast; long answers can stream.” Split the budget:
- Retrieval (embedding + vector search + rerank)
- Prompt assembly
- LLM generation (dominant in many cases)
Gotcha: RAG adds network hops. Each hop adds tail latency and failure probability.
Scale
Consider:
- Corpus size growth (chunks count, not documents count)
- Ingest rate (batch vs continuous)
- Query QPS and concurrency
- Multi-region needs (data residency, latency)
Cost
The main cost drivers:
- Tokenized prompt size (retrieved context + chat history)
- Tokens generated
- Reranking calls (if using a cross-encoder or LLM-as-reranker)
- Embedding calls (ingest-time and query-time)
- Vector DB storage + index maintenance
Most teams lose money on “top-k too high” + “chunks too big” + “chat history unbounded.”
Compliance / data handling
Decide early:
- Can prompts and retrieved snippets be sent to a third-party LLM API?
- Must data remain in-region?
- Retention: do you log prompts? If yes, how do you redact?
- Access control model: document-level, section-level, row-level?
SLOs and correctness expectations
RAG is not a transactional system, but production still needs:
- Availability targets
- Defined refusal behavior (“I don’t know” with suggested sources)
- Escalation path (“open the source doc” or “file a ticket”)
Architecture that survives reality
You want something boring, debuggable, and permission-safe.
Minimum viable production architecture
- Ingestion pipeline (async)
- Fetch → normalize → extract text → chunk → embed → store
- Persist raw text + metadata (doc id, version, ACL, timestamps, source URL)
- Query service (sync)
- AuthN/AuthZ → query rewrite (optional) → retrieval → rerank → prompt → LLM → postprocess (citations, safety, formatting)
- Data stores
- Vector store for embeddings + metadata
- Source-of-truth store for document text and ACLs (don’t rely on vector DB alone)
- Control plane
- Config registry for prompts, models, top-k, chunking versions
- Feature flags for rollout
Permissioning: do it at retrieval time, not after generation
Hard requirement:</strong enforce access control before you retrieve/assemble context.
Common pattern:
- Store ACL metadata per chunk (tenantid, groups, docvisibility)
- Filter vector search by ACL constraints (or pre-partition indexes per tenant)
- If your vector DB filtering is limited or slow, use one of:
- Per-tenant index (simple, can be expensive)
- Coarse partitioning (per business unit) + post-filter + rerank
- Hybrid: candidate retrieval broad, then strict filter, then rerank
Do not retrieve across tenants and “trust the LLM to ignore it.”
Retrieval quality: hybrid + rerank (usually)
Pure embeddings can miss exact matches (IDs, error codes). Pure keyword search can miss paraphrases. In production, hybrid tends to win:
- Lexical search (BM25) for exact terms, codes, names
- Vector search for semantic match
- Merge candidates → rerank to top-N
Decision point:
- If your corpus is heavy on structured identifiers (tickets, logs, runbooks): hybrid is strongly favored.
- If your corpus is mostly prose and synonyms matter: vector-first can be fine, but still consider rerank.
Context assembly that won’t explode tokens
Guardrails:
- Cap retrieved tokens (not just number of chunks)
- Prefer smaller, well-formed chunks + rerank over giant chunks
- Use “quote then answer” formatting to keep the model grounded
- Include document title + section headers in chunks to preserve meaning
Answer format contract
Treat output as an API, not prose:
- JSON (or structured) fields: answer, citations[], confidence/coverage hints, refusal_reason
- Enforce max length and required citations for “factual” answers
If you can’t reliably parse output, ops will be miserable.
Security and privacy checklist
Non-negotiables for internal RAG:
- AuthZ before retrieval (tenant/group filters, doc-level allow lists)
- Prompt injection awareness: retrieved text is untrusted input
- Strip/ignore instructions from documents (“Ignore previous instructions…”)
- Use a system message that explicitly treats documents as data, not directives
- Secrets handling
- Redact known secret patterns in logs (API keys, tokens)
- Provide a “don’t paste secrets” UX warning, but don’t rely on it
- Logging policy
- Decide whether to store prompts/responses; if yes, retention + access controls
- Separate operational logs (latency, error codes) from content logs
- Data egress controls
- If calling external LLMs: approved endpoints, TLS, vendor terms, regional routing as required
- Model isolation
- Don’t share caches across tenants unless keys include tenant identity
- Document provenance
- Store source URL/path and version; show it to users to reduce blind trust
Observability and operations
RAG debugging is mostly “why did it say that?” Build observability around the pipeline, not just the endpoint.
What to log (carefully, with redaction):
- Request id, tenant id, user id (or hashed)
- Retrieval:
- top-k doc ids, chunk ids, scores
- filters applied (ACL constraints)
- reranker version + scores
- Prompt stats:
- tokens in: system + user + context + history
- tokens out
- Model info: provider/model id, temperature, max tokens
- Latency breakdown: embed, search, rerank, generation
- Outcome tags:
- answered vs refused
- citation_count
- “no relevant context found” reason
Dashboards that matter:
- Answer rate vs refusal rate (by tenant and query type)
- p50/p95/p99 latency split by stage
- Cost proxy: tokens in/out per request
- Retrieval health: “% queries with at least one citation from expected collections”
- Error budgets: timeouts, provider errors, vector DB errors
On-call runbook:
- How to disable reranking
- How to lower top-k and cap context tokens
- How to switch models/providers
- How to flip to “citations-only mode” (return snippets without synthesis)
Failure modes and how to handle them
Common real-world failures and mitigations:
-
Vector DB slow or down
-
Fallback: lexical search only (if available)
-
Fallback: “no synthesis, show top sources” mode
-
Circuit breaker + cached “popular questions” results (tenant-scoped)
-
LLM provider latency spikes / errors
-
Timeouts + retry with jitter (careful: retries can double cost)
-
Secondary model/provider failover
-
Degraded mode: shorter answers, smaller context, stream partial
-
Retrieval finds nothing relevant
-
Refuse with “I couldn’t find this in your docs” + suggest query reformulations
-
Offer top 3 near matches with titles, not hallucinated answers
-
Hallucinated synthesis despite good sources
-
Force cite-then-answer prompt pattern
-
Post-check: if no citations, refuse
-
Consider answer verification pass only for high-risk categories (policy, security)
-
Prompt injection via documents
-
Treat retrieved text as untrusted
-
Use a “document is data” instruction and ignore instructions in sources
-
Filter or flag documents that contain obvious injection patterns (best-effort)
-
Stale/duplicate docs leading to conflicting answers
-
Version metadata + prefer latest
-
Deduplicate at ingest (hash normalized text)
-
Show doc timestamps and “last updated” in citations
Rollout plan
Ship in controlled phases. RAG is easy to demo and hard to trust.
- Feature flags
- Enable by tenant/team
- Enable by query category (start with low-risk: FAQs, onboarding)
- Canary
- Route a small percentage of traffic to new retrieval config/prompt/model
- Compare: refusal rate, citations present, user feedback
- Human feedback loop
- “Was this helpful?” + “Report incorrect citation” buttons
- Triage queue that links directly to the retrieval trace
- Rollback
- One-click revert of prompt/model/top-k/reranker version
- Keep last-known-good configuration pinned
- Launch gates
- No cross-tenant leakage incidents in trial
- Latency within budget at expected concurrency
- Clear refusal behavior (no “confident nonsense”)
Cost model (rough)
Don’t pretend you can compute exact dollars without your provider pricing and traffic. Model the units:
Per query cost is roughly:
- Embedding(query) calls (usually 1)
- Vector search + rerank compute (varies by approach)
- LLM tokens:
- Input tokens = system + user + chat history + retrieved context
- Output tokens = answer length + citations formatting overhead
Key levers (in order of impact, usually):
- Context token cap (biggest predictable lever)
- top-k and rerank-to-N
- Chunk size/overlap (affects both retrieval and tokens)
- Chat history policy (summarize or window it)
- Model selection per route:
- Cheap model for rewrite + retrieval help
- Stronger model only for synthesis when sources are good
Budget guardrails:
- Hard cap max input tokens
- Rate limits per tenant/user
- Quotas + alerts on token usage anomalies
- Cache embeddings for repeated questions (tenant-scoped)
Bench to Prod checklist
Copy this into a ticket.
Bench readiness
- [ ] Frozen corpus snapshot + versioned ingest config
- [ ] Gold QA set with expected citations
- [ ] Automated run harness with stored artifacts (prompt, configs, outputs)
- [ ] Regression detection for retrieval recall and citation correctness
Production architecture
- [ ] Source-of-truth store for doc text + metadata + versions
- [ ] Vector store schema includes tenant_id + ACL fields
- [ ] AuthZ enforced before retrieval (filtering/partitioning validated)
- [ ] Hybrid retrieval decision made (vector-only vs hybrid) with rationale
- [ ] Reranker strategy chosen (or explicitly rejected)
Safety and security
- [ ] Prompt injection mitigations in place (documents treated as untrusted)
- [ ] Logging policy defined (content vs metadata, retention, access)
- [ ] Redaction for known secret patterns in logs
- [ ] Tenant-scoped caches and isolation checks
- [ ] Egress controls reviewed (if external LLM used)
Ops
- [ ] Per-stage latency metrics (embed/search/rerank/generate)
- [ ] Token in/out metrics and cost proxy dashboards
- [ ] Trace viewer for “why this answer” (top chunks + scores + prompt stats)
- [ ] Circuit breakers + degraded modes (sources-only, lexical-only)
- [ ] Runbook for model/provider failover and config rollback
Rollout
- [ ] Feature flagging by tenant/team
- [ ] Canary plan with success metrics and abort conditions
- [ ] User feedback loop wired to traces
- [ ] Launch gates defined (leakage, refusal quality, latency)
Recommendation
Productionizing RAG is mostly about two things: permission-safe retrieval and bounded context/cost. Start with a minimum architecture that enforces AuthZ before retrieval, versions your corpus and prompts, and logs retrieval traces you can debug. Use hybrid retrieval if your documents contain lots of identifiers and exact terms; add reranking when “top-k contains the right chunk but answer still stinks.”
Most importantly: ship a system that can refuse safely and show sources, then iterate. A “sometimes wrong but always confident” RAG bot will get turned off the first time it burns an on-call engineer.
Leave a Reply