Case study

Architecture

How Agent Lab is put together, the choices behind each piece, and the tradeoffs that shaped them. The interesting work is mostly in the backend pipeline: ingestion, recall, streaming, tool safety, and traceable agent orchestration.

Stack

FrontendNext.js 15, Tailwind, shadcn-style primitives
BackendFastAPI, SQLAlchemy async, Pydantic v2
LLMOpenAI Responses API + LangGraph
StoragePostgres + pgvector, Redis, MinIO / S3
WorkersRQ workers with retry + per-document Redis locks
AuthGoogle OAuth + guest sessions, cookie-backed
DeployStatic Next export on nginx; Docker Compose dev; k3s prod-sim
ObservabilityStructured JSON logs, OpenTelemetry, Prometheus

Backend Control Plane

The backend is the product boundary. It resolves guest or user ownership, validates document scope, checks safety policy, enforces credits and token budgets, chooses the model provider, claims idempotency keys, creates a durable run, then dispatches to grounded_qa, agent, or assistant. That keeps expensive work behind one audited gate instead of spreading policy across the UI.

Core Data Flow

1Upload request validates ownership, type, size, quota, storage key, and malware policy before bytes are accepted.
2Object bytes go to MinIO / S3; document + job rows become the durable ingestion contract; RQ carries retry metadata.
3Worker takes a Redis document lock, downloads, parses, cleans, chunks, embeds, writes a new chunk revision, and flips the active pointer.
4Run request validates document scope, budgets, provider selection, safety, idempotency, and thread state.
5Grounded QA retrieves chunks, asks for structured JSON, then persists answer, citations, support status, token usage, and response metadata.
6Agent mode plans, retrieves or calls tools, verifies evidence, retries within caps, synthesizes, runs post-answer actions, and persists every node as trace data.
7Redis pub/sub carries live SSE events while Postgres remains the source of truth for completed traces.

Ingestion Pipeline

Ingestion is staged: download, parse, OCR when needed, chunk, embed, upsert, cleanup, publish. Each stage records duration so a slow file can be diagnosed as parse-bound, OCR-bound, embed-bound, or database-bound.
Chunks are written under a fresh revision id. Retrieval only reads the active revision, so a failed reindex does not erase the previous searchable version.
Embedding calls are batched by token budget, not raw chunk count. The worker verifies the embedding count before writing so chunk and vector order cannot silently drift.
Enqueue uses RQ retry intervals and the worker takes a Redis lock keyed by document_id. Two worker pods can run in parallel, but only one ingestion for a document can own the write path at a time.
The document stores an ingestion report: chunk count, token count, embed batch count, slowest stage, PDF cleanup notes, OCR summary, and page-level parse status where available.
Failed RQ jobs are retained for operator inspection. A bounded CLI reports failed jobs and safely requeues app-level ingestion jobs after resetting document and job state.

Parsing And Chunking

The chunker is format-aware. Markdown preserves heading paths. PDF keeps page numbers and optional bounding boxes. Slide / ebook-like sources preserve logical page numbers. Text-like formats normalize into token-sized chunks. Every chunk stores display content, embedding text, content hash, token count, page or heading metadata, revision id, full-text vector, and the active embedding column.

The embedding text intentionally includes document title and heading context where useful. That helps semantic search without polluting citation display, because the UI still cites the clean source content and metadata.

PDF And OCR

PDF text extraction prefers a real text layer, including special handling for right-to-left Hebrew extraction. If text is empty or looks like CID/mojibake garbage, the affected pages go through OCR.
Cleanup removes repeated headers, footers, page-number noise, soft line wraps, hyphenation artifacts, table-border noise, and obvious OCR junk before chunking.
Page-level results are preserved as pdfplumber, ocr, skipped_empty, or failed. A single bad page should not fail a useful document, but an entirely unreadable document should not become ready with zero chunks.
Bounding boxes power exact highlights when geometry exists. Otherwise the reader falls back to source-text matching while citations still point at page and chunk metadata.

Retrieval

Retrieval is hybrid: pgvector cosine search plus Postgres full-text search, fused with Reciprocal Rank Fusion. Vector retrieval handles paraphrase and semantic drift. FTS handles names, IDs, exact terms, invoice numbers, policy labels, and short keyword-heavy prompts. RRF is the current latency/quality tradeoff before adding a cross-encoder reranker.

Every hit carries fused score, source label, vector rank, FTS rank, fused rank, vector score, FTS score, page, heading, document title, and snippet. That is what makes retrieval failures diagnosable: you can tell whether the miss came from extraction, chunking, scoping, vector search, FTS, fusion, or missing source material.

Recall Quality

Retrieval quality is measured separately from answer quality. A gold set declares natural-language questions and expected chunk substrings. The recall driver uploads each unique fixture, runs the normal grounded_qa path through POST /runs, reads the trace, and scores hit@1, hit@3, and hit@5. Current dashboards show recall by category, document type, and difficulty so regressions can be tied to the data shape that failed.

Grounded QA

1Retrieve active chunks scoped to the owner and selected documents.
2Publish retrieval start, retrieval end, and retrieval.hits events.
3Build a context block from current-turn evidence, with conversation history above it for reference.
4Call the selected provider for strict JSON: answer, citations, support_status.
5Parse the JSON, persist tokens and model metadata, then stream the final answer text in chunks.

Structured JSON is not streamed directly because partial JSON is fragile. The backend waits for the valid object, then streams the answer field so the UX still feels live while the stored result remains machine-checkable.

Agent Workflow

Agent mode is a LangGraph workflow: safety guard, planner, retrieval, tool call, verifier, bounded retry, synthesis, and optional post-answer actions. The planner sees ready documents, connected sources, enabled tools, conversation history, and retry feedback. If it returns bad JSON, picks an unknown tool, or chooses a disabled integration, the run degrades to retrieval instead of crashing.

State is explicit: query, owner, scopes, planner output, retrieval results, tool results, verification verdict, final answer, citations, support status, steps, token usage, response id, and response metadata.
Wall-clock budget wraps the graph. Timeout becomes an inspectable wall_timeout step with insufficient evidence rather than a raw 500.
Retry edges are bounded by planner-loop, tool-call, and retrieval-attempt caps so the agent can improve weak evidence without spinning forever.
Composite requests like summarize this and email it force retrieval and synthesis first. Gmail send runs afterward through the same approval, safety, and idempotency path as planner-picked write tools, using only recipients explicitly supplied by the user or reused from prior user email requests.

Tooling

Built-in tools: architecture_lookup, document_search, document_compare, quote_finder, structured_extraction, and run_summary.
Integration tools: web_fetch.fetch_url (SSRF-hardened, UA set, public-host checks), read-only project knowledge, Drive-backed source reads, GitHub actions, and Gmail actions.
Tool output is JSON-serializable and traceable. Tool failures become failed or blocked tool steps with bounded error kinds, duration, and output summaries.
Sensitive external writes require human approval. Approval pauses the graph, stores pending interrupt state, and resumes from a checkpoint after allow or deny.
Post-synthesis email is modeled as a follow-up action rather than prompt text, so the answer body stays evidence-focused and the delivery result is appended as traceable run state.

Streaming

Live events are session-scoped SSE over Redis pub/sub. The client subscribes once and receives run lifecycle, phase, step, text, reasoning, retrieval, tool, state, interrupt, cancellation, and ingestion events. Postgres stays the durable source of truth; Redis is the live coordination layer.

There are two streams by design: /events for live session progress and /runs/:id/stream for replaying a completed run from stored trace rows. Publish failure is diagnostic, not load-bearing, so a Redis event hiccup should never fail a run.

Cancellation

Stop is cooperative but broad. A cancellation flag in Redis is polled during retrieval, OpenAI calls, stream reads, agent planning, tool execution, direct synthesis, and simulated answer streaming. Cancelled runs persist as cancelled, emit run.cancelled, clear the flag, and skip cancelled assistant credits or response-chain persistence.

Safety And Quotas

Deterministic guards normalize Unicode and common homoglyphs before matching prompt-injection, exfiltration, abuse, and unsafe tool patterns.
Retrieved text is treated as untrusted data. If chunks contain instruction-like content, the run records safety metadata and adds an explicit context warning.
Budgets are layered: request rate, app credits, hourly tokens, upload storage, run wall-clock, planner loops, tool calls, integration timeout, circuit breaker, and web-fetch per-session / per-host windows.
Idempotency keys protect retry-sensitive requests. Same key and same fingerprint replays the original result; same key and different input conflicts.
Production adds edge controls: guest-session proof-of-work, stricter Traefik IP limits for auth/session/token routes, crawler noindex rules for private routes, and in-cluster-only metrics access.

Provider Boundary

Generation can use app-paid OpenAI, user-provided OpenAI-compatible credentials, or the Free Ollama API when configured. The selected provider becomes run metadata. Document upload and retrieval currently stay on the shared OpenAI embedding index, including when the chat provider is Free Ollama API, because the configured Ollama Cloud path does not expose the embedding model used by this demo.

Observability

Metrics cover HTTP duration, run duration, tokens, LLM latency, tool invocations, retries, fallbacks, queue depth, failed RQ jobs, ingestion stages, retrieval duration, and retrieval hit source.
Labels stay low cardinality: mode, status, provider, model family, failure reason, tool kind, source. No session id, document id, filename, or raw prompt belongs in Prometheus.
Traces and logs carry enough bounded context to separate ingestion failures, retrieval misses, provider failures, tool timeouts, stale response-chain fallbacks, and safety blocks.
Queue depth and failed-job gauges refresh from a long-running backend process, not only from scheduled maintenance, so Prometheus sees backlog changes between CronJob runs.

Conversation Continuity

Threads use provider response ids when available and also inject a bounded text history window. If a prior response id goes stale, the backend retries once without it and relies on injected history. Evidence remains current-turn scoped so citations do not drift from older retrieval results.

Deployment

Local development runs a composed stack. The web app builds as a static Next export and is served from an unprivileged nginx image, with browser-visible environment baked at image build time. Production targets k3s on a VPS with separate web, API, worker, Postgres, Redis, object storage, maintenance jobs, and observability services. The current production-simulation manifests run two backend replicas and two worker replicas while keeping Postgres, Redis, and MinIO single-replica for the demo.

The deployment is intentionally conventional: readiness/liveness checks, migrations, CPU and memory resources, backups, rollback, and separate worker capacity for ingestion-heavy work. This tests app and worker scaling behavior without pretending the stateful layer is highly available.

Tradeoffs

Responses API over Chat Completions. Better response chaining and metadata; fewer mature ecosystem helpers.
RRF before rerank. Lower latency and simpler ops; less precision on some hard questions until recall data justifies reranking.
SSE over WebSockets. Fits one-way run progress, works well with Redis pub/sub, and is simpler to deploy behind normal proxies.
Revisioned chunks over in-place replacement. More storage churn during reindex, but avoids losing search while a new ingestion attempt fails.
Curated integrations over arbitrary tools. Less open-ended than user-supplied servers, but safer for auth, approval, rate limits, and failure handling.
Production simulation over full HA. Two backend pods and two worker pods exercise app scaling and queue concurrency, while single-replica Postgres, Redis, and MinIO keep the demo understandable and cheap to run.

Known limits & next steps

This section exists because every real system has failure modes and most portfolios hide them. Here's what Agent Lab does not yet do well, and what's next.

Re-rank is missing. Cross-encoder reranking after hybrid candidates is still a measured upgrade, not default hot-path latency.
OCR on low-quality scans. Tesseract with heb+eng handles typeset Hebrew well, but struggles on old handwritten forms and heavily-ruled tables. A gated ocrmypdf pass now catches harder scans; vision-model OCR remains the later fallback if evals prove it is worth the cost.
Eval coverage. Scenario tests, a judge-model assertion, and the retrieval gold set are in place. Recall runs manually and weekly, but the gold set still needs to grow with real failure cases.
Chunk highlights are best-effort. Stored PDF bounding boxes drive exact overlays when available; older chunks or geometry misses still fall back to text-layer substring highlighting.
In-process agent failover. Agent runs execute inside the backend pod that accepted the request. LangGraph checkpoints help resume approval-gated state, but an in-flight non-background provider or tool call is not migrated if that pod dies.
Integration breaker half-open. Tripped breakers need a real tool success to close; an exponential probe window would let a recovered integration heal on its own.