In Q4 2025, a FinTech client asked us to deploy a RAG system over their internal compliance documentation — 14,000+ pages of regulatory filings, internal policies, and audit reports. The system had to answer complex questions from compliance officers in real time, while meeting SOC 2 Type II controls and providing full audit trails for every answer.
The Stack (After Three Iterations)
- Document parsing: Unstructured.io for PDFs, custom parsers for proprietary formats
- Chunking strategy: Hierarchical (section → subsection → paragraph) with 20% overlap
- Embeddings: Voyage-3 for retrieval, OpenAI text-embedding-3-large for fallback
- Vector store: pgvector on AWS RDS Postgres
- Reranking: Cohere Rerank v3
- Generation: Claude 3.7 Sonnet
- Audit layer: Custom — every retrieval + generation logged to immutable storage
The Retrieval + Citation Discipline (Working Code)
import anthropic
import psycopg2
from voyageai import Client as VoyageClient
voyage = VoyageClient()
claude = anthropic.Anthropic()
def retrieve(question: str, k: int = 12):
"""Vector retrieval against pgvector."""
emb = voyage.embed([question], model="voyage-3").embeddings[0]
with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
cur.execute("""
SELECT chunk_id, document_id, section_path, content, content_hash,
1 - (embedding <=> %s::vector) AS sim
FROM compliance_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (emb, emb, k))
return cur.fetchall()
def answer(question: str):
chunks = retrieve(question, k=12)
context = "nn".join(
f"[chunk:{c[0]}|{c[2]}]n{c[3]}" for c in chunks
)
response = claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=2000,
system="""You are a compliance assistant.
Cite every claim with [chunk:ID]. If the answer is not in the context,
say so explicitly. Never invent regulatory references.""",
messages=[
{"role": "user", "content": f"Context:n{context}nnQuestion: {question}"}
],
)
audit_log(question, chunks, response) # immutable S3 Object Lock
return response.content[0].text
What We Got Wrong First
Our initial implementation used semantic chunking with no hierarchical context. Retrieval quality was poor because chunks lost their document context. The fix: every chunk now embeds with a synthetic preamble like “From section 4.2 of the AML Compliance Manual, subsection on Sanction Screening:” This single change improved retrieval accuracy by 31% on our benchmark.
Production Metrics (4 Months In)
- 2,400+ queries/month from compliance officers
- 96.4% answer satisfaction (in-app feedback)
- Zero audit findings from the Q1 2026 SOC 2 review
- Estimated 60+ analyst hours/month saved on document lookup
Ready to optimize your cloud or AI footprint?
Book a free 30-minute architecture review. We will deliver a written cost-and-architecture audit within 48 hours.
Need help with compliance grade RAG?
Ohveda runs free 30-minute architecture reviews. We will identify your top opportunities in writing within 48 hours — at no cost.