AI Workflow Automation

Building a Compliance-Grade RAG System: Lessons from a 2026 FinTech Deployment

6 min read

In Q4 2025, a FinTech client asked us to deploy a RAG system over their internal compliance documentation — 14,000+ pages of regulatory filings, internal policies, and audit reports. The system had to answer complex questions from compliance officers in real time, while meeting SOC 2 Type II controls and providing full audit trails for every answer.

The Stack (After Three Iterations)

  • Document parsing: Unstructured.io for PDFs, custom parsers for proprietary formats
  • Chunking strategy: Hierarchical (section → subsection → paragraph) with 20% overlap
  • Embeddings: Voyage-3 for retrieval, OpenAI text-embedding-3-large for fallback
  • Vector store: pgvector on AWS RDS Postgres
  • Reranking: Cohere Rerank v3
  • Generation: Claude 3.7 Sonnet
  • Audit layer: Custom — every retrieval + generation logged to immutable storage
Compliance-Grade RAG PipelineSOC 2 + audit-trail architectureDocuments14K pagesChunkingHierarchicalEmbeddingsVoyage-3pgvectorOn RDSRerankCohere v3Claude 3.7Citation-disciplinedAUDIT TRAIL · Every query → S3 Object Lock · 7-year retentionQuestion · Retrieved chunks (hashed) · Prompt version · Model version · Final answerZero audit findings · Q1 2026 SOC 296.4% answer satisfaction · 2,400+ queries/month · p95 retrieval 180ms
The compliance-grade RAG architecture Ohveda deployed for a regulated FinTech in Q4 2025.

The Retrieval + Citation Discipline (Working Code)

import anthropic
import psycopg2
from voyageai import Client as VoyageClient

voyage = VoyageClient()
claude = anthropic.Anthropic()

def retrieve(question: str, k: int = 12):
    """Vector retrieval against pgvector."""
    emb = voyage.embed([question], model="voyage-3").embeddings[0]
    with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
        cur.execute("""
            SELECT chunk_id, document_id, section_path, content, content_hash,
                   1 - (embedding <=> %s::vector) AS sim
            FROM compliance_chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (emb, emb, k))
        return cur.fetchall()


def answer(question: str):
    chunks = retrieve(question, k=12)
    context = "nn".join(
        f"[chunk:{c[0]}|{c[2]}]n{c[3]}" for c in chunks
    )

    response = claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=2000,
        system="""You are a compliance assistant.
Cite every claim with [chunk:ID]. If the answer is not in the context,
say so explicitly. Never invent regulatory references.""",
        messages=[
            {"role": "user", "content": f"Context:n{context}nnQuestion: {question}"}
        ],
    )

    audit_log(question, chunks, response)  # immutable S3 Object Lock
    return response.content[0].text

What We Got Wrong First

Our initial implementation used semantic chunking with no hierarchical context. Retrieval quality was poor because chunks lost their document context. The fix: every chunk now embeds with a synthetic preamble like “From section 4.2 of the AML Compliance Manual, subsection on Sanction Screening:” This single change improved retrieval accuracy by 31% on our benchmark.

Production Metrics (4 Months In)

  • 2,400+ queries/month from compliance officers
  • 96.4% answer satisfaction (in-app feedback)
  • Zero audit findings from the Q1 2026 SOC 2 review
  • Estimated 60+ analyst hours/month saved on document lookup

Ready to optimize your cloud or AI footprint?

Book a free 30-minute architecture review. We will deliver a written cost-and-architecture audit within 48 hours.

Book a free architecture review · sales@ohveda.com

Need help with compliance grade RAG?

Ohveda runs free 30-minute architecture reviews. We will identify your top opportunities in writing within 48 hours — at no cost.

Book a Free Architecture Review →