
Building a Contract Review Agent with LLMs, pgvector, and a Dash of Production Regret
Somewhere around the third time a lawyer friend showed me a vendor contract with a liability cap buried in a footnote, I decided this was a tooling problem. Legal review for MSAs and vendor paper is expensive, repetitive, and deeply adversarial. Vendors use the same tricks. The same indemnity clauses, same microscopic caps, same auto-renewal traps. Every senior lawyer spends part of their week catching the same things they caught last week in slightly different fonts.
Legal Agent is my attempt at compressing that first-pass into a structured workflow: upload a contract, extract typed clauses, score risk against a living playbook, propose redlines, and send only the flagged items to human review. Here's how I built it, what actually runs, and what nearly took the backend down on day one.
The Pipeline (Six Stages, One Goal)

Each stage is a distinct concern. The guardrails fire before any scoring — deterministic checks run faster and catch anything catastrophic without spending tokens on a nuanced risk narrative that won't matter if the indemnity clause is outright unlimited.
Extraction: Getting Typed Clauses from Unstructured Text
The LLM extraction stage uses with_structured_output to get Pydantic models back directly from the model. No regex parsing, no brittle JSON prompts. You hand the model a schema and it returns a typed object or raises.
chain = llm.with_structured_output(ContractClauses)
clauses: ContractClauses = await asyncio.get_event_loop().run_in_executor(
_cpu_pool, lambda: chain.invoke({"text": chunk})
)The run_in_executor detail matters. SQLAlchemy async sessions and LangChain's sync .invoke() don't coexist happily in the same coroutine. The executor keeps the blocking call off the event loop without spawning a thread per request.
Chunking strategy for large contracts: I split on clause headings when available (regex against common MSA section patterns), fall back to 1000-token sliding windows with 200-token overlap if the document doesn't have obvious structure. This is the part I'd do differently (more on that later).
Risk Scoring: Two Signals, One Number
The risk.py chain combines two signals before generating the narrative:
Rule weights — each bright-line flag from the guardrail stage carries a pre-assigned severity (0–30 points).
LLM holistic score — the model sees the full clause context and the rule violations and returns a score with a
chain-of-thoughtreasoning block I keep for audit but don't surface to users.
These are merged:
def merge_levels(rule_score: int, llm_score: int) -> int:
# rule engine is authoritative for hard violations
return max(rule_score, int(llm_score * 0.7 + rule_score * 0.3))Simple, but effective. Hard violations can't be "reasoned down" by an LLM that thinks the overall contract is fair.
Playbook RAG: Why pgvector, Not Pinecone
The playbook lives in Postgres as PlaybookEntry rows — preferred language for each clause type, policy rationale, and a Vector(1536) column for the embedding:
class PlaybookEntry(Base):
__tablename__ = "playbook_entries"
clause_type: Mapped[str]
preferred_language: Mapped[str]
policy_rationale: Mapped[Optional[str]]
embedding: Mapped[list[float]] = mapped_column(Vector(1536))The retrieval path is a hybrid RRF (Reciprocal Rank Fusion): cosine similarity on the embedding column plus Postgres full-text search on preferred_language, results merged by rank rather than score. This degrades gracefully — if embeddings are stale (e.g., you changed playbook wording and haven't re-run make vector.index), FTS still returns something sane.
Why pgvector over Pinecone or Weaviate? The playbook is O(hundreds) of entries, not millions. Another managed service meant another auth secret, another billing account, another failure mode. The operational simplicity of "it's just a Postgres column" wins at this scale. If we ever hit tens of millions of vectors, that calculus changes.
The Migration That Almost Wasn't
Here's the war story. I added the submitted_for_review_at column to documents and a reviewer workflow to redlines in Alembic migration 0004, named 0004_redline_review_document_submit. Ran alembic upgrade head. The DDL executed — ALTER TABLE documents ADD COLUMN submitted_for_review_at TIMESTAMPTZ — successfully. Then Alembic tried to update alembic_version:
asyncpg.exceptions.StringDataRightTruncationError:
value too long for type character varying(32)The alembic_version.version_num column is VARCHAR(32). The revision string 0004_redline_review_document_submit is 34 characters. So: columns exist in the database, version table never updated, migration "never ran" according to Alembic. Next upgrade attempt? Alembic tries to run 0004 again, hits column already exists, crashes differently.
Fix: shorten the revision ID to 0004_redline_submit (19 chars), then alembic stamp 0004_redline_submit to sync the version table with reality.
Lesson: keep Alembic revision IDs under 32 characters. Alembic doesn't validate this for you.
The Agent Layer: LangGraph + ReAct
For the orchestration layer I used LangGraph's ReAct pattern. The agent has tools: extract_clauses, run_guardrails, score_risk, retrieve_playbook, propose_redlines. The two-phase pattern was deliberate: the agent first reasons about which clauses need deeper scrutiny (given the rule violations), then invokes retrieval only for those — rather than retrieving against every clause regardless.
SSE streaming (/api/v1/review/stream) sends incremental events as each stage completes:
data: {"stage": "extraction", "progress": 40, "clauses_found": 12}
data: {"stage": "risk_scoring", "progress": 70, "score": 67}
data: {"stage": "redlines", "progress": 90, "count": 5}
data: {"stage": "complete", "progress": 100}The frontend renders a live progress rail instead of a spinner. For a process that can take 20–40 seconds on a long contract, this matters more than it sounds.
Honest Retrospective
What I'd change:
The clause chunking loop is sequential — one LLM call per clause type. For a 40-clause contract that's 40 serial round-trips. Batching into a single structured extraction call with the full contract would cut latency by ~80%. I kept sequential extraction for simplicity during development and never went back.
I also don't have an eval harness. Prompts get adjusted by feel, not by regression. That's fine for a solo project; it's not fine the moment a second engineer touches the prompts or a model upgrade silently changes extraction behavior.
What's next:
Multi-agent supervision: a second LangGraph agent that challenges the first's risk scores — adversarial review
Slack approval routing: send flagged redlines to a reviewer channel, capture approval/rejection via slash command
Prompt versioning tied to migration IDs so you know which prompt produced which clause extraction in the audit trail
The Takeaway
LLMs don't replace legal judgment. What they do is handle the volume problem — getting a structured, annotated, risk-scored first pass in front of a human reviewer who can then spend their time on actual judgment calls instead of hunting for auto-renewal clauses.
The tooling complexity is real: async SQLAlchemy, RAG retrieval tuning, streaming UX, migration hygiene. None of it is magic. It's just software engineering applied to a domain that hasn't had much of it yet.