Platform
The Epstein Files — We Made Them Rigorously Searchable With One-Off Data Pipelines
How we made the DOJ's release of the Epstein files semantically searchable with one-off data pipelines per query — no traditional RAG involved.
Abstract
- 01 Traditional embedding-based RAG breaks down on messy, large-scale document corpora like the Epstein files.
- 02 A structured pre-processing pipeline — ingestion, transcription, extraction, resolution — creates a queryable dataset, not just embedded chunks.
- 03 Each user query generates a one-off pipeline that runs against the full corpus, not a sampled subset.
- 04 Deterministic operations handle filtering and aggregation; LLM calls are reserved for genuine judgment — interpretation and synthesis.
- 05 The result is speed (minutes, not hours), traceability (every answer links to source documents), and reliability (fewer hallucinations).
We took everything we know about large-scale data systems and applied it to the DOJ’s public release of the Epstein files. The result: a system that can query the entire corpus — including messy scans, exhibits, and OCR-heavy documents — in minutes. And no, it’s not built using traditional RAG.
The rest of this post walks through how we did this under the hood, and is moderately technical.
Why we didn’t use traditional embedding-based RAG
Most of us have been there — setting up vector embeddings, experimenting with chunking strategies, maybe getting something that works well enough in a demo. We’ve been through that too. And in our experience, these systems sorta work… until they really don’t.
The issue we kept running into wasn’t just about retrieval. It was about what happens before retrieval — the “compression” step, when you take a large, messy document and collapse it down into a vector. At some point, you have to ask: does that representation actually preserve what matters? We’re not convinced it does, especially at scale.
Now throw in a corpus like the Epstein files — scanned exhibits, heavy redactions, inconsistent naming conventions, OCR artifacts, fragmented references. In that environment, embedding-based retrieval gets brittle fast. Specifically, three things tend to go wrong:
- 01
Context fragments
A meaningful signal might span several sections of a document, but retrieval only surfaces pieces of it.
- 02
Entities drift
"HRH Prince Andrew," "Prince Andrew," and "The Duke of York" don't reliably resolve to the same person.
- 03
False confidence
The system returns plausible-sounding snippets and acts certain — even when it hasn't actually reasoned across the full corpus.
For casual or exploratory search, that’s probably fine. For investigative work, it isn’t. We needed a system that could evaluate the corpus much more deterministically when the question called for it — not one that samples it and hopes for the best.
The pre-processing pipeline
Rather than starting with retrieval, we built a structured data layer underneath everything. The pre-processing pipeline runs in four stages:
- 01
Ingestion
Provenance tracking and deduplication from day one.
- 02
Transcription
OCR and structured descriptions so every document becomes searchable text.
- 03
Extraction
People, organizations, locations, dates, and relationships pulled into structured form.
- 04
Resolution
Obvious aliases normalized so that references actually cohere across documents.
The end result is a curated, queryable dataset — not just a pile of embedded chunks. And that’s the input into the next stage that involves the user query.
From intent to post-query pipeline
When a user submits a query, we don’t retrieve snippets and summarize them. Instead, we translate the intent behind that query into a one-off pipeline that runs across the full corpus. You got that right — the full corpus. We’re not sampling here — there’s no random subset; we’re running the query against every document, individually.
Now in this one-off pipeline a lot of the work is pretty deterministic (i.e. filters, joins, aggregations). We bring in LLM calls only where genuine judgment is needed — interpretation, synthesis, that kind of thing — and we parallelize them heavily. This gives us:
- Speed — most queries resolve in minutes, across the full corpus. For deterministic operations we use frameworks built for high data scale (Polars / PySpark depending on the data); for LLM-based operations we heavily parallelize our requests.
- Traceability — every answer links back to a source document.
- Reliability — less drift, fewer hallucinations.
The best way to think about it: this behaves more like a data pipelining system with inference as a specific transformation — alongside many other deterministic ones — inside it, than a chatbot sitting on top of a pile of PDFs.
And the icing on the cake here is composing this pipeline is itself also an act of inference — so we can go from user intent to written pipeline to executed pipeline in just a few minutes.
For the public interest — please reach out
There are people out there fighting the good fight with respect to Jeffrey Epstein and his co-conspirators. We want them to have a tool like this — to do deep research through these files, pull out instances from documents reliably, and make a compelling, fact-based case against an individual or organization very quickly.
If you or someone you know is in the press, in law enforcement, or in government, please reach out at founders@overstandlabs.com, or request access here.
Next steps
Related
Frequently asked questions
Traditional RAG embeds documents into vectors and retrieves snippets based on similarity. That works for casual search but breaks down on messy corpora with OCR artifacts, redactions, and inconsistent naming. Our system builds a structured data layer first, then translates each query into a one-off pipeline that runs against the full corpus deterministically — bringing in LLM calls only where genuine judgment is needed.
We're offering access to responsible journalists, researchers, and government officials working in the public interest. Reach out directly at founders@overstandlabs.com, or request access at overstandlabs.com/epstein-files.
Yes. As the DOJ releases additional documents, we ingest them through the same pre-processing pipeline — ingestion, transcription, extraction, and resolution — so the corpus stays current and queryable.