Smart search engine for the Epstein files from the DOJ built on a RAG pipeline that indexes and makes them searchable via natural language (i.e. LLM style). Documents (PDFs, HTML, images) go through text extraction with OCR as a fallback for scanned pages, then get sliced into ~500-token chunks with 50-token overlap. Each chunk gets a metadata prefix baked in (document title, source section) before being embedded into 1536-dimensional vectors through OpenAI's text-embedding-3-small. Those vectors live in PostgreSQL with pgvector, sitting behind an HNSW index with ef_search cranked to 400 (default is 40, which misses too much).
Queries hit the same embedding model, and the system pulls the top-K most similar chunks by cosine distance. There's a hybrid search mode too: it over-fetches 5x candidates from the vector index in parallel with keyword search (full-text search via a GIN-indexed tsvector column, falling back to trigram ILIKE when FTS returns few results). Results are merged using a slot reservation system: 60% of the final top-K comes from vector results ranked by cosine similarity, with up to 40% reserved for keyword-only matches that the vector search missed. Retrieved chunks get stuffed into a prompt with source metadata and sent to Claude Sonnet or GPT-4o with instructions to cite sources in bracket notation.
On the backend, pub-sub workers handle the indexing pipeline: text extraction, chunking, batch embedding in groups of 100, and firing off face detection through AWS Rekognition on images pulled from PDFs (very good in some cases, not so much in others). The query endpoint is free with some rate limiting, but also sits behind x402 micropayments ($0.50) that bypass rate limits when valid (it's not cheap to run these queries as of now). There's also an MCP server so AI agents can query directly as a tool.
Built with the help of Claude, so some of the tech (RAG via LLM, pgvector, etc.) is newish to me. Was a fun exercise!
Smart search engine for the Epstein files from the DOJ built on a RAG pipeline that indexes and makes them searchable via natural language (i.e. LLM style). Documents (PDFs, HTML, images) go through text extraction with OCR as a fallback for scanned pages, then get sliced into ~500-token chunks with 50-token overlap. Each chunk gets a metadata prefix baked in (document title, source section) before being embedded into 1536-dimensional vectors through OpenAI's text-embedding-3-small. Those vectors live in PostgreSQL with pgvector, sitting behind an HNSW index with ef_search cranked to 400 (default is 40, which misses too much).
Queries hit the same embedding model, and the system pulls the top-K most similar chunks by cosine distance. There's a hybrid search mode too: it over-fetches 5x candidates from the vector index in parallel with keyword search (full-text search via a GIN-indexed tsvector column, falling back to trigram ILIKE when FTS returns few results). Results are merged using a slot reservation system: 60% of the final top-K comes from vector results ranked by cosine similarity, with up to 40% reserved for keyword-only matches that the vector search missed. Retrieved chunks get stuffed into a prompt with source metadata and sent to Claude Sonnet or GPT-4o with instructions to cite sources in bracket notation.
On the backend, pub-sub workers handle the indexing pipeline: text extraction, chunking, batch embedding in groups of 100, and firing off face detection through AWS Rekognition on images pulled from PDFs (very good in some cases, not so much in others). The query endpoint is free with some rate limiting, but also sits behind x402 micropayments ($0.50) that bypass rate limits when valid (it's not cheap to run these queries as of now). There's also an MCP server so AI agents can query directly as a tool.
Built with the help of Claude, so some of the tech (RAG via LLM, pgvector, etc.) is newish to me. Was a fun exercise!