How I Turned 10,000 Company Documents Into an AI That Answers Any Question in Seconds

A couple of years ago, a new engineer asked a simple question during onboarding: "What's our refund policy for customers who cancel mid-contract?"

Three people on the team gave three different answers. The actual policy was buried in a Google Doc that nobody could find. It took 45 minutes and a Slack thread with 12 people to get the correct answer.

That was the moment I decided to build an internal AI helpdesk — a system where anyone in the company could ask a question in plain English and get an accurate answer, sourced from our actual documents, in seconds.

Within months, it was handling over 300 queries per day. Support team response times dropped by 60%. New hire onboarding went from "ask everyone everything" to "ask the AI first." And the system hadn't hallucinated a wrong answer in production in months.

Here's exactly how I built it.

The problem: tribal knowledge doesn't scale

Every growing company hits the same wall. Knowledge lives in people's heads, in random Slack threads, in Google Docs that haven't been updated in months, in Confluence pages that nobody reads. When the company is 20 people, this works. When it's 200, it's a disaster.

In a fast-growing company, the problem compounds quickly. Product knowledge, compliance policies, operational procedures, vendor agreements — this isn't just "nice to know" information. Getting it wrong can mean a compliance violation or a customer getting incorrect guidance.

We had over 10,000 documents across Google Drive, Confluence, Notion, and internal wikis. The information was there. Finding it was the problem.

Why traditional search fails

The obvious solution is "just use search." We tried. It didn't work, for reasons that will be familiar to anyone who's tried searching across multiple document systems:

Keyword search is brittle. If you search "refund policy" but the document calls it "cancellation terms," you get nothing. Real questions don't use the same words as the documents that answer them.

Documents are long. Even when search finds the right document, you still have to read a 15-page PDF to find the one paragraph that answers your question.

Context is scattered. The answer to "can a customer switch from Plan A to Plan B mid-cycle?" might require information from three different documents — the product spec, the billing policy, and the operations handbook.

Traditional search finds documents. What we needed was a system that finds answers.

The architecture: Document → Parsing → Chunking → Embeddings → Vector DB → LLM

This is the RAG (Retrieval-Augmented Generation) pipeline. I'll walk through each stage, because the devil is in the details and most guides gloss over the parts that actually matter.

Stage 1: Document parsing

Before you can do anything intelligent with documents, you need to extract clean text from them. This sounds trivial. It's not.

Our documents were in every format imaginable: PDFs (some scanned, some digital), Google Docs, Confluence pages, Markdown files, Excel spreadsheets with policy tables, even PowerPoint presentations.

Each format needs different parsing:

Digital PDFs: libraries like PyMuPDF or pdfplumber extract text with layout awareness
Scanned PDFs: OCR with Google Cloud Vision API (this is where my Google Cloud Vertex AI certification came in handy — understanding the Vision API ecosystem saved weeks of trial and error)
Google Docs / Confluence: API-based extraction that preserves structure
Tables: specialized extraction that maintains row/column relationships — critical for policy and pricing documents where the answer depends on which row you're in

The biggest lesson: garbage in, garbage out. If your parser drops a table or mangles a heading, the AI will give wrong answers downstream. We spent three weeks just on parsing — more than on any other stage — and it was worth every day.

Stage 2: Chunking — the most underrated step

Once you have clean text, you need to split it into chunks. This is where most RAG implementations go wrong.

Why chunk at all? LLMs have context windows, and vector search works better with focused passages than with entire documents. You want each chunk to contain one coherent idea or answer.

Naive chunking (splitting every 500 tokens) is terrible. It splits sentences in half. It separates a question from its answer. It puts a table header in one chunk and the table data in another.

What we did instead:

Semantic chunking. We split on document structure — headings, sections, paragraphs — not on token count. A section titled "Refund Policy" becomes one chunk, regardless of whether it's 200 tokens or 800.

Overlap with context. Each chunk includes the document title, section hierarchy, and a small overlap with adjacent chunks. So a chunk from "Employee Handbook > Benefits > Insurance" carries that context with it.

Table-aware chunking. Tables are kept whole. A policy table with 10 rows stays as one chunk, because splitting it would destroy the meaning.

Metadata tagging. Every chunk gets tagged with source document, last updated date, document type (policy, product spec, operations guide), and department. This metadata becomes critical for filtering and attribution later.

We ended up with roughly 45,000 chunks from 10,000 documents. The average chunk is about 300-400 tokens — small enough for precise retrieval, large enough to contain a complete thought.

Stage 3: Embeddings — turning text into math

Embeddings convert text chunks into high-dimensional vectors — essentially, mathematical representations of meaning. Similar meanings produce similar vectors, which is what makes semantic search possible.

We evaluated several embedding models:

OpenAI text-embedding-3-large: excellent quality, highest cost
Google Vertex AI embeddings: strong quality, good integration with our GCP infrastructure
Open-source models (e5-large, BGE): decent quality, self-hosted, no API costs

We went with Google Vertex AI embeddings — partly because of quality, partly because our infrastructure was already on GCP, and partly because my experience with the Vertex AI platform from my certification work meant I could optimize the pipeline quickly.

Key decisions that mattered:

Embed queries differently from documents. We use instruction-prefixed embeddings: document chunks get embedded as-is, but search queries get prefixed with "Represent this question for searching relevant passages:" This asymmetric approach improved retrieval accuracy significantly.

Batch processing for the initial load. Embedding 45,000 chunks one by one would take hours. We parallelized it across Vertex AI batch prediction endpoints — the full corpus was embedded in under 20 minutes.

Stage 4: Vector database — where the chunks live

The vector database stores embeddings and enables fast similarity search. When a user asks a question, their query gets embedded into the same vector space, and the database finds the most similar chunks.

We evaluated Pinecone, Weaviate, Qdrant, and pgvector (Postgres extension). We chose Qdrant for several reasons:

Self-hosted on our GCP infrastructure (important for data privacy — sensitive internal data can't leave our environment)
Excellent filtering capabilities (we filter by document type, department, and recency)
Handles our scale easily (45K vectors is small for a vector DB)
Open source with a strong community

The filtering is critical. When someone from the operations team asks a question, we boost operational documents. When someone asks about billing, we boost policy documents. This hybrid approach — vector similarity + metadata filtering — dramatically improved answer relevance.

Stage 5: The LLM — turning retrieval into answers

The final stage: take the retrieved chunks, feed them to an LLM along with the user's question, and generate a natural language answer.

Our prompt architecture:

You are an internal knowledge assistant for [company].
Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say so.
Always cite your sources with document name and section.

Context:
[top 5 retrieved chunks with metadata]

Question: [user's question]

Key design decisions:

Strict grounding. The LLM is instructed to only use information from the retrieved context. No making things up. No "based on my general knowledge." If the documents don't contain the answer, the system says "I don't have enough information to answer this — here are the closest documents I found."

Source citation. Every answer includes clickable links to the source documents and specific sections. Users can verify any answer in seconds. This built trust faster than anything else.

Confidence scoring. We calculate a relevance score based on the vector similarity of retrieved chunks. If the top chunks have low similarity scores, the system warns that the answer may be incomplete.

Conversation memory. The system maintains conversation context, so users can ask follow-up questions: "What about for patients in California?" without restating the entire original question.

The results

After running in production:

Response time: average 3.2 seconds from question to cited answer. Compare that to 15-45 minutes of Slack threads and document hunting.

Accuracy: 94% of answers rated "correct and complete" by users. The remaining 6% were mostly "partially correct" — the system found relevant information but missed context from a document that hadn't been ingested yet.

Adoption: 300+ queries per day across the company. Support team, operations team, product team, new hires — everyone uses it.

Support impact: tier-1 support response times dropped 60%. Agents use the AI to find answers instantly instead of escalating to senior team members.

Zero hallucinations in production over extended periods. The strict grounding approach works. When the system doesn't know, it says so.

What I'd do differently

Start with fewer document sources. We tried to ingest everything at once. Should have started with the 100 most-accessed documents and expanded from there.

Invest in chunk quality earlier. We iterated on our chunking strategy five times. Should have spent more time upfront analyzing how people actually ask questions, then designed chunks to match.

Build feedback loops from day one. We added thumbs up/down feedback later. Should have been there from launch. User feedback is the fastest way to find retrieval gaps.

The tech stack

For anyone building something similar:

Document parsing: PyMuPDF, Google Cloud Vision API (OCR), custom parsers per format
Chunking: custom Python pipeline with semantic splitting
Embeddings: Google Vertex AI text-embedding
Vector DB: Qdrant (self-hosted on GCP)
LLM: GPT-4 / Claude for generation (we A/B test between them)
Orchestration: LangChain for the RAG pipeline
Frontend: React chat interface embedded in our internal tools
Monitoring: custom dashboard tracking query volume, relevance scores, user feedback

The entire system runs on our existing GCP infrastructure. Monthly cost for serving 300+ queries/day: roughly $800, mostly LLM API calls. Compare that to the engineering hours it saves.

The bigger picture

RAG isn't magic. It's plumbing. Good plumbing.

The documents have to be parsed correctly. The chunks have to be the right size and shape. The embeddings have to capture meaning accurately. The retrieval has to find the right chunks. The LLM has to stay grounded in what was retrieved.

Get any one of those stages wrong and the system gives bad answers. Get them all right and you have something that genuinely transforms how a company operates.

Every company has this problem — years of accumulated knowledge locked in documents that nobody can find. The technology to unlock it exists today. The hard part isn't the AI. It's the engineering discipline to build each stage of the pipeline correctly.

And if there's one thing I've learned building engineering organizations: the discipline to do things right is always worth the investment.