Internal Knowledge Search With Citations
Finance, insurance, and professional services teams often have valuable knowledge spread across PDFs, folders, slides, policy documents, and old project materials. Search exists, but finding the right answer still takes too long.
AI knowledge search can help, but only if the answer is tied back to sources. A model that answers fluently without citation is a liability in any team where the wrong answer has a regulatory or contractual cost. The technical work is not just "build a chatbot" — it is building a retrieval system that the team can audit.
Why Citations Matter
An answer without a source is hard to trust. A useful internal knowledge system should show which documents were used, where the answer came from, and what the user should verify.
This matters for compliance-sensitive teams because the AI should support work, not invent policy. Citations also change the failure mode: when the model is wrong, the user can see why — usually because the retrieved chunk was outdated, ambiguous, or off-topic. That diagnostic visibility is what turns a knowledge system from a black box into a tool the team can improve.
A workable citation contract for the response object:
```json
{
"answer": "...",
"citations": [
{
"doc_id": "policy-2024-03",
"title": "Underwriting Guidelines 2024",
"page": 17,
"snippet": "...",
"score": 0.82,
"version": "rev-3",
"approved_at": "2024-03-12"
}
],
"confidence": 0.74,
"unresolved": false
}
```
Every answer should be reproducible: same question, same indexed version, same citations. If the system cannot reproduce its own output, the team cannot defend it.
A Good First Scope
The first sprint should focus on one document collection and one team.
Index approved documents
Search in natural language
Return answer summaries with citations
Show source snippets
Log questions and missing content
This creates a safe pilot before expanding to more repositories.
A practical baseline architecture for the first sprint:
Ingestion. A worker that pulls from one source (SharePoint, Google Drive, S3, or a network folder), runs OCR where needed (Tesseract, Azure Document Intelligence, or AWS Textract for scanned PDFs), and writes normalized markdown to object storage.
Chunking. Section-aware splits (heading, paragraph, table), 400-800 tokens, with overlap only where structure breaks. Store the original page number, section heading, and document version alongside each chunk.
Embedding and storage. OpenAI `text-embedding-3-large` or a comparable model, written to pgvector or a managed vector DB (Pinecone, Weaviate, Qdrant). Keep the raw text and metadata in Postgres so the chunk can always be reconstructed.
Retrieval. Hybrid search: BM25 (Postgres full-text or OpenSearch) plus vector similarity, then a re-ranker (Cohere Rerank, or a cross-encoder model) on the top 30 candidates.
Generation. A small prompt that takes the top 5-8 chunks and produces an answer with mandatory inline citations. Reject any output that cites a `doc_id` not in the retrieved set.
Surface. A minimal Next.js or Streamlit app showing the answer, the citation chips, and a preview of each cited chunk on hover.
Access control is not optional. Every chunk should carry the same permissions as its source document, and the retrieval step must filter by the requesting user's access before generation. "The model accidentally summarized a confidential file" is a much more expensive incident than a missing answer.
What To Measure
Measure time to answer, citation usefulness, repeated questions, missing documents, and user confidence.
Specific metrics that hold up in a pilot:
Citation precision. Of the citations the model returns, how many genuinely contain the cited claim? Sample 50-100 answers per week and label by hand for the first month.
Answer coverage. Of real user questions, how many produced an answer with at least one valid citation, versus "I don't know"? "I don't know" is a feature, not a regression — track it but do not punish it.
Top unanswered topics. Cluster the unanswered questions; these are the gaps in the document corpus, and they are often more valuable than the answers themselves.
Reviewer override rate. When an expert reviewer reads the AI answer, how often do they edit or reject it? A falling override rate is the clearest sign the system is earning trust.
Latency. P50 and P95 end-to-end. Under 4-6 seconds for a fully cited answer is a reasonable target on a moderate corpus.
If people trust the sources, the system can become a practical internal assistant rather than another search box. The route to that trust is boring and concrete: small corpus, strong retrieval, mandatory citations, visible failure modes, and a measurable weekly improvement loop.