My LLM Cited a Paper That Doesn't Exist. Here's How I Fixed It.

Three months into deploying a research assistant for a pharma client, I got the call every AI consultant dreads.
"Your system cited a paper. We tried to look it up. It doesn't exist."
Not just the URL — the entire paper. The title, the authors, the journal. All fabricated. The model had generated a perfectly plausible citation — "Zhang et al., 2023, Journal of Clinical Pharmacology" — complete with a DOI number that led to a 404 page. And one of their analysts had already included it in a regulatory submission draft.
That was two years ago. Since then, I've built hallucination prevention into every system I touch, and I've gotten way more paranoid about it. This post is everything I've learned — not as a clean list of techniques, but as the messy, hard-won lessons from production systems that couldn't afford to be wrong.
Related: RAG vs fine-tuning for knowledge grounding, production scaling for reliability patterns, and responsible AI for building trustworthy systems.
Why LLMs Hallucinate (The 30-Second Version)
LLMs predict probable next tokens. They don't verify facts. They don't "know" things — they pattern-match from training data and generate the most statistically likely continuation.
When the model doesn't have strong enough patterns for the right answer, it generates the most plausible-sounding answer instead. Which is worse than a wrong answer — it's a convincing wrong answer.
Common hallucination types I've personally encountered:
- Fabricated citations — the pharma incident above
- Confident wrong numbers — "Revenue grew 23% YoY" when the actual number was 8%
- Entity confusion — mixing up two similar companies' financials
- Context ignoring — answering from training data when the provided context had the right answer
- Logical drift — starting a reasoning chain correctly, then gradually going off the rails
You can't eliminate hallucinations entirely. Current models hallucinate 3-5% of the time even in ideal conditions. But you can get from a 20-30% baseline error rate to 2-5% in production — which is the difference between "unusable" and "useful with appropriate caution."
Layer 1: Prompt Engineering (The Cheapest Fix)
Before you build anything complex, fix your prompts. This alone handles 40-60% of hallucination issues and costs nothing extra.
The core principle: be explicit about what the model should and shouldn't do. Especially the shouldn't.
Here's a prompt structure I use in every production system:
def create_grounded_prompt(user_query, context_docs):
return f"""You are a customer support specialist with access to order information.
CONTEXT (use ONLY this information to answer):
{context_docs}
RULES:
1. Answer using ONLY information from the CONTEXT above
2. If the context doesn't contain the answer, say "I don't have that information"
3. Never guess at dates, numbers, or order details
4. Cite which part of the context your answer comes from
WHAT NOT TO DO:
- Do not make up order numbers or tracking IDs
- Do not infer shipping dates that aren't explicitly stated
- Do not combine information from different orders
QUESTION: {user_query}
ANSWER (with citation):"""
Two things that make the biggest difference:
The "ONLY" instruction. Without it, the model happily supplements context with training data. "Answer using ONLY the context provided" is the single most important sentence in any grounded prompt.
Explicit DO NOT rules. I maintain a running list of hallucination patterns I've seen for each system. Every time the model invents something new and creative that's wrong, I add a DO NOT rule. After a few weeks, the prompt is dialed in.
The model also needs permission to say "I don't know." This sounds obvious, but without explicit instruction, models almost never admit ignorance — they fill the gap with plausible-sounding fabrication instead.
Layer 2: RAG — Give the Model a Source of Truth
If prompt engineering is the cheapest fix, RAG (Retrieval-Augmented Generation) is the most impactful one. Instead of hoping the model remembers facts from training, you give it the relevant documents before each response.
For the full RAG vs fine-tuning analysis, see my decision guide.
class RAGSystem:
def __init__(self, index_name, docs):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Pinecone.from_documents(
docs, self.embeddings, index_name=index_name
)
self.llm = OpenAI(temperature=0) # temperature=0 reduces randomness
def query_with_sources(self, question, k=4):
retriever = self.vectorstore.as_retriever(search_kwargs={"k": k})
qa_chain = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
result = qa_chain({"query": question})
return {
"answer": result["result"],
"sources": [doc.metadata for doc in result["source_documents"]],
"confidence": self._score_confidence(result)
}
RAG alone cuts hallucinations by 50-70%. But — and this is important — RAG doesn't solve everything. I've seen three ways RAG systems still hallucinate:
-
Bad retrieval. The vector search pulls the wrong documents. The model then answers based on irrelevant context, which technically isn't a hallucination but is still wrong. This is why retrieval quality is your ceiling.
-
The model ignores the context. Even with "use ONLY this context" instructions, models sometimes drift into training data. Especially for common topics where the training data signal is strong.
-
Fabricated citations. The model claims its answer comes from "Document 3" when it actually doesn't. You need citation verification (see Layer 4).
For production, combine dense search (embeddings) with sparse search (BM25 keyword matching):
class HybridRAG(RAGSystem):
def __init__(self, index_name, docs):
super().__init__(index_name, docs)
self.dense_retriever = self.vectorstore.as_retriever(search_kwargs={"k": 10})
self.sparse_retriever = BM25Retriever.from_documents(docs)
self.sparse_retriever.k = 10
# 60% semantic, 40% keyword — this balance works well for most domains
self.ensemble = EnsembleRetriever(
retrievers=[self.dense_retriever, self.sparse_retriever],
weights=[0.6, 0.4]
)
Layer 3: Chain-of-Thought with Self-Verification
This is the technique that caught the pharma citation problem — after I implemented it.
The idea: ask the model to reason step by step, then ask it to check its own work. It sounds like asking a student to grade their own exam, but in practice it catches a surprising number of errors.
def verified_answer(question, context):
# Step 1: Generate answer with reasoning
reasoning_prompt = f"""Given this context, answer step by step.
CONTEXT: {context}
QUESTION: {question}
REASONING: [your step-by-step thought process]
ANSWER: [your answer]
EVIDENCE: [exact quotes from context that support your answer]"""
initial = llm.generate(reasoning_prompt)
# Step 2: Ask the model to verify itself
verify_prompt = f"""Review this answer for accuracy.
CONTEXT: {context}
QUESTION: {question}
PROPOSED ANSWER: {initial}
Check:
1. Is every claim supported by the context?
2. Are there any facts that aren't in the context?
3. Is the reasoning logically sound?
If anything is unsupported, provide a corrected answer using ONLY verified information.
VERIFICATION:"""
verified = llm.generate(verify_prompt)
return verified
Why this works: The model is better at spotting errors in existing text than avoiding errors in generation. It's the same reason humans are better editors than writers — it's easier to evaluate than to create.
When it doesn't work: When the hallucination is subtle enough that the model can't detect it even in review mode. Factual errors about obscure topics slip through because the model doesn't have strong enough signal to flag them. That's why this is Layer 3, not Layer 1 — you need RAG underneath to provide the factual ground truth.
Cost: doubles your API calls. Worth it for high-stakes applications (medical, legal, financial). Overkill for a chatbot that recommends restaurants.
Layer 4: Output Validation (The Safety Net)
This is where you stop trusting the model and start checking its work programmatically. Every production system I build has at least basic output validation.
from pydantic import BaseModel, Field, validator
class ValidatedResponse(BaseModel):
answer: str = Field(..., min_length=10, max_length=500)
confidence: float = Field(..., ge=0.0, le=1.0)
sources: list[str] = Field(..., min_length=1)
@validator('answer')
def no_hedging_language(cls, v):
"""If the model is hedging, it probably doesn't know."""
red_flags = ['I believe', 'probably', 'might be', 'as far as I know']
for flag in red_flags:
if flag.lower() in v.lower():
raise ValueError(f"Uncertain language detected: {flag}")
return v
But schema validation only catches formatting issues. For factual consistency, you need something smarter:
from sentence_transformers import CrossEncoder
class ConsistencyChecker:
def __init__(self):
self.nli_model = CrossEncoder('cross-encoder/nli-deberta-v3-base')
def validate_answer(self, answer, source_documents):
"""Check if every claim in the answer is supported by sources."""
claims = self._extract_claims(answer)
results = []
for claim in claims:
supported = any(
self.nli_model.predict([(doc.page_content, claim)])[0] > 0.7
for doc in source_documents
)
results.append({"claim": claim, "supported": supported})
unsupported = [r["claim"] for r in results if not r["supported"]]
return {
"is_valid": len(unsupported) == 0,
"unsupported_claims": unsupported,
"support_ratio": sum(r["supported"] for r in results) / len(results)
}
This is the layer that would have caught the pharma citation. The NLI model checks: "Is this claim actually entailed by the source documents?" If the model says "Zhang et al. found X" but no source document mentions Zhang, the validation fails.
Impact: catches 70-85% of remaining hallucinations after RAG and prompt engineering. It's the most important layer for high-stakes applications.
Layer 5: Confidence Scoring — Know When You Don't Know
Not all model outputs are equally trustworthy. Some answers come from strong retrieval matches with clear evidence. Others are the model shooting in the dark.
The simplest useful approach: generate multiple responses and measure agreement.
def ensemble_confidence(prompt, n_samples=5):
"""If 5 attempts give the same answer, it's probably right.
If they all disagree, the model is guessing."""
responses = [model.generate(prompt, temperature=0.7) for _ in range(n_samples)]
similarities = calculate_pairwise_similarity(responses)
agreement = np.mean(similarities)
if agreement > 0.85:
return {"answer": most_common(responses), "confidence": "HIGH"}
else:
return {
"answer": "I'm not confident enough to give a reliable answer.",
"confidence": "LOW"
}
5x the cost? Yes. But for a medical or financial system where one wrong answer has real consequences, it's cheap insurance. For a content recommendation engine, skip it.
I use confidence scoring as a routing mechanism: high-confidence answers go to the user directly. Low-confidence answers get flagged for human review. This way, 80-90% of queries are handled automatically while the risky 10-20% get human oversight.
Layer 6: The Human-in-the-Loop Feedback System
Every hallucination your system produces is training data for preventing future hallucinations — if you capture it.
class FeedbackLoop:
def log_interaction(self, question, answer, sources, user_feedback=None):
interaction = {
"timestamp": datetime.utcnow(),
"question": question,
"answer": answer,
"sources": sources,
"user_feedback": user_feedback,
"needs_review": user_feedback in ["hallucination", "factual_error"]
}
self.db.interactions.insert_one(interaction)
if interaction["needs_review"]:
self._flag_for_expert_review(interaction)
The key insight: sample based on confidence. Don't randomly review 5% of interactions — review the ones the system was least confident about. Low-confidence responses have 5-10x higher hallucination rates.
Over time, the corrected examples become your test suite and potentially your fine-tuning dataset. The system gets better at exactly the types of questions it previously struggled with.
Layer 7: Automated Testing (The Regression Prevention)
After the pharma incident, I built a hallucination test suite. Every time we find a new type of hallucination, it becomes a test case. The suite now has 400+ test cases across domains.
class HallucinationTestSuite:
def test_factual_accuracy(self, llm_system):
results = []
for case in self.test_cases:
response = llm_system.query(case["question"], case["context"])
has_hallucination = self._detect_hallucination(
response["answer"],
case["context"],
case.get("known_hallucinations", [])
)
results.append({
"test_case": case["question"],
"passed": not has_hallucination,
"response": response["answer"]
})
hallucination_rate = sum(not r["passed"] for r in results) / len(results)
assert hallucination_rate < 0.05, f"Rate {hallucination_rate:.1%} exceeds 5% threshold"
Run this on every deployment. Run it when you change prompts. Run it when you update the vector database. Hallucination patterns are sneaky — a change that improves one area can introduce regressions in another.
Layer 8: External Verification (The Nuclear Option)
For the highest-stakes applications, verify claims against external knowledge bases in real-time.
class FactVerifier:
def verify_claim(self, claim: str) -> dict:
entities = self._extract_entities(claim)
for entity in entities:
wiki_data = self._query_wikipedia(entity)
if wiki_data:
is_consistent = self._check_consistency(claim, wiki_data)
if not is_consistent:
return {"verified": False, "conflicting_source": wiki_data["url"]}
return {"verified": True}
I call this the nuclear option because it adds significant latency (external API calls) and complexity. Use it for medical, legal, or financial applications. For most other use cases, Layers 1-5 are sufficient.
The Layered Defense: Putting It All Together
Here's how I stack these layers in practice:
| Layer | Technique | Hallucination Reduction | Cost | When to Use |
|---|---|---|---|---|
| 1 | Structured prompts + DO NOT rules | 40-60% baseline | Free | Always |
| 2 | RAG with hybrid retrieval | +50-70% on remaining | Moderate | Always for factual tasks |
| 3 | Chain-of-thought + self-verification | +15-25% on remaining | 2x API cost | High-stakes answers |
| 4 | Output validation + NLI checking | Catches 70-85% of what slips through | Low compute | Always in production |
| 5 | Confidence scoring + routing | Routes uncertain answers to humans | 5x for ensemble | When mistakes are costly |
| 6 | Human feedback loop | 10-20% improvement over time | Human time | Always |
| 7 | Automated test suite | Prevents regressions | CI/CD time | Always |
| 8 | External fact verification | 80-90% for verifiable claims | High latency | Medical, legal, financial |
For a typical production system (customer support, content generation): Layers 1, 2, 4, 6, 7. Gets you to 90-95% reliability.
For high-stakes applications (medical, legal, financial): All 8 layers. Gets you to 97-99% reliability with human oversight for the rest.
What I Actually Measure in Production
Forget complicated metric frameworks. Track these four numbers weekly:
- Hallucination rate — percentage of responses with factual errors (target: <5%)
- "I don't know" rate — how often the system admits uncertainty (healthy range: 8-15%. Too low means it's bullshitting. Too high means retrieval is broken.)
- User correction rate — how often users flag wrong answers (target: <3%)
- Confidence calibration — do high-confidence answers actually have higher accuracy? (If not, your confidence scoring is useless.)
If hallucination rate starts climbing, check retrieval quality first (the most common cause), then prompt drift, then data freshness.
The Honest Truth
After implementing all of this across dozens of systems, here's what I've come to accept:
You will not eliminate hallucinations. Current LLM technology hallucinates. Period. The question is whether you've built enough layers of defense that the remaining hallucinations are caught before they cause damage.
"I don't know" is your most valuable output. A system that confidently says wrong things is dangerous. A system that says "I'm not sure — let me flag this for a human" is trustworthy. Trustworthy beats impressive every time.
The best hallucination prevention is good retrieval. If your RAG pipeline consistently finds the right documents, the model rarely hallucinates. If retrieval is bad, no amount of prompt engineering will save you. Invest in retrieval quality before anything else.
Monitor, don't just test. Your system will encounter questions in production that you never thought to test for. Continuous monitoring catches the hallucinations that your test suite misses.
The pharma client? They're still using the system — with all 8 layers in place. The citation fabrication hasn't happened again. Last time I checked, their hallucination rate was 1.8%. Not zero. But low enough that the system delivers far more value than risk.
That's the goal. Not perfection — reliability.
Building an LLM system that can't afford to hallucinate? I've hardened systems for pharma, finance, and legal. Let's talk about what reliability looks like for your use case.
