šŸ§  Turn AI Hallucinations Into Reliable Evidence

Can you turn AI hallucinations into reliable evidence? Spot the three types of hallucinationsā€”and apply proven strategies to get more reliable outputs from LLMs in your research.

Kia ora, Namaskaram šŸ™šŸ¾

Do you trust your favourite AI chatbot?

If you're using ChatGPT, Claude, Copilotā€”or any other LLMs for researchā€”you're bound to encounter ā€œhallucinations.ā€

These are answers that may sound rightā€”but are often made up, taken out of context, or just confident-sounding waffle.

šŸ“š Evidence on Hallucinations

To reduce hallucinations, it helps to know which type you're dealing withā€”so you can prompt effectively and get more reliable evidence.

Type I ā€“ Factual Inaccuracies (Sounds true, but wrong facts)

šŸ‘‰šŸ¾ Use RAG i.e. Retrieval-Augmented Generation
Cross-check outputs with trusted sourcesā€”like giving your AI assistant access to a fixed but reliable research library.

šŸ“ˆ Effectiveness: Very High
šŸ“ Why: The evidence highlights RAG as one of the most potent strategies. It grounds LLM outputs in external, trustworthy information (e.g., knowledge graphs, databases, APIs). This significantly reduces reliance on memorised or fabricated content.

Type II ā€“ Semantic Distortions (Misunderstands the context)

šŸ‘‰šŸ¾ Use Prompt Engineering
Be clear and specificā€”like you're briefing a junior colleague. The more context you give, the better the AI understands what and how you're asking it to research.

šŸ“ˆ Effectiveness: Medium to High
šŸ“ Why: The evidence notes that well-structured, context-rich prompts (including techniques like in-context learning and instruction tuning) can improve the relevance and precision of outputs. Itā€™s especially helpful for improving how LLMs interpret meaningā€”but it depends heavily on user skill.

Type III ā€“ Fluency Discrepancies (Overconfident output)

šŸ‘‰šŸ¾ Use Chain-of-Thought (CoT) Prompting
Ask AI to think and write down step by stepā€”it makes the logic visible and easier to understand how it arrived at the final output.

šŸ“ˆ Effectiveness: High
šŸ“ Why: Chain-of-Thought prompting encourages step-by-step reasoning, which the evidence shows improves factual consistency and logical coherence. It reduces the risk of ā€œfluent nonsenseā€ by making the AI reveal its internal logic.

Hallucinations have reducedā€”but they havenā€™t disappeared.

With the best prompts and latest model updates, theyā€™re likely here to stay.

Because thatā€™s just how the models works.

ā

"LLMs are 'good at things that don't have wrong answers' but 'very bad at precise information retrieval'."

@benedictevans on X (formerly Twitter)

The real question is ā€¦
When should we use LLMs for more reliable research?

āœ… When to Use LLMs for Research

  • Idea Generation: Great for brainstorming and exploring fresh angles.

  • Summarising: Use AI to distill academic papers or research reports into smaller segments.

  • Writing: Avoid the blank page with asking AI for a rough outline or feedback on your research reports.

āŒ When Not to Use LLMs for Research

  • Facts: Donā€™t rely on AI for precise factsā€”check content in the source material.

  • High-Stake Decisions: Use human judgment for decisions that impact people, policy, or planet.

  • Inferences: LLMs can make assumptionsā€”validate insights through multiple sources.

Want my full ChatGPT Playbook for Behaviour Changeā€”taught live? Join my upcoming 2-week course.

Written by Vishal George, Chief Behavioural Scientist at Behavioural by Design.

P.S. Newsletter readers can use the code EVIDENCE to get $50 off the course (Offer expires in 48 hours)