
Hallucination detection tools benchmarked in our study.
This benchmarking study was conducted within the Genie R&D group at Emumba. Genie (GENaI@Emumba) is Emumba’s dedicated Generative AI research and prototyping team, working across recent advances in large language models and applied GenAI domains — including retrieval-augmented generation (RAG), multi-agent applications, and LLM fine-tuning. Our goal is to validate practical methods that help build safer, more reliable, and production-ready AI systems for enterprise use.
Hallucinated answers — confident but incorrect outputs — remain a leading cause of user distrust in AI-powered RAG systems. These incorrect answers become a critical failure point in real-world applications.
To address this, we systematically evaluated six hallucination detection techniques: built-in metrics from three LLM evaluation frameworks (UpTrain, Ragas, and DeepEval); Vectara HHEM, an open-source LLM-based classifier; the LettuceDetect library; and Trustworthy Language Model (TLM), a SaaS solution for hallucination detection. Except for TLM, all methods are open source and free to use, with the only cost being the underlying LLM API calls where applicable.
Our goal was to identify which frameworks are most effective for practical use cases, using criteria that go beyond simple accuracy:
Granularity: Can the method pinpoint exactly which parts of an answer are hallucinated?
Precision: Does it avoid false positives (flagging correct content) and false negatives (missing hallucinations)?
We conducted a three-phase experiment:
Benchmarking on a labeled, hallucination-rich dataset
Stress-testing top performers with adversarial examples
Evaluating the top performing method (UpTrain) for fine-grained detection.
This post shares our methodology, findings, and practical recommendations for choosing the right hallucination detection tool.
TL; DR: We tested six hallucination detectors for RAG. UpTrain stood out with pinpoint accuracy and clear explanations. Read on to see which tools failed, which ones surprised us, and how to pick the best fit for safer AI.
We evaluated six hallucination detection tools, each using a distinct method to assess whether an LLM-generated answer is grounded in its supporting context. The table below summarizes the core technique behind each tool, the required inputs, and the format of its output.
To evaluate how well each framework detects hallucinated outputs, we curated a small but high-quality benchmark dataset deliberately designed to include a wide variety of hallucination types. Starting from a base dataset of question and answer pairs over our internal Wiki data, we injected synthetic hallucinations generated by an LLM, and refined them manually.
Our goal was to test detection performance across diverse hallucination patterns often encountered in RAG pipelines, including:
Context overrides (ignoring retrieved documents),
Fabricated details,
Misinterpretations of source content,
Over-generalizations,
Contradictions,
Conflicts across sources, and
Incorrect selection from context.
The hallucinated answers included a balanced mix of the types listed above. Each generated hallucination was manually reviewed and labeled to ensure clean ground-truth annotations. We maintained a high hallucination ratio (~80%) to test the detection frameworks under challenging conditions. The remaining examples were factual, grounded answers for contrast.
To simulate a realistic retrieval setting, we built a standard RAG pipeline using LangChain over the document set (internal Wiki documents) used for creating the benchmark dataset.
Some frameworks return a continuous score instead of a binary label, so in production this score must be thresholded to decide whether to flag a response. We experimented with both F1 score maximization and Youden’s J statistic on the ROC curve to determine optimal thresholds. While not the focus of this study, these calibration methods can significantly affect precision and recall.
Below, we report accuracy results based on the best-performing thresholds for each tool, sorted from most to least accurate.
These results provided an initial view of accuracy, but real-world robustness needed more rigorous testing — addressed in Phase 2.
In Phase 1, we evaluated the overall accuracy of different frameworks at detecting whether an answer contained hallucinated content. But benchmark accuracy alone isn’t enough — real-world RAG systems often produce answers that blend fact with subtle hallucinations. Phase 2 tested whether top frameworks can still detect hallucinations embedded within otherwise factual responses.
This experiment further tested the three top-performing frameworks from Phase 1 — UpTrain, Vectara HHEM, and Ragas — on a larger dataset of 181 queries, each containing adversarially-injected hallucinations. The goal was to see whether these systems would still flag the responses correctly, even when the hallucinations were subtle and contextually embedded.
Note we evaluated only overall detection robustness, not granularity — frameworks needed only to detect that a hallucination was present, not locate it precisely. This is because Vectara HHEM and Ragas do not provide a statement or span level breakdown.
For both Phase 2 and Phase 3, we used a new dataset derived from RAGBench, a public benchmark available on Hugging Face. We sampled 181 question–answer–context triples, all originally generated by GPT-4 and grounded in supporting documents.
To construct a clean factual baseline, we passed each answer through UpTrain and retained only those for which it returned a perfect factuality score of 1.0, manually spot-checking to ensure reliability of the ground truth data. This ensured we began with fully grounded answers with no hallucinations.
From this verified subset, we created a new test set by injecting hallucinations into selected factual statements using controlled LLM prompts. This produced answers that contained a mix of grounded and hallucinated statements, simulating real-world LLM behavior in RAG pipelines where hallucinations often appear subtly within otherwise factual text.
Each hallucination-augmented sample was passed through the three frameworks. Since all responses now contained some hallucinated content, we expected that:
None of the frameworks should return a perfect score (i.e. indicating zero hallucination)
Any perfect score would indicate a missed hallucination
This allowed us to evaluate real-world robustness without relying on threshold tuning.
As the table below shows, UpTrain ended up missing 10.5% of the injected hallucinations, while Ragas captured all but 5.5% of them.
While Vectara HHEM appears to have caught all hallucinations, this result is misleading. In our tests, Vectara HHEM never returns a perfect 1.0 score — even for fully factual answers. As a result, if we treat only a perfect score as proof of factuality, every response would be flagged as partially hallucinated by default. This means proper threshold calibration is essential for meaningful results with Vectara HHEM, and its raw output cannot be fairly compared in this phase without that step.
While Phase 2 tested whether frameworks can detect that hallucination occurred, it still treated the answer as a whole. But in actual RAG applications, that’s not good enough. For safe and meaningful use, we need frameworks that not only raise a flag, but can also tell us exactly which parts of an answer are unsupported.
UpTrain is the only framework in our study that outputs a statement-level breakdown. In Phase 3, we evaluated how reliable this breakdown actually is.
Does UpTrain correctly flag the hallucinated statements?
Equally important, does it avoid falsely flagging factual statements?
We reused the hallucination-augmented dataset from Phase 2, but this time focused on per-statement performance. Specifically, we measured:
True Detections: Hallucinated statements correctly identified
False Positives: Grounded statements incorrectly flagged
This deeper evaluation allowed us to assess how usable UpTrain’s output would be in a pipeline that aims to remove or revise only the hallucinated parts, rather than discard the whole answer.
True detections: UpTrain correctly flagged 232 out of 303 hallucinated statements, achieving a 76.6% detection rate.
False positives: UpTrain correctly identified 278 out of 322 factual statements, mislabeling 13.7% as hallucinations.
While UpTrain’s fine-grained detection is not flawless, its ability to pinpoint hallucinated statements makes it uniquely valuable for production RAG pipelines. Unlike frameworks that only flag the entire answer, UpTrain’s span-level verdicts enable targeted corrections, content pruning, or automated re-queries, all while preserving valid information. As the only open-source tool in our study that provides this granularity, UpTrain stands out as the most practical choice for teams that want to move beyond binary hallucination checks and towards real-time factuality refinement. With further prompt tuning, dataset-specific examples, and potential integration with custom retrieval signals, its accuracy can be further improved, unlocking safer, trustable LLM outputs in high-stakes deployments.
In summary, hallucination detection is essential for any robust RAG deployment — yet not all frameworks are equal in practicality or depth of insights. Our multi-phase evaluation shows that while general-purpose tools like Ragas and Vectara HHEM offer strong baseline detection with minimal integration effort, UpTrain delivers the best combination of high accuracy and actionable granularity. For organizations building production-grade, trust-sensitive AI systems, investing in a fine-grained, tunable detection workflow like UpTrain can dramatically reduce the risk of misleading outputs and build user confidence in LLM-powered products.
We hope this benchmarking and fine-grained testing effort helps teams make informed choices to safeguard their RAG pipelines against the persistent risk of hallucinations.
The code for these experiments is available on Github.