Show HN: HalluMix – A Benchmark for Real-World LLM Hallucination Detection - CodeGurus

We built HalluMix to evaluate how well hallucination detectors perform in the kinds of messy, high-stakes environments where LLMs are actually deployed: long-form outputs, multi-document contexts, and domain-specific tasks like law, medicine, science, and news.

Most existing benchmarks focus on synthetic or short-form QA data. That didn’t reflect what we were seeing in production, so we built our own to test our hallucination detectors, and decided to open source it.

The dataset includes 6,500 examples across QA, summarization, and NLI tasks. We added distractor documents, shuffled the context, and removed assumptions about format (like requiring a question) to better reflect real-world conditions.

We ran 7 detection systems on it, both open-source models and commercial APIs. While some performed well on shorter examples, even the best struggled with long-form content and multi-document grounding — precisely where hallucinations tend to be most harmful.

Would love feedback, especially from anyone working on evals, hallucination detection, or RAG.

Links:
– HF Dataset: https://huggingface.co/datasets/quotient-ai/hallumix
– HF Blog: https://huggingface.co/blog/quotientai/hallumix
– Internal Blog: https://blog.quotientai.co/introducing-hallumix-a-task-agnos…
– Paper: https://arxiv.org/abs/2505.00506

Comments URL: https://news.ycombinator.com/item?id=43906890

Points: 1

# Comments: 0

Related Articles