Make hallucinations rare and explainable.
STACK Forensics catches fabricated outputs in real time, traces them back to root cause, and gives your team a forensic record of every false claim your model made.
Hallucination is a category, not a single failure
Different hallucinations have different causes. We classify before we remediate.
Citation Fabrication
Verify every cited URL, paper, or section actually exists and contains the claimed content.
Claim Grounding
Score every factual statement against retrieved context. Flag claims with no source.
Identity Confusion
Catch person/place/product confusion: wrong CEO, wrong year, wrong jurisdiction.
Arithmetic Errors
Numeric claims re-evaluated symbolically. Bad math caught before users see it.
Drift Patterns
Cluster hallucinations by topic, prompt template, and model version — find systemic issues fast.
Root Cause
Trace each hallucination to retrieval miss, prompt ambiguity, model brittleness, or training-data gap.
Questions teams ask before deploying
Straightforward answers about scope, integration, data handling, and rollout.
How is this different from generic LLM evals?
Evals score samples; we monitor production. Detection runs on live traffic with sub-15s latency, and we provide forensic root-cause for each event.
What's the false-positive rate?
2.1% on our public benchmark. We disclose calibration data per claim type — arithmetic is near-zero, identity confusion is the hardest.
Do you replace user feedback?
No. We complement thumbs-down by catching the hallucinations users don't notice — and giving QA teams a queue to review.
How do we feed findings back to improve the model?
Findings export to your eval set, fine-tuning corpus, or retrieval index as targeted negatives.