When your AI makes things up
Language models occasionally fabricate facts with complete confidence. That's manageable in experimentation. In production workflows handling contracts, invoices, or internal knowledge bases, it's a real liability.
Your AI doesn't know it's wrong
That's what makes it dangerous in production.
A client's contract summarization workflow ran for three months without anyone reviewing the outputs. The model extracted parties, terms, notice periods. Looked right. Ran daily. Nobody checked.
Then a contract auto-renewed. The summary said 30 days notice. The contract said 90.
One wrong output. Three months undetected. Another year stuck in a vendor relationship they were trying to exit.
What actually happens
Language models don't retrieve facts. They predict the next most plausible token, based on training and whatever context you hand them. When the information you're asking for isn't cleanly present, the model fills in something that sounds correct, with the same confidence as when it's actually right.
That's not a bug waiting to be patched. That's how the technology works.
The error rate is usually low enough that nobody notices in daily use. Which is exactly what makes it dangerous: the workflow looks fine, outputs look plausible, and the edge case reveals itself three months later when the consequences are already in play.
Where it actually hurts
Contract analysis is the obvious one. Notice periods, liability clauses, payment terms. Using a language model to extract these without a verification step means running a process that will occasionally be wrong in ways that cost money. The error rate isn't high enough to catch every day. That's the problem.
Financial data carries similar risk. Invoice amounts, IBANs, tax IDs. A small extraction error creates a failed transfer or a reconciliation problem that takes longer to investigate than the original transaction was worth.
The less obvious area: internal knowledge bases. When an AI assistant answers questions by citing internal documents, it makes implicit factual claims. "According to our procedures, the correct approach is..." If that's wrong, the company owns the consequences. Not the model.
What actually helps
The simplest fix is identifying which output fields actually matter and building review steps only for those. Not everything needs a human check, that defeats the point. But the fields where a wrong value has real consequences need some form of verification.
Confidence prompting is underused. If you instruct the model to respond with "Unable to determine" when it isn't confident, uncertain cases surface instead of quietly passing through. Simple instruction, big difference.
Most hallucinations are also formally detectable before they cause damage. An IBAN with the wrong check digit. A date that doesn't exist. An amount two orders of magnitude off from historical values. Run validation rules on outputs before they hit any downstream system.
Cross-checking against reference data handles the rest: supplier name from the invoice against the CRM, amount against the purchase order range. Differences flag for review. None of this is complicated, it's just the verification step that was missing.
When to skip the language model
For processes where every input is structured and predictable, a language model is often the wrong tool regardless of hallucination risk.
OCR plus a rule-based parser on a standard invoice format is deterministic, cheaper, and easier to debug. If clear rules cover 95% of your cases, start there. Use language models for the edges where rules break down: unstructured text, varying formats, context that needs interpretation.
The question isn't "where can we use AI?" It's "what's the most reliable tool for this specific problem?" Sometimes that's a parser. Sometimes a rule set. Sometimes an LLM with a review step wired in. Sometimes a person.
What this actually means
You own the outputs of systems you build. "The model said so" doesn't shift liability to the vendor.
Every process where a wrong AI output has real consequences needs a mechanism that catches it. The mechanism can be simple. A validation rule. A confidence check. A review queue. But it needs to exist.
Most AI projects don't fail because the model was bad. They fail because the surrounding process wasn't designed to handle the cases where it's wrong.
Want to know which of your processes have that gap? Our free Automation Check walks through that in 30 minutes.