Technology5 min read19.04.2026Max Fey

When your AI makes things up

Language models occasionally fabricate facts with complete confidence. That's manageable in experimentation. In production workflows handling contracts, invoices, or internal knowledge bases, it's a real liability.

Your AI doesn't know it's wrong

That's what makes it dangerous in production.

A client's contract summarization workflow ran for three months without anyone reviewing the outputs. The model extracted parties, terms, notice periods. Looked right. Ran daily. Nobody checked.

Then a contract auto-renewed. The summary said 30 days notice. The contract said 90.

One wrong output. Three months undetected. Another year stuck in a vendor relationship they were trying to exit.

What actually happens

Language models don't retrieve facts. They predict the next most plausible token, based on training and whatever context you hand them. When the information you're asking for isn't cleanly present, the model fills in something that sounds correct, with the same confidence as when it's actually right.

That's not a bug waiting to be patched. That's how the technology works.

The error rate is usually low enough that nobody notices in daily use. Which is exactly what makes it dangerous: the workflow looks fine, outputs look plausible, and the edge case reveals itself three months later when the consequences are already in play.

Where it actually hurts

Contract analysis is the obvious one. Notice periods, liability clauses, payment terms. Using a language model to extract these without a verification step means running a process that will occasionally be wrong in ways that cost money. The error rate isn't high enough to catch every day. That's the problem.

Financial data carries similar risk. Invoice amounts, IBANs, tax IDs. A small extraction error creates a failed transfer or a reconciliation problem that takes longer to investigate than the original transaction was worth.

The less obvious area: internal knowledge bases. When an AI assistant answers questions by citing internal documents, it makes implicit factual claims. "According to our procedures, the correct approach is..." If that's wrong, the company owns the consequences. Not the model.

What actually helps

The simplest fix is identifying which output fields actually matter and building review steps only for those. Not everything needs a human check, that defeats the point. But the fields where a wrong value has real consequences need some form of verification.

Confidence prompting is underused. If you instruct the model to respond with "Unable to determine" when it isn't confident, uncertain cases surface instead of quietly passing through. Simple instruction, big difference.

Most hallucinations are also formally detectable before they cause damage. An IBAN with the wrong check digit. A date that doesn't exist. An amount two orders of magnitude off from historical values. Run validation rules on outputs before they hit any downstream system.

Cross-checking against reference data handles the rest: supplier name from the invoice against the CRM, amount against the purchase order range. Differences flag for review. None of this is complicated, it's just the verification step that was missing.

When to skip the language model

For processes where every input is structured and predictable, a language model is often the wrong tool regardless of hallucination risk.

OCR plus a rule-based parser on a standard invoice format is deterministic, cheaper, and easier to debug. If clear rules cover 95% of your cases, start there. Use language models for the edges where rules break down: unstructured text, varying formats, context that needs interpretation.

The question isn't "where can we use AI?" It's "what's the most reliable tool for this specific problem?" Sometimes that's a parser. Sometimes a rule set. Sometimes an LLM with a review step wired in. Sometimes a person.

What this actually means

You own the outputs of systems you build. "The model said so" doesn't shift liability to the vendor.

Every process where a wrong AI output has real consequences needs a mechanism that catches it. The mechanism can be simple. A validation rule. A confidence check. A review queue. But it needs to exist.

Most AI projects don't fail because the model was bad. They fail because the surrounding process wasn't designed to handle the cases where it's wrong.

Want to know which of your processes have that gap? Our free Automation Check walks through that in 30 minutes.

#Halluzinationen#KI-Risiken#LLM#Qualitätssicherung#Risikomanagement#Automatisierung

Automation Check

How much could automation save your business?

Find out in under 2 minutes which processes in your company have automation potential — and how much you could save.

Check your potential

Technology8 min read

Open Source Automation vs. SaaS: The Complete Platform Comparison 2026

Open source automation or SaaS platform? Detailed comparison with 5-year cost analysis, GDPR assessment, and clear recommendations.

Technology9 min read

Activepieces vs. Zapier vs. Make: The Ultimate Comparison 2026

Activepieces vs. Zapier vs. Make comparison 2026: Features, costs, GDPR, performance — honestly evaluated with clear recommendations per use case.

Technology22 min read

Deep Agents: AI That Doesn't Just Advise — It Acts. The Complete Guide 2026

Deep Agents understand complex goals, plan autonomously, and execute operationally. The comprehensive guide to architecture, use cases, GDPR compliance, and ROI — with an implementation roadmap for mid-market enterprises.