Skip to main content
Back to Blog
Technology22 min read04.05.2026Max Fey

The token bill: why AI automations get expensive in year two

Pilot month one: $47 on OpenAI. A year later: $1,840. The work hadn't changed. Here is what actually happens between the demo and production, and the five moves that cut LLM costs in half.

The token bill: why AI automations get expensive in year two

A founder I worked with last spring was thrilled with his first OpenAI invoice. Forty-seven dollars for a workflow that triaged customer emails and drafted responses. "We save three hours a day with this," he told me. "It pays for itself."

Two weeks ago he called me again. The monthly bill was $1,840. The workflows had not changed. The savings had not either. The cost side had grown by a factor of thirty-nine.

This is what happens to most AI automations once they leave the demo phase. Volume grows. Prompts get richer. New use cases get added. And what looked like a few cents per call turns into something that costs more than the headcount it was supposed to replace.

This article is about why LLM costs almost never match the demo numbers, where the actual money goes, and which architectural decisions you should make today so your AI automation is still affordable in two years.

What the demo doesn't show

A typical demo looks like this: one call to GPT-5 or Claude with a 200-word prompt, a tidy answer, applause. Cost mentioned, if at all, as "a few cents per call."

That number isn't wrong. It's just incomplete.

In real production, calls look different.

First, every call carries a system prompt that defines the model's behavior, persona, and rules. In most production setups, this is between 1,500 and 3,000 tokens. Per call. Always.

Second, few-shot examples. Four to eight example queries with example answers, showing the model how to behave. Another 2,000 to 5,000 tokens.

Third, context. For an email analysis, the last ten messages from this customer. For a contract review, the relevant excerpt from the case file. For a chatbot, the last ten exchanges. Often 3,000 to 10,000 tokens. Sometimes much more.

Fourth, and this gets ignored: models don't answer in fifty words. They answer in 300 to 800 tokens, because they default to being thorough.

Add these layers up and a typical production call lands between 8,000 and 18,000 tokens. At current GPT-5 pricing (around $12 per million input tokens, $48 per million output tokens), that's eleven to twenty-five cents per call.

A workflow running a hundred times a day costs $11 to $25 daily. At a thousand calls, $110 to $250. Daily.

Then production adds the things demos don't have: retries (when the model chokes), validations (when the response won't parse), and cascades (when a cheaper first call fails and a more expensive one takes over).

Real-world cost factor in production: three to five times what the pilot suggested. That's not a guess. That's the consistent pattern in every project I've been called into to fix the cost problem.

Where the money actually goes

If you want to control LLM costs, you have to understand where they come from. Three drivers dominate in nearly every setup.

The biggest driver is the system prompt being resent on every call.

Imagine a workflow that classifies inbound support tickets. The system prompt describes your fourteen categories, gives examples for each, defines edge cases, specifies the JSON output format. That's 2,500 tokens.

At 800 tickets per day, those 2,500 tokens get sent 800 times daily. Two million tokens per day, just for content that never changes. At current pricing, $24 per day, $720 per month, paying repeatedly for the exact same prompt.

The fix is prompt caching, and it's the most underused feature in the LLM ecosystem. Anthropic, OpenAI, and Google all support it. Configured properly, it cuts input costs by 80 to 90 percent on the cached portion. Almost nobody turns it on.

The second driver is unnecessary output tokens.

LLMs are conversational. They explain things you didn't ask them to explain. They restate the question. They wrap up with a polite summary. Pleasant in conversation, expensive in automation.

I've seen workflows where a binary yes/no classification has the model reasoning out loud for 400 tokens, then giving the answer, then summarizing again. The downstream system uses one word. You pay for 400.

At today's output prices, every unnecessary output token is four to eight times more expensive than an input token. Output discipline has the highest return per hour of work.

The third driver is model overshoot.

Most workflows default to the best available model, because it's the easiest choice. Using Claude Opus 4.7 or GPT-5 because "it's the best one" often means buying the most expensive hammer for a screw.

A two-class classification doesn't need a frontier model. Claude Haiku 4.5 handles it. GPT-5 Mini handles it. A locally hosted Llama 3.3 handles it. The cost difference between a top-tier and a mid-tier model is on the order of 10 to 30 times.

Running everything on the top model is convenient. It's also a convenience tax.

Why volume grows faster than you expect

A pattern that catches most teams off guard: AI automations almost always expand in their first year, not contract.

What starts as a pilot with a hundred calls a day, if it works, gets cloned to other use cases. The email assistant becomes a quote drafter, becomes a lead qualifier, becomes a report summarizer. Each use case multiplies volume.

User behavior changes too. In month one, employees send ten queries because they don't trust it yet. In month six, they send a hundred. Whatever's efficient gets used more. Good for the users, brutal for the bill.

I've watched a project go from 50 calls per day at pilot launch to 380 after three months to 2,100 after seven months. Functionality stayed largely the same. Adoption grew. The invoice followed.

Anyone planning an LLM automation should budget for several times the pilot volume, not for the pilot volume itself. Otherwise the cost surprise is built into the plan.

The honest TCO calculation

A serious cost calculation for an AI automation accounts for three categories, of which token costs are only one.

Direct token costs. Your monthly invoice from OpenAI, Anthropic, Google, or Azure. Cleanly measurable.

Infrastructure and tooling. Vector database if you use RAG. Logging tools for traceability. Monitoring for hallucinations. A self-hosted setup for data processing if data residency rules out cloud services. Add $200 to $800 per month, depending on complexity.

Maintenance and adjustment. Models change. OpenAI ships a new GPT and your existing prompts behave differently. Anthropic deprecates a model and you migrate. An API changes its response shape and your parser breaks. On top of that, your requirements evolve, new edge cases appear, the setup needs ongoing tuning. In practice, one to two person-days per month for each production LLM workflow.

Sum the three and your monthly cost is rarely just the token bill. It's token costs times two or three.

For a planned saving of $5,000 per month (about half a full-time role displaced), the honest total cost lands around $1,500 to $2,500 per month. Still a sound business case. Just not the "a few cents per call" case the project was sold on.

Architectural choices that quietly cost money

Some architectural choices drive costs disproportionately without anyone noticing. Four show up over and over.

A pipeline of multiple LLM calls where one would do.

Common pattern: call one extracts entities, call two classifies, call three picks the next action, call four drafts the response. Four calls per email, 5,000 tokens each, adds up fast.

Most of these chains can collapse into a single call that solves the same task in one shot. Token usage drops 60 to 75 percent because the system prompt and context get sent once, not four times.

The exception is when different steps genuinely need different models, like cheap classification followed by expensive generation. That's a deliberate choice. The chain-by-default architecture isn't.

The full conversation history in every call.

Chat workflows often pass the entire conversation back to the model on every turn so it has context. After twenty turns, you're sending 30,000 to 50,000 tokens of history, every time.

In most cases, the last three to five turns plus a compact summary of the older ones is enough. The summary gets generated once and reused. On a thirty-turn conversation, that easily saves 80 percent of the tokens.

The RAG that returns too much.

Retrieval-augmented generation is an elegant way to inject current data into LLM calls. It gets expensive when the retrieval step is configured generously.

Typical case: the system queries a knowledge base and returns the top ten chunks, each 500 tokens. That's 5,000 extra tokens per call, when three chunks usually answer the question.

Discipline in tuning the retrieval depth is missing in nearly every setup I review. Three chunks instead of ten saves 70 percent of the context tokens with no measurable drop in answer quality.

The model that's smarter than required.

I've seen production code that uses GPT-5 to extract addresses, phone numbers, and order IDs from emails. Any cheap model handles that with identical accuracy.

Reaching for the best model by default costs without delivering. A useful rule: if the task is well-defined and reproducible, route it to the cheapest model that solves it. Save the top model for the steps that genuinely need it.

What real cost control looks like

Three things matter: visibility, discipline, limits.

Visibility means you know which workflow costs what.

Every major provider lets you tag costs by project, API key, or workflow. Use it. If one workflow generates half your monthly bill, you should know that before the invoice arrives, not after.

Concretely, every production LLM workflow gets its own API key or tag. A daily report shows costs per workflow. Surprises get caught early.

Discipline means treating tokens as a resource, not a given.

Every prompt should be developed with the question: what's the minimum input the model needs to do this task? Every response should be configured with the question: what's the shortest form that's still useful?

In practice: review system prompts and cut examples that don't pull their weight. Constrain output format with JSON schemas, stop sequences, or strict length limits. And yes, sometimes "answer in one word" is the right instruction.

Limits means hard stops before the bill hurts.

Set daily budgets per workflow where the provider supports it. OpenAI has project-level spend limits. Anthropic supports workspace budgets. When the limit hits, the system stops, instead of producing more.

Hard limits feel restrictive until the day a bug puts a workflow into a retry loop and burns through API calls all night. Without limits, that's a four to five-figure event by morning. These incidents aren't theoretical. They happen often enough to plan for.

Cache, cache, cache

If I had to name one lever, it would be prompt caching. Best return per hour of work, and almost nobody uses it well.

Anthropic shipped prompt caching in 2024, OpenAI in early 2025, Google later in 2025. The principle is similar across all three: repeated portions of a prompt (system prompt, few-shot examples, static context) get cached server-side and processed at a fraction of the normal cost on subsequent calls.

A cache hit on Anthropic costs roughly 10 percent of the normal input price. On OpenAI, around 50 percent. On Google, around 25 percent. If your system prompt accounts for 80 percent of input tokens (typical), and you cache it, your input costs drop 70 to 90 percent.

In numbers: a workflow that runs $2,000 per month uncached lands at $400 to $600 per month with proper caching. Same volume. Same functionality.

Implementation isn't hard, but it requires deliberate setup. The cacheable block (system prompt, few-shot examples) needs to be marked correctly in the API call so the provider treats it as static. The rest is automatic.

I regularly find setups where someone has spent days hand-tuning prompts to save two sentences, and caching isn't enabled. The order of priorities is wrong.

Smaller models, more discipline

A second discipline that often gets skipped: the right model for the task.

GPT-5 and Claude Opus 4.7 are excellent when the task requires nuance. They're wasteful when the task is routine.

A practical scheme:

Top-tier (Claude Opus, GPT-5, Gemini Pro): tasks that need real reasoning. Drafting customer-facing replies, summarizing long documents, multi-step argument validation.

Mid-tier (Claude Sonnet, GPT-5 Mini, Gemini Flash): tasks with clear structure. Classifications across five to twenty categories, extracting from semi-structured inputs, simple summaries.

Small (Claude Haiku, GPT-5 Nano, local Llama): high-volume, narrowly defined tasks. Binary classifications, simple extractions, routing decisions.

The cost gap between these tiers is roughly 5x to 20x from mid to top, another 3x to 5x from small to mid. Using all three tiers, instead of running everything on top, can cut total costs in half.

The investment is the test. Try each task on the next-smaller model, measure the quality, decide. One to two hours per use case. Worth the money every time.

When self-hosting pays off

A question that comes up around the time the monthly bill clears $2,000: would a self-hosted open-source model be cheaper?

The honest answer: usually no, sometimes yes.

It pays off when you have high, steady volume (well over 5 million tokens per day), the task can be solved with a mid-tier or small model, and you already have the infrastructure team to run a GPU stack. With those preconditions, a locally operated Llama 3.3 or Mistral Large is often 50 to 70 percent cheaper than equivalent API usage.

It doesn't pay off when your volume is uneven, you have spikes that need top-tier quality, or you don't have GPU operations capacity. With sporadic usage, the fixed cost of the servers exceeds the variable cost of API calls.

In practice, self-hosting solves a narrow set of high-volume, stable workloads. For the typical mix of business automations, it's too heavy, too operational, and too hard to scale up and down.

If you do want to test it: vLLM or Hugging Face's Text Generation Inference are the standard frameworks. An A10G or L4 GPU runs around $250 to $400 per month in a cloud environment. Enough to serve models up to 70 billion parameters at reasonable speed.

What to do this week

If you're planning or running AI automations, the following don't belong in next quarter's roadmap. They belong on this week's checklist.

Make costs visible. One API key or tag per workflow. Weekly cost review. If you don't know which workflow costs what, you can't manage it.

Turn on caching. Set the cache markers in your Anthropic, OpenAI, or Google API calls. Highest return per hour spent.

Right-size the models. Test each production workflow on the next-smaller model. If quality holds, switch. The hit rate is higher than people expect.

Tighten outputs. Adjust system prompts so the model answers in the most compact useful form. JSON schemas. Stop tokens. Length caps.

Set spending limits. Daily budget per workflow, hard stop on overrun. Prevents the runaway loop that creates the surprise invoice.

These five steps typically cut monthly LLM costs by 40 to 70 percent in real setups, with no functionality loss. The work involved: one to two days per workflow.

The question worth asking

Before adding the next AI automation, ask: what does this cost per call if volume goes up by 10x?

If the answer is "a few cents, scales fine," you've probably only counted the base token cost. Multiply by three to five. Add infrastructure and maintenance. That's the real number.

If the answer is "no idea, we'll find out," you'll find out. Two quarters after launch, when finance asks why cloud costs are out of control.

If the answer is "we have caching enabled, the right model in each step, output discipline in the prompts, and hard limits configured," you've built an automation that's still economical in year two.

Asking the question is annoying. Skipping it is more expensive.

If you want to know how economical your current AI workflows actually are, and where the biggest cost levers sit, the free Automations Check gives you a clear picture in about 30 minutes.

#LLM-Kosten#OpenAI#Anthropic#KI-Automatisierung#TCO#Token#Prompt Caching#Cost Management