Controlling LLM Costs In Production Workflows

The most common LLM cost surprise is not an expensive pilot. It is a cheap pilot that becomes an expensive production system, or a viable project killed by a scary projection, because the two run on different economics.

Cost is a design property of a workflow, not a price you discover at the end of the month. The teams that control it treat cost per workflow-run as a number they engineer, measure, and accept or reject, the same way they treat accuracy.

Why Pilot Costs Mislead

A chat pilot runs on per-seat economics. Twenty people asking a few questions a day is bounded by human patience; the monthly bill stays small almost regardless of model choice. A production workflow runs on per-event economics. It fires on every inbound email, every document, every webhook. Volume is set by the business, not by patience, and it grows when the business does.

Pilots are also unoptimized by design: one large model for every step, full documents in the context, retrieval dumped into the prompt because nobody had a reason to trim it. That is the right way to validate quality, and the wrong baseline for a budget.

Both failure modes follow. One team extrapolates the pilot linearly, sees a five-figure monthly bill, and cancels a project the levers below would have made cheap. Another assumes "it was $200 in the pilot" and discovers production economics in the second invoice.

The Cost Levers, In Order Of Leverage

Right-size the model per step. A workflow is not one model call; it is several steps of different difficulty. Use a frontier model for the genuinely hard steps and a small model for classification, routing, and extraction. This is the largest single lever, often 10x by itself.

Prompt and context discipline. Trim retrieval to the chunks that change the answer, and cache the static system prompt. Prompt caching cuts the cost of repeated context by up to roughly 90% where supported.

Batch and go async. Most workflow steps do not need an interactive answer. Batch APIs are typically priced around half the interactive rate, and async queues smooth spikes.

Cache repeated answers. If the same question arrives every day, serve the approved answer from a cache and call the model only on misses.

Control output length. Output tokens usually cost around five times input tokens. Constrain outputs to schema-bound JSON, set per-step maximum lengths, and never let a classification step write an essay.

Make cost per workflow-run an acceptance criterion. Log tokens and cost per step, per run, from day one, and put the number next to the quality metrics in every demo. What is not measured will not be managed.

The order matters. Most teams reach for caching tricks first, but model right-sizing and context discipline usually deliver more than everything else combined.

A Worked Example: Support Triage At 3,000 Emails A Month

Take a standard support email triage workflow: classify each inbound email, extract structured fields, and draft a reply for the roughly 40% that need one. Unoptimized, every email means a full thread plus retrieved knowledge-base context, call it 6,000 input tokens and 800 output tokens.

| Design | Cost per email | Monthly at 3,000 emails |
| --- | --- | --- |
| Frontier model for every step, full context | ~$0.15 | ~$450 |
| Mid-tier model for every step, same prompts | ~$0.03 | ~$90 |
| Tiered: small model classifies and extracts; mid-tier drafts the 40%, trimmed context, cached system prompt | ~$0.006 | ~$18 |

The figures are illustrative, computed from public 2026 per-token price ranges: read the ratios, not the absolutes. The 25x spread comes from design decisions, not from negotiating a discount.

At 3,000 emails a month, every row is affordable, which is exactly how the trap forms: nothing forces the discipline until volume does. The same workflow at 300,000 events a month is the difference between roughly $45,000 and $1,800. And tiering down must be verified per step against an evaluation set, not applied as a blanket downgrade; choosing the right Claude model per step is its own decision.

When A Step Should Leave The LLM

Past a certain volume, the cheapest model call is the one you stop making. The signals:

A high-volume classification step with stable labels: a fine-tuned small model, or embeddings plus a classical classifier, does the same job for a fraction of the price.

A deterministic transform (date parsing, currency conversion, lookups, schema mapping) belongs in ordinary code, never in a model call.

The same answers served repeatedly: promote reviewed outputs to a cache or a rules table, and reserve the model for genuinely new cases.

One step dominating the bill at volume: fine-tuning can shrink the prompt itself, because a tuned model needs fewer instructions and examples per call.

Fine-tuning is not free even when training is cheap: it carries dataset curation, evaluation, and retraining whenever the task drifts. Justify it with arithmetic, projected monthly savings against that carrying cost, and let the same arithmetic protect you in the other direction: nobody should fine-tune to save $70 a month.

In our sprints, cost per workflow-run is written into the acceptance criteria, and the weekly demo shows the cost dashboard next to the quality metrics; see packages for how the sprints are scoped. By handover you know what a production month costs before you commit to one. The practical rule: if you cannot state today what one run of your workflow costs, that, not the model choice, is the first thing to fix.