Why does an agent cost so much more than a single chat?

Because the harness re-sends a growing context every step and adds retries, tool calls, and reasoning tokens. A ten-step run can bill many times the tokens of one call, especially without caching.

How much does prompt caching save on an agent run?

A lot on long runs, because the repeated prefix is charged at the cache-read rate (about a tenth of input) after a one-time write premium. The simulator shows the cached cost beside the uncached counterfactual.

What is the power-user risk?

A small number of users running long or looping agents can dominate spend. Forecasting a heavy run shows whether your pricing or caps survive that, before it shows up on the invoice.

Can I trust the forecast?

The prices and cache model are validated against real public usage logs; the step-count and token-shape coefficients are directional seed values until you calibrate them to your own runs. Treat it as a sized estimate, not a guarantee.

Tools

Simulator

What does one AI agent run really cost, and can a power user blow your budget?

An agent run costs far more than a single API call. The harness re-sends a growing transcript every step, retries repeat work, tool calls add tokens, and reasoning tokens are billed as output. On a context-accumulating harness, total cost grows roughly with the square of the step count; prompt caching pulls it back toward linear. A few heavy users can dominate your bill.

For builders shipping coding agents or agent loops who need a per-run and per-user cost forecast before launch, and anyone sizing the risk that power users or runaway loops quietly burn the budget.

Agent harness

Model

5) for Haiku (

/$5) can cut a token bill ~5×.

Task

Files touched4

Repo size (kLOC)50

Complexity50%

Ambiguity30%

Compile/test failures1

Reasoning & volume

Reasoning effort

Runs / month

Estimated cost / agent run

.77

/ run

36 steps · 1.5M in · 68K out (40K reasoning)

Per run

.77

Claude Code

Monthly

at your run volume

Cache savings

67%

vs $5.41 uncached

Why is this run expensive?

Reasoning$0.599 · 34%

Output$0.417 · 24%

Cache reads$0.409 · 23%

Cache writes$0.247 · 14%

Uncached input$0.097 · 5%

Cost distribution (P50 / P90 / P99)

stochastic

The same task can vary up to ~30× in tokens, so the cost is a band, not a point. P50 is a typical run; budget against P90.

P50 typical

.77

P90 budget

$7.08

P99 worst case

6.56

End-to-end reliability18% · retry ×1.82

Expected monthly (incl. retries)$323

API monthly

3-year TCO (~25% API)$708/mo

Cost per step

Context accumulates — watch input climb across steps (caching keeps it in check).

Step

SEED coefficients — a directional estimate, not calibrated to real runs.

Caching assumes a stable prompt prefix; any non-determinism busts it and raises cost.

Cost is stochastic — the P50/P90/P99 band reflects how the same task can vary up to ~30× in tokens.

Pricing validated to the cent against real Claude Code logs (ccusage/LiteLLM). In aggregate, real usage runs ~$0.68–0.94 / 1M tokens blended with ~98% cache reads; a single short task reads higher until cache reads accumulate.

Priced on Claude Sonnet 4.5 (Anthropic) at $3.00 in ·

5.00 out / 1M.

Example scenario

The same task can cost very differently by harness. A context accumulator (Claude Code style) re-sends the whole transcript each step, so a long run climbs fast; a windowed harness keeps a bounded slice and a compressed harness sends summaries. Turn prompt caching on and the accumulator's bill drops sharply, because the repeated prefix is charged at the cache-read rate. The simulator shows both the cached cost and the uncached counterfactual.

What the inputs mean

Harness type: how it builds context, accumulating, windowed, or compressed.
Model: sets the token and cache rates.
Task size and steps: how long the run is.
Prompt caching: read and write behavior for the repeated prefix.
Retries: repeated work when a step fails.
Reasoning: hidden tokens that are billed as output.

What the result means

You get a per-step and total forecast in tokens and dollars, the cached cost next to the uncached counterfactual so you can see what caching is doing, and the reasoning-token share that is invisible in the response but still billed.

Assumptions

Step-count and token-shape coefficients are seed values, directional until you tune them to your own runs.
The underlying prices and the cache model (read about 0.1 times input, write about 1.25 times input) are validated against real logs.
In aggregate usage, cache reads are about 98 percent of all tokens, so a long run's blended rate sits well below list input, around $0.68 to $0.94 per 1M tokens across viberank, clawdboard, and local ccusage.
A single short task reads a little higher before cache reads pile up.

Where the prices come from

Per-token and cache read/write rates come from the source-backed pricing index, where every figure links to the provider's own page and carries a last-checked date. This tool reads those committed numbers; it never calls a provider or fetches live prices.

How the calculation works

Token price is the same wherever you call a model; what differs is how much the harness sends. Each step bills the context it re-sends plus new output and any reasoning tokens. Caching charges the repeated prefix at the cheaper cache-read rate after a one-time write, which is why an accumulating harness with caching grows closer to linear than quadratic. Retries multiply a step's cost. The forecast combines these drivers; it does not change the model's published rates.

Frequently asked questions

Why does an agent cost so much more than a single chat?: Because the harness re-sends a growing context every step and adds retries, tool calls, and reasoning tokens. A ten-step run can bill many times the tokens of one call, especially without caching.
How much does prompt caching save on an agent run?: A lot on long runs, because the repeated prefix is charged at the cache-read rate (about a tenth of input) after a one-time write premium. The simulator shows the cached cost beside the uncached counterfactual.
What is the power-user risk?: A small number of users running long or looping agents can dominate spend. Forecasting a heavy run shows whether your pricing or caps survive that, before it shows up on the invoice.
Can I trust the forecast?: The prices and cache model are validated against real public usage logs; the step-count and token-shape coefficients are directional seed values until you calibrate them to your own runs. Treat it as a sized estimate, not a guarantee.

Pricing data last checked 2026-06-01. Rates are read from the source-backed pricing index and its change history. This tool never calls a provider or fetches live prices.