Example scenario
The same task can cost very differently by harness. A context accumulator (Claude Code style) re-sends the whole transcript each step, so a long run climbs fast; a windowed harness keeps a bounded slice and a compressed harness sends summaries. Turn prompt caching on and the accumulator's bill drops sharply, because the repeated prefix is charged at the cache-read rate. The simulator shows both the cached cost and the uncached counterfactual.
What the inputs mean
- Harness type: how it builds context, accumulating, windowed, or compressed.
- Model: sets the token and cache rates.
- Task size and steps: how long the run is.
- Prompt caching: read and write behavior for the repeated prefix.
- Retries: repeated work when a step fails.
- Reasoning: hidden tokens that are billed as output.
What the result means
You get a per-step and total forecast in tokens and dollars, the cached cost next to the uncached counterfactual so you can see what caching is doing, and the reasoning-token share that is invisible in the response but still billed.
Assumptions
- Step-count and token-shape coefficients are seed values, directional until you tune them to your own runs.
- The underlying prices and the cache model (read about 0.1 times input, write about 1.25 times input) are validated against real logs.
- In aggregate usage, cache reads are about 98 percent of all tokens, so a long run's blended rate sits well below list input, around $0.68 to $0.94 per 1M tokens across viberank, clawdboard, and local ccusage.
- A single short task reads a little higher before cache reads pile up.
Where the prices come from
Per-token and cache read/write rates come from the source-backed pricing index, where every figure links to the provider's own page and carries a last-checked date. This tool reads those committed numbers; it never calls a provider or fetches live prices.
How the calculation works
Token price is the same wherever you call a model; what differs is how much the harness sends. Each step bills the context it re-sends plus new output and any reasoning tokens. Caching charges the repeated prefix at the cheaper cache-read rate after a one-time write, which is why an accumulating harness with caching grows closer to linear than quadratic. Retries multiply a step's cost. The forecast combines these drivers; it does not change the model's published rates.
Frequently asked questions
- Why does an agent cost so much more than a single chat?
- Because the harness re-sends a growing context every step and adds retries, tool calls, and reasoning tokens. A ten-step run can bill many times the tokens of one call, especially without caching.
- How much does prompt caching save on an agent run?
- A lot on long runs, because the repeated prefix is charged at the cache-read rate (about a tenth of input) after a one-time write premium. The simulator shows the cached cost beside the uncached counterfactual.
- What is the power-user risk?
- A small number of users running long or looping agents can dominate spend. Forecasting a heavy run shows whether your pricing or caps survive that, before it shows up on the invoice.
- Can I trust the forecast?
- The prices and cache model are validated against real public usage logs; the step-count and token-shape coefficients are directional seed values until you calibrate them to your own runs. Treat it as a sized estimate, not a guarantee.