The Hidden Cost Curve: How Token Billing Really Works in Claude API
March 18, 2026 · Based on 315 Claude Code sessions, 20,800+ user turns
Key takeaways
- Every API call resends the entire conversation as input — cost grows quadratically with conversation length
- Prompt caching cuts that cost by 85%, but the O(N²) shape remains
- Each tool call in a chain re-reads the full context — a 3-tool chain costs 3× a simple reply
- One chat response ($0.12) costs more than two agent tool calls ($0.11) at 100K context
- Context compression at ~165K tokens is the main cost-control mechanism
How chat-model billing works
LLM APIs are stateless. Every time you send a message, the API receives the full conversation history as input and generates a response as output. There is no persistent "session" on the server — the entire context is re-processed on each call.
This means the cost of turn k is not proportional to what you just typed — it's proportional to the total conversation length at that point. And since context grows with each turn, total cost across a conversation is O(N²).
Concrete example: 100 turns at 10K tokens each is not 1M total input tokens — it's 49.7M, because each turn re-sends everything before it.
Claude Opus 4.6 pricing
| Component | Price per MTok | Relative cost |
| Base input (no cache) | $5.00 | 1× (baseline) |
| Cache read | $0.50 | 0.1× — main savings |
| Cache write (1h TTL) | $10.00 | 2× base — expensive but small volume |
| Output | $25.00 | 5× base — expensive but tiny volume in agent mode |
The full 1M context window is available at standard pricing — no premium for >200K tokens.
N: the hidden cost multiplier
When you send a message to Claude Code, the model doesn't just reply with text. It can trigger a chain of tool calls — read a file, run grep, edit code, run tests. Each step in the chain is a separate API call that re-reads the entire context from scratch.
Example: user asks "find and fix this bug"
API call 1: reads context (100K) → decides to search → tool_use
API call 2: reads context (100K + result) → reads file → tool_use
API call 3: reads context (100K + ...) → edits file → tool_use
API call 4: reads context (100K + ...) → responds to user
Result: N=4, context read 4 times
Cost ≈ 4 × 100K × $0.50/MTok = $0.20 (cache read alone)
Distribution of N across 20,800+ real turns:
| N (API calls) | Share of turns | What it is | Cost multiplier |
| N=1 | 57% | Simple text response or single tool call | ×1 (baseline) |
| N=2 | 30% | One tool call + response | ×2 |
| N=3 | 11% | Chain of 2 tool calls + response | ×3 |
| N=4+ | 2% | Long chains (search + read + edit + verify) | ×4+ |
The cost formula per user turn:
cost_turn = N × (ctx × $cache_read + cw × $cache_write + out × $output) / MTok
Interactive explorer
Adjust the sliders to see how cost changes with different usage patterns. The scatter plot shows 20,800+ real data points from Claude Code sessions — overlapping points appear brighter.
Conclusions
1. O(N²) growth is a fundamental property of chat APIs
Each turn resends all previous messages. Over 100 turns at 10K tokens each, total input is 49.7M tokens (not 1M). Without caching that would cost $268. The cost of each successive turn grows linearly: from $0.21 (1st) to $5.16 (100th).
2. Prompt caching reduces cost by 80–90%, but doesn't change the asymptotics
Cache read costs 10% of base input ($0.50 vs $5.00/MTok). At 1M context, a turn costs $0.79 instead of $5.20. Growth is still O(N²), but with a ×0.1 coefficient. The savings increase with context size (at 10K there is almost no saving; at 1M it is 85%).
3. Context compression is the main cost-control mechanism
Claude Code automatically compresses context at ~165K tokens, resetting it to ~50K. This breaks the quadratic growth — average turn cost stays around $0.07–0.15 instead of growing without bound. In the longest observed session (2,863 turns), 18 compressions occurred.
4. Output is a negligible share of cost
Despite costing $25/MTok (5× more than input), output accounted for only $7.52 out of $384.78 (2%) in a real session. Total output was 300K tokens vs. 488M total input (cache read). Even with long responses (8K/turn), the dominant cost is re-reading context.
5. Chat vs Agent: a detailed comparison
At 100K context, the cost difference per call comes entirely from cache write and output — cache read is the same:
| Component (at ctx=100K) | Chat (claude.ai) | Agent (Claude Code) | Why |
| Cache read: ctx × $0.50/MTok |
$0.049 | $0.050 |
Same — both read full context |
| Cache write × $10/MTok |
$0.025 (2.5K tok) |
$0.006 (600 tok) |
Chat: long response + message. Agent: small tool_use JSON |
| Output × $25/MTok |
$0.050 (2K tok) |
$0.001 (26 tok) |
Chat: long text. Agent: short tool call |
| Total per call |
$0.124 |
$0.057 |
Chat is 2.2× more expensive per call |
One chat call ($0.124) costs more than two agent calls ($0.114). The difference comes from two "expensive" components:
- Cache write ($10/MTok) — 20× more expensive than cache read. Chat writes 2.5K new tokens, agent writes 600.
- Output ($25/MTok) — 50× more expensive than cache read. Chat generates 2K tokens of text, agent generates 26 tokens of JSON.
6. Caching saves 85% in practice
Real session data (kick-compose, longest observed):
| Metric | Value |
| Sessions analyzed | 315 |
| User turns | 20,460 |
| Context range | 5.5K – 326K tokens |
| Longest session cost (with cache) | $384.78 |
| Same session without cache | $2,513.15 |
| Savings from caching | 84.7% |
Cache reads account for 97.3% of all input tokens.
7. Context never starts at zero
Minimum context is ~30K tokens (system prompt ~6K + tool definitions ~9K + CLAUDE.md + overhead). Only 4% of turns have context below 30K. The bulk of turns sit between 30K–170K.
Practical implications
- Prompt caching is mandatory — without it, costs are 5–7× higher.
- Context is the main expense — optimize system prompt size and conversation history, not response length.
- Context compression/truncation is more effective than any prompt optimization. Resetting from 160K to 50K saves ~$0.05 on every subsequent turn.
- Memory systems (RAG, fact-based memory) become cheaper than long-context after ~10 turns at >100K context (arxiv 2603.04814).
- Batch API provides an additional 50% discount for non-interactive workloads.
Methodology
Data was collected from ~/.claude/projects/*/*.jsonl session files — 315 sessions, 20,800+ user turns, 32,600+ API calls from February–March 2026. Model distribution: 87% Claude Opus 4.6, 10% Claude Opus 4.5 (same pricing), 3% Claude Sonnet 4.6.
A user turn is defined as one user message plus all subsequent API calls until the next user message. Streaming chunks are deduplicated by UUID (keeping the one with max output_tokens). Per-turn cost is the sum of all call costs; per-turn context is the maximum context observed in that turn.
Theoretical curves use the formula cost = N × (ctx × CR + cw × CW + out × OUT) with official pricing. Empirical lines are per-bin medians stratified by N.
Source: Claude API pricing documentation