The Hidden Cost Curve: How Token Billing Really Works in Claude API

March 18, 2026 · Based on 315 Claude Code sessions, 20,800+ user turns

Key takeaways

Every API call resends the entire conversation as input — cost grows quadratically with conversation length
Prompt caching cuts that cost by 85%, but the O(N²) shape remains
Each tool call in a chain re-reads the full context — a 3-tool chain costs 3× a simple reply
One chat response ($0.12) costs more than two agent tool calls ($0.11) at 100K context
Context compression at ~165K tokens is the main cost-control mechanism

How chat-model billing works

LLM APIs are stateless. Every time you send a message, the API receives the full conversation history as input and generates a response as output. There is no persistent "session" on the server — the entire context is re-processed on each call.

This means the cost of turn k is not proportional to what you just typed — it's proportional to the total conversation length at that point. And since context grows with each turn, total cost across a conversation is O(N²).

Without caching

cost(turn k) =
  context(k) × $5.00/MTok + output × $25.00/MTok

Total over N turns:
  ∑ context(k) ≈ N×(N-1)/2 × tokens_per_turn
  → O(N²) growth

With prompt caching

cost(turn k) =
  cache_read(k) × $0.50/MTok    (90% discount!)
  + cache_write(k) × $10.00/MTok (new tokens only)
  + output × $25.00/MTok

Still O(N²), but coefficient ×0.10

Concrete example: 100 turns at 10K tokens each is not 1M total input tokens — it's 49.7M, because each turn re-sends everything before it.

Claude Opus 4.6 pricing

Component	Price per MTok	Relative cost
Base input (no cache)	$5.00	1× (baseline)
Cache read	$0.50	0.1× — main savings
Cache write (1h TTL)	$10.00	2× base — expensive but small volume
Output	$25.00	5× base — expensive but tiny volume in agent mode

The full 1M context window is available at standard pricing — no premium for >200K tokens.

N: the hidden cost multiplier

When you send a message to Claude Code, the model doesn't just reply with text. It can trigger a chain of tool calls — read a file, run grep, edit code, run tests. Each step in the chain is a separate API call that re-reads the entire context from scratch.

Example: user asks "find and fix this bug"

API call 1: reads context (100K) → decides to search → tool_use
API call 2: reads context (100K + result) → reads file  → tool_use
API call 3: reads context (100K + ...) → edits file    → tool_use
API call 4: reads context (100K + ...) → responds to user

Result: N=4, context read 4 times
Cost ≈ 4 × 100K × $0.50/MTok = $0.20 (cache read alone)

Distribution of N across 20,800+ real turns:

N (API calls)	Share of turns	What it is	Cost multiplier
N=1	57%	Simple text response or single tool call	×1 (baseline)
N=2	30%	One tool call + response	×2
N=3	11%	Chain of 2 tool calls + response	×3
N=4+	2%	Long chains (search + read + edit + verify)	×4+

The cost formula per user turn:

cost_turn = N × (ctx × $cache_read + cw × $cache_write + out × $output) / MTok

Interactive explorer

Adjust the sliders to see how cost changes with different usage patterns. The scatter plot shows 20,800+ real data points from Claude Code sessions — overlapping points appear brighter.

Conclusions

1. O(N²) growth is a fundamental property of chat APIs

Each turn resends all previous messages. Over 100 turns at 10K tokens each, total input is 49.7M tokens (not 1M). Without caching that would cost $268. The cost of each successive turn grows linearly: from $0.21 (1st) to $5.16 (100th).

2. Prompt caching reduces cost by 80–90%, but doesn't change the asymptotics

Cache read costs 10% of base input ($0.50 vs $5.00/MTok). At 1M context, a turn costs $0.79 instead of $5.20. Growth is still O(N²), but with a ×0.1 coefficient. The savings increase with context size (at 10K there is almost no saving; at 1M it is 85%).

3. Context compression is the main cost-control mechanism

Claude Code automatically compresses context at ~165K tokens, resetting it to ~50K. This breaks the quadratic growth — average turn cost stays around $0.07–0.15 instead of growing without bound. In the longest observed session (2,863 turns), 18 compressions occurred.

4. Output is a negligible share of cost

Despite costing $25/MTok (5× more than input), output accounted for only $7.52 out of $384.78 (2%) in a real session. Total output was 300K tokens vs. 488M total input (cache read). Even with long responses (8K/turn), the dominant cost is re-reading context.

5. Chat vs Agent: a detailed comparison

At 100K context, the cost difference per call comes entirely from cache write and output — cache read is the same:

Component (at ctx=100K)	Chat (claude.ai)	Agent (Claude Code)	Why
Cache read: ctx × $0.50/MTok	$0.049	$0.050	Same — both read full context
Cache write × $10/MTok	$0.025 (2.5K tok)	$0.006 (600 tok)	Chat: long response + message. Agent: small tool_use JSON
Output × $25/MTok	$0.050 (2K tok)	$0.001 (26 tok)	Chat: long text. Agent: short tool call
Total per call	$0.124	$0.057	Chat is 2.2× more expensive per call

One chat call ($0.124) costs more than two agent calls ($0.114). The difference comes from two "expensive" components:

Cache write ($10/MTok) — 20× more expensive than cache read. Chat writes 2.5K new tokens, agent writes 600.
Output ($25/MTok) — 50× more expensive than cache read. Chat generates 2K tokens of text, agent generates 26 tokens of JSON.

6. Caching saves 85% in practice

Real session data (kick-compose, longest observed):

Metric	Value
Sessions analyzed	315
User turns	20,460
Context range	5.5K – 326K tokens
Longest session cost (with cache)	$384.78
Same session without cache	$2,513.15
Savings from caching	84.7%

Cache reads account for 97.3% of all input tokens.

7. Context never starts at zero

Minimum context is ~30K tokens (system prompt ~6K + tool definitions ~9K + CLAUDE.md + overhead). Only 4% of turns have context below 30K. The bulk of turns sit between 30K–170K.

Practical implications

Prompt caching is mandatory — without it, costs are 5–7× higher.
Context is the main expense — optimize system prompt size and conversation history, not response length.
Context compression/truncation is more effective than any prompt optimization. Resetting from 160K to 50K saves ~$0.05 on every subsequent turn.
Memory systems (RAG, fact-based memory) become cheaper than long-context after ~10 turns at >100K context (arxiv 2603.04814).
Batch API provides an additional 50% discount for non-interactive workloads.

Methodology

Data was collected from ~/.claude/projects/*/*.jsonl session files — 315 sessions, 20,800+ user turns, 32,600+ API calls from February–March 2026. Model distribution: 87% Claude Opus 4.6, 10% Claude Opus 4.5 (same pricing), 3% Claude Sonnet 4.6.

A user turn is defined as one user message plus all subsequent API calls until the next user message. Streaming chunks are deduplicated by UUID (keeping the one with max output_tokens). Per-turn cost is the sum of all call costs; per-turn context is the maximum context observed in that turn.

Theoretical curves use the formula cost = N × (ctx × CR + cw × CW + out × OUT) with official pricing. Empirical lines are per-bin medians stratified by N.

Source: Claude API pricing documentation