Token Counting Everywhere
Your Token Numbers Are Wrong
Go look at your LLM spending dashboard right now. If you are attributing cost by feature, by customer, or by endpoint, I will bet that the numbers are at least 10–30% off from what you are actually being billed. Not because you are not collecting data — because the data you are collecting is from the wrong place, at the wrong time, with the wrong tokenizer.
Token accounting sounds like bookkeeping. It is actually a systems problem: you have three independent sources of token counts (server response, client estimation, database storage), and they disagree in ways that compound over time. This post explains the gaps and shows you how to close them.
Where Counts Come From — and Why They Diverge
Every LLM API response includes usage metadata: prompt tokens, completion tokens, total tokens. That number is authoritative. The problem is not that the server lies — it is that teams supplement server counts with client-side estimates and never reconcile the two.
Three paths. Three numbers. The pre-flight estimate drives product decisions (context budgeting, cost gates). The server response drives billing. The analytics database may be pulling from neither — it might be computing tokens from stored prompt text with a heuristic that was accurate for GPT-3 and is now 15% off for GPT-4o.
Tokenizer Drift: The Silent Budget Killer
Tokenizers change between model versions. OpenAI moved from cl100k_base (GPT-3.5/4) to o200k_base (GPT-4o). The same input string tokenizes differently — often 5–15% fewer tokens under the newer tokenizer due to longer token merges.
If your cost-gate logic still uses cl100k_base while the production model uses o200k_base, you are blocking requests that would have been fine, and your context utilization metrics are understated.
import tiktoken
def count_tokens(text: str, model: str) -> int:
"""Always derive encoding from model name, never hard-code."""
try:
enc = tiktoken.encoding_for_model(model)
except KeyError:
# Fall back for fine-tuned or custom model names
enc = tiktoken.get_encoding("o200k_base")
return len(enc.encode(text))
# Bad: hard-coded encoding
# enc = tiktoken.get_encoding("cl100k_base") # wrong for gpt-4o
# Good: model-derived encoding
tokens = count_tokens(prompt, model="gpt-4o")Make model name a parameter everywhere you count tokens. Never hard-code an encoding. When you upgrade a model, the tokenizer updates automatically.
Counting at the Right Boundary
The most common mistake: counting tokens on the final assembled prompt string at call time, then storing only that count. You lose the breakdown.
Count and store token counts at every composition boundary:
from dataclasses import dataclass, field
@dataclass
class TokenBudget:
system_prompt: int = 0
conversation_history: int = 0
retrieved_context: int = 0
user_message: int = 0
reserved_completion: int = 0
@property
def total_input(self) -> int:
return (
self.system_prompt
+ self.conversation_history
+ self.retrieved_context
+ self.user_message
)
@property
def total_with_reserve(self) -> int:
return self.total_input + self.reserved_completion
def build_prompt(
system: str,
history: list[dict],
context_chunks: list[str],
user_msg: str,
model: str,
max_completion: int = 1024,
) -> tuple[list[dict], TokenBudget]:
budget = TokenBudget(
system_prompt=count_tokens(system, model),
conversation_history=sum(
count_tokens(m["content"], model) for m in history
),
retrieved_context=sum(count_tokens(c, model) for c in context_chunks),
user_message=count_tokens(user_msg, model),
reserved_completion=max_completion,
)
messages = [
{"role": "system", "content": system},
*history,
{"role": "user", "content": "\n\n".join(context_chunks) + "\n\n" + user_msg},
]
return messages, budgetStore the full TokenBudget alongside the response usage. Now you can answer questions like "how much of our token spend is going to retrieved context versus conversation history?" — which is the question that actually drives retrieval optimization decisions.
Attribution: Connecting Tokens to Features and Customers
Token counts without attribution are useless for cost optimization. You need to know which feature or customer is driving which spend, not just the aggregate.
Instrument every LLM call with structured metadata and pipe it to a time-series store or OLAP database:
import time
import uuid
from openai import OpenAI
client = OpenAI()
def call_with_attribution(
messages: list[dict],
model: str,
budget: TokenBudget,
feature: str,
customer_id: str,
tenant_id: str,
) -> dict:
request_id = str(uuid.uuid4())
t0 = time.monotonic()
response = client.chat.completions.create(
model=model,
messages=messages,
)
latency_ms = (time.monotonic() - t0) * 1000
usage = response.usage
# Emit structured telemetry event
emit_usage_event({
"request_id": request_id,
"timestamp": time.time(),
"model": model,
"feature": feature,
"customer_id": customer_id,
"tenant_id": tenant_id,
# Estimated (client-side)
"est_prompt_tokens": budget.total_input,
"est_system_tokens": budget.system_prompt,
"est_context_tokens": budget.retrieved_context,
"est_history_tokens": budget.conversation_history,
"est_user_tokens": budget.user_message,
# Authoritative (server-side)
"actual_prompt_tokens": usage.prompt_tokens,
"actual_completion_tokens": usage.completion_tokens,
"actual_total_tokens": usage.total_tokens,
# Derived
"estimation_error_pct": (
(usage.prompt_tokens - budget.total_input) / usage.prompt_tokens * 100
),
"latency_ms": latency_ms,
})
return responseThe estimation_error_pct field is the key diagnostic. If it consistently runs above 5%, you have a tokenizer mismatch. If it spikes on certain features, those features are assembling prompts in ways your client-side counter does not model correctly (images, tool schemas, special tokens).
The Dashboard That Actually Helps
Once you have the attribution data flowing, the useful views are not the ones most observability tools default to.
Avoid: Total tokens per day (aggregate, no signal)
Build instead:
-
Cost per feature, per week, normalized by request count. Detects feature-level regressions when a prompt change inflates context.
-
Context token fraction by feature. For RAG features:
est_context_tokens / actual_prompt_tokens. If this climbs above 70%, retrieval is over-fetching. -
Estimation error by model version. When you roll out a new model, watch this metric. A jump signals tokenizer drift.
-
Completion/prompt token ratio by feature. A feature where completions are 3x prompt tokens is generating verbose output — often a prompt bug.
-- Cost breakdown by feature (ClickHouse / BigQuery)
SELECT
feature,
SUM(actual_total_tokens) AS total_tokens,
SUM(actual_total_tokens) / COUNT(*) AS tokens_per_request,
AVG(estimation_error_pct) AS avg_estimation_error,
SUM(actual_completion_tokens) * 1.0 / SUM(actual_prompt_tokens) AS completion_ratio
FROM llm_usage_events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY feature
ORDER BY total_tokens DESCRun this query weekly. The features with the highest tokens-per-request are your optimization targets.
Reconciling Against the Invoice
The final step: validate your telemetry against the actual billing invoice. Do this monthly.
Your telemetry should match the invoice within 2–3%. Larger gaps indicate:
- Requests that bypassed your instrumented client (direct API calls from scripts, third-party integrations)
- Streaming responses where you are not accumulating usage from the final chunk
- Batch API calls that are not flowing through the same instrumentation path
For streaming responses, the usage data is in the final chunk, not in each delta:
# Correct: collect usage from the final chunk
usage = None
with client.chat.completions.stream(model=model, messages=messages) as stream:
for chunk in stream:
if chunk.usage is not None:
usage = chunk.usage
# process chunk.choices[0].delta.content ...
# usage.prompt_tokens, usage.completion_tokens are now populatedIf you are missing usage data on streaming calls, that is likely where your reconciliation gap is hiding.
Key Takeaways
- Client-side estimates, server usage metadata, and database heuristics are three independent sources that diverge — the server response is the only authoritative count.
- Tokenizer drift between model versions causes systematic estimation errors; always derive the tokenizer from the model name, never hard-code it.
- Count tokens at every prompt composition boundary (system, history, context, user message) to enable cost attribution by component, not just by request.
- Attribute every LLM call with feature and customer metadata; aggregate token counts are nearly useless for optimization decisions.
- Track estimation error percentage as a diagnostic metric; sustained error above 5% signals a tokenizer mismatch or prompt assembly gap.
- Reconcile your telemetry against the billing invoice monthly — gaps above 3% reveal instrumentation blind spots.