Where Your LLM Tokens Actually Go
You profile CPU and memory as a reflex. The context window is the most expensive resource in an AI app, and almost nobody profiles it. So I built llmprof - pprof for your LLM context. It is a local proxy that flame-graphs every request's tokens, prices the call, and tells you what to cut. This is what it shows and how it works.
An agent call that should be simple arrives at the model carrying twenty thousand tokens of context. Some of it is the system prompt. A lot of it is tool schemas, most of which this particular call never uses. The rest is conversation history that has been quietly accreting since turn one, including the same file pasted three times. You pay for all of it, on every single turn, and you have almost no idea how it splits.
We would never accept this for CPU. If a service were slow you would pull a profile and see exactly which function burned the cycles. But for the context window - the thing you pay per token for, on every request, forever - the best most teams have is the provider's billing page. That is not a profiler. It is a meter.
01 A bill is not a profile
Your provider dashboard tells you that you spent $4,000 last month and that it trended up. It will not tell you that 38% of every request is tool schemas, that a quarter of those schemas are for tools the model never calls, or that your retrieval step keeps pasting the same document into context turn after turn. Those are the facts that change what you build, and a meter cannot surface a single one of them.
The unit that matters is not "dollars this month." It is "tokens in this request, broken down by what put them there." Get that and the optimization work stops being guesswork. That breakdown has an obvious shape, and the shape is a flame graph.
02 The context flame graph
Take one request's context window and treat it like a sampled profile, except the resource is tokens, not CPU samples. The root is the whole window. Each child is a category - system prompt, tool schemas, history, the user's message - and a frame's width is the share of tokens it owns. Tool schemas split into individual tools; history splits into tool results and prior turns. Wide frames are where your tokens, and therefore your dollars, actually go.
Here is a real-shaped one from a coding agent. Scan for the widest frames, then look for the ones you would not have guessed. Hover any frame for its token count and share; click to zoom into its subtree; reset to zoom back out.
The user's actual question is the sliver on the right. The two widest things in the
window are tool schemas and history - and inside them are the findings: a
browser tool definition worth 2,300 tokens that this agent never calls, and
a file that retrieval pasted into context three separate times. Neither is visible on a
bill. Both are obvious here in seconds.
03 From breakdown to reclaimable dollars
A breakdown is interesting; a number you can act on is useful. So on top of the attribution llmprof runs a waste detector that looks for three things it can price: duplicated content (the same chunk repeated across the window or across turns), tool schemas that are never called, and stable prefixes that are not being cached when the provider would happily cache them. It rolls those into a single "reclaimable per month" figure, projected from your actual traffic, with cache advice that is aware of each provider's caching model.
The throughline: attribute every token to what put it there, price it against the real per-model rates, then point at the specific tokens you are paying for and not using. The flame graph makes it legible; the waste detector makes it a dollar figure.
04 How it works, and why it stays local
The mechanism is deliberately boring: llmprof is a proxy that speaks the OpenAI and
Anthropic wire formats. You point your client's base URL at localhost:4000
and your API key passes straight through to the real provider. The request is forwarded
unchanged and the response streams straight back; the tokenizing, attribution, pricing,
and waste detection all happen off the hot path, so it adds essentially no
latency to the call.
Two design choices matter. First, it is local by default: prompts, completions, and keys only ever go to the upstream provider you already use. Traces land in a SQLite file you own. Nothing is sent to me, to a cloud, or to any third party, which is what makes it safe to run against production traffic. Second, pricing is offline: a snapshot of rates for 1,000+ models ships inside the package, with curated rates for the newest flagships, so cost is correct even with no network and you are never one upstream outage away from a broken profiler.
05 The payoff
One base-URL change and the context window stops being a black box. You see which prompt template drives the bill, watch history balloon turn over turn across an agent run, and get a concrete list of tokens to cut. "Our LLM costs are creeping up" stops being a vague worry and becomes a flame graph with the expensive frames highlighted.
llmprof is open source (MIT) and out now. Want to see it before installing anything? Try the live dashboard in your browser - the real interface on a recorded session, nothing to set up. Then profile your own context in about a minute:
pipx install llmprof && llmprof up # or, no Python: npx llmprof up
Point a client at the proxy, open localhost:4000, and look at where your
tokens went. Code and docs:
github.com/luthraG/llmprof
· live demo
· documentation
· PyPI
· npm.
Building AI infrastructure or LLM tooling and want a second pair of hands? luthra.zenith@gmail.com
← all writing