Continuous Profiling: Finding Where the CPU Actually Goes

Metrics tell you a service is slow. Traces tell you which request. Profiling tells you the exact function - and continuous profiling does it in production, always on, at about 1% overhead. Here is how sampling profilers work, how to read a flame graph, and what it takes to support every runtime at once.

You get paged: a service's CPU is pinned and latency is climbing. Your dashboards confirm it - CPU at 90%, p99 doubled. Your traces show the slow endpoint. And then you are stuck, because none of those tools answer the only question that fixes the problem: which code is burning the CPU?

That is the gap profiling fills. Metrics are aggregates, traces are request paths, but a profile is a breakdown of where the machine actually spent its cycles, down to the function and line. The catch has always been that profiling was something you did locally, on your laptop, against a synthetic workload - which behaves nothing like production at 3am under real traffic. Continuous profiling closes that gap: it runs always-on, in production, cheaply enough to leave on forever.

01 Sample, don't instrument

The naive way to measure where time goes is to instrument every function - record a timestamp on entry and exit. That is accurate and ruinously expensive: you pay on every single call, and the measurement itself distorts what you are measuring.

Sampling profilers take the opposite approach. Many times per second - say 100Hz - the profiler interrupts the program and records just one thing: the current call stack. That's it. Each sample is cheap, and over thousands of samples a statistical picture emerges: a function that shows up in 40% of samples was using roughly 40% of the CPU. You never counted a single function call, yet you know exactly where the time went.

At a fixed frequency the profiler snapshots the live call stack; functions that appear in more samples were on-CPU more often.

Because the cost is fixed per sample rather than per call, overhead stays around 1% regardless of how busy the service is. That is the number that makes "always on in production" possible.

02 The flame graph

Thousands of stack samples are useless as a list. The flame graph is how you make them legible. Every sampled stack is merged into a tree: the bottom frame is the entry point, each frame above it is a function it called. A frame's width is proportional to how many samples contained it - which is to say, how much CPU it and its children used. Wide frames are where your time goes. Tall stacks are deep call chains.

The trick to reading one: scan left to right for the widest frames, ignore the tall-but-thin ones (deep but cheap), and look for a wide frame whose children are all narrow - that frame is doing expensive work itself. Try it. Hover any frame for its share of CPU; click it to zoom in and rescale its subtree to full width; click reset to zoom back out.

In that handler, the request itself is a sliver - almost all the CPU is under db.Query, and within it the row scanning and reflection, not the SQL. That is the kind of thing a flame graph makes obvious in seconds and a dashboard never will.

03 One model, many runtimes

Sampling is simple in principle and a mess in practice, because every language runtime emits its profiles in a different shape. Go speaks pprof. The JVM emits JFR. Python, Node, Ruby, .NET, PHP, and native (perf-style) agents each have their own wire format, their own idea of what a stack frame even is. If you want one product that ingests all of them, the formats are the hard part, not the sampling.

A dispatcher parses each runtime's wire format into one normalized profile model; everything downstream sees a single shape.

The design that survives contact with reality is a parser dispatcher: a thin front that recognizes the format and routes to a per-runtime parser, each of which normalizes into a single internal profile model. Everything downstream - storage, query, the flame-graph UI - sees one shape. Adding a new runtime is a new parser file, not a rewrite. That boundary is the whole architecture; get it right and the eighth format costs the same as the second.

04 Storing and diffing

Profiles are heavier than metrics - each one is a whole tree of stacks, not a single number - so storage gets its own treatment: keep the raw profile bytes, index them by service and time, and lean on the fact that consecutive profiles from the same service overlap enormously, so they compress and dedup well.

The feature that pays for all of it is the diff. Pick two time windows - before a deploy and after - and subtract one flame graph from the other. Frames that got wider light up: that is your regression, named down to the function, with no guesswork. "p99 went up after Tuesday's release" becomes "this function started taking 3x the CPU," which is a bug you can actually go fix.

The throughline: sampling makes profiling cheap enough to always run, a normalized model makes it work for every runtime, and diffing turns it from a debugging tool into a regression alarm.

05 The payoff

~1%

overhead, always on

runtimes, one model

function

resolution, in prod

Metrics and traces tell you that something is wrong and roughly where. Continuous profiling tells you the exact line, in production, under real load, cheaply enough that it is just always there when you need it. Once a team has it, "the service is slow" stops being an investigation and starts being a lookup.

Building observability tooling and want a second pair of hands? luthra.zenith@gmail.com

← all writing

Continuous Profiling: Finding Where the CPU Actually Goes

01 Sample, don't instrument

02 The flame graph

Flame graph - one request handler

03 One model, many runtimes

04 Storing and diffing

05 The payoff