home/writing

Anomaly Detection for Alerting, at Nanosecond Cost

A static threshold can't tell a Tuesday-afternoon traffic peak from an outage. Anomaly detection can - but the obvious way to build it melts your CPU at ten alerts. Here is how to get it to 8.5 nanoseconds per evaluation and 100,000+ timeseries, by refusing to do real work in the hot path.

Every alerting system starts with static thresholds: page me when CPU goes over 80%, when error rate crosses 2%. They are simple and they are wrong about half the time. Real traffic is seasonal - quiet at 4am, busy at 2pm, different on weekends. A threshold tuned for the night fires all afternoon; one tuned for the afternoon sleeps through a 3am incident. You end up either drowning in false pages or missing the real ones.

The fix everyone reaches for is anomaly detection: instead of a fixed line, learn what "normal" looks like for this metric at this time of week, and alert when reality strays too far. The idea is right. The naive implementation is what gets you in trouble.

01 The trap: doing the work in the hot path

The straightforward design computes the baseline on demand. Every time the alert rule evaluates - typically every 15 to 60 seconds, per timeseries - it pulls a few weeks of history from the metrics store, crunches a baseline and a spread, and compares the latest point against it.

Naive design: each evaluation fans out several history queries to the time-series store, every tick, uncached.

It works beautifully in a demo with three alerts. Then it meets production. Each evaluation becomes four or more queries against the time-series database, fetching and re-aggregating weeks of data that barely changed since the last tick. Nothing is cached, because "the latest window" is always slightly different. CPU climbs linearly with the number of alerts and saturates somewhere around seven to ten timeseries. For a platform that needs to watch tens of thousands, that is a non-starter.

The core mistake is putting expensive, repeated work on the latency-critical path. The fix is almost always the same shape: precompute the expensive part out of band, and leave the hot path doing nothing but arithmetic.

02 Split training from scoring

So separate the two jobs. Training - turning weeks of raw history into a compact model - is expensive but rarely needs to be fresh to the second. Run it in the background, on a schedule. Scoring - deciding whether the current point is anomalous - has to be instant, but if the model is already built, scoring is just a subtraction and a divide.

Once you frame it this way, the evaluation hot path stops talking to the database entirely. It reads a small precomputed model from memory and does a few floating-point operations. The expensive query load doesn't grow with your alert count, because the alert count no longer drives queries at all.

03 A model shaped like the week

What should the model actually be? Seasonality is the whole point, so the model is indexed by position in the week. Bucket time into a grid: seven days, and within each day, 288 five-minute slots. That's 2,016 buckets covering every five-minute window of a typical week. Each bucket stores what "normal" looked like for that slot, learned across however many weeks of history you keep.

A timestamp maps to one (day-of-week, time-of-day) bucket; each bucket carries a learned center and spread.

Scoring a point becomes: find its bucket from the timestamp, read the bucket's center and spread, and ask how far out the point is. The lookup is an array index. No search, no allocation, no I/O.

04 Why median and MAD, not mean and standard deviation

Here is the subtle part, and the part most home-grown detectors get wrong. The intuitive choice for "center" and "spread" is the mean and the standard deviation. But think about what's in your training data: it includes the very incidents and spikes you are trying to detect. The mean gets dragged toward outliers, and standard deviation gets dragged even harder - it squares the distances. Train on data containing a spike and your "normal" band quietly inflates to include the spike. The detector learns to ignore exactly what it was built to catch.

The robust alternative is the median for the center and the median absolute deviation (MAD) for the spread:

# center and spread that ignore outliers
center = median(values)
MAD    = median(|value - center|  for value in values)

# robust z-score: how many MADs from the center?
score  = |x - center| / MAD

The median doesn't move when you add a few extreme values; neither does the MAD. You can drop a 10x spike into the training window and the learned band barely flinches. The score is a robust analogue of the z-score - "how many typical deviations away is this point" - and you alert when it exceeds some sensitivity k.

05 See it move

The band around the baseline is center ± k·MAD. The sensitivity k is the one knob that matters: small k means a tight band and a twitchy detector; large k means a forgiving band that only fires on the dramatic stuff. Drag it and watch points cross the line. Inject a spike and watch a robust band refuse to be impressed.

Anomaly band - one day of a metric

Notice the band is wider in the busy part of the day and tighter at night - because the spread is learned per bucket, the detector is automatically more tolerant when the metric is naturally noisier, and stricter when it's calm. A fixed threshold can't do that.

06 The hot path is just arithmetic

With the model precomputed, scoring at evaluation time is a handful of operations and zero allocations - no garbage for the collector to chase, no syscalls, no network:

func (m *Model) Score(ts int64, x float64) float64 {
    b := m.bucket[bucketIndex(ts)] // array index - O(1)
    if b.mad == 0 {
        return 0 // flat history, nothing to flag
    }
    return math.Abs(x-b.center) / b.mad // robust z-score
}

This is the move that buys the orders of magnitude. Scoring a point drops from roughly 200 milliseconds (four network round-trips and a re-aggregation) to about 8.5 nanoseconds. Rendering a 24-hour band for a chart - 288 points - goes from tens of seconds of repeated queries to microseconds of arithmetic, because every point is read from the same in-memory model.

07 Keeping the model fresh without a stampede

A precomputed model is stale the moment you build it, so it needs refreshing - but you can't have every rule retraining at once, or you've just recreated the query stampede on a timer. Two techniques keep it cheap:

  • Piggybacked incremental updates. Every evaluation already has the latest data point in hand. Fold it into the bucket's running statistics right there, for free, so the model drifts along with reality between full retrains. Streaming estimators (Welford's algorithm for running moments, bounded approximations for the robust statistics) make this O(1) per point.
  • Staggered retraining. Spread full rebuilds across the schedule with something like (rule_id + day_of_epoch) % N, so each rule retrains on its own rotating day. The heavy historical fetch is amortized instead of synchronized - no thundering herd.

New alerts get a one-off model trained inside the request so they have a usable band immediately, rather than waiting for the next background cycle.

08 The lesson

200ms → 8.5ns
per evaluation
~10 → 100k+
timeseries watched
0
queries added per alert

None of the individual pieces are exotic - seasonal bucketing, robust statistics, streaming updates, and a background job are all standard tools. The leverage came entirely from where the work happens. The naive design and the fast one compute almost the same numbers; one does it on the latency-critical path under a per-alert multiplier, the other does it once, in the background, and leaves the hot path doing arithmetic on data that's already in memory.

That's the reusable idea, and it long outlives this particular system: find the expensive thing on your hot path, and ask whether it can be precomputed. Most of the time, it can.

Building something in this space and want a second pair of hands? luthra.zenith@gmail.com

← all writing