The Hard Part of Sampling Is the Zero

You can answer "how many errors in the last 30 days?" by reading a sliver of the data and multiplying back, for a fraction of the cost. The multiply is the easy part. The engineering is refusing to report a confident zero when your sliver happened to miss everything.

Some questions are cheap to ask and ruinous to answer. "How many checkout failures in the last 30 days?" is one number - but the obvious way to get it reads every log line in those 30 days, tens of millions of rows, to hand back a single integer. You pay full price for the whole haystack to count a few needles.

There is a much cheaper way, and it is old. Do not read the whole month. Read a slice of it, count what is there, and scale up.

01 A spoonful of soup

You have a giant pot of soup and you want to know how salty it is. You do not drink the pot. You stir, taste one spoonful, and judge the whole pot from it. Sampling a query is the same move: read one window of time instead of the whole range, count the matches in it, and multiply by the inverse of the fraction you read. Read a quarter of the month, see 1,800 errors, estimate 1,800 × 4 = 7,200 for the month.

Read one sub-window, count the matches inside it, and multiply by 1 / the fraction read. The textbook name is the Horvitz-Thompson estimator; in plain words, "I looked at a quarter, so I multiply by four."

That multiplier - one divided by the fraction you read - is the whole trick. It is provably unbiased for totals: on average, counted ÷ fraction lands on the true total no matter how the data is shaped. This only works for quantities you can legally add back up - count and sum. An average or a 95th percentile cannot be recovered by multiplying a slice, and the number of distinct users is hopeless this way, so we never sample those. For the totals, though, the arithmetic is a one-liner.

02 The multiply is the easy part

If the estimator is unbiased, where is the catch? In the word "average." Unbiased means it is right on average over many random slices. Any single slice you actually run can be off - and how far off depends entirely on the shape of the data you filtered down to.

For common, steady things - all requests, all 200s - every slice looks like every other, so the estimate is tight. The danger is the rare filter. Ask for "500s from one app build in one region," maybe 40 events scattered across a month, and your slice almost certainly contains none of them. Multiply zero by four and you get zero. The errors are real. Your window just missed them, and you reported their absence with total confidence.

This is the failure that matters: not a number that is 8% high, but a confident 0 that is wholly wrong. A noisy estimate, a user can distrust. A clean zero reads as "nothing happened" - and silence is the one answer an alerting or analytics system must never fake.

03 See it break

Below is the estimator running on a month of synthetic events. Pick a data shape, then drag the sample window across the range. Watch the estimate track the truth on steady data, swing on bursty data, and collapse to a confident zero on a rare filter. When it lands empty, hit widen once and see whether a bigger look rescues it.

On the rare filter most positions show the same thing: a window with nothing in it, an estimate of zero, and a true total that is plainly not zero. That is the case the rest of this article is about. Everything else - the math, the speed - is the easy 80%.

04 Measure the matches, not the answer

The first instinct is to guard on the result: "if the estimate is zero, do not trust it." That is wrong, and the bug is subtle. A sum of refund amounts that genuinely comes out to zero over a window full of non-refund traffic is a correct zero. A count of errors that is legitimately zero because the service was healthy is a correct zero. Flagging those is crying wolf.

The thing that tells a real zero from a dangerous one is not the value at all - it is how many rows the window matched. A window that matched ten thousand rows and summed to zero is solid. A window that matched almost no rows might have landed in a gap, and any number it produces - including zero - is a guess. So the guard keys off the matched-row count, never the output. Plenty of matches, any value: trust it. Near-zero matches: be suspicious, whatever the value.

The distinction in one line: a zero answer over many matched rows is data. A zero answer over an empty window is the absence of data wearing the same clothes. Guard on the matches, not the number.

05 Widen once, then decide

So the window matched too few rows. You have three options and two of them are traps. Returning the thin estimate anyway is the confident-zero lie. Quietly falling back to scanning the whole range is the other trap - it turns the fast approximate query you were paid to deliver back into the slow exact one, silently, exactly when the data is awkward. The honest move sits between them: look a little wider, once.

Double the window, recompute the scale factor from the new width so the arithmetic stays correct, re-run, and re-check the matched-row count. One bounded step, not a loop. If the wider window now has enough matches, return that estimate. If it still does not, stop guessing and say so.

The safety chain. The decision keys off matched rows at every gate. A thin window widens exactly once; if it is still thin, the request refuses rather than emit a number nobody should trust.

Why one step and not "keep widening until it is dense enough"? Because that loop has a name - it is a full scan with extra steps. The bounded widen says "give the estimate a fair second chance," and the refusal says "past that, the data is too sparse for a cheap answer, and pretending otherwise is the bug." A quick aside on the threshold: the accuracy of a scaled count depends on the absolute number of matched rows, and its relative error shrinks like 1 / √(matched rows) - so the floor is a count of matches, not a fraction of the dataset. A bigger dataset does not need a bigger floor; it just makes the floor easier to clear in a small slice.

06 A flag you do not render is not a safety feature

When the widened window is still too thin, the tempting design is to return the estimate with a lowConfidence: true flag attached and let the caller decide. It feels responsible. It is the most common way this ships, and it is mostly theater.

A flag only protects a user if something on the path actually renders it. Think about who calls an analytics API: a dashboard that draws the number in 48px and never reads your metadata. An alerting rule that compares the number to a threshold and pages or does not. A nightly script that writes the number into a report. A spreadsheet pulling it over HTTP. None of them look at lowConfidence. To every one of them, your carefully flagged guess is just a number, and a flagged confident zero fires - or fails to fire - an alert exactly as loudly as an unflagged one.

The reusable lesson: when the safe behavior depends on the caller reading a field, the default is unsafe. Design for the consumer who ignores your metadata, because most of them do and silently. Fail closed: by default, refuse the thin sample outright - return an explicit error, not a flagged number. Only return the flagged estimate when something opts in and visibly renders the warning. The flag becomes real the day a UI is built to show it; until then, the rejection is the only honest answer.

This is not exotic. It is the discipline the systems that have lived with sampling longest all converge on: report uncertainty and escalate rather than answer (BlinkDB), state plainly that a restrictive filter can miss rare values "altogether" (Elasticsearch), keep extrapolation opt-in and warn (New Relic). The shared rule underneath all of them is the same: never let a thin sample wear the costume of a confident answer.

07 The payoff

~1 / √k

error set by matched rows, not data size

widen ×2, once

a bounded look, then a decision - never a loop

confident zeros shipped to a caller

The cheap part of approximate analytics is the part everyone demos: read a quarter, multiply by four, watch the query get fast. The part that decides whether you can put it in front of real users is the unglamorous safety chain - measure the matches not the value, widen once, and fail closed when the data is too sparse to answer cheaply. Get that right and "our long-range queries are too expensive" stops being a tradeoff between speed and trust. Get it wrong and you have built a system that is fast, cheap, and occasionally, confidently, lying.

Building approximate query or observability infrastructure and want a second pair of hands? luthra.zenith@gmail.com

← all writing

The Hard Part of Sampling Is the Zero

01 A spoonful of soup

02 The multiply is the easy part

03 See it break

Sampling a month - drag the window, change the shape

04 Measure the matches, not the answer

05 Widen once, then decide

06 A flag you do not render is not a safety feature

07 The payoff