Don't aggregate your metrics

Don't aggregate your metrics

Recently Visnu Pitiyanuvath of Observable presented how dataviz techniques can be applied to developer dashboards to improve insights. I followed his advice and it had a transformational effect on my work.

(full talk is below but you don't need to watch it right now)

His talk emphasized that we often over aggregate metrics [proof]. Indeed, most monitoring dashboards are time series of the mean, perhaps some percentiles, but a load of line graphs nonetheless.

image Typical Graphana dashboards are not good for building insight

His message was to stop doing that, and just draw every single event, no aggregation into trendlines. Allow your visual system to notice patterns that would otherwise be invisible under aggregation.

I was quite intrigued. The advice felt like the opposite of what the Google SRE book suggests, where SREs are encouraged to distill the system down to a small number of actionable graphs and precise SLO boundaries. (FWIW: I think the difference is SREs expect to be piloting a well-understood system whereas Visnu is advising how you can start to understand a poorly understood system).

Anyway, I was building a new system that had lots of weird performance quirks, so it was a great testbed to try out the techniques he suggested.

Serverless Cells

The system I am developing executes observable notebooks in a serverless environment. It's multi-tenancy Puppeteer on Cloud Run. It potentially has multiple levels of cold starts, and I was interested in understanding the performance characteristic better.

So I hacked together a latency prober notebook, which satisfyingly could utilize the serverless environment to implement, schedule, and execute the probes, as well as visualize the results.

Data collection

The core latency measurement work is a cell named timedRequest that wraps an HTTP call with a timer. Multiple calls to timedRequest are made in the serverless-cell called ping. In particular, for each region, every 30 minutes, two calls are made in quick succession, with the results saved to the Firebase Realtime Database. The first is tagged the "cold" call, and the second is tagged "warm".

    async function ping(region) {
      const coldTime = readableTimestamp();
      const datapointCold = await timedRequest(region);
      baseRef.child(coldTime + region).set({
        tag: "cold",
        time: coldTime,
      const warmTime = readableTimestamp();
      const datapointWarm = await timedRequest(region);
      await baseRef.child(warmTime + region).set({
        tag: "warm",
        time: warmTime,

Scheduling the work every 30 mins is achieved with a cron. I let the cron kick off latency work executed in europe-east1 which probes 3 different regions, europe-east4, us-central1 and asia-east1.

One would hope the latency between europe-east1 and europe-east4 would be the lowest, but of course, our stack is complicated and might route traffic to US, so that was the motivation for the regions chosen.


Visnu suggested using Vega-lite (JS API) and Observable together as a good pairing for quick dashboards. Wow, he was 100% right on how easy it was to draw and refine a decent graph containing all the measurement points read (in realtime) from Realtime Database.

The following code (source) produces the main workhorse of the dashboard. Every latency measurement is plotted, X is time, Y is latency. And each point is color-coded by the response code and whether it was warm or cold measurement.

The visualization includes the ability to hover over individual data points to get the raw data written in a tooltip, which makes localizing the precise time incredibly easy for heading to the system logs.

Here is the full code for building the main dashboard:

    vl.markCircle({ size: 100, tooltip: { content: "data" } })
        vl.calculate("datum.tag + '/' + datum.status").as("type/status")
          .scale({ type: 'log' })
            title: "latency (ms)"

Immediate results - Thundering Hurd Issues

Within 24 hours I had observed my first performance glitch. A cluster of measurements with a constellation of response codes occurring at nearly the same time.


It turned out the system had a bug in its pool mechanism. The intent was that requests to the same notebook would reuse the underlying puppeteer instance. Unfortunately, the pool was not populated until AFTER the instance started. So if n requests came in the system would exhaust its resources booting up n puppeteer instances and generally go bananas. It was a one-line fix


Chasing down the long tail

After fixing the first set of obvious problems, the focus turned to issues in the long tail. We had several unexpectedly long performance measurements. Requests taking longer than 5 minutes, for a service with a timeout of 20 seconds! What the hell?

In the following graph, you can see on the far left a period where some requests took an impossibly long time. The graph is segmented into intervals where we tried out a different fix.


We first noted that the latencies were beyond the timeout setting for Cloud Run. We noted that the Cloud Run severs the client connection at the timeout setting, and freezes the container. But, the express handler continues to run after the container is unfrozen for unrelated subsequent requests.

The fix was to detect and end the long-running processes explicitly (see commit).

However, we then saw a reduction in the number of warm latency measurements. Now that the 20s timeout was truly respected, the latency prober ran out of time to gather the 2nd datapoint. So the 2nd adjustment was to bump the deadline to 60 seconds.

After that fix, it seems to work at first, but latency measurements crept up over time. This turned out to be a problem with the measurement system, not the infra. Each latency prober boots up the latency measurement notebook, which queries the measurement history.

So, the problem is that the Realtime Database will pause when large quantities of data arrive. If that happens in the middle of a latency measurement, then that measurement is stretched proportional to the amount of data in the system. This is why it was trending upwards over time. It's also why some data points are not affected and many are but by a similar magnitude even though they are readings for different regions!

Figuring out that last bug was quite tricky. I ended up running the serverless env locally and step debugging. I ended up pausing during a latency measurement for several minutes, causing a massive spike in latency! But that was just me debugging.

So after adding some logic so that the cron job never loads the dashboard pointlessly:-

if (!getContext().serverless) ...

We were finally back where we were but with no crazy long tails! We were often recording warm latencies below 1s!


Open questions remain. We are seeing more 429 from us-central1 despite it being our least loaded region. And also latency is lower in us-central1 when it should be lowest in europe-west4, indicating our traffic is pointless being routed to the US somewhere (Firebase Hosting origin is only in US).

Visnu was right

The main point of this dashboard was to see if plotting individual measurements was better than aggregated trend lines. My conclusion: absolutely.

Many of the trickier issues were only diagnosed because we could see strange unexpected correlations across groups. Or very precise synchronizations in time. Or problems that affected only a small number of data points.

Aggregation would have washed out those details! I am a convert. Never aggregate!

Its also cool you can host an end-to-end latency prober system in a single notebook, if you want to build your own you can fork mine