Skip to content

Uptime is a Poor Metric

A colleague recently shared "GitHub's Historic Uptime", which is a visualization designed in a way that highlights the significant increase in instability with GitHub's services since Microsoft's acquisition in late 2018.

At first glance the contrast is glaring, but on closer inspection one might notice that the y-axis starts at 99.5%. Adjusting the graph to show the entire range from 0-100% would make the graph appear flat.

A line chart showing GitHub's average uptime from 2016 to 2025 showing the range from 0-100%. The line appears to be mostly flat near 100% with minimal variation.

Does that mean that I think the Microsoft acquisition wasn't a significant impact on GitHub's services?

No.


I'm reminded of an apocryphal story of two smart students comparing their grades. The first received a grade of 99, and the second a grade of 100.

The first says to the second

One point difference isn't that big a deal.

to which the second replies

You got a 99 because that's the most you could get. I got 100 because that's all I could get.


Returning to uptime, it's hard to overstate how significant a difference there is between no problems and almost no problems.

When discussing high availability, it's common to hear measurements in numbers of nines.

Nines Uptime
2 99%
3 99.9%
4 99.99%
5 99.999%

Each nine generally represents an order of magnitude of more effort.

Visually, they're practically indistinguishable.

This brings me back around to my primary topic and the title of this post, uptime is a poor metric. In my opinion, uptime expressed as a percent is just about the worst way to represent this data.

At this point you should be asking

So what do you think is a better way to represent uptime data?

I'm so glad you asked.

Before I give my answer, let's take a look at everyone's favorite subject: math.

We have three co-dependent values:

  • Uptime
  • Downtime
  • Total Time
Total Time = Uptime + Downtime

Uptime can be expressed as a percent which is really just a ratio of uptime to total time.

Uptime% = Uptime Total Time

Downtime can similarly be expressed as a percent.

Downtime% = Downtime Total Time

With a bit of massaging we can show the relationship between Uptime% and Downtime%:

Downtime% = Total Time Uptime Total Time Downtime% = Total Time Total Time Uptime Total Time Downtime% = 100% Uptime Total Time Downtime% = 100% Uptime%

Uptime% and Downtime% are unitless measures, as the units cancel. Both are relative to the total time.

What if, instead we look at the ratio of uptime to downtime? For brevity I will call this the "Uptime Ratio".

Uptime Ratio = Uptime Downtime

We can massage the equation a bit to express Uptime Ratio as a function of Uptime%:

Uptime Ratio = Total Time Downtime Downtime Uptime Ratio = Total Time Downtime Downtime Downtime Uptime Ratio = Total Time Downtime 1 Uptime Ratio = 1 Downtime% 1 Uptime Ratio = 1 1 Uptime% 1

Like Uptime%, Uptime Ratio is a unitless value, but instead of representing the portion of the time that a service is up out of the total time, it represents an average of how many units of time a service is up for every unit of time a service is down.

With an Uptime Ratio of 1, you'd expect the service to be down for every minute it's up (which is not a very good service).

Looking at Uptime Ratios for the "nines", an interesting pattern emerges:

Nines Uptime Uptime Ratio (approx)
2 99% 99
3 99.9% 999
4 99.99% 9999
5 99.999% 99999

Unfortunately, graphing uptime ratios runs into an issue at 100% uptime because the ratio becomes infinity.

A line chart showing GitHub's average uptime ratio from 2016 to 2025 with the values off the chart before the microsoft acquisition followed by values in the 30,000 range or lower with periodic asymptotes indicating when uptime was at 100%

We could massage the data further by changing the units of the ratio from 1:1 to something else. Different scales might make sense for different levels of availability, such as months per minute (1/43200), or days per second (1/86400).

We can then also choose an arbitrarily large value as our maximum value as a stand-in for infinity. In the following example, which uses a months per minute adjustment, the maximum value used is 30, which implies that those months may have up to two seconds of downtime

The same data as the previous image, but with the y scale normalized to months per minute, and a maximum value of 30. Before the MS acquisition the values are mostly 30. After the acquisition the values are under 2 and often less than 1.

Even with data massaging the asymptotic nature of values representing 100% uptime make this visualization less than ideal. Some improvements might be had by switching to a logarithmic plot, but ultimately I think we can do better.

What if, instead of Uptime Ratio we looked at Downtime Ratio?

Downtime Ratio = Downtime Uptime

Just like Uptime Ratio we can massage the equation to express Downtime Ratio as a function of Uptime%:

Downtime Ratio = Total Time Uptime Uptime Downtime Ratio = Total Time Uptime Uptime Uptime Downtime Ratio = Total Time Uptime 1 Downtime Ratio = 1 Uptime% 1

Like Uptime%, and Uptime Ratio before it, Downtime Ratio is a unitless value, and represents the average of how many units of time a service is down for every unit of time the service is up.

And, just like Uptime Ratio, we can scale the values by different units in order to produce a meaningful visualization.

Nines Uptime Downtime Ratio
2 99% 7h16m/month
3 99.9% 43m/month
4 99.99% 4m19s/month
5 99.999% 26s/month

The benefit to Downtime Ratio is that 100% uptime becomes 0% downtime. 100% downtime does turn into Infinity, but anyone experiencing 100% downtime has more important things to worry about than how they're reporting their data.

A line chart showing GitHub's average downtime ratio, expressed in units of minutes per month, from 2016 to 2025. Values are mostly 0% ahead of the microsoft acquisition followed by values oscillating in the 20-200 range

This is how I'd prefer to talk about uptime.

It's not significantly different than just plotting the actual number of minutes of downtime each month (or seconds each day) but by using the ratio, downtime is weighted heavier the longer it lasts.

You can go further by inverting the range on the graph, since the Downtime Ratio represents a negative effect on the system.

The same line chart as the previous image, but with the y-axis flipped so that the range is 220 to 0 rather than 0 to 200

And suddenly all that math and data massaging has led us right back to where we began, with the same basic graph, telling the same story, except this version has better units to provide understandable context to the reader.