The perfect CDN would be able to send an unlimited amount of data with very little latency to every HTTP client that is connected to its servers no matter where those clients are physically located. Obviously, there is no perfect CDN just as there is no perfect wireless operator that can give your phone 5 bars of service no matter where you travel in the world.
All we can do is measure how different CDNs perform (in terms of throughput and latency) in different regions of the world and track this performance over time. But measuring CDN performance can be confusing. There are different ways to measure throughput and latency and different ways to interpret the results. This blog post will discuss how the conclusions we reach regarding CDN performance can change depending on the lens we use to view the data collected.
How to measure CDN performance and what are p95 and p99?
Like most things in life, CDN performance can be analyzed using a normal distribution.
To create this bell curve, we measure latency and throughput for a group of users at regular intervals over a period of time. We then sort the observed latency and throughput values from the fastest to the slowest (or from the highest throughput to the least throughput). When we’re done sorting, we’ll have a collection of “good scores” and another collection of “poor scores”. If we have a sufficiently large number of observations, 68.3% of the scores will be within one standard deviation (delta) of the median score for each metric. In layman’s terms, most of the scores are somewhere in the middle or, put another way, most students are average :-)
So, the question is: do we focus more on the scores in the middle or the scores on the edges? The answer requires a little more context.
Generally speaking there are two types of optimization problems for any service connected to the Internet:
A major, critical problem: Most users will experience this type of problem. For developers, these are usually easier to detect and reproduce. If most users experience this problem, then chances are that a developer can open a browser and see the problem locally right away.
A local, specific problem (also known as an “anomaly”): For instance, what if there’s a new version of Firefox that some users have which is causing a bug? It might “only” be 7% of the users, but they experience terrible performance, and it’s harder to reproduce if we don’t already know that it’s just related to the newest version of Firefox.
Unfortunately (or fortunately) the “low hanging fruit” problems are all solved at some point, and improving the performance gets harder and harder. Often, that doesn’t mean that performance is actually good. User expectations are very high, and there’s more impatience and churn than ever before. That’s why many companies are forced to go through the Sisyphean process of detecting and fixing anomalies that can be elusive and tricky. It can be a bug in the video player, a problem in the encoder, an error in the ad server or the CDN.
When looking at performance metrics, such as rebuffering, abandonment or latency and throughput the picture can be misleading - median or mean values (the scores in the middle) will drop when there is a critical problem, but not necessarily when there is a local problem. Since smaller scale problems are more common but trickier to detect, it’s important to look at the 95th or 99th percentiles (p95, p99) as these are the values that will move when local problems pop up.
By keeping a close eye on the p95 and p99 for different metrics, the best streaming companies ensure that their service works great for almost everyone, not just the majority of their users. After all, you wouldn’t want even 1% of your users to experience poor quality if it could have been prevented.
CDN performance is local
CDNs rarely fail completely or get congested on a global scale. On the other hand, they frequently suffer local outages which force them to “re-route” traffic. These local outages are caused by a variety of factors, including (a) the need to upgrade hardware, (b) congested local links, or (c) software bugs. This happens every day and users in the affected regions experience serious performance degradation.
Let’s see a real-world example of how a local problem affects the different percentiles. The following charts show throughput and latency for 10 different CDNs as measured on March 25th.
When looking at the median (50th percentile) throughput chart, there is no obvious problem:
When looking at the 75th percentile throughput chart, the issue begins to present itself:
The p95 throughput chart makes it clear that the pink CDN had a local outage:
This particular outage was confined to Brazil so it only affected a small % of the total user base.
Similarly, when we look at the 50th percentile latency chart, no obvious pattern emerges:
The 75th percentile latency chart reveals more information:
The p95 latency chart makes it quite obvious that all is not well:
- When comparing CDNs, look at the 95th percentile performance figures rather than the 50th percentile figures because they better represent which vendor is more stable and provides better performance in problematic regions.
- Anomaly detection will be an important subject in the coming years. The best companies already do it; you can do it too with relatively simple tools that we can help you deploy.
- Alerts and monitors on the 95th and 99th percentiles are encouraged.
- Intelligently switching to a different CDN during local outages can greatly improve the user experience for customers in the affected regions - that’s why we released the Peer5 MultiCDN.