Reducing video rebuffering is hard. As discussed in our last blog post, monitoring your 95th percentile metrics can make it easier to detect anomalies, but what about prevention? One solution that many people are talking about these days is moving to a multi-CDN architecture. But will going multi-CDN magically reduce your rebuffering and cure all of your streaming ills? The answer, of course, is complicated which is why we’ve written this article.
Going multi-CDN can provide several benefits for a broadcaster, such as better geographic coverage and, possibly, better economics. Adding live switching logic between the CDNs goes a step further and enables load balancing and redundancy in case of problems. But what are the problems that you’re likely to encounter? Let’s examine some common CDN problems and their impact.
Catastrophe & Chaos
Once in a blue moon, a CDN will experience a major outage that affects a large geographic area. These outages are so extreme that they shut down a large portion of the Internet for a non-trivial amount of time. Recent examples of such outages:
- June 2, 2019 - Google Cloud Platform multi-region 4 hours outage
- July 24, 2019 - BGP Routing issue as described on Cloudflare’s Blog
Detectability: Very easy to detect as any metric that you care to measure will explode. Your alerts will fire or, if you don’t have alerts, you’ll get messages and phone calls from users.
Solution: A good CDN switching engine will attempt to re-route users to a working CDN within the limits of the outage.
Occasionally, a CDN will experience a local issue in one of its PoPs (Point-of-Presence), which means all of the users that are routed through that specific PoP will have problems fetching video segments, and will most likely experience rebuffering.
Detectability: Medium or hard depending on the percentage of users that are affected. A large portion of the traffic would skew the metrics enough to create an anomaly, while a small portion might get swallowed up within the geographic granularity of the monitoring system.
Solution: A good CDN will re-route the impacted traffic to a different PoP within its network which will cover for the faulty PoP with best-effort performance. A good CDN switching engine will eliminate the faulty CDN altogether in that region and just use a non-faulty one.
Smaller scale outages are so hard to detect that you might miss them entirely. Have your users experienced such an outage? The answer is most likely yes, since we see dozens of them as part of our daily monitoring efforts. We call these events “blind spots”.
The Blind Spots of CDN Switching
There are 3 main blind spots that server-side CDN switching engines do not address very well:
#1 The DNS Propagation Problem
“We already know there’s a problem, but we have to wait at least 5-10 minutes for DNS to propagate”
A common CDN switching implementation is based around DNS resolution. The DNS resolver incorporates a switching logic that responds with the best CDN at that given moment. If one of the CDNs in the portfolio experiences an outage or degradation, the DNS resolver will start responding with a different (healthy) CDN for the affected region.
The blind spot of DNS would be its propagation time. From the moment the switching logic decides to change CDNs, it might take several minutes (or longer) until the majority of traffic is actually transitioned. Moreover, while most ISPs will obey the TTL (DNS response lifetime) defined by the DNS resolver, some will not, causing the faulty CDN to remain the assigned CDN for the users behind that ISP. Rebuffering on existing sessions is inevitable, at least until the DNS TTL expires on the user’s browser.
#2 The Data Problem
“We select CDNs based on a synthetic test file, but real video delivery is much slower”
Any switching solution must implement a data feed that reflects the performance of the CDNs in different regions and from different ISPs. A common approach for gathering such data is to use test objects that are stored on all of the CDNs in the portfolio. The test objects are downloaded to users’ browsers, which then report back the performance that was observed.
Often times, for various reasons, the test objects don’t represent the actual performance of the video resources. For example, imagine that the connection between the origin and the edge server is congested - the test object will not be impacted since it’s already warmed in the cache of the edge server and does not need to use the congested middle mile connection to the origin.
This performance gap will cause a CDN to be erroneously selected as the best one even though the reality differs. It’s also possible that the test objects and the actual video resources do not share the same CDN bucket configuration. If the video resources bucket is misconfigured, some users might get unoptimized or even faulty responses which at no point will get detected because the test objects bucket functions properly.
This disconnect between synthetic performance measurements and actual delivery performance often causes degraded performance that goes undetected for a very long time.
#3 The Granularity Problem
“We select CDNs based on overall performance in each region, but this stream is performing poorly for a subset of the users”
A typical CDN Switching flow might be:
- measure CDN performance across different regions
- report results to the server
- server chooses the “best” CDN
- users are assigned to the best CDN for their region
Unfortunately, not all regions have fresh performance data all the time and so a fallback logic is usually applied. When there isn’t enough data in a specific region, data from its greater containing region will be used instead.
It’s possible that a region with a small number of users gets swallowed up by a larger fallback region, in which case an outage might never be detected at all because the affected users comprise a small portion of total traffic that is not enough to “move the needle”.
This is a data granularity problem. A broadcaster might have 100k users that are spread across 15 countries, 1,000 unique regions, 5,000 ISPs and a host of other parameters. Taken together, these parameters segment the user base into millions of tiny dimensions, none of which will have enough data to perform meaningful switching decisions, not to mention the computational load it will create on the switching system.
For this reason, server-side switching is inherently limited to a more coarse grouping that is technically and mathematically viable. This reality creates a blind spot when it comes to smaller regions that might get hit by a local, undetectable outage.
Real World Example of The Granularity Problem
Last week, an outage occurred in the USA which demonstrates the granularity problem.
The CDN outage caused a significant drop in request performance that, in turn, led to rebuffering. Starting at 7:20 AM, an increase in Time-To-First-Byte (TTFB) was observed from an average of 850ms to a peak of 6700ms.
When comparing the 95th percentile TTFB of the affected area to its greater containing region, it’s clear that the affected area didn’t constitute enough data to move the overall metric.
The greater containing region (blue) doesn’t show any anomalies throughout the outage. (less is better)
Rebuffering spikes render the playback unwatchable. (less is better)
While this chart might seem dull, it illustrates that throughout the entirety of the outage, no CDN switching took place for the affected region.
Enter: Per User CDN Switching
Video playback is a very fragile thing - a user might have just a couple of seconds of content buffered ahead and any slowdown in fetching segments can easily consume that buffer and freeze the playback. For this reason, we created our client-side switching feature which constantly monitors the playback experience for each individual user and is able to react to poorly performing CDNs within a split second (literally, milliseconds) and prevent rebuffering from ever happening. This means that even an outage that affects only one user will be accounted for.
Let’s examine the metrics during the outage described above with and without our client-side switching feature.
The TTFB of the client-side switching group (green) was affected as well but much less than the other group. (less is better)
The client-side switching group (green) experiences almost no interruption in playback. (less is better)
As seen in the graph above, users that relied solely on server-side switching (red line) were impacted significantly compared to users with client side switching. Server-side CDN switching was not granular enough to detect the local outage and the assigned CDN for that region remained the same even though some users experienced terrible performance degradation. The client-side switching, with its per-user granularity, was able to change the mix of CDNs within the region and avoid the issue in real-time. The rebuffering was reduced from 11.2% (server-side only users) to 0.2% (client-side switching enabled users), and the overall region rebuffering was reduced by 70% from 1% to 0.3%.
When CDNs experience outages, users will encounter rebuffering. There are multiple types of outages - some stay below the radar and are completely undetected while others will make your phone ring constantly. Different layers of redundancy and different levels of granularity are strategies to address the various outages that an online delivery pipeline might experience. A combination of several such strategies is likely to achieve the best UX.
Employing server-side switching alongside client-side switching allows us to:
- Reduce rebuffering by monitoring video playback constantly for all users
- Respond very QUICKLY to outages by switching CDNs on a per request level
- Improve bitrate and quality by increasing the granularity of CDN selection to a per-user level
Read more about the Peer5 MultiCDN