Archive · · 4 min read

Lessons in Designing Blast Radius The Hard Way; One Mistake Crashes Facebook For Hours

Facebook, Instagram, and WhatsApp are deeply integrating into many aspects of daily life for many communities and business. One networking misconfiguration reminded 3.5 billion users of just that.

Lessons in Designing Blast Radius The Hard Way; One Mistake Crashes Facebook For Hours

An icon representing a document where the bottom half of it has been drawn with a dotted outline, implying a copy The CBC Radio segment has been archived and is only available from CBC by request.

Facebook, Instagram, and WhatsApp are part of daily life for many. It’s hard to imagine how deeply until they disappear offline. That’s what happened on Tuesday, 04-Oct-2021 for the better part of a day.

It’s very rare for Facebook and its other properties to go offline. The last major outage was in 2019, before that 2017, 2015, and twice in 2014. The same cause was at the root of each, a misconfiguration. Really, a mistake.

What Happened?

Due to a mistake, Facebook’s DNS (domain name service) and BGP (border gateway protocol) services were removed from the internet.

This means that no device or user could lookup up domains like Facebook.com. When you try to access Facebook, your device asks a DNS server, “Where is Facebook.com? ”

That DNS server replies with something like, “Oh, they’re at 157.240.241.35”

That’s the IP address that the device will then use to get to Facebook.com.

How does that request get there? Your internet service provider’s systems of possible routes to any IP address on the internet. That map is constantly updated to account for changes in the networks. That’s BGP at work.

Big services and internet providers share BGP routes (the maps) in order to optimize how the internet works. That way each major service can say, “I’m here and here is how you can reach me.”

When Facebook’s engineering team made a mistake in their update, that information was removed. Users could no longer get the actual address of their systems (DNS) or the route to get to those systems (BGP).

Their mistake told the internet they didn’t exist.

The Impact

In addition to Facebook’s core services, Oculus VR headsets stopped working. Sites and apps that use Facebook login were unable to use that feature. Sites that use Facebook’s services for commenting no longer had commenting available.

The bigger impact? Businesses that use Facebook’s services—like Messenger and WhatsApp—to run their businesses could not connect with their customers.

The New York Times coverage has more of these personal stories about the impact. It’s well worth a read to get a better feel for that perspective.

Overall Facebook has been tremendously successful at delivering its service. If you include partial outages, they’ve averaged just under 5 hours downtime per year since 2014. Most of that is associated with the outage in 2019 that lasted a full 24 hours.

That’s an “uptime” (the time where the service is working as expected) of 99.94%.

If we only look at complete outages, Facebook hits 99.98% uptime. That’s only a few minutes per year. Very impressive for a system that provides services to over 3.5 billion users.

But any downtime can be significant when you run your business through Facebook. That mirrors of the challenge that this outage highlights: the single point of failure.

The Lesson

For builders, the critical lesson here is about reducing the “blast radius.”

That’s the term we use for making sure that—when possible—a problem doesn’t impact other parts of a system. Or at least that it has the smallest impact possible. This is an expansion of the idea of a “single point of failure.”

When it comes to DNS and BGP, you can’t avoid them. They are critical to the way internet works!

To mitigate the risk of a single point of failure for DNS, you use more than one hosting server. It’s not uncommon to use 3 or 4 different services—hopefully in different geographies—to host DNS records for a domain.

You take a similar approach with BGP. You typically have more than one place on your network that provides those routes to the internet.

Both of these services are critical. There are a lot of popular, well run DNS services you can take advantage of. But, if you’re running at a large enough scale, you have to deal with BGP.

These core services are critical to a working internet. They have redundancies into their protocols built in to avoid issues like this outage.

That leads to a logical question, “Is there anything that builders can do to mitigate changes to these key resources?”

Yes.

This is a situation where some technical controls and a lot of process are the right call.

On the technical side of things, make it hard to change these items.

Lock down the permissions required to change DNS and BGP settings. Take the extra step fo making sure that only specific “DNS-change-marknca” and “BGP-change-marknca” accounts or roles are able to make changes to these critical records.

Is that a pain for administrators? 100%. That’s the point.

That’s the opposite of the normal approach to operations but it makes sense here. You want to make sure that no one on the team could, “Whoops!” you off the internet.

You must to have a clear process for making these types of changes. That process should require more than one person to complete. That way, you can make sure that every change is reviewed at least twice from a design perspective. You can also separately review the change as it’s being made. A/k/a, “Did I type this in properly?”

DNS and BGP are an edge case where you can’t add that more technology redundancy. You have to make sure that it takes explicit effort to make changes.

There are times when you can’t design around a single point of failure. This is when process and a health bit of paranoia can make sure that you don’t have a really, really bad day.

References

Read next