Archive · · 2 min read

Can We Improve How Netflix Handled Failover Using DNS In 2017?

In late 2017, Netflix explained how they tackled the problem of failing over when disaster struck. Four years later, how well does that design hold up? What could we improve given the services and features available today?

Can We Improve How Netflix Handled Failover Using DNS In 2017?

In late 2017, Netflix did an AWS “This is My Architecture” video. It was one of the first. In the video they explained how they tackled the problem of failing over when disaster struck.

Specifically, they drilled down on a clever use of DNS records to create a very flexible system that allows services to continue in the event of some pretty significant disaster.

Few companies operate at Netflix’s scale (even their scale back then) but this technique is broadly applicable for any team that needs to failover in the event of either disaster or even a simple failure.

Now, a few years later, I react to that video and see what’s stood the test of time, what could be done simpler given today’s technology, and generally critique the design against the AWS Well-Architected Framework.

The AWS Well-Architected Framework

The AWS Well-Architected Framework is designed to help you and your team make informed trade offs while building in the AWS Cloud. It’s built on five pillars;

There pillars cover the primary concerns of building and running any solution. And as much as we’d all love to have everything, that’s just not possible.

…enter the framework.

It’ll help you strike the right balance for your goals to make sure that your build is the best it can be now and moving forward.

Why Architecture?

I often get asked why I talk about building in the cloud and architectural choices so often…aren’t I a security person?

Yes, I do focus on security and architecture is a critical part of that.

There’s really two types of security design work. The first is when you’re handed something and need to make sure the risks of that technology matches the risk appetite of the users.

The second type is when you’re building the technology. This is where making choices informed by security early in the process can have profound effects. You’re no longer bolting security on but building it in by design.

That’s why I talk about architecture and building so much. It’s where we all can have the largest possible security impact!

This video—and the ones that will come after—looks at a specific set of design decisions and how they balance the concerns of the AWS Well-Architected Framework…where security is one of the five pillars.

Netflix’s Design

Netflix’s operates a massive scale and has for years. They operate concurrently in multi-regions and that should make the failure of one region easier to handle.

In this video, Netflix explains how they leverage the DNS (domain name system) to make failing over various services easier for the team. They use alias records and a smart domain tiering structure to give themselves flexibility for failover and restoration.

It’s a very clever solution that still holds up years later. Watch the video 👆 for more of the details!

Btw, I’ve updated my course, “Mastering The AWS Well-Architected Framework” on A Cloud Guru. If you want a solid walk through of the ideas behind the framework and how to apply it to your work in the AWS Cloud, check it out!

Read next