Security Cloud Privacy Tech
Canadians Are Reliant on Rogers Whether We Like It or Not

Canadians Are Reliant on Rogers Whether We Like It or Not

I spoke with Hallie Cotnam on CBC Ottawa Morning on 07-Aug-2022 about this issue. Have a listen šŸ‘‡

On Friday, 08-Jul-2022, the Rogers network suffered a massive outage. Rogers is a major ISP and cellular provider in Canada. Just how massive might surprise anyone not living here. They have 35% of the national market share for mobile connections and 30% of all Canadian home internet connections.

On top of that, they have 2.25 million retail internet customers and another 7,000 enterprise customers.

Over a third of the country is online because of Rogers. Over a third of the country went dark for the entire day.

Much has been made of the outage (just check the references section at the end of this post) but when you wade through all of the opinions, it appears that the issue was the result of one mistake.

Itā€™s the type of mistake that keeps network engineers and operations teams up at night. A simple misconfiguration that threads the wrong needle and is extremely difficult to rollback.

Cloudflare has a great summary of the issue as seen from the internet.

Cloudflare BGP data showing Rogers network drop off the internet on the day of the outage, 08-Jul-2022

Cloudflare BGP data showing Rogers network drop off the internet on the day of the outage, 08-Jul-2022

šŸ‘† that big cliff? Thatā€™s not good.

Network Access

Most people will never see the inside of a data centre, including a lot of that networkā€™s engineers. Most of the work is done remotely. That requires a secure access path into systems that can update the network resources in question.

Care to guess where simple mistakes escalate out of control?

If you said, ā€œRemote access and update configurations?ā€, you win! ā€¦and by that, we all lost on July 8th.

Someone, somewhere made a simple mistake that apparently closed much needed update pathways and took most of the network offline.

How? These types of changes usually have both technical and process guardrails in place but they arenā€™t infallible. Mistakes still make it to production. It happensā€¦thankfully rarely.

The good news? The root cause of the issue was probably located quickly.

The bad news? The issue had already taken enough of the network offline that bringing it back up presented its own, unique challenge.

While this network outage lasted almost 17 hours. All indications seem to point to the original issue being resolved reasonably quickly and then the rest of the time spent unravelling the nightmare of legacy systems.

Rogers took a lot of heat for this outage. Their stock drop 1.17% on the day. But while itā€™s easy to blame them, the reason the outage was so long was written in a thirty year build up of technical debt, business incentives, and the geographical challenges of the Canadian market.

Are There Any Takeaways?

Everyone impacted has called for change. The Government called Rogers, Bell, and others to the carpet to figure out how to prevent another outage this significant. Those efforts wonā€™t drive any significant changes.

Canada is just too big and our population is too small to have a diverse set of telecommunications providers. Thatā€™s ok. We have reasonableā€”if expensiveā€”coverage today. We need significantly better coverage in the Territories and some rural areas but most Canadians have access to reasonably fast internet.

Do we need change in this sector? Yes.

Lower costs would help. Regulation that prevents bundling of multiple services (discounts for more services from one provider) which forces Canadians to put all of eggs in one basket. Subsidized access to rural and northern areas.

But at the end of the day, this massive outage was from a mistake. A mistake that happened despite technical and process safeguards. Why? Because šŸ’© happens. šŸ¤·

Thoughts On The Day

A Twitter thread from me on the day with my initial reactions;

References

More Content