The Help Desk Is a Key to Incident Response

First of all, no one working a help desk is that 👆 happy. You should aim to help them be that happy but this is not an expression you are apt to see on the help desk. If you do see this, back away very, very slowly

I recently experience an outage from my internet provider, Bell Canada. My cell provider was still up and thanks to a very generous mobile data allowance, I was still able to get some work done over the 9 hour outage of my primary fibre connection.

I tweeted about it a few times.

...still down https://t.co/BIMK8Tv8Jw— Mark Nunnikhoven (@marknca) August 7, 2020

In fact, I fired off a bit of a tweet rant at one point when I was on a call with support because I simply couldn’t believe what I was hearing.

deleted the rant tweets about @Bell_Support as they weren’t helping anyone, me included...to easy of a dunk 🏀

maybe a blog post about how to help front line support if your org is having an issue?#BellDown— Mark Nunnikhoven (@marknca) August 7, 2020

I ended up deleting those tweets because they didn’t help anyone…but I think this post might.

#BellDown

The timeline for this incident is very straightforward. At 11:20 am, a major part of the Bell and Telus network went down. This impacted all services; cellular, internet, and TV.

About 1:20 pm, the issue was resolved the network was restored.

..at least that’s what the official support account and the media statement said;

Some customers in parts of Eastern Ontario and Québec experienced brief service disruptions earlier today. Service has now been restored. Our team is continuing to investigate the cause.— Bell Support (@Bell_Support) August 6, 2020

A bit of digging on Twitter and in some forums and that statement didn’t hold up. Customers all over the province were still experiencing issues and calls to the help desk were taking longer and longer, if you could get through at all.

I called in at 3:20 pm and spent an hour on hold. When I did finally reached a support team member, the answer I got was simple, “We’re still trying to figure it out.”

That contradicted the public statement but at least it was honest.

When service wasn’t restored by 9:05 pm, I finally called back in. I had periodically tried rebooting my router during the outage to ensure it was attempting to reconnect. After rebooting it again to a failure state, I figured I should call in.

This time, it only took a couple of minutes to reach someone but that’s when things got really interesting.

Keeping the Front Line Informed

The first level support team member I spoke to said the network was fine and no services were currently experiencing outages.

I informed them that wasn’t the case and in fact, the outage that was so bad it had landed in the news earlier that day still had customers offline.

This was news to the team member.

Not that some customers were still offline (though that too was news to them) but that there was a massive outage earlier in the day.

What!?! How is that possible?

I ended up getting pushed to second level support and after manually resetting my connection in the backend, my service was restored. During this process, I had the chance to chat with the second level team member to get some clarity about what was happening behind the scenes.

Shift Change

It turns out that the team member I spoke to during the afternoon was on shift during the outage so they knew about it. The people I spoke to in the evening had just come on shift and had no idea it had happened because it wasn’t an active incident.

Apparently what happens is when an incident is resolved, after a short amount of time, it no longer appears in their active data feed. They would have to manually seek out the information in order to figure out that a massive chunk of the network had failed earlier that day.

While that would be a good practice to start their shift, they don’t have time to do that type of research. They are expected to login and start processing customer requests ASAP.

This setup leaves them out to dry.

Jumping to Conclusions

Based on my experience, once an incident’s root cause has been determined and the core service restored, incidents tend to be flagged as “resolved.” Why? Most organizations metrics are based around the amount of downtime an incident causes.

This naturally drives teams to resolve incidents as quickly as possible.

Nothing wrong with that, right?

Well there is. Downtime as a metric tends to push teams to resolve the big issue and declare victory too early. It’s not malicious or even deliberate, it’s just a result of what they’re measuring.

A much better metric is “time to restore full service”. This means that you’ve verified the root cause and all down stream impacts have been resolved. In this case, fixing the network and then verifying that each customer is back online.

That’s more work but also more representative of the customer perspective. Good metrics should drive the customer experience, not the Help Desk managers.

Here’s How Bell Could Do Better

When an outage is so big that the media reports on it, that’s a black eye for the company. There is massive pressure to report a small amount of downtime and to declare a quick victory. That’s what Bell appears to have done for this incident.

The problem with that is the only publicly available information to customers was that the issue was resolved. That was the statement to the press and on Twitter. When a customer sees this and also sees that their service hasn’t been restored, they assume it’s a new issue.

They then call into the Help Desk and quickly overwhelm that outlet. This causes a high amount of frustration because the usual preemptive messages like, “We’ve currently experiencing a large outage and working quickly to resolve it” aren’t in place to inform customers and reduce support calls.

If customers manage to hang on and speak to someone, they get pushed through a frustrating troubleshooting script that has no hope of resolving the issue. This isn’t fair to the customer or the Help Desk.

The desire for positive public perception actually made the situation a lot worse for the people the company should care the most about, their customers!

What Can You Do?

If your organization is experiencing an incident—whether it’s a operational or security-based—one of the first calls your response teams needs to make is to the Help Desk.

This is the team that is going to take the brunt of the outcry but they are also your best olive branch to affected customers (internal or external). Being honest is key here. An incident is not the time to get “cute” with your phrasing.

The Help Desk can help you set customer expectations.

Direct, clear language is best…even if you don’t know what’s going on. Provide regular updates. Yes, even if those updates are, “We’re still investigating.” Just knowing that someone is looking into the issue goes a very long way to build and maintain trust with your customers.

It’s a small amount of effort that goes a long way. Don’t let concern about appearance impact the level of service you provide. Ironically, it’s reacting to those worries that will have the biggest negative impact on how your organization is perceived.

Yes, it’s difficult to admit there’s a problem. It’s even harder to admit you don’t know what it is.

But being honest about an issue and clearly communicating throughout the process can turn that failure into a positive example of how your organization can help.