Adding Guardrails To A Cloud Account After The Fact
The best communities generate good discussions that spark ideas in others. Skycrafters is just starting out but those discussions are already happening. After posting, “A New Cloud Account and It’s Full Speed Ahead, Right?” the other day, @marykay25 replied with this comment ;
This sentence could be a topic all on it’s own
👆 Yes it could be and this post will get that ball rolling. Why? Because when I read @marykay25’s comment, they were spot on.
What I described in the original post is the ideal setup. It’s definitely easier to create that configuration in a brand new environment.
And while I don’t agree that these ideal setups are the exception, this again highlights the strength of the community. @marykay25 has different experiences than I have had and deals with a different set of teams.
That said, I have picked up a few tips along the way to implement the same principles within well established accounts.
With a brand new account, your initial configuration sets the tone. In the original post, I laid out how to put a strong security spin on that initial tone.
With existing accounts, the challenge is twofold.
The first is with the team. The team working with that account will already be used to operating under the existing configuration. They’ve been doing it this way for a while and things are working, so there’s really no direct motivation to change.
The second challenge is on the technical side. Can these guardrails be implemented without breaking anything inside the active account? What level of testing will be required? How much work is involved overall?
Boiling it down, this is a security feature request that needs to be prioritized. How can we approach this challenge?
Getting The Team Onboard
Everyone wants their systems to be more secure. But security is just one of the pillars of building well in the cloud. When faced with deploying a new feature that directly helps customers or deploying security guardrails that may help in the future, it’s hard to argue against the customer.
That’s completely understandable and one of the key reasons the centralized security monitoring structure is so hard to put in place in an environment that is already working.
The story usually proceeds like this;
- Security determines they need visibility into every account now
- Security decrees from on high that this work must be done immediately for “compliance” reasons
- A few teams comply grumpily, others dig their heels in and slow down the work
No one likes being told they have to drop their work and do different work that doesn’t directly advance their goals. This is squarely on the security teams shoulders. They need to adjust their approach.
Until they do, let’s look at this from your teams point of view. How can centralized security monitoring and audit help you meet your goals?
As much as auditing sounds scary, it’s really just having someone double check your work. If you’re able to get feedback (preferably automated) that your workloads are configured in a strong manner, isn’t that a positive thing?
Similarly, while centralized monitoring always have challenges with context, having another team looking for security issues can add a layer of assurance that your team hasn’t missed anything. Additionally, centralized monitoring can have added benefits like spotting larger patterns that aren’t visible with only one accounts data.
There are positives for your team. They just aren’t as direct or impactful as you may want…which is fine as long as the cost or effort to implement isn’t too high.
That leads to the technical implementation of these guardrails. What are the risks associated with the steps I suggested in the original post?
Digging Up Roots
The first step in the sample checklist was;
This is probably the trickiest step to back away from. If you’ve made the mistake—and yes, it’s a mistake—of using the root account to create resources or run workloads in your account, you may have to re-launch them with a less privileged account or re-assign ownership.
The good news? Most cloud resources don’t have ownership assigned to a user but to the account. That means any account with sufficient permissions should be able to maintain or remove those resources.
Backing away from root ownership is more an exercise in reducing permissions, not changing ownership. Still, there is potential for downtime here but the risk of those elevated privileges usually justifies moving this work up as a high priority.
The one area that might be a “gotcha” is if someone is using the root account credentials on their workstation or has them embedded somewhere else like a deployment server.
Use the API call audit tool available in each of the big three clouds to find that access if it does exist.
Estimated time to resolve? An hour.
Level of effort? High due to log searches required and possible permission changes.
Return on investment? Very high. Root accounts are the keys to the kingdom and should be protected at all costs.
API Call Auditing
Of course, in order to check the API call logs, those logs have to be enabled! The good news is that for most accounts, those logs have been enabled by default since the account was created. That’s true for Azure, Google, and AWS.
But each of the clouds does have an exception (or three) that might apply here. There was a time when API calls were either not logged by default or used a different system.
With Azure, “Classic” resources may or may not log to the activity log. For Google, some services use the activity logs and not the newer audit logs. In AWS, older accounts simply didn’t have CloudTrail enabled and weren’t logging those calls in any form.
For older accounts, taking a few minutes to enable this logging is a smart move.
The configuration is minimal and essentially boils down to providing a place to store the logs. This should not impact any production resources or result in any downtime.
The only downside is the possible costs associated with storing the logs. Though, again, all of the clouds have ways to easily reduce that cost over time.
Estimated time to resolve? Five minutes.
Level of effort? Minimal. These features are probably already one.
Return on investment? High. These logs are a fantastic source of troubleshooting information for any operational issue (including security).
You Spent What?
Billing alerts are something that should be enabled on all cloud accounts by default. The CSPs won’t enable them by default because what I am willing to spend on the account hosting my personal website is significantly different from what I’m willing to spend on my workload supporting paying customers.
That means it’s up to you to setup billing alerts that match your risk tolerance.
Again, the good news here is that this is a non-breaking change. These alerts don’t stop resources in your accounts, they highlight spending that might be higher than you expect.
Ask any team out there, it’s always better to get a notification early in the month that something is off versus a bill that is thousands and thousands of dollars higher than you expect.
A simple billing alert can help avoid that disaster. There’s no reason not to apply these to your account immediately. It’s five minutes that could save you thousands.
Estimated time to resolve? Ten minutes.
Level of effort? Moderate. You have to decide not only where to send the alerts but what to do if you receive one.
Return on investment? High. It doesn’t take a lot of searching to find horror stories of very large and very unexpected cloud spending bills.
This is the step that typically meets with the most pushback. The truly interesting part of that is the reason for the pushback. This step is usually fought against because of the idea of someone looking over your shoulder.
The technical side of this step is relatively simple. The centralized accounts need to be already setup and then provided a role in your accounts that has read access only.
This means there won’t be any production impact and this setup should be completely automated. The centralized teams should be able to provide a cloud-specific script that sets up the needed permissions.
The true issue here is the relationship between your team and the centralized services. This can be tricky waters and is definitely worth discussing in the forum.
Estimated time to resolve? Five minutes.
Level of effort? Minimal. This should be completely scripted and have zero production impact.
Return on investment? Low for your team. High for the overall organization. The idea behind centralized security and audit accounts is to get a handle on the overall risk the organization faces. This is one you take for the team.
Organizing Access Control Permissions
Despite the high level of pushback in the previous step, this recommendation is by far the hardest to pull off.
For some reason, permissions almost always gradually drift towards “administrator” levels.
It’s often little changes here and there over time and before you know it, a resource needlessly had full administrator access to your cloud account. This is why you need to regularly review and maintain the permissions in your cloud account.
Remember, the goal is to manage these permissions using a higher level abstraction. Creating policies or roles for various tasks is a great first step.
There’s a lot of information out there to help get you started. Here are a few examples;
- Get started with permissions, access levels, and security groups – Azure DevOps from the Microsoft Docs site
- The “Getting started with AWS identity services“ session from AWS re:Invent 2020
- “How IAM works” from the Google Cloud documentation
Unfortunately the tooling that would help you monitor which permissions are actually being used isn’t nearly as mature as I’d like to see. Leading the way is the AWS IAM Access Analyzer which I’m hoping other clouds will copy.
It should be very simple to find out which permissions assigned have never been used. Sadly, it still takes a lot of effort.
Estimated time to resolve? Ongoing.
Level of effort? Hard. This is a complicated and constant activity and if you remove a critical permission, the consequences could be dire.
Return on investment? High. Almost all of the public security breaches in the cloud stem from misconfigured permissions. This is the top security issue by far.
We have gone through each of the sample checklist ideas and determined the level of effort required to implement them along with a ballpark return.
They are a reasonable set of trade-offs. Circling back to @marykay25’s original comment, then why aren’t they implemented more often in existing environments?
There are a multitude of reasons and they all center on culture and communication. There is no one way to move forward addressing these challenges…and I haven’t even mentioned the second part of @marykay25’s point around hybrid and multi-cloud.
What do you think the reason is? Why are teams not implementing these relatively simple technical measures to help improve the strength of their builds?
Is it lack of awareness? Positioning from the security team? A lack of time?