Follow Mark on LinkedIn Follow @marknca on Twitter Follow marknca on YouTube
marknca

Advanced Security Automation Made Simple

Originally presented at the AWS Summits in Santa Clara, NYC, Washington D.C., and Toronto in 2019. This was also presented at the inaugural AWS re:Inforce in Boston.

Security is often misunderstood and addressed in the last stages of a build. Operationally, it’s ignored until there is an emergency. In this talk, we review several advanced security processes and discuss how too easily automate them using common tools in the AWS Cloud.

This approach helps you and your team increase the security of your build while reducing the overall operational requirement of security in your stack. Leave this dev chat with everything you need to get started with automating security.

Here’s a PDF of the slide deck for future reference though the format below will probably make a lot more sense…

Slide from presentation, described in the next grid item

The goal of cybersecurity is often stated in technical terms. I think there's a much simpler way to look at it. The goal of any and all cybersecurity activities is, "To make sure that systems work as intended...and only as intended".

Security is fundamentally a software/hardware quality issue. No one wakes up thinking they want to write 💩 code today. This goal is far more conducive to solving the problem vs. blaming or focusing on cybercriminals and hackers.

Slide from presentation, described in the next grid item

The Shared Responsibility Model is often positioned as a security model. In actuality is governs how everything is done in the AWS Cloud.

There are business and security wins as you move more and more of your build to the right side of the model towards "abstract" type services.

* For more on the Shared Responsibility Model, check out this live stream that was part of my "Road to re:Invent 2019" series.

Slide from presentation, described in the next grid item

Within most organzations, security, operations, and development are completely different teams with different reporting structures. That's crazy given that they have a shared goal.

Development and operations have been (slowly) starting to come together but security is still on the outside...despite the marketing push around the term 🥶 DevSecOps 🥶 which still isn't a thing.

Slide from presentation, described in the next grid item

Really these teams should start coming together. Whether that means a two pizza team approach (a la AWS) or another governance model, the important thing is that they work together to solve problems for the customer.

Slide from presentation, described in the next grid item

One tool that really makes this clear is the AWS Well-Architected Framework.

The framework helps you make decisions between which AWS services to use, how to deploy various technologies, and what operation aspects you need to focus on. It a battle-proven decision making tool written with input from AWS teams around the world.

* I teach a course on the Well-Architected Framework for A Cloud Guru. Check it out if you're interested in learning how to effectively apply the framework to your builds.

Slide from presentation, described in the next grid item

Security is one of the five pillars of the framework. It not off on it's own but part of everything that you do in the AWS Cloud. That's a critical way of thinking of security and (sadly) not one that's often used.

Slide from presentation, described in the next grid item

Mainly because people think that cybersecurity is crazy complicated. In fact, most practioners make it far more complicated than it needs to be.

Slide from presentation, described in the next grid item

Cybersecurity can be simple. That's what we're going to do in the rest of this presentation. We're going to take some rather advanced and complex concepts and automate them in a very simple and effective manner.

Slide from presentation, described in the next grid item

Restricting permissions in your build

Slide from presentation, described in the next grid item

A core security tenant: the principle of least privilege.

Simply put, users and entities should only have the permissions they need to accomplish a task and no more.

Slide from presentation, described in the next grid item

But that's rarely the case. More often than not, a lot of extra permissions are assigned. Often in the name of "making it work" with promises of "we'll clean it up later"...

#protip: later never comes.

Slide from presentation, described in the next grid item

AWS doesn't make it easy to stick to the principle of least privilege. Despite my best efforts with the team, IAM continues to offer "FullAccess" policies.

While the user experience makes sense, this usually causes some serious security issues.

#protip: 99% of the time you see these policies used in production, it's wrong. Way, way too many permissions for most tasks.

Slide from presentation, described in the next grid item

The steps to automate a solution:

  • In an isolated test environment, apply a FullAccess policy or the permissions you believe are required
  • Complete the desired tasks
  • Compare against CloudTrail logs to verify actual permissions used
  • Use new policy to enforce the principle of least privilege
  • Repeat as code changes
Slide from presentation, described in the next grid item

Believe it or not, a simple(ish) console command will solve this.

aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=USERNAME --start-time 2019-10-03T01:00:00Z --end-time 2019-10-03T02:00:00Z | jq '.["Events"]' | jq '.[] | {call: .EventName, service: .EventSource}' | more

This generates a clean list of the service endpoints called during the specified timeframe.

Slide from presentation, described in the next grid item

This command uses the AWS CLI to call the CloudTrail service, specifically the 'lookup-events' endpoint.

We use this command to return the audit log entries for our testing period. The command offers a number of ways to filter out irrelevant information.

Slide from presentation, described in the next grid item

We then pass the results of that AWS CLI command to a very handy tool called, "jq".

This tool helps search JSON data and pull out values of interest. The syntax takes a little getting used to but it's immensely powerful.

Slide from presentation, described in the next grid item

With this simple two step process, we've automated a simple verification of what permissions are actually being used.

Slide from presentation, described in the next grid item

As with anything in the AWS Cloud, there's at least six ways to solve any problem. We can expand this process to automatically apply the policy.

We could run the query as a 'cron' job using a scheduled CloudWatch Event or we could use a lambda triggered from a build server.

Similarly for the output, we could not only apply the policy but we could add it to the code repository. We could even send a message to the team's Slack channel.

Slide from presentation, described in the next grid item

Monitor S3 Exposure

Slide from presentation, described in the next grid item

One of my pet principles that should be obvious...

The principle of the face palm: do not make that which is secure, insecure.

Slide from presentation, described in the next grid item

AWS does it's best to help builders and stop them to mistakenly exposing their data from Amazon S3.

All of these services have one or more ways to warn users when an Amazon S3 bucket is set to allow public access. But since access is your responsibility, the buck stops with you (as it were).

Slide from presentation, described in the next grid item

The workflow is simple. When a permissions change is made that applies to the target bucket, we hook the appropriate CloudWatch Event in order to trigger a custom AWS Lambda function.

Slide from presentation, described in the next grid item

We'll hook the PUT bucket policy event.

You should also several IAM events in order to get full coverage. For simplicity, we'll stick with the one event here.

Slide from presentation, described in the next grid item

Track Production Logins

Slide from presentation, described in the next grid item

A core principle of the DevOps movement is that systems, not users access productions systems.

This ensures consistent configurations and reduces the possibilities of an error.

Slide from presentation, described in the next grid item

The work flow for this process is also straightforward. We'll use CloudWatch Logs to monitor system events such as an RDP or SSH login.

Once we see one of those events, we'll trigger an AWS Lambda function. That function then sends a message to the team's Slack channel.

Slide from presentation, described in the next grid item

#protip I met an organization that was using this workflow and actually prompting users in Slack to verify is they were logging into a system. If the user said yes, they were prompted with some education info. If they said no, a security incident was immediately raised.

Slide from presentation, described in the next grid item

Forensic Isolation

Slide from presentation, described in the next grid item

Another made-up-but-makes-sense principle...

The Crichton principle: if something unknown is happening, quarantine it until you figure it out.

Slide from presentation, described in the next grid item

The workflow here is simple and straightforward. In order for this to work, you'll need a security tool (open source, commercial, or home grown) that can emit events to something like Amazon SNS.

  • Security controls on the instance alert on issue
  • Lambda triggers by alert
  • Change the security group to make the system inaccessible
  • Open security incident
  • Create a forensic instance to analyze the infected instance
Slide from presentation, described in the next grid item

The trick to this work flow is an Auto Scaling group. For cloud-native workloads, this is a normal feature.

But even for legacy workloads, the Auto Scaling group can help when you set the minimum and maximum to one. This ensures that there is always one server running.

When you change the security group to be inaccessible, the Auto Scaling health checks fails and the system spins up a new instance.

Best case scenario, this takes care of the attack and you have an instance to analyze at your leisure.

In the case of a persistent attacker, this work flow automatically keeps them at bay by destroying their foothold every time compromise the instance.

Slide from presentation, described in the next grid item

What's Next?

Slide from presentation, described in the next grid item

A few ideas of work flows you can implement yourself...

  • Send custom application logs to AWS Config (via rules) for a central compliance log
  • Correlate Auto Scaling alerts with backend data to detect possible DDos attacks
  • Detect unauthorized drift from applications and infrastructure using CloudFormation templates
  • Streamline the incident response process, including restoring to full capacity
  • Automatically find and mitigate vulnerabilities before deployment
Slide from presentation, described in the next grid item

Bottom line: don't over complicate security!

Slide from presentation, described in the next grid item

There are three generic steps to automating any activity:

  1. Start manually
  2. Determine your risk tolerance
  3. Lambda all the things
Slide from presentation, described in the next grid item

An important caveat: be aware of how fast your event source is.

CloudWatch Events trigger in near real time. This makes the service ideal for response work flows.

In contrast, CloudTrail usually delivers logs in 2-4 minutes after an event occurs. There's no guaranteed delivery time, so this is a simple rule of thumb.

This timing makes CloudTrail ideal for compliance and clean up work flows.

Slide from presentation, described in the next grid item

Remember the goal of cybersecurity: to make sure that whatever you are building works as intended...and only as intended!

Slide from presentation, described in the next grid item

I'll be streaming a lot leading up to AWS re:Invent 2019. Usually twice a week and each time I do a ~30m look into a specific AWS service.

You can follow along at markn.ca/2019/aws-reinvent.

Slide from presentation, described in the next grid item

Thank you!

You can reach me on Twitter, where I'm @marknca or via my website at markn.ca/2019/aws-reinvent.