Why monitor resource configuration changes for cloud and DevOps tools?
Table of contents
Ten years ago when DevOps didn’t hit the headlines yet, developers were writing code and operations engineers were running it. Traditionally, software development and operations were two different jobs. Specialized roles focusing on quality and tests were also on demand.
In the last decade, DevOps and DevSecOps triggered a big shift towards developers to take more and more responsibilities. Removing barriers has been more about empowering developers to own their code all the way to the customer. In modern teams, developers have tools to test and ship their codes without getting approval from other teams.
There are two main reasons behind this shift. First, companies need to ship code faster to production and reach out to their customers earlier than their competition. Second, reliable software is now seen as a fundamental requirement for customer satisfaction and revenue. To meet these high reliability requirements, developers own what they build as they can fix most problems faster than anyone else.
All these benefits are great but there are some challenges. This shift in increased expectations from developers means there is a lot more to learn and master. The path to becoming a great developer was never easy, but now it is harder than ever.
When there is a lot to learn and care for, security and compliance are often a second class citizen. Most teams start to approach security and compliance as necessary evil, not as something they should care about for the sake of the company.
To fix this problem, we can’t just keep saying security is everyone’s job. It doesn’t work if we don't take action. Just like how devops/platform teams support developers with infrastructure as code and CI/CD tooling, we should support developers with the right security and compliance tooling. These tools should offer seamless experience to make security and compliance be part of developers' daily routine without slowing them down.
Can Terraform or similar tools help?
Yes and no.
Yes, because Terraform, Pulumi, Cloudformation, and other infrastructure automation tools are great for managing infrastructure. As a best practice, we should update our cloud infrastructure and other tools using them. They make resource creation repeatable and much safer than manual updates.
No, because we can’t ensure all changes go through Terraform. There is always a legacy, manually configured tool in our stack. Even when we try to enforce policies to use IaaC tools, writing providers for all sorts of resources could be time-consuming. Eventually, a team will write a custom script or change things manually. Remember, it is not just the cloud. We use Git, CI/CD, containers, chat, collaboration, authentication tools and all the way to in-house build ones. Besides, changes going through these tools are not readily available for auditing and querying for compliance and security. Teams have to build infrastructure to collect all that in one place.
Many services and tools to ensure safe changes
Even with the right infrastructure automation, nowadays too many people have access to critical tools and services. On top of that, we now have A LOT OF tools in our stack. Most teams have freedom to choose the best tools for their jobs. But, even the companies that have standardized tooling, have cloud services, CI/CD, git, and chat tools. This means that it is really hard to keep track of who changed what, and ensure that these changes don’t create a potential breach. This makes debugging a big pain.
For example, teams use AWS services like EC2, IAM, S3 and combine them with tools like GitHub, Jira, Jenkins, and Slack. Each service has objects, what we call resources. For example, Amazon EC2 has instances, images, and security groups while GitHub has repositories, users, issues, organizations. When we say resource configuration, we refer to documents, usually JSON objects, that describe these resources. For example, resource configuration for a GitHub repository looks like this:
Resmo collects all your cloud and devops resource configurations in a central place, and allows you to spot compliance and security issues with its SQL interface. We will talk more about Resmo in the upcoming blog posts. For now, we just wanted to let you know that we are coming to solve this problem!
Now, let’s talk more about why this data is important and in what ways it can help us.
Resource configuration data are more critical than ever
APM, logging and distributed tracing solutions give visibility into what is happening in our applications and infrastructure. But with the increasing number of cloud services and developer tools, we now have more places to look for. These tools can lead to unauthorized access, customer or internal company data leaks, or bad practices when configured the wrong way.
For example, developers find it easy to create IAM roles, security groups or Amazon S3 buckets with admin permissions and they may forget to delete them once they are done. Or an important GitHub repository may be public and no one notices it until it’s too late. Even chat tools like Slack became critical for our development with the rise of ChatOps. If a user is authorized with admin permissions and no one notices, this can create a serious security issue. In another real life example, when a Slack app is tied to a user and when that user is disabled, Slack app stops working. Debugging similar scenarios are hard when there are too many points to look for.
Having a central place for resource changes means you can ask questions
When all your resource and config changes are in place, you can debug changes and find out what is wrong much easier. Combine that with an easy to use query language like SQL and the ability to query resources from different tools opens up many opportunities. You can answer questions like;
- Which AWS IAM users have access key enabled but never used?
- Is there any Amazon S3 bucket that grants full control access to anybody other than the owner?
- Is there any open security related GitHub issue for more than one day?
- Which PRs did this developer open in the last 5 days?
- Are there any repositories with visibility change to public from private?
- Are there any Amazon S3 buckets that don’t have the required prefix?
- Which Slack users installed Slack apps with channel:read permission?
Time, service or user-based filters also opens up many more querying capabilities that’ll help teams to get even more insights from the resource config data.
Compliance shouldn’t be driven with JIRA issues or Excel sheets
Most SaaS providers go through the lengthy process of obtaining SOC 2 hoping to prove to their customers that they are following best practices, changes are audited, controls are in place. In order to do it, they collect compliance evidence, usually by opening Jira tickets or passing around Excel and Word documents. This is inefficient, time consuming and doesn’t help companies to get the real benefits. Instead, everyone ends up hating compliance and security.
Modern teams that embrace DevSecOps mentality make continuous compliance part of their daily routine. They start with GitOps and check policies using code - known as compliance as code. Many compliance questions can be answered if teams have a central place where they record resource config changes. Saved queries for required questions should be run regularly to check if there is a compliance breach or not. These questions are there for a reason and only checking them quarterly is a big risk for enterprises.
Timely security alerts require seeing the big picture
Security breaches are hard to debug when no one tracks what has changed in your cloud and critical dev resources. These updates are, hopefully, recorded manually to an internal logging engine which most teams have no idea how to use. Digging into logs requires deep knowledge of your systems and services. These changes also require enabling many expensive auditing tools and moving onto enterprise plans. That is why so many teams decide to enable them when it is too late. Given the critical information these resources hold for the health and security of your services, these data should be accessible and easily queryable.
In line with DevSecOps mentality, service driven approach should alert related development teams, not just the security team. Teams should have visibility into their resources and have a set of alerts defined for their services. Yet, when teams have security and compliance responsibilities, they usually require a lot of training because there are a lot of unknowns, unfamiliar tools and languages. The key for devsecops adoption is to offer familiar and easy to use tooling.
Resource configurations are too precious to be left in the dark
Change is inevitable and necessary. The speed of development is increasing fast. Manually or using automation, developers are hitting APIs many times a day, hoping to improve things. But changes to resources can lead to dangerous consequences if not recorded and observed. In this blog post, we talked about why monitoring resource configuration changes matter so much.