Building a Policy Decision Platform that detects and prevents Security Misconfigurations
TL;DR — This post talks about my learnings on building a generic Policy Decision Platform using Open Policy Agent that’s been integrated at various points (CI/CD platform, pre-commit hooks etc) in the development pipeline to detect and prevent security misconfigurations. This is a foundational piece in moving an organization to a standard based security posture and also a platform component needed in moving an enterprise to a zero-trust model
I first came across Open Policy Agent (OPA) in late 2017 from an old colleague of mine who was looking at OPA as a way to enforce authorization for their platform’s service to service communications. Back then, I had spent a couple of hours looking at it and thought it was a framework that’s limited to enforcing authorization and that’s about it. But it was only when I was watching 2019’s Kubecon and Cloud-Native Con videos I realized how a lot of people started using OPA as a generic Policy Decision Point (PDP) to validate their Kubernetes configurations, Terraform code, Serverless configs etc. That to me was revolutionary.
In 2020 where almost everyone is using a cloud provider or using a managed service to deploy their apps and now that DevOps has shifted left most of the engineers now spend their time writing structured configurations. Just the fact that we now have a generic policy engine that can validate these configurations opens up ways for us to detect or prevent a vast variety of security issues which were only possible by running a vulnerability scanner in the past.
There’s a lot of content and tutorials on the internet that talks about how to use OPA and how to write policies. This isn’t going to be one of such posts. Instead, I plan on covering what problems was I trying to solve and my learnings/observations from working on these problems.
Purely from a career point of view, 2020 has been pretty great. I changed teams. We started a new team called Security Architecture that defines enterprise security architecture and spends more time strategically building programs that add controls to detect and mitigate our company’s top security risks. As a part of it, while looking at the top 10 security risks we had, I realized a good number of the issues identified had a repeating pattern. So we decided to build a program around it.
So what were those problems that I was trying to solve?
- As on organization we were fairly new to the cloud. 2020 was also the year where our business grew so much that we had to move to have multiple active regions/data centers. This was a great business problem to have but in reality, our configuration management system wasn’t being used consistently and we were identifying a good number of misconfigured instances/resources issues. Operationally, the engineering organization as a whole started moving away from having everything being provisioned by one traditional infrastructure team to a more self-service model — which helped scale and not depend on the infrastructure team but was not great from a security perspective as we didn’t have the right toolchain or guardrails to ensure secure defaults were being adhered to. This brings us to the problem, how do we detect or block misconfigured infrastructure from being deployed?
- While performing an internal threat model of our ecosystem we realized that our deployment tool was overly permissive to a point that made me uncomfortable — Any compromise to the deployment pipeline could mean a complete takeover of all our clusters. How can the ecosystem/our clusters take care of themselves even if the deployment tool was compromised?
- While I did change teams in the hope of moving away from a pure application security role, in reality, most of my time in early 2020 was still spent only on performing design reviews. There was a huge knowledge gap in the team and to fix that and make myself redundant in these reviews I spent a good portion of my time writing down security standards and common architecture patterns. I‘ve talked more about that here. While I was writing these standards down, I kept asking myself how do we actually verify if our engineers are even following these standards?
- From an application security point of view, for authorization between services, internally we rolled out our own internal OAuth 2.0 server as an Authorization server. All our services’ endpoints are protected with a scope and we had a standard that required every endpoint to be protected with a scope. We built a declarative authorization model where all an engineer needs to do is to fill out a configuration file and our frameworks would handle the rest for them. But as the number of services and environments(dev/test/stage/prod) grew the number of configuration files and overrides a service owner had to maintain grew. This got tricky to a point where we couldn’t manually keep track of all the changes and we were seeing some services misconfigured and thereby opening up their endpoints with no authorization at all. Which got me thinking how do we identify this way early in the development process?
- As a company, we’ve also been moving to a Zero Trust model (even before COVID 19). For any zero trust architecture to work, one needs to have a component that acts as Policy Decision Point (PDP) and a Policy Enforcement Point. So how do we build a platform that acts as a generic PDP?
How did I approach these problems?
- We built a cloud-native Policy Decision Platform using Open Policy Agent that validates a change against a codified policy (standards) written in Rego.
- The platform is currently being used to detect and prevent misconfigurations for changes that are being made to our cloud platform (GCP) and our Kubernetes clusters. Given that this is a generic policy decision platform, other infrastructure teams have been using this platform to detect or enforce non-security related misconfigurations too.
- For changes to most of our infrastructure, we use Terraform (an IAC solution). We integrated our platform with the build pipeline so that for any new change that’s being introduced we scan the changes for misconfigurations. This way an engineer introducing a change is given Just In Time (JIT) feedback. An interesting observation is that today, even though we do not block a misconfigured change from being pushed, only a couple of times have engineers pushed in the changes without fixing the issue. This proves that engineers will most likely do the right thing if they’ve been told what to do at the right time. This same solution is also used to detect misconfigurations to how a service configures its endpoints’ authorization.
- While we do encourage our engineers to Terraform all their changes, in reality for dev like environments we still have engineers that make changes to infrastructure through the cloud console. To be able to detect such misconfigurations too, our platform subscribes to all cloud audit logs and look for security misconfigurations to GCP resources that are being made via the UI/console. If a Critical or a High severity misconfiguration on a particular resource was detected we slack the person that made the change and let them know about the misconfiguration their change introduced. This ensures that both the developer and the security team are aware of any misconfigured resource.
- For changes to our Kubernetes clusters, we made use of Kubernetes’ Admission Controller to talk to our platform using a validating webhook that verifies that changes to the cluster meet our standards. We initially configured this to be permissive and were only logging and reporting on misconfigurations. Having this control ensured that even if our deployment tool was compromised we still had the ability to detect, prevent and also possibly fix the misconfiguration (using a mutating webhook).
Learnings from building this platform
- Build a pure cloud-native solution — One of the requirements we set was not to babysit this platform’s infrastructure ourselves. So while coming up with the design we used all Cloud-Native services in our design. We are a GCP shop and we used google managed services like cloud functions, cloud run, pubsub and GCS. This way the only thing we need to worry about is the application code. From a pricing point of view, the compute and processing costs averages about $425 a month. The platform processes close to 44 Million GCP audit logs a month. It’s also been integrated as an admission controller for about 6 clusters (per environment) that validates every change being made to the cluster and is also integrated with our build pipeline. Vendor products that do the same thing cost about ~$120K a year. So from a cost, scale and feature relevance perspective building this internally was definitely the better approach. And given that our platform was built using cloud-native solutions, Google manages our own infrastructure for us!
- Integrate Security tools into developer pipeline — Back in 2018, While solving for the Using Components with known vulnerabilities class of vulnerabilities we had a solution that involved a Security Engineer to manually run the tool on our codebase and generate a monthly report. The issues with that process were that we had to manually run the job and it wasn’t part of the developer's process so they didn’t have visibility into the issues they were adding to their code. We moved to a different solution that was much easier to integrate with our CI/CD and gave developers visibility into the vulnerabilities their libraries were adding and this had a positive effect and we saw a lot more engagement from engineering teams. Instead of the security team filing issues to update the library versions etc., the tool lets developers know how to fix the issue when possible and also went to the extent of creating a Pull Request(PR) that fixes the issue. We no longer need to babysit engineers for this class of vulnerabilities. I used the same pattern here to solve for security misconfiguration class of vulnerabilities. We integrated our PDP platform into the CI/CD pipeline. On identifying an issue, the platform today leaves a comment on the PR(along with sending an alert to the security team) on what the misconfiguration is and how can an engineer potentially fix it. Now instead of the security team babysitting these issues, we require the developer that introduced this misconfiguration to triage and fix the issue when necessary. This way we have reduced the overload this tool adds to the security team and more importantly, we educate the developer about a potential security issue they were going to introduce.
- Reduce the number of decisions an engineer needs to make — Engineers make a lot of decisions throughout the SDLC cycle. Right from the design to testing phases, they are required to make a lot of decisions that meet product and security requirements. Early on, the number of violations being detected and reported were in the numbers of 1000’s per month. This volume was overwhelming and we decided to only track and report on issues we classified as Critical and High. Also, to reduce decision fatigue on our engineers, we created an internal terraform template registry that had secure defaults turned on by default. This drastically reduced the number of violations being reported for any new infrastructure that was being provisioned. For issues that don’t necessarily change the functionality of an instance/service (example: turning on Google managed disk-level encryption, enabling secure boot etc.) we started automating fixing these misconfigurations in the background and not necessarily asking our engineers to fix them, but still notify them. We also have plans of creating a fix pull request for issues that can be fixed easily. Overtime by working on multiple projects, we’ve learned doing a lot of this work for your developers via automation can help gain trust with our engineers and also reduce the number of decisions they need to make with regards to this class of vulnerability.
- Quality content/telemetry for SOC — Every violation that the policy engine detects is being reported to the SIEM. Not all violations are necessarily actionable by an engineer. We reach out to engineers only for the actionable ones. We do have other rules in place that looks for abnormalities (for example taking a snapshot of a disk etc.) which is great telemetry for the SOC but there’s no fix needed from an engineer per se. Having all violations being reported to the SOC also ensures that the SOC team can correlate multiple logs if and where needed. This also helps someone from the security team keep an eye out on our Critical and High misconfigurations that could potentially be expensive for the company.
- Cloud Function for a validating webhook is fine but limited — When we first started looking at how to protect our clusters we looked at gatekeeper as a solution. We looked at version ~2 which found it extremely complex given that we weren’t the ones that maintained our clusters. Every time a policy needed an update we had to request the platform team to help implement the change and that whole process was painful. So we had decided to go with having validating webhook point to a cloud function (which in-turn spoke to the policy agent) that validated a change to the cluster. This approach worked great for us as we controlled the cloud function and the policies that got validated etc. We soon realized there were limitations to this approach. Given that this is deployed outside of the cluster we couldn’t necessarily write rules/polices that required to know the state of the cluster — for example writing rules like is any other pod using the same Kubernetes Service Account (KSA) etc. could no longer be written as the cloud function doesn’t have access to the state of the cluster. Also if we wanted to move to a preventive mode by using a mutating webhook we would need to modify our master authorized network to allow communications from all of Google’s network because Cloud functions is a google managed service and we really can’t whitelist only a set of IP’s (this is under the assumption that we do not have VPC Service controls) which would open up a different security risk. We’re revisiting newer versions of gatekeeper at this moment to move this control to a preventive mode. My recommendation for anyone starting this is to go for cluster native solutions like gatekeeper from the start.
While I specifically talk about using this platform for detecting and fixing security misconfigurations, the platform we built acts as a generic PDP that can be used for detecting and preventing violations of any policy. The platform scales well and is extremely cost-effective. I’ll talk more about the design and technical challenges we saw building the platform in a future blog post but in the meanwhile, I’d love to hear from you on how you are using OPA to solve other security risks in your organization?