At Datadog, when we decided in 2017 to migrate to a containerized infrastructure, we chose Kubernetes as our containers orchestrator. Its API driven design, its extensibility, and, more importantly, its thriving and growing community, made us decide to fully use it for our infrastructure and applications (including stateful applications).
Our current infrastructure runs on dozens of Kubernetes clusters, on different clouds, adding up to thousands of Kubernetes nodes. The Compute team at Datadog, responsible for this infrastructure, also manages the Kubernetes control plane directly, tweaking the different Kubernetes components, including the API server.
Over the years, Datadog’s engineering teams have been contributing to different Kubernetes projects – new features, corner cases, or bugs – in order to run Datadog successfully.
In this article, we highlight some of these contributions, as well as spotlight some Datadog engineers who have made significant contributions to the project.
Autoscaling
As mentioned, Datadog runs a large Kubernetes infrastructure in several public clouds. At this scale it also means that our autoscaling needs are very high and stretch the limits of the different autoscaling solutions. Throughout these years, Datadog engineers have been contributing to the different autoscaling solutions.
Vertical Pod Autoscaler
The Vertical Pod Autoscaler (VPA) is a Kubernetes component that automatically suggests (and optionally updates) container’s CPU and memory requests based on usage statistics.
As part of our internal usage of Kubernetes, we rely on the VPA to develop internal tools to ensure developers select a good amount of CPU and memory requests for their workloads.
As part of our internal VPA usage, engineers from the Compute team (the team behind our Kubernetes infrastructure) have been proposing improvements and bug fixes to the VPA that benefit both Datadog internal usage and the wider community.
One of Datadog’s biggest contributions to the VPA was Lally Singh’s PR to add the ability to use an External Metrics source. Historically, the VPA relied uniquely on the Kubernetes Metrics Server to gather metrics around CPU and memory usage, but as more companies use external observability tools, it makes sense for the VPA to rely on other metrics sources for its recommendations.
As part of the same PR, Lally improved the testing story of the VPA, by adding the ability to run the VPA in kind(Kubernetes IN Docker), making it easier and cheaper to run E2E tests for the VPA.
David Benque has also made several contributions to the VPA. Among those, he contributed to a new feature to improve the user experience of the VPA when used with the CPU Manager static policy. Thanks to that feature, VPA users can now opt-in to get CPU integer recommendations, rather than float recommendations, to know how many CPUs they need to allocate for their workloads.
Other contributions from David include several code refactoring PRs to improve the VPA code base, bug fixes, as well as developer experience improvements.
“Autoscaling for nodes and pods is critical for keeping control of runtime cost. Having the chance to exchange with the SIG Autoscaling and contribute to the associated components is key for getting a better coverage in terms of functionality. Being confronted with community problems allows us to mature our ideas and deliver better solutions that apply to our Datadog environment and that can also benefit the wider community.”
The Datadog Compute team continues to be involved in the VPA community and continues to work on improvements and fixes to the component.
Cluster Autoscaler
The Cluster Autoscaler is a tool that increases the number of nodes of a Kubernetes cluster if there are pending pods due to lack of capacity, and that decreases the number of nodes if there are spare nodes.
The Cluster Autoscaler uses the different cloud APIs to provision new instances and then register them as nodes in the cluster. Datadog uses the Cluster Autoscaler with Azure, AWS, and Google Cloud through an internal abstraction called NodeGroup. To learn more about how we manage thousands of nodes on dozens of clusters, you can watch the Datadog on Kubernetes Node Management episode.
As we run a significant number of large and dynamic clusters, exposing a range of instance-types to our applications over three cloud-providers, we had to ensure the fleet was able to scale well with a growing number of nodes, autoscaling groups, and activity. For this reason, a large part of contributions focused on improving cluster-autoscaler’s performances and scalability.
Benjamin Pineau has been working on several of these improvements, across Azure, GCP, and AWS, illustrating the PRs with metrics and dashboards. We present 3 of them here:
For Azure, on cluster-autoscaler start (or restart), all VMSS instances caches will be refreshed at once, with the same TTL, causing hitting the API at the same time. Adding an optional jitter spreads those calls overtime:
For AWS, by keeping cluster-autoscaler up to date with the upstream APIs, we were able to reduce the number of API calls by half, when calling DescribeAutoScalingGroups
(in light blue). This is very important to reduce the risk of being rate limited by the AWS API:
For Google Cloud, we introduced several changes that improved the cluster-autoscaler start up time in clusters with hundreds of Managed Instance Groups (MIGs), by adding parallelism. On this first one, it went down from 40 minutes to 5 minutes. On this second one, it went down from 6 minutes to 90 seconds.
Benjamin’s contributions are too many to mention all of them in this post. You can have a look at all of his contributions to the kubernetes and autoscaler repositories.
E2E Framework
At Datadog we automate a lot of our administration activities in our cluster extending the Kubernetes API with Custom Resource Definitions and writing custom controllers for them. At the time of writing, Datadog teams write, build, and run 20+ controllers that are critical for our Kubernetes operations.
As more and more controllers are added to our clusters and the existing ones grow in complexity, having a robust end-to-end testing strategy for those is crucial.
At Datadog we chose to use the e2e-framework as our test framework for those controllers and the clusters themselves. While migrating our internal test suite to the framework, [Matteo Ruina] found several areas of improvements and decided to start contributing to the project. Some of his contributions include:
- Adding a `--context flag to be able to run the tests against a specific cluster
- Several documentation updates and improvements.
- Fixed a race condition that was preventing to execute tests in parallel – With Philippe Scorsolini
As we continue with the migration we will continue collaborating with SIG Testing with improvements to the E2E Framework.
“Contributing to the e2e-framework was a great opportunity to get involved in a kubernetes-sig project and I am looking forward to see the framework adoption increase.”
Kubernetes CSI Drivers
Traditionally, Kubernetes has been managing the different storage systems through in-tree volume plugins. The Container Storage Interface (CSI) was introduced (GA in 1.13) to standardize the way storage systems are exposed to workloads in Kubernetes. With CSI, storage providers can develop and maintain their plugins independently, allowing quicker iteration and a leaner Kubernetes code base.
When Datadog engineers started to migrate its volumes from the in-tree plugins to CSI drivers, they also decided to contribute to improve the observability of those drivers. In particular, Baptiste Girard-Carrabin worked on adding OpenTelemetry instrumentation to the drivers:
Security
At Datadog, the scale of our Kubernetes environments presents interesting security challenges. We are proud to work with the Kubernetes community to solve them, and share what we learn as we go. Many current and former Datadog employees contribute to Kubernetes security by sharing education, code, and leadership.
Several of our engineers led by Ethan Lowman, together with the containerd community, added image signature verification support to that fundamental part of the container ecosystem. Cédric Van Rompay and Julien Doutre continue to support and refine this work.
Jeremy Fox, Julien Terriac, and Edouard Schweisguth developed KubeHound to help us harden our internal Kubernetes deployments, and continue to improve and share it to benefit the broader Kubernetes user community.
Rory McCune helps to maintain the CIS benchmark for Kubernetes, a vendor neutral hardening guide that's widely used by cluster operators and security tooling vendors. He also publishes original security research, both individually and together with his hacker crew SIG Honk.
Tabitha Sable helps the Kubernetes community stay safer through her security research and community leadership. She presents frequently at KubeCon and other conferences, publishes vulnerability write-ups and tooling, and mentors the next generation of Kubernetes security leaders. Tabitha serves the community as co-chair of SIG Security and a member of the Security Response committee.
Several Datadog security researchers contribute to Kubernetes security research as part of Datadog's Security Labs.
Knowledge sharing
Aside from code and documentation contributions, Datadog has been contributing to the ecosystem for a long time by sharing with the community our stories around running Kubernetes at scale.
These are some of the talks at KubeCon/CloudNativeCon that Datadog engineers have participated in:
- Everything, Everywhere, All At Once
- Secure Transport for Your Software Supply Chain with TUF
- How to Carefully Replace Thousands of Nodes Every Day
- Logs Told Us It Was DNS, It Felt Like DNS, It Had To Be DNS, It Wasn’t DNS
- Building Container Images In Kubernetes: It’s Been a Journey!
- Image Signing and Runtime Verification at Scale: Datadog's Journey
- Malicious Compliance: Reflections on Trusting Container Scanners
- Mind the Gap! Bringing Together Cloud Services and Managed K8s Environments
- PKI the Wrong Way: Simple TLS Mistakes and Surprising Consequences
- PodSecurityPolicy Replacement: Past, Present, and Future
“We started our migration to Kubernetes in 2018 with version 1.10. We were deploying clusters hosting thousands of nodes which back then was pushing limits. We started working with the community very quickly and they were helpful and receptive to our suggestions. We started by contributing to SIG-Network, especially on kube-proxy and CNI and with SIG-Scalability. Since then, we have continued to build relationships with the ecosystem and we currently actively collaborate on Cilium and SIG-autoscaling for instance.”
The road ahead
Datadog will continue to run its services in Kubernetes, at scale. As our engineers continue to find new scalability problems to solve, they will continue working with the community proposing solutions to said challenges.