Chaos Engineering at Datadog
At Datadog, reliability and resilience of our infrastructure and services is critical, as our customers depend on us in order to be able to monitor and secure their own infrastructure and applications. One of the best practices that Datadog engineering teams follow is Chaos Engineering. As part of their resilience efforts, engineering teams disrupt their infrastructure or services, in a controlled manner, totally or partially, to build confidence on how well the rest of the system would behave, the monitoring around it, or how quickly they would be able to recover from an incident.
These efforts were mostly team based, and the experiments were organized manually on a case by case basis, with some automation on top. But as the engineering team at Datadog scaled, and more and more teams wanted to run gamedays frequently, the need for a more specific project arose.
In this "Datadog on..." episode, engineers from the Chaos Engineering team at Datadog further explained how Chaos Engineering is performed at Datadog:
The Datadog Chaos Controller
Datadog Chaos Controller is a Kubernetes controller with which you can inject various systemic failures, at scale, and without caring about the implementation details of your Kubernetes infrastructure. By applying resources of type Disruption
, teams can control how and when they want their Kubernetes pods and nodes to fail.
In this spotlight article we talked with Joris Bonnefoy, original author of the Chaos Controller, and Sam Azouzi, maintainer, about the origins of the project and its future.
We also spoke to Nikos Katirtzis, from Expedia Group, one of the contributors to the project, on how they are currently using the Chaos Controller as part of the chaos engineering efforts at Expedia, and their contribution experience.
Maintainers spotlight
When and why did you decide to start Chaos Controller?
The Chaos Controller was created internally in late 2018. At that time, Datadog was migrating most services to Kubernetes clusters which introduced a whole new set of possible failure scenarios to explore and get prepared for. Kubernetes also offered a brand new way of working and it was the occasion to propose a Kubernetes native platform to create failures for our engineers to be able to focus on what’s important to them: their services’ behavior under unexpected conditions rather than how to create those unexpected conditions.
Since then, the internal platform evolved a lot with many advanced features but the Chaos Controller remains the masterpiece of everything.
As a maintainer of an open source project that has been around for a while, how do you keep a healthy community and project?
The decision to open source the Chaos Controller project has never been made around building a community like some other existing tools would do. We rather wanted to propose an alternative vision of a failure injection tool used in a tech company with high scalability constraints but with already advanced Kubernetes users.
The linear adoption and regular contributions to the Chaos Controller outside of Datadog seems to demonstrate that there is still a need in that area which is not entirely fulfilled by other projects and we are glad that our experience can be helpful there.
What would be your recommendation for people who want to contribute to the Chaos Controller but haven't taken the first step yet?
The Chaos Controller is made of multiple scoped components such as the controllers, responsible for the custom resources lifecycle management, or the injector, responsible for the failure injection logic. Those components can be seen as standalone projects by themselves. While it’s interesting to know how those components are interconnected, it is not a requirement to be able to work on one of them and I’d recommend that someone joining the project should focus on one component first to clearly identify its role and implementation.
When did you start contributing to Chaos Controller?
In 2019 I interned at Datadog, as part of the Chaos Engineering team. Since joining, I was assigned small Chaos Controller issues to work on, which allowed me to gain confidence in the codebase and ability to start tackling more complex features.
After the internship, I joined Datadog as a Site Reliability Engineer and continued contributing to the project.
What are the contributions you are most proud of?
The Chaos Controller allows to disrupt outgoing traffic using the Network disruption functionality.
I worked on several new features for the network disruption functionality, including packet corruption, packet delays, including adding jitter to delays, and packet duplication.
What are the features you are most looking forward to in Chaos Controller?
We are introducing the use of eBPF to implement more disruptions, including some very promising work that has been done to add disruptions to the application layer.
External contributions spotlight
How did you first learn about the Chaos Controller?
I follow industry innovation and I always encourage my colleagues to chat with engineers from other companies. In this specific case our teams were tasked with creating the vision for chaos engineering in the company. We got in touch with leaders in this space and Datadog's chaos controller looked like a great fit - it was not tied to any technologies other than Kubernetes, which was the chosen container orchestration solution in our new runtime platform.
How are you currently using it?
The controller is the main component behind Expedia's Chaos Engineering Platform. We published a blogpost about the platform which has evolved since then. We are now running hundreds of experiments with teams across the company. The scenarios range from verifying CPU-based autoscaling of services, to validating timeouts and fallbacks, or even verifying cluster-wide Availability Zone failover. The latter is an exercise that requires coordination between different teams and organisations within the company.
We have also been exploring combining canaries with Chaos Engineering. We have an internal Progressive Deployment solution which we presented at the DASH conference and we can target canary workloads using labels. We have a funny name for this: Chaotic Progressive Deployment.
What are your main contributions to the project?
My primary contributions have been on making the controller work for the company I work for.
This includes:
- Extending or tweaking the controller to work with our runtime platform which builds on top of Kubernetes and the Istio Service Mesh. I recall debugging DNS spoofing and reconciliation issues in the control plane for quite some time.
- Abstracting certain configurations so we can override with company-specific details.
- Adding support for container state failures.
- Documentation including diagrams and improvements in the local setup.
We now have an amazing team who are also contributing to the controller.
What do you hope to see next for the Chaos Controller?
There have been some great additions such as support for eBPF, network disruptions on cloud managed services, and reporting per disruption.
There is not much missing from the controller itself.
What appeals to me is an end-to-end platform that enables failures with deterministic outcomes. Netflix engineers presented the solution they have internally for this at re:Invent. Unfortunately this requires standardisation which is hard to get in big companies.