Cilium | Datadog Open Source Hub

Datadog infrastructure runs on dozens of Kubernetes clusters, on different clouds, adding up to thousands of Kubernetes nodes.

To manage this complex and heterogeneous Kubernetes environment, Datadog required a networking solution that was fast, worked with different clouds and was able to manage the scale we are running on.

In 2017, Laurent Bernaille, Principal Software Engineer at Datadog, met Thomas Graf, one of the founders of the Cilium project, an eBPF-based networking solution for Kubernetes. After discussing Datadog needs, and Cilium vision and roadmap, Laurent decided to give Cilium a try and created a proof of concept to see if Cilium could be the networking solution for Kubernetes Datadog was looking for.

Since then a lot has changed in the Cilium project and Datadog, and, today, Cilium is used heavily in our Kubernetes clusters: as pod and services networking solutions, to enforce Network Policies, and to encrypt host-to-host traffic.

As Cilium is a critical piece of infrastructure for Datadog, and Datadog use case is a very specific one, we sometimes found features that we needed that were not yet implemented, or bugs that we were the first ones to encounter. In those cases we did what needed to be done: participate in the community, discuss potential changes and, when they were approved, contribute upstream with those changes.

In this spotlight article we talked with Laurent Bernaille and Hemanth Malla, two of the engineers in the Datadog team who have been regularly contributing to the Cilium project, and with Liz Rice, Chief Open Source Officer at Isovalent, about Cilium, their contributions, and the future ahead.

When looking for a networking solution for Datadog’s Kubernetes clusters, what was it that you saw in the Cilium project that made you give it a try?

We had been running with kube-proxy and the Lyft CNI plugin for more than a year and we were facing a few issues:

kube-proxy had significant latency issues in iptables mode so we used IPVS mode. But IPVS mode was still very young and we had to fix several issues. I actually became a maintainer of the kube-proxy IPVS code base. We also had some scalability issues with the Lyft CNI plugin: IP allocations were performed independently on each host leading to rate limiting during fast scale ups. Finally, we did not have a good solution for network policies and network level encryption, which we needed.

Cilium had a very good story for these topics and eBPF felt like the best approach for load-balancing (kube-proxy) and network policies: the solutions we were using had not been designed for these use-cases and eBPF gave us full programmability.

What are your main contributions to Cilium?

I helped with the specification and testing of the AWS IPAM feature and I contributed mostly to the IPSEC encryption support.

Another smaller contribution but an interesting one, is the one I made to add an unreachable route to the pod IP when deleting pods. This was made as part of a several months investigation of an incident, in which a high rate of errors on service updates made it seem like it was a DNS problem, but it wasn’t.

Cilium and eBPF have also proved as a great way to debug and fix kernel issues. Can you tell us a bit more about it?

One clear example is that having eBPF in the datapath of packets coming from pods helped us debug a kernel bug by modifying the socket buffer (“SKB”) with bpftrace, an eBPF program. To help mitigate this issue, Cilium maintainers created an eBPF program to workaround the issue for affected kernels. I gave a talk around this particular scenario at eBPF Summit 2022, in case you are interested in learning more.

What are you looking forward to seeing in the Cilium project?

I am mostly interested in several scalability features, like support to use CRDs in large clusters to remove the additional etcd dependency. Another nice feature to have would be the ability to embed identities in packets to allow verifying the identity without having to synchronize all IPs and identities in large clusters or across clusters.

When did you start contributing to Cilium and how was that first contribution experience?

I started contributing to Cilium in October 2021 when we found a bug that’s been bothering us for a while. I clearly remember my first contribution experience to the Cilium project. I created a draft pull request and joined the weekly community meeting to request some early feedback. The community was very welcoming and helped a lot by walking me through different corner cases for the solution. I even got help with some advanced git where I needed to move changes from one commit to another.

What are your main contributions to Cilium?

My primary contributions have been to the ENI IPAM mode in Cilium. Most importantly adding support for prefix delegation and introducing a handshake between agent and operator to fix a race condition while releasing excess IPs.

What are you looking forward to seeing in the Cilium project?

I’m really looking forward to a feature called high-scale IP cache which significantly improves the scalability of a core Cilium subsystem called ipcache. You can read more about it in this CFP.

What advice would you give to someone who is considering becoming an open-source contributor, but hasn't taken the first step yet?

Attend community meetings, join the slack channel, look for help-wanted labels on github issues. More importantly, if you use an open source project and notice something that’s not right, don’t hesitate speaking to the community about it. It’s very likely that the community would love to see the issue be fixed, but they haven’t gotten around to doing it. Another helpful exercise is to triage open issues in areas you’re familiar with. Not only does it save a lot of time to the maintainers but with a little research you might be able to fix some of the issues reported by other users. If you're trying to understand how something works in a large OSS project, start by reading test cases. If you find that there isn't enough test coverage, maybe that's your first contribution?

What’s the relationship between Datadog and the Cilium project?

Datadog was an early adopter, and is a long time contributor, and a valued supporter of the Cilium project. It was one of the first companies to run Cilium at significant scale, and helped make and prove that the project was production ready. As well as using Cilium, Datadog also contributes back, in code and in many other ways. It’s great that Laurent and Hemanth are so involved in the project as committers, and Datadog folks regularly talk about their experiences with Cilium at conferences and in blogs.

What would you highlight from Datadog's contributions?

It would be difficult to go through everything that Datadog has given to the project, but some of the highlights have been captured in conference talks including All Your Queues Are Belong to Us: Debugging and Mitigating a Kernel Bug with eBPF where Datadog found and mitigated a kernel bug on the fly with Cilium and Tales from an eBPF Program’s Murder Mystery where they dove into how different eBPF based applications interact. Datadog has also written deep dive blog posts helping the community see how to run Cilium at scale like key metrics for monitoring Cilium.

What advice would you give to someone who is considering contributing to Cilium, but hasn't taken the first step yet?

As you can see from Datadog's example, there are many ways to get involved in the project. If you want to start with code contributions, check out the contributor guide. If you want to write a blog post or give a talk about Cilium, reach out to the community for help.