Datadog is an observability and security platform that ingests and processes tens of trillions of datapoints per day, coming from millions of hosts from our more than 26,000 customers.
At Datadog we use Kafka heavily as our messaging persistence layer and the intake of tens of trillions of datapoints translates into double-digit gigabytes per second.
With that scale, Datadog engineers have been building internal tooling to properly manage their Kafka fleet, including handling partition to broker mappings, failed broker replacements, storage based partition rebalancing, and replication auto-throttling.
- topicmappr replaces and extends the
kafka-reassign-partition
tool bundled with Kafka. It allows for minimal movement broker replacements, cluster storage rebalancing / partition bin-packing, leadership optimization, many-at-once topic management, and more—all with rack awareness support. - registry is a gRPC+HTTP API service for Kafka that allows granular resource (topics, brokers) lookup and management with custom tagging support.
- autothrottle is a service that automatically paces Kafka replication/recovery throttle rates, powered with metrics using the Datadog API.
- metricsfetcher is a utility that fetches metrics via the Datadog API for Kafka storage rebalancing and partition mapping with topicmappr.
This set of tools to manage Kafka were open sourced and released under the name of kafka-kit in late 2017. Since then, both internal and external contributors have been keeping kafka-kit up to date fixing bugs and adding new features.
In this spotlight article we talked with Jamie Alquiza, one of the founders of kafka-kit, about the origins of the project, their contributions, and the future ahead.
Why did you take the decision of starting kafka-kit instead of using existing tools to manage Kafka?
We operated Kafka at a significant scale and quickly found limitations in the existing tooling. In particular, storage rebalancing with rack-aware partition placement and granular controls over topic isolation were not readily available.
How was the decision of open sourcing kafka-kit?
After speaking with dozens of engineers working on Kafka outside of Datadog, it became clear that many others faced the same problems we were solving and were interested in how we were doing so. The code was also intentionally designed in a way that wasn’t overly integrated into any in-house tooling or systems, which made it easy to open source.
What advice would you give to someone who is considering contributing to kafka-kit, but hasn't taken the first step yet?
Small changes and bug fixes are welcomed and frequently accepted! One good way of going about larger changes is opening an issue that describes what you intend to do and how you might go about it, mostly because we may have suggestions that could make development or testing easier. We also have a Contributing Guide that can help with first steps.