Go is used heavily at Datadog. As part of our APM, profiling, and security products, we wrote and maintain our Go tracing library that allows Go developers to trace requests as they flow across services; find CPU, memory, and synchronization bottlenecks; or even alert for Go application attacks that try to exploit code-level vulnerabilities.

But our Go usage only starts there. Many of the Datadog backend services are written in Go, and our engineers stretch the limits of the language to ensure that our applications are as performant as possible.

As part of our library and backend services work, we sometimes find areas of the language that can be improved, bugs, or performance issues that require low level changes in the language. Our engineers contribute regularly to Go with those improvements.

In this article we highlight some of those contributions.

Reducing Go Execution Tracer Overhead

Go has an execution tracer that is able to provide a moment-to-moment view of what happens in a Go program over some duration. Unfortunately, its large overhead (up to 20% CPU) prevented potential users from turning it on in production, limiting its usefulness.

At Datadog, we wanted to integrate the execution tracer in our profiling tool, but this overhead made it impossible for us to do it.

Most of this overhead was coming from stack unwinding, so, in early 2023, two Datadog engineers, Felix Geisendörfer and Nick Ripley, started to propose patches to prove that the overhead could be reduced by implementing frame pointer unwinding.

Since then they have been collaborating with the Go runtime team at Google and thanks to the work, the go1.21 release featured a new version of the tracer which should provide less than 1% overhead for most applications.

“Fast stack unwinding is a critical ingredient for observing and optimizing Go programs via execution tracing in production. We’re very grateful to the Go team at Google for taking the time to review our patches and help us overcome various challenges when it comes to managing frame pointers. Since go1.21 was released, we have deployed continuous execution tracing for our engineers at Datadog as well as our customers. This has already resulted in many success stories when it comes to root causing incidents and optimizing applications. Going forward, we are excited to continue our collaboration with the Go team and enhance other parts of the runtime with frame pointers as well.”
Felix Geisendörfer
Felix GeisendörferSenior Staff Software EngineerDatadog

Once those patches were released, the Go Profiling team at Datadog was able to start using the execution tracer, which lets them capture extremely detailed data. This unlocked implementing the new profiler timeline feature, allowing to debug difficult latency issues in your Go services. Felix created a fantastic video explaining how to use the feature to solve a P95 latency issue in a service:

Improvement and fixes to the runtime metrics

As a monitoring tool, we are always trying to improve the observability of the different parts of our customers' tech stack. In many cases, this involves discovering issues in existing metrics or finding missing ones on upstream projects, and fixing those issues in the projects themselves.

Felix Geisendörfer added a proposal to include a way to properly monitor the Live Heap. After some discussions with the core team, the proposal was accepted and Felix landed several patches to implement the new metrics. This work was finally released in Go 1.21.

When the Datadog Go tracing team started working on moving our Go tracing library to use the Go runtime/metrics package to report runtime metrics, they noticed that one of the metrics (sched/latencies:seconds) was was being incorrectly marked as non-cumulative, when in reality the runtime was reporting it as a cumulative metric. This metric is very valuable as it helps us understand how much latency is coming from the go runtime goroutine scheduling. Nayef Ghattas worked on a patch to change the definition of the metric in the runtime as cumulative, aligning it to how the metric was already reported by the runtime.

Community

Aside from contributing with code, Felix Geisendörfer regularly contributes to the Go community by sharing knowledge in conference talks. Some of his Go related talks are:

Gopher Con 2021: Go Profiling and Observability from Scratch

GoLab Webinar: Go Profiling from the Bottom Up