Stateful Workloads on Kubernetes – Webinar by Datadog

DateMay 20, 2024

In a recent Datadog webinar, engineers shared insights into the company’s practices and experiences with running stateful workloads on Kubernetes. The event, part of Datadog’s ‘Datadog on’ series, was unique as it featured a live audience at the Datadog Summit in London, marking the first in-person session of these online events. The session focused on how Datadog handles stateful workloads like Kafka and PostgreSQL on Kubernetes.

The presenters, Edward and Martin, introduced themselves and provided context on Datadog’s scale and infrastructure. Datadog, an observability platform, manages telemetry data for over 27,000 customers, translating to tens of trillions of data points per day. This vast amount of data is processed across thousands of Kubernetes clusters, each with thousands of nodes, highlighting the complexity and scale at which Datadog operates.

The session delved into the technical challenges and solutions associated with running stateful workloads on Kubernetes. Stateless applications are relatively straightforward to manage on Kubernetes, but stateful workloads introduce complexities, primarily due to the need to ensure data integrity and avoid data loss.

Martin discussed Kafka, a distributed streaming platform critical to Datadog’s operations. Kafka supports high throughput and low latency workloads, making it suitable for Datadog’s needs. Martin detailed how Kafka clusters are managed using Kubernetes stateful sets, persistent volume claims (PVCs), and node affinity rules to ensure data persistence and resilience.

One significant topic covered was the distinction between local and remote storage. Local storage offers performance benefits but can complicate node replacements, as data needs to be copied back to new nodes. Remote storage, such as cloud block storage, simplifies this process by allowing data volumes to be reattached to new nodes, though it can introduce latency and performance trade-offs.

Edward then shifted the focus to PostgreSQL, another critical component of Datadog’s infrastructure. PostgreSQL, with its single-leader architecture, requires careful management to ensure high availability and performance. Edward explained how Datadog leverages Zookeeper and Patroni for cluster state management and leader election, ensuring seamless failover and minimal downtime.

Datadog’s approach to node lifecycle management was another key topic. The company employs node lifecycle automation to regularly replace nodes, ensuring hardware freshness and reliability. This practice, while beneficial for maintaining performance, necessitates robust data recovery and backup mechanisms to minimize downtime and data loss.

The webinar also explored the future of Datadog’s stateful workload management. Edward highlighted ongoing improvements in proxy management, aiming to streamline operations and enhance performance. The team is working on integrating authentication, automatic traffic routing, and observability features into their proxy setup, further optimizing their infrastructure.

Throughout the session, the presenters emphasized the importance of Kubernetes as an abstraction layer, providing a consistent operational environment across multiple cloud providers. This consistency allows Datadog to leverage Kubernetes’ extensibility and API-driven nature to build custom solutions tailored to their needs.

The webinar concluded with a Q&A session, addressing topics such as the use of open-source operators, the challenges of multi-cloud environments, and the implications of regional versus zonal clusters. The presenters shared valuable insights into Datadog’s practices and encouraged further engagement with the audience through Datadog’s online resources and future webinars.

Overall, the Datadog webinar provided a deep dive into the company’s innovative approaches to managing stateful workloads on Kubernetes, highlighting the challenges, solutions, and future directions in this critical aspect of their infrastructure.

Leave a Reply