Enhanced observability for AWS Trainium and AWS Inferentia with Datadog

TutoSartup excerpt from this article:
This post is co-written with Curtis Maher and Anjali Thatte from Datadog… This post walks you through Datadog’s new integration with AWS Neuron, which helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability into resource utilization, model execution pe…

This post is co-written with Curtis Maher and Anjali Thatte from Datadog.

This post walks you through Datadog’s new integration with AWS Neuron, which helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability into resource utilization, model execution performance, latency, and real-time infrastructure health, enabling you to optimize machine learning (ML) workloads and achieve high-performance at scale.

Neuron is the SDK used to run deep learning workloads on Trainium and Inferentia based instances. AWS AI chips, Trainium and Inferentia, enable you to build and deploy generative AI models at higher performance and lower cost. With the increasing use of large models, requiring a large number of accelerated compute instances, observability plays a critical role in ML operations, empowering you to improve performance, diagnose and fix failures, and optimize resource utilization.

Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and ML operations. Datadog is excited to launch its Neuron integration, which pulls metrics collected by the Neuron SDK’s Neuron Monitor tool into Datadog, enabling you to track the performance of your Trainium and Inferentia based instances. By providing real-time visibility into model performance and hardware usage, Datadog helps you achieve efficient training and inference, optimized resource utilization, and the prevention of service slowdowns.

Comprehensive monitoring for Trainium and Inferentia

Datadog’s integration with the Neuron SDK automatically collects metrics and logs from Trainium and Inferentia instances and sends them to the Datadog platform. Upon enabling the integration, users will find an out-of-the-box dashboard in Datadog, making it straightforward to start monitoring quickly. You can also modify preexisting dashboards and monitors, and add news ones tailored to your specific monitoring requirements.

The Datadog dashboard offers a detailed view of your AWS AI chip (Trainium or Inferentia) performance, such as the number of instances, availability, and AWS Region. Real-time metrics give an immediate snapshot of infrastructure health, with preconfigured monitors alerting teams to critical issues like latency, resource utilization, and execution errors. The following screenshot shows an example dashboard.

For instance, when latency spikes on a specific instance, a monitor in the monitor summary section of the dashboard will turn red and trigger alerts through Datadog or other paging mechanisms (like Slack or email). High latency may indicate high user demand or inefficient data pipelines, which can slow down response times. By identifying these signals early, teams can quickly respond in real time to maintain high-quality user experiences.

Datadog’s Neuron integration enables tracking of key performance aspects, providing crucial insights for troubleshooting and optimization:

NeuronCore counters – Monitoring NeuronCore utilization helps make sure that cores are being used efficiently, helping you identify if you need to make adjustments to balance workloads or optimize performance.
Execution status – You can monitor the progress of training jobs, including completed tasks and failed runs. This data makes sure models are being trained smoothly and reliably. If failures increase, it may signal issues with data quality, model configurations, or resource limitations that need to be addressed.
Memory used – You can gain a granular view of memory usage across both the host and Neuron device, including memory allocated for tensors and model execution. This helps you understand how effectively resources are being used, and when it might be time to rebalance workloads or scale resources to prevent bottlenecks from causing disruptions during training.
Neuron runtime vCPU usage – You can keep an eye on vCPU utilization to make sure your models aren’t overburdening the infrastructure. When vCPU usage crosses a certain threshold, you will be alerted to decide whether to redistribute workloads or upgrade instance types to avoid training slowdowns.

By consolidating these metrics into one view, Datadog provides a powerful tool for maintaining efficient, high-performance Neuron workloads, helping teams identify issues in real time and optimize infrastructure as needed. Using the Neuron integration combined with Datadog’s LLM Observability capabilities, you can gain comprehensive visibility into your large language model (LLM) applications.

Get started with Datadog and Inferentia and Trainium

Datadog’s integration with Neuron provides real-time visibility into Trainium and Inferentia, helping you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see AWS Inferentia and AWS Trainium Monitoring.

To learn more about how Datadog integrates with Amazon ML services and Datadog LLM Observability, see Monitor Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.

If you don’t already have a Datadog account, you can sign up for a free 14-day trial today.

About the Authors

Curtis Maher is a Product Marketing Manager at Datadog, focused on the platform’s cloud and AI/ML integrations. Curtis works closely with Datadog’s product, marketing, and sales teams to coordinate product launches and help customers observe and secure their cloud infrastructure.

Anjali Thatte is a Product Manager at Datadog. She currently focuses on building technology to monitor AI infrastructure and ML tooling and helping customers gain visibility across their AI application tech stacks.

Jason Mimick is a Senior Partner Solutions Architect at AWS working closely with product, engineering, marketing, and sales teams daily.

Anuj Sharma is a Principal Solution Architect at Amazon Web Services. He specializes in application modernization with hands-on technologies such as serverless, containers, generative AI, and observability. With over 18 years of experience in application development, he currently leads co-building with containers and observability focused AWS Software Partners.

Enhanced observability for AWS Trainium and AWS Inferentia with Datadog
Author: Curtis Maher