
Enable pod-based GPU metrics in Amazon CloudWatch

15/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks…yaml The file contains a cwagent configmap and a prometheus configmap… Edit the prometheus-eks…yaml file Open the prometheus-eks…json: | { “logs”: { “metrics_collected”:…
In February 2022, Amazon Web Services added support for NVIDIA GPU metrics in Amazon CloudWatch, making it possible to push metrics from the Amazon CloudWatch Agent to Amazon CloudWatch and monitor your code for optimal GPU utilization. Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. To obtain instance-level metrics of GPU utilization, you can use Packer or the Amazon ImageBuilder to bootstrap your own custom AMI and use it in various managed service offerings like AWS Batch, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS). However, for many container-based service offerings and workloads, it’s ideal to capture utilization metrics on the container, pod, or namespace level.
This post details how to set up container-based GPU metrics and provides an example of collecting these metrics from EKS pods.
Solution overview
To demonstrate container-based GPU metrics, we create an EKS cluster with g5.2xlarge
instances; however, this will work with any supported NVIDIA accelerated instance family.
We deploy the NVIDIA GPU operator to enable use of GPU resources and the NVIDIA DCGM Exporter to enable GPU metrics collection. Then we explore two architectures. The first one connects the metrics from NVIDIA DCGM Exporter to CloudWatch via a CloudWatch agent, as shown in the following diagram.
The second architecture (see the following diagram) connects the metrics from DCGM Exporter to Prometheus, then we use a Grafana dashboard to visualize those metrics.
Prerequisites
To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.) already installed. In order to clone the container project from GitHub, you will need git. To build and run the container, you will need Docker. To deploy the architecture, you will need AWS credentials. To enable access to Kubernetes services using port-forwarding, you will also need kubectl.
These prerequisites can be installed on your local machine, EC2 instance with NICE DCV, or AWS Cloud9. In this post, we will use a c5.2xlarge
Cloud9 instance with a 40GB
local storage volume. When using Cloud9, please disable AWS managed temporary credentials by visiting Cloud9->Preferences->AWS Settings
as shown on the screenshot below.
Build and run the aws-do-eks container
Open a terminal shell in your preferred environment and run the following commands:
The result is as follows:
You now have a shell in a container environment that has all the tools needed to complete the tasks below. We will refer to it as “aws-do-eks shell”. You will be running the commands in the following sections in this shell, unless specifically instructed otherwise.
Create an EKS cluster with a node group
This group includes a GPU instance family of your choice; in this example, we use the g5.2xlarge
instance type.
The aws-do-eks project comes with a collection of cluster configurations. You can set your desired cluster configuration with a single configuration change.
- In the container shell, run
./env-config.sh
and then setCONF=conf/eksctl/yaml/eks-gpu-g5.yaml
- To verify the cluster configuration, run
./eks-config.sh
You should see the following cluster manifest:
- To create the cluster, run the following command in the container
The output is as follows:
- To verify that your cluster is created successfully, run the following command
The output is similar to the following:
In this example, we have one m5.xlarge and one g5.2xlarge instance in our cluster; therefore, we see two nodes listed in the preceding output.
During the cluster creation process, the NVIDIA device plugin will get installed. You will need to remove it after cluster creation because we will use the NVIDIA GPU Operator instead.
- Delete the plugin with the following command
We get the following output:
Install the NVIDIA Helm repo
Install the NVIDIA Helm repo with the following command:
Deploy the DCGM exporter with the NVIDIA GPU Operator
To deploy the DCGM exporter, complete the following steps:
- Prepare the DCGM exporter GPU metrics configuration
You have the option to edit the dcgm-metrics.csv
file. You can add or remove any metrics as needed.
- Create the gpu-operator namespace and DCGM exporter ConfigMap
The output is as follows:
- Apply the GPU operator to the EKS cluster
The output is as follows:
- Confirm that the DCGM exporter pod is running
The output is as follows:
If you inspect the logs, you should see the “Starting webserver”
message:
The output is as follows:
NVIDIA DCGM Exporter exposes a Prometheus metrics endpoint, which can be ingested by the CloudWatch agent. To see the endpoint, use the following command:
We get the following output:
- To generate some GPU utilization, we deploy a pod that runs the gpu-burn binary
The output is as follows:
This deployment uses a single GPU to produce a continuous pattern of 100% utilization for 20 seconds followed by 0% utilization for 20 seconds.
- To make sure the endpoint works, you can run a temporary container that uses curl to read the content of
http://nvidia-dcgm-exporter:9400/metrics
We get the following output:
Configure and deploy the CloudWatch agent
To configure and deploy the CloudWatch agent, complete the following steps:
- Download the YAML file and edit it
The file contains a cwagent configmap
and a prometheus configmap
. For this post, we edit both.
- Edit the
prometheus-eks.yaml
file
Open the prometheus-eks.yaml
file in your favorite editor and replace the cwagentconfig.json
section with the following content:
- In the
prometheus
config section, append the following job definition for the DCGM exporter
- Save the file and apply the
cwagent-dcgm
configuration to your cluster
We get the following output:
- Confirm that the CloudWatch agent pod is running
We get the following output:
Visualize metrics on the CloudWatch console
To visualize the metrics in CloudWatch, complete the following steps:
- On the CloudWatch console, under Metrics in the navigation pane, choose All metrics
- In the Custom namespaces section, choose the new entry for ContainerInsights/Prometheus
For more information about the ContainerInsights/Prometheus namespace, refer to Scraping additional Prometheus sources and importing those metrics.
- Drill down to the metric names and choose
DCGM_FI_DEV_GPU_UTIL
- On the Graphed metrics tab, set Period to 5 seconds
- Set the refresh interval to 10 seconds
You will see the metrics collected from DCGM exporter that visualize the gpu-burn
pattern on and off each 20 seconds.
On the Browse tab, you can see the data, including the pod name for each metric.
The EKS API metadata has been combined with the DCGM metrics data, resulting in the provided pod-based GPU metrics.
This concludes the first approach of exporting DCGM metrics to CloudWatch via the CloudWatch agent.
In the next section, we configure the second architecture, which exports the DCGM metrics to Prometheus, and we visualize them with Grafana.
Use Prometheus and Grafana to visualize GPU metrics from DCGM
Complete the following steps:
- Add the Prometheus community helm chart
This chart deploys both Prometheus and Grafana. We need to make some edits to the chart before running the install command.
- Save the chart configuration values to a file in
/tmp
- Edit the char configuration file
Edit the saved file (/tmp/kube-prometheus-stack.values
) and set the following option by looking for the setting name and setting the value:
- Add the following ConfigMap to the
additionalScrapeConfigs
section
- Deploy the Prometheus stack with the updated values
We get the following output:
- Confirm that the Prometheus pods are running
We get the following output:
Prometheus and Grafana pods are in the Running
state.
Next, we validate that DCGM metrics are flowing into Prometheus.
- Port-forward the Prometheus UI
There are different ways to expose the Prometheus UI running in EKS to requests originating outside of the cluster. We will use kubectl port-forwarding
. So far, we have been executing commands inside the aws-do-eks
container. To access the Prometheus service running in the cluster, we will create a tunnel from the host. Here the aws-do-eks
container is running by executing the following command outside of the container, in a new terminal shell on the host. We will refer to this as “host shell”.
- Open the Prometheus UI
- If you are using Cloud9, please navigate to
Preview->Preview Running Application
to open the Prometheus UI in a tab inside the Cloud9 IDE, then click theicon in the upper-right corner of the tab to pop out in a new window.
- If you are on your local host or connected to an EC2 instance via remote desktop open a browser and visit the URL
http://localhost:8080
.
- If you are using Cloud9, please navigate to
- Enter
DCGM
to see the DCGM metrics that are flowing into Prometheus - Select
DCGM_FI_DEV_GPU_UTIL
, choose Execute, and then navigate to the Graph tab to see the expected GPU utilization pattern
- Stop the Prometheus port-forwarding process
Run the following command line in your host shell:
Now we can visualize the DCGM metrics via Grafana Dashboard.
- Retrieve the password to log in to the Grafana UI
- Port-forward the Grafana service
Run the following command line in your host shell:
- Log in to the Grafana UI
Access the Grafana UI login screen the same way as you accessed the Prometheus UI earlier. If using Cloud9, select Preview->Preview Running Application
, then pop out in a new window. If using your local host or an EC2 instance with remote desktop visit URL http://localhost:8080
. Login with the user name admin and the password you retrieved earlier.
- In the navigation pane, choose Dashboards
- Choose New and Import
We are going to import the default DCGM Grafana dashboard described in NVIDIA DCGM Exporter Dashboard.
- In the field
import via grafana.com
, enter12239
and choose Load - Choose Prometheus as the data source
- Choose Import
You will see a dashboard similar to the one in the following screenshot.
To demonstrate that these metrics are pod-based, we are going to modify the GPU Utilization pane in this dashboard.
- Choose the pane and the options menu (three dots)
- Expand the Options section and edit the Legend field
- Replace the value there with
Pod {{pod}}
, then choose Save
The legend now shows the gpu-burn
pod name associated with the displayed GPU utilization.
- Stop port-forwarding the Grafana UI service
Run the following in your host shell:
In this post, we demonstrated using open-source Prometheus and Grafana deployed to the EKS cluster. If desired, this deployment can be substituted with Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Clean up
To clean up the resources you created, run the following script from the aws-do-eks
container shell:
Conclusion
In this post, we utilized NVIDIA DCGM Exporter to collect GPU metrics and visualize them with either CloudWatch or Prometheus and Grafana. We invite you to use the architectures demonstrated here to enable GPU utilization monitoring with NVIDIA DCGM in your own AWS environment.
Additional resources
- Amazon EC2 GPU instances
- NVIDIA DCGM: Manage and Monitor GPUs in Cluster Environments
- kube-prometheus-stack GitHub repo
About the authors
Amr Ragab is a former Principal Solutions Architect, EC2 Accelerated Computing at AWS. He is devoted to helping customers run computational workloads at scale. In his spare time, he likes traveling and finding new ways to integrate technology into daily life.
Alex Iankoulski is a Principal Solutions Architect, Self-managed Machine Learning at AWS. He’s a full-stack software and infrastructure engineer who likes to do deep, hands-on work. In his role, he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open-source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.
Keita Watanabe is a Senior Solutions Architect of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based Self-managed Machine Learning solutions. His background is in Machine Learning research and development. Prior to joining AWS, Keita was working in the e-commerce industry. Keita holds a Ph.D. in Science from the University of Tokyo.
Author: Amr Ragab