AI-powered event response for Amazon EKS

TutoSartup excerpt from this article:
Cloud environments with dozens of microservices are now easier to manage than ever, and modern DevOps teams are well-equipped to balance rapid deployments with operational stability — even as monitoring tools surface thousands of daily signals… AWS DevOps Agent is a fully managed autonomous AI …

Cloud environments with dozens of microservices are now easier to manage than ever, and modern DevOps teams are well-equipped to balance rapid deployments with operational stability — even as monitoring tools surface thousands of daily signals.

AWS DevOps Agent is a fully managed autonomous AI Agent that resolves and proactively prevents incidents, continuously improving reliability and performance of applications in AWS, multicloud, and hybrid environments. It brings Kubernetes-native intelligence to incident response. It understands how Pods relate to Deployments, which Services route traffic, what ConfigMaps provide configuration, and how these components interact across your environment. Rather than seeing isolated infrastructure issues, the agent comprehends the architectural relationships that matter most for fast, accurate root cause analysis. In this post, you’ll learn how AWS DevOps Agent integrates with your existing observability stack to provide intelligent, automated responses to system events.

Architecture Diagram of DevOps AgentFigure 1: This is an example of target architecture of how Amazon EKS workloads are deployed and how AWS DevOps agent can interact with the different managed services like Amazon CloudWatch

How AWS DevOps Agent discovers Kubernetes resources

Built on Amazon Bedrock, the agent can analyze complex operational scenarios and correlate data from multiple sources. AWS DevOps Agent combines natural language processing (NLP) of logs and error messages with root cause analysis, powered by machine learning (ML), to automatically identify issues across your infrastructure.

Telemetry-based discovery

The agent analyzes OpenTelemetry data to infer runtime relationships:

  • Service Mesh Analysis: Examines network traffic patterns between pods to identify service-to-service communication
  • Trace Correlation: Uses distributed traces to map request flows across microservices
  • Metric Attribution: Associates performance metrics with specific pods, containers, and nodes

Metadata enrichment

The agent enriches discovered resources with contextual information:

  • Labels and Annotations: Extracts application metadata, ownership information, and deployment details
  • Resource Specifications: Captures CPU/memory requests and limits, health check configurations, and environmental variables
  • Network Topology: Maps pod IPs, service cluster IPs, ingress rules, and network policies

Discovery process

When you start an investigation, the agent executes the following discovery workflow:

  1. Initial Scan: Queries the Kubernetes API for all resources in relevant namespaces
  2. Dependency Analysis: Builds a dependency graph showing how resources relate to each other
  3. Telemetry Correlation: Matches discovered resources with their corresponding metrics, logs, and traces
  4. Context Building: Aggregates resource state, recent events, and performance data into a unified view

Implementation details

Prerequisites

Before implementing this solution, verify that you have the following:

Development environment

As part of the setup and the applications that are mentioned in the following sections. We have this AWS samples repo with deployable scripts and setup instructions.

This section walks you through deploying and configuring the complete AWS DevOps Agent demo environment.

Step 1: Deploy AWS DevOps Agent infrastructure

Begin by deploying the AWS DevOps Agent using the AWS CDK. The infrastructure includes the Agent Space configuration, IAM roles and policies, and integration with your EKS cluster.

Screenshot of the AWS DevOps Agent web interface showing a form to create a new Agent Space named "OTEL-DevOpsAgent-Demo-v1". The form includes fields for agent space name and description, radio button options for configuring IAM roles for AWS resource access and web app access, auto-generated role names, and Cancel and Create action buttons.

Figure 2: This is screenshot of configuring the Agent Space in the AWS Console

Configure the Agent Space in the AWS Console by navigating to the AWS DevOps Agent service. You will then create a new Agent Space for your EKS cluster. Finally, you will set up data source integrations including Prometheus workspace endpoints, Amazon CloudWatch Log groups, and X-Ray service configuration.

Screenshot of the AWS DevOps Agent Capabilities configuration tab for the OTEL-DevOpsAgent-Demo-v1 agent space. The interface displays a Cloud capability section with a primary AWS account (ID: 123456789012) showing a Valid status, a secondary sources section with no entries, and options to add additional source accounts.Figure 3: This is screenshot of validating the connectivity to data sources and access to the AWS account

Validate the deployment by accessing the DevOps Agent web interface, verifying connectivity to data sources, and confirming that the agent can discover your EKS cluster resources.

Screenshot of the AWS DevOps Agent Incident Response Dashboard for the OTEL-DevOpsAgent-Demo-v1 project. The dashboard includes a text area to describe a new investigation, quick-start template buttons for common scenarios (Latest alarm, High CPU usage, Error rate spike), a bar chart showing daily investigation frequency from January 29 to February 4 with one investigation on February 4, and an empty investigations table filtered by Pending Start status.

Figure 4: This is screenshot of checking the incident response and if there is any ongoing investigation

Step 2: Set up port forwarding for applications

Configure port forwarding to enable the traffic generator to access your deployed applications. Set up port forwarding for all sample applications using “kubectl” commands. Each application runs on different ports to simulate a realistic microservices environment:

# Sample Metrics App (port 8000)
kubectl port-forward svc/sample-metrics-app 8000:8000 -n default &
# Python OTEL App (port 8080)
kubectl port-forward svc/otel-sample-app 8080:8000 -n default &
# Go OTEL App (port 8090)
kubectl port-forward svc/go-otel-sample-app 8090:8080 -n default &
# Java OTEL App (port 8081)
kubectl port-forward svc/java-otel-sample-app 8081:8080 -n default &

Step 3: Install and configure Traffic Generator

The traffic generator is a Python-based tool that creates realistic load patterns and error scenarios for testing AWS DevOps Agent capabilities. Install the required Python dependencies and make the traffic generator executable:

# Install required Python packages
pip install requests
# Make the script executable
chmod +x traffic-generator.py

The traffic generator supports multiple configuration options for creating different testing scenarios.

Step 4: Generate baseline traffic

Create baseline operational data by generating normal traffic patterns across all applications. This establishes normal operational patterns that AWS DevOps Agent can learn from. Generate steady traffic to establish baseline metrics:

# Generate normal baseline traffic
python traffic-generator.py --app all --duration 900 --rps 10 --error-rate 0.05

Terminal output showing the initialization of an EKS Platform Traffic Generator script targeting four OTEL applications — Sample Metrics App, Python OTEL App, Go OTEL App, and Java OTEL App — each configured for 900 seconds at 15 requests per second with a 10% error rate. All four services are confirmed available at their respective localhost ports.

This command generates traffic to all applications for 15 minutes at 10 requests per second with a 5% error rate, simulating normal operational conditions. Monitor the baseline traffic generation by checking application metrics and HPA scaling behavior:

# View HPA status (should show scaling based on metrics)
kubectl get hpa -A
# Check custom metrics availability
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
# View specific metric values
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sample_app_requests_rate | jq .

System log output from a load testing session displaying real-time performance metrics including elapsed time, total requests, success and failure counts, success rate percentage, and requests per second (RPS). The log spans timestamps from 22:38:49 to 22:40:41, showing a consistent RPS of approximately 14.8–14.9 and alternating cycles of 0%, ~90%, and 100% success rates across approximately 9,000 to 10,650 total requests.

Step 5: Configure AWS DevOps Agent Investigation

Set up AWS DevOps Agent to monitor your EKS cluster and prepare for event investigation workflows.

Screenshot of the AWS DevOps Agent Incident Response interface showing the investigation timeline for the "Demo CPU Spike" incident created on 2026-02-04. The timeline displays a user request describing a 97.5% CPU spike on an EKS workload, followed by two assistant responses tracking investigation progress. The investigation was completed at 20:02:40 on the same day.Figure 5: This is screenshot of overall timeline of the ongoing investigation through AWS DevOps agent

Access the AWS DevOps Agent through the AWS Console:

  1. Navigate to AWS DevOps Agent in the AWS Console.
  2. Select your configured Agent Space.
  3. Select Operator access to open the DevOps Agent web application.
  4. Configure data source connections to verify proper integration.

Figure 6: This is screenshot of validating the access to the observability data from the cluster

Verify that the AWS DevOps Agent can access observability data from your cluster including metrics from Amazon Managed Prometheus, logs from Amazon CloudWatch Logs, traces from AWS X-Ray, and topology information from your EKS cluster. DevOps agent can also pull the service map information of your kubernetes resources.

Testing scenarios and use cases

This section demonstrates different testing scenarios that showcase AWS DevOps Agent capabilities in various operational situations.

Scenario 1: Normal load testing

This scenario establishes baseline operational patterns that AWS DevOps Agent can learn from and use for anomaly detection. Generate steady traffic to establish baseline metrics:

python traffic-generator.py --app all --duration 900 --rps 10 --error-rate 0.05

Command-line output report summarizing traffic generation test results for three applications. The Sample Metrics App achieved a 100% success rate across 13,365 requests, while both the Go OTEL App and Java OTEL App recorded 0% success rates with all 13,455 requests failing. The overall success rate was 90.49% at an average of 14.92 requests per second. A green checkmark confirms traffic generation completed for all applications, followed by next steps for monitoring with Prometheus, Grafana, and AWS DevOps Agent.What this test does: The command runs a 15-minute (900-second) steady traffic test across all applications at 10 requests per second, with a 5% simulated error rate. This low, consistent load represents typical production traffic and gives the agent enough signal to establish a reliable operational baseline.
Screenshot of the AWS DevOps Agent Root Cause analysis tab for the "Demo CPU Spike" incident. The investigation, completed on 2026-02-04 at 20:02:40, identifies one key finding: the HorizontalPodAutoscaler (HPA) for the go-otel-sample-app failed to retrieve the custom metric "go_app_requests_rate" from the custom metrics API 472 times between 16:19:44Z and 18:17:43Z, preventing automatic workload scaling during the incident.
Figure 7: This is screenshot of investigation and root cause found in one of the cluster

What you should observe: During this scenario, the AWS DevOps Agent learns normal operational baselines. In the agent’s investigation dashboard, you will see the following being captured and recorded:

Typical request patterns and response times — The agent records average latency and throughput across all services, establishing what “healthy” looks like for your environment.
Normal error rates and distribution — With a 5% error rate, the agent learns the expected noise floor for errors, so it can distinguish genuine incidents from routine fluctuations.
Resource utilization patterns — CPU, memory, and network usage are tracked per pod and node, giving the agent a reference point for what normal resource consumption looks like under standard load.
Service dependency relationships — The agent maps how services communicate with each other, identifying upstream and downstream dependencies that will be critical for root cause analysis in future incidents.

After the test completes, you should see a stable metrics summary in the agent dashboard showing consistent throughput, low error variance, and steady resource utilization — confirming that a reliable baseline has been captured.

Expected outcomes:

During this scenario, the AWS DevOps Agent learns normal operational baselines. These baselines include typical request patterns and response times, normal error rates and distribution, resource utilization patterns, and service dependency relationships.

Scenario 2: Simulated production event

This scenario demonstrates the AWS DevOps Agent’s ability to investigate and analyze events with elevated error rates and performance degradation:

python traffic-generator.py --app java-otel --duration 600 --rps 30 --error-rate 0.25

Figure 8: This is screenshot of details of the Nodes and telemetry data which is relevant for the investigations

What this test does: The command targets the java-otel application specifically, running a 10-minute (600-second) high-load test at 30 requests per second — three times the baseline — with a 25% error rate. This simulates a degraded service experiencing both a traffic surge and a significant increase in failures.

Figure 9: This is screenshot of review and recommended next steps after the analysis of the root cause

What you should observe: Once the test begins, the AWS DevOps Agent detects the deviation from the established baseline and initiates an investigation. In the agent’s incident response view, you will see the following outcomes:

Affected application identification — The agent pinpoints java-otel-app as the impacted service, distinguishing it from other applications running normally in the cluster.
Error pattern analysis — The agent breaks down the 25% error rate into specific failure modes (for example, HTTP 500 errors, timeout spikes, or connection refusals), helping you understand not just that errors are occurring, but why and where.
Resource utilization correlation — The agent correlates CPU and memory spikes on the affected pods with the observed performance degradation, showing a clear relationship between resource exhaustion and increased error rates.
Root cause identification with confidence scoring — The agent presents a ranked list of potential root causes, each with a confidence score, so you can prioritize your investigation. For example, it may identify a memory leak or thread pool exhaustion as the most likely cause with high confidence.

Advanced analysis capabilities: Beyond the immediate incident, the agent performs deeper analysis that you can explore in the investigation timeline view:

Cross-service impact correlation — The agent identifies whether the degradation in java-otel-app has cascading effects on dependent services, showing you the full blast radius of the incident.
Timeline reconstruction — The agent reconstructs the sequence of events leading up to and during the incident, helping you understand how the situation evolved over time.
Dependency mapping — Upstream and downstream service dependencies are visualized, making it clear which services are affected directly and which are at risk.
Prioritized remediation recommendations — The agent provides actionable remediation steps ranked by business impact, so your team can address the most critical issues first. Recommendations may include scaling the affected deployment, adjusting resource limits, or rolling back a recent configuration change.

After the test completes, you should see a full incident report in the agent dashboard summarizing the root cause, affected components, timeline, and recommended next steps — giving your team everything needed to resolve the issue and prevent recurrence.

Expected outcomes:

The agent identifies which application is affected (java-otel-app), analyzes error patterns and rates for specific failure modes, correlates resource utilization with performance degradation, and provides potential root causes with confidence scoring.

Figure 10: This is screenshot of mitigation plan suggested by the AWS DevOps agent

Advanced analysis capabilities:

The agent performs analysis including cross-service impact correlation, timeline reconstruction of event progression, dependency mapping to identify upstream/downstream effects, and prioritized remediation recommendations based on business impact.

AWS DevOps Agent Investigation workflow

This section details how to use AWS DevOps Agent for event investigation and analysis.

Screenshot of the "Start an investigation" modal dialog in the AWS DevOps Agent Incident Response Dashboard. The dialog displays pre-filled investigation details describing an increase in application error rates, an investigation starting point noting an error spike for a workload tagged DemosFor: OTEL-DevOpsAgent-AiforOperations within the last 20 minutes, and an incident timestamp of 2026-02-04T21:30:00. Cancel and Start investigating buttons are shown at the bottom.Figure 11: This is screenshot of starting an investigation from the AWS Console inside AWS DevOps agent service

Starting an investigation

Access the AWS DevOps Agent web interface and initiate a new investigation:

  1. Investigation Trigger: Choose from predefined scenarios like “High CPU usage,” “Error rate spike,” or “Performance degradation”
  2. Time Range Selection: Select the time period when you generated traffic or observed issues.
  3. Scope Definition: Provide the AWS Account ID, AWS Region (us-east-1), and specific cluster or application context.
  4. Data Source Configuration: Make sure all observability data sources are properly connected.

Investigation Process

AWS DevOps Agent follows a systematic investigation methodology:

Data Collection Phase:

Screenshot of the AWS DevOps Agent Incident Response interface showing an active investigation for the "Demo Error Spike" incident created on 2026-02-04 at 22:57:20. The investigation timeline on the left shows sequential updates including a user request, assistant responses, and a planning phase. A chat assistant panel on the right displays a welcome message and an input field for asking questions about the investigation.Figure 12: This is screenshot of investigation timeline done by the AWS DevOps agent

This approach correlates metrics from Amazon Managed Prometheus workspace, and analyzes logs from Amazon CloudWatch Logs for error patterns and anomalies. It also reviews distributed traces from AWS X-Ray for service dependencies, and examines application topology and service relationships to provide comprehensive observability during the migration process.

Analysis phase:

Screenshot of the AWS DevOps Agent Incident Response interface during the "Demo Error Spike" investigation. The investigation timeline shows an Update step and a Fetching data step. An assistant response identifies relevant AWS resources in us-east-1 including the dev-eks-automode EKS cluster, and displays a CloudWatch describe_alarm_history API call targeting the incident window from 2026-02-04T21:00:00Z to 21:57:31Z. A chat assistant panel is visible on the right.Figure 13: This is screenshot of analysis done from the AWS Console inside AWS DevOps agent service

This approach identifies patterns and anomalies using MLalgorithms. It correlates events across multiple data sources for comprehensive understanding, applies statistical analysis to determine the significance of observed changes, and compares current behavior against established baselines for accurate detection and assessment of system behavior.

Root cause identification:

Figure 14: This is screenshot of root cause and investigation summary done from the AWS Console inside AWS DevOps agent service

This approach provides systematic root cause analysis with confidence scoring, identifies contributing factors and potential trigger events, maps event timeline with correlated evidence from multiple sources, and suggests most likely causes based on data correlation and pattern analysis to enable efficient troubleshooting and resolution.

Mitigation strategy:

Figure 15: This is screenshot of mitigation summary recommended by the AWS DevOps agent

This approach recommends immediate mitigation actions to resolve current issues. It also suggests long-term prevention strategies to avoid recurrence, provides runbook-style guidance for event response teams, and integrates with existing DevOps workflows and tools for seamless incident response and improvement.

Key features and benefits

Preventing future incidents

AWS DevOps Agent analyzes patterns across your incident investigations to deliver targeted recommendations that continuously improve your operational posture and prevent future incidents.

Screenshot of the AWS DevOps Agent Prevention tab showing a newly started weekly evaluation run for the OTEL-DevOpsAgent-Demo-v1 project. The evaluation has used 3 minutes of a 15-hour budget. The agent summary states no new recommendations were generated for the past week. A recommendation frequency breakdown shows zero items across all four categories: Code optimization, Observability, Infrastructure, and Governance.Figure 16: This is screenshot of prevention tab in the AWS Console for the AWS DevOps agent

DevOps Agent topology

AWS DevOps Agent Topology automatically discovers and maps your entire infrastructure into an interactive, living blueprint. It reveals not only what resources exist, but how they interconnect, depend on each other, and drive system behavior.

Screenshot of the AWS DevOps Agent Topology graph view for Agent Space 442bc2d6-616a-49dd-b13f-e8a43fab0450, displaying 1,806 total discovered resources. The graph is filtered to show Container resources and visualizes three interconnected sections: a left section with OTEL-DevOpsAgent-Demo-v1 ECS clusters and network components, a middle section with StackSets and multiple standalone resource groups, and a right section with additional standalone resources, connected by relationship lines.Figure 17: This is screenshot of topology of Amazon EKS cluster discovered by AWS DevOps agent

Clean up

To avoid ongoing charges, delete the resources that you created while following this walkthrough.

Remove AWS DevOps Agent resources:

  • Delete investigation data and Agent Space configuration through the AWS Console.
  • Remove IAM roles and policies created specifically for the DevOps Agent.
  • Delete any CloudFormation stacks created during deployment.

Conclusion

As organizations continue to embrace cloud-native architectures and DevOps practices, tools like AWS DevOps Agent will become essential for maintaining competitive advantage in an increasingly complex technological landscape.

Ready to add AI powered observability to your container infrastructure? Visit the AWS documentation to access implementation guides, or reach out to your AWS account team to discuss how this automated migration approach can accelerate your cloud modernization journey, while reducing operational overhead.


About the authors

AI-powered event response for Amazon EKS
Author: Aritra Nag