AWS re:Invent 2024 - Don't get stuck: How connected telemetry keeps you moving forward (COP322)

Updated: November 13, 2025

AWS Events


Summary

The video provides an in-depth look at troubleshooting processes in a production site scenario, covering common triggers of outages such as system changes and instance failures. It emphasizes the importance of quick identification and mitigation strategies, showcasing the benefits of application instrumentation for faster issue resolution. Additionally, it introduces efficient troubleshooting techniques using CloudWatch logs and traces, including leveraging AI agents for detailed analysis and hypothesis generation to identify root causes accurately.


Introduction to Troubleshooting

The speaker wakes up to an alert about a production site problem, leading to a troubleshooting scenario with teammates coming online to assist but facing challenges in identifying the root cause.

Purpose of Troubleshooting

Explains the purpose of troubleshooting, which is to find actionable solutions to resolve issues quickly, focusing on resolving immediate problems and making follow-ups for long-term improvements.

Identifying Common Triggers

Discusses the five common triggers of outages: changes in the system, reaching limits, dependencies failing, instances failing, and workload changes, emphasizing the need to address these triggers to stop the problem.

Investigative Techniques

Illustrates how to investigate issues faster by identifying and addressing specific triggers systematically, emphasizing the importance of quick identification and mitigation strategies.

Benefits of Application Instrumentation

Explains the benefits of application instrumentation, including faster navigation during troubleshooting, enhanced visibility into system performance, and the importance of dimensionality in measurements.

Importance of Navigation and Connectivity

Emphasizes the significance of navigation and connectivity in troubleshooting to avoid getting stuck, navigate efficiently between components, and utilize manual instrumentation for effective tracking and resolution.

Utilizing CloudWatch Logs and Traces

Demonstrates how to leverage CloudWatch logs and traces for detailed analysis, investigating faults, and tracing the flow of requests to identify and resolve issues effectively.

Optimizing Troubleshooting with CloudWatch

Introduces new capabilities in CloudWatch for faster and more efficient troubleshooting, such as filtering noise, summarizing results, and identifying issues through log analysis and query functionalities.

Utilizing AI Agent for Investigations

Using an AI agent to conduct investigations and sift through telemetry and topology to identify issues.

Starting the Investigation

Starting the investigation process from the entry signals, focusing on the bot service and gateway, and identifying configuration API issues.

Identifying Dependencies

Exploring bot schedule service as a dependency of bot service and following causal pathways to uncover errors.

Access Denied Error Detection

Detecting access denied messages in logs related to specific microservices like DynamoDB and tracing errors back to resource policy changes.

Generating Hypotheses

The AI agent generates hypotheses to explain issues, providing insights into the root cause and recommending actions like reviewing recent changes and implementing change control policies.

Troubleshooting Steps

Exploring five troubleshooting steps to identify root causes, gather information, and investigate using a systematic approach.


FAQ

Q: What is the purpose of troubleshooting?

A: The purpose of troubleshooting is to find actionable solutions to resolve issues quickly.

Q: What are the five common triggers of outages discussed in the scenario?

A: The five common triggers of outages are changes in the system, reaching limits, dependencies failing, instances failing, and workload changes.

Q: How can issues be investigated faster according to the scenario?

A: Issues can be investigated faster by identifying and addressing specific triggers systematically.

Q: What are the benefits of application instrumentation mentioned in the scenario?

A: The benefits of application instrumentation include faster navigation during troubleshooting, enhanced visibility into system performance, and the importance of dimensionality in measurements.

Q: Why is navigation and connectivity significant in troubleshooting?

A: Navigation and connectivity are significant in troubleshooting to avoid getting stuck, navigate efficiently between components, and utilize manual instrumentation for effective tracking and resolution.

Q: How can CloudWatch logs and traces be leveraged for troubleshooting?

A: CloudWatch logs and traces can be leveraged for detailed analysis, investigating faults, and tracing the flow of requests to identify and resolve issues effectively.

Q: What new capabilities in CloudWatch were introduced for troubleshooting efficiency?

A: New capabilities in CloudWatch introduced for troubleshooting efficiency include filtering noise, summarizing results, and identifying issues through log analysis and query functionalities.

Q: How was an AI agent utilized in the troubleshooting scenario?

A: An AI agent was used to conduct investigations and sift through telemetry and topology to identify issues, generate hypotheses, provide insights into the root cause, and recommend actions.

Q: What were the steps mentioned in the scenario for identifying root causes during troubleshooting?

A: The steps mentioned for identifying root causes during troubleshooting include starting from the entry signals, focusing on specific services like the bot service and gateway, and exploring causal pathways to uncover errors.

Logo

Get your own AI Agent Today

Thousands of businesses worldwide are using Chaindesk Generative AI platform.
Don't get left behind - start building your own custom AI chatbot now!