# Automated Diagnostics

# What is PagerDuty's Automated Diagnostics Solution?

Automated diagnostics is a solution provided by integrating PagerDuty's Incident Response and Runbook Automation products. By automating the retrieval of “diagnostic” data during incidents, you can shorten the length of incidents, reduce the number of individuals paged to help with resolution, and gather evidence for fixing the root-cause after the incident.

# Use Cases

There are multiple use-cases and benefits to the Automated Diagnostics solution. Here are a few of the most common examples:

  1. Improve Triage: surfacing diagnostic data can improve the time spent troubleshooting and the number of people pulled into incidents.
  2. Capture Environment State: by capturing the environment or application "state" during an incident, operations engineers and developers have evidence to help them fix code-level bugs and configuration errors - perhaps a while after the incident has been resolved.
  3. Realtime Updates: by querying backend services in realtime, an Incident Commander can more easily provide updates to stakeholders during an incident.

For more details on these use-cases, see this section of the solution-guide.

# Prebuilt Automation

PagerDuty provides a solution that helps users start automating diagnostics quickly. This Solution consists of prebuilt Automation Jobs that retrieve data from common infrastructure and services for investigating, debugging and diagnosing incidents:

Automated Diagnostics within PagerDuty
Automated Diagnostics within PagerDuty

Verbose Diagnostics in Process Automation
Verbose Diagnostics in Process Automation

As an example, if an incident is triggered for a service running in Kubernetes, PagerDuty Runbook Automation can retrieve information from logs, API’s, databases and other sources that support this service. This could be triggered with the click of a button or through event-driven invocation.

# Simplifying and Sharing Diagnostics

Diagnostics retrieved using Runbook Automation can be made available in multiple interfaces such as PagerDuty's Mobil App, Slack, and Microsoft Teams:

Diagnostics in Slack
Diagnostics in Slack

# Examples & Templates

This guide includes a full section on Examples & Best Practices - a preview of that is shown here:



Stopped ECS Task Errors


ELB Targets Health


CloudWatch Logs


Azure Function App Health


Azure File Sync


Load Balancer Health Probes


Load Balancer Health Checks


Troubleshoot Firewall Rules


GKE Cluster Connectivity


Top CPU Consuming Processes


Retrieve Errors from Syslog


List Top Disk Consuming Files


Active Directory Replication Statistics


Retrieve IIS Web Server Logs


SMB Connection Failures


API Health Check


Recent Pod Logs


Recent Kubernetes Events


Pod Status & Errors


Retrieve Deployment Diagnostics


Top Resource Consuming Queries


Blocking Locks


Missing Indexes


BGP Route Flapping


Check Spanning Tree


Check Duplex Mismatch


Retrieve Application Logs


Retrieve Saved Queries


Intrinsic Latency Diagnostics Test


Check Redis Port Listening


Retrieve Redis Memory Statistics


Slow Log Entries


Check Database Storage Status


Query Nginx Status Endpoint


Retrieve Error Logs


Test Nginx Configuration


Retrieve Recent PostgreSQL Logs


Test for PostgreSQL Server Running


Check Compaction Statistics


Describe Kafka Topic


View Topic Messages


Retrieve Java Thread Dump


Retrieve Java Heap Dump


RabbitMQ Node Health
Last Updated: 9/28/2023, 11:38:04 PM