# Automated Diagnostics

# What is PagerDuty's Automated Diagnostics Solution?

Automated diagnostics is a solution provided by integrating PagerDuty's Incident Response and Runbook Automation products. By automating the retrieval of “diagnostic” data during incidents, you can shorten the length of incidents, reduce the number of individuals paged to help with resolution, and gather evidence for fixing the root-cause after the incident.

# Use Cases

There are multiple use-cases and benefits to the Automated Diagnostics solution. Here are a few of the most common examples:

  1. Improve Triage: surfacing diagnostic data can improve the time spent troubleshooting and the number of people pulled into incidents.
  2. Capture Environment State: by capturing the environment or application "state" during an incident, operations engineers and developers have evidence to help them fix code-level bugs and configuration errors - perhaps a while after the incident has been resolved.
  3. Realtime Updates: by querying backend services in realtime, an Incident Commander can more easily provide updates to stakeholders during an incident.

For more details on these use-cases, see this section of the solution-guide.

# Prebuilt Automation

PagerDuty provides a solution that helps users start automating diagnostics quickly. This Solution consists of prebuilt Automation Jobs that retrieve data from common infrastructure and services for investigating, debugging and diagnosing incidents:

Automated Diagnostics within PagerDuty
Automated Diagnostics within PagerDuty

Verbose Diagnostics in Process Automation
Verbose Diagnostics in Process Automation

As an example, if an incident is triggered for a service running in Kubernetes, PagerDuty Runbook Automation can retrieve information from logs, API’s, databases and other sources that support this service. This could be triggered with the click of a button or through event-driven invocation.

# Simplifying and Sharing Diagnostics

Diagnostics retrieved using Runbook Automation can be made available in multiple interfaces such as PagerDuty's Mobil App, Slack, and Microsoft Teams:

Diagnostics in Slack
Diagnostics in Slack

# Examples & Templates

This guide includes a full section on Examples & Best Practices - a preview of that is shown here:

Stopped ECS Task Errors

ELB Targets Health

CloudWatch Logs

Azure Function App Health

Azure File Sync

Load Balancer Health Probes

Load Balancer Health Checks

Troubleshoot Firewall Rules

GKE Cluster Connectivity

Top CPU Consuming Processes

Retrieve Errors from Syslog

List Top Disk Consuming Files

Active Directory Replication Statistics

Retrieve IIS Web Server Logs

SMB Connection Failures

API Health Check

Recent Pod Logs

Recent Kubernetes Events

Pod Status & Errors

Retrieve Deployment Diagnostics

Top Resource Consuming Queries

Blocking Locks

Missing Indexes

BGP Route Flapping

Check Spanning Tree

Check Duplex Mismatch

Retrieve Application Logs

Retrieve Saved Queries

Intrinsic Latency Diagnostics Test

Check Redis Port Listening

Retrieve Redis Memory Statistics

Slow Log Entries

Check Database Storage Status

Query Nginx Status Endpoint

Retrieve Error Logs

Test Nginx Configuration

Retrieve Recent PostgreSQL Logs

Test for PostgreSQL Server Running

Check Compaction Statistics

Describe Kafka Topic

View Topic Messages

Retrieve Java Thread Dump

Retrieve Java Heap Dump

RabbitMQ Node Health
Last Updated: 9/28/2023, 11:38:04 PM