10 KiB
name, description
| name | description |
|---|---|
| AWS CloudWatch Investigation | Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates, alarm-to-deployment correlation, blast-radius narrowing decision tree, and PromQL-style metric query patterns for structured incident triage. |
AWS CloudWatch Investigation Skill
Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
Pattern 1: Logs Insights Query Templates
Error Spike Detection
Find the top errors in a time window, grouped by error type:
fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|critical)/
| stats count(*) as errorCount by bin(5m), @logStream
| sort errorCount desc
| limit 20
P99 Latency Breakdown by Operation
Identify which operations are driving latency spikes:
fields @timestamp, @duration, operation
| filter ispresent(@duration)
| stats avg(@duration) as avgMs,
pct(@duration, 50) as p50Ms,
pct(@duration, 95) as p95Ms,
pct(@duration, 99) as p99Ms,
count(*) as invocations
by operation
| sort p99Ms desc
| limit 15
Lambda Cold Start Detection
Quantify cold start impact during an incident:
fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as coldStarts,
avg(@initDuration) as avgInitMs,
max(@initDuration) as maxInitMs,
avg(@duration) as avgDurationMs
by bin(5m)
| sort @timestamp desc
Out-of-Memory (OOM) Detection
Find Lambda functions or containers killed by memory pressure:
fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
| stats count(*) as oomEvents by @logStream, bin(10m)
| sort oomEvents desc
| limit 10
For memory utilization trending before OOM:
fields @timestamp, @maxMemoryUsed, @memorySize
| filter ispresent(@maxMemoryUsed)
| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
by bin(5m)
| sort @timestamp desc
Timeout Detection
Find invocations that hit the configured timeout:
fields @timestamp, @duration, @logStream, @requestId
| filter @message like /Task timed out/ or @duration > 28000
| stats count(*) as timeouts by @logStream, bin(5m)
| sort timeouts desc
Pattern 2: Alarm History to Deploy-Event Correlation
Process
- Get alarm transition time — note the exact timestamp when the alarm entered ALARM state.
- Query CloudTrail for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
# CloudTrail Lake query for deployment events
SELECT eventTime, eventName, userIdentity.arn, requestParameters
FROM <event-data-store-id>
WHERE eventTime > '<alarm_time_minus_30m>'
AND eventTime < '<alarm_time>'
AND eventName IN (
'UpdateFunctionCode', 'UpdateFunctionConfiguration',
'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
'CreateChangeSet', 'ExecuteChangeSet',
'StartPipelineExecution', 'PutImage'
)
ORDER BY eventTime DESC
-
Correlation criteria — a deploy is "correlated" if:
- It targets the same service/resource as the alarm
- It completed within 15 minutes before the alarm transition
- The deployer identity matches a CI/CD role (not a human applying a hotfix)
-
Strengthening the correlation:
- Check if the same alarm was healthy in the previous deployment cycle
- Verify no other environmental changes (scaling events, config changes) in the same window
- Look for canary/synthetic monitor failures that started at the same time
Output Format
Deploy Correlation:
Event: UpdateFunctionCode
Time: 2024-03-15T14:23:07Z (12 min before alarm)
Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle
Pattern 3: Narrow the Blast Radius Decision Tree
Use this tree to systematically scope an incident from broadest to most specific:
START
|
v
[1] ACCOUNT — Which account(s) show the alarm?
| - Check: Are alarms firing in multiple accounts?
| - If yes → suspect shared service (SSO, networking, shared deployment pipeline)
| - If no → proceed to Region
v
[2] REGION — Which region(s) are affected?
| - Check: Same alarm in other regions?
| - If multi-region → suspect global service (IAM, Route53, S3 global)
| - If single-region → proceed to Service
v
[3] SERVICE — Which service namespace shows degradation?
| - Check CloudWatch namespace: AWS/Lambda, AWS/ECS, AWS/ApiGateway, etc.
| - If multiple services → suspect shared dependency (VPC, NAT, DNS, IAM)
| - If single service → proceed to Operation
v
[4] OPERATION — Which API action or function is failing?
| - For Lambda: which function name?
| - For ECS: which service/task definition?
| - For API GW: which stage/resource/method?
| - If all operations → suspect service-level issue (throttling, quota)
| - If specific operation → proceed to Resource
v
[5] RESOURCE — Which specific resource instance?
- Function ARN, Task ID, DB instance identifier
- This is your investigation target
- Proceed to log and trace analysis scoped to this resource
Shared Dependency Investigation
When blast radius spans multiple services, investigate in this order:
- VPC/Networking — NAT Gateway ErrorPortAllocation, packet drops, DNS resolution failures
- IAM/STS — ThrottlingException on AssumeRole, token vending latency
- Downstream dependency — shared database, cache, or external API
- Deployment pipeline — simultaneous deploys across services from same pipeline run
- AWS service event — check AWS Health Dashboard and Service Health for the region
Pattern 4: PromQL-Style Metric Query Patterns
These patterns use CloudWatch metric math and GetMetricData to build composite signals. Express them as metric queries for dashboards or programmatic retrieval.
Error Rate as Percentage
MetricDataQueries:
- Id: errors
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 60
Stat: Sum
- Id: invocations
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 60
Stat: Sum
- Id: error_rate
Expression: "errors / invocations * 100"
Label: "Error Rate %"
Latency Anomaly Detection (Compare to Baseline)
MetricDataQueries:
- Id: current_p99
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Duration
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 300
Stat: p99
- Id: baseline_p99
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Duration
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 300
Stat: p99
# Use StartTime/EndTime set to same window last week
- Id: anomaly_ratio
Expression: "current_p99 / baseline_p99"
Label: "Latency vs Baseline (ratio > 2 = anomaly)"
Throttling Pressure Score
Combine multiple throttling signals into a single pressure metric:
MetricDataQueries:
- Id: lambda_throttles
MetricStat:
Metric: {Namespace: AWS/Lambda, MetricName: Throttles}
Period: 60
Stat: Sum
- Id: api_gw_429s
MetricStat:
Metric: {Namespace: AWS/ApiGateway, MetricName: 4XXError, Dimensions: [{Name: ApiName, Value: TARGET}]}
Period: 60
Stat: Sum
- Id: dynamo_throttles
MetricStat:
Metric: {Namespace: AWS/DynamoDB, MetricName: ThrottledRequests, Dimensions: [{Name: TableName, Value: TARGET}]}
Period: 60
Stat: Sum
- Id: throttle_pressure
Expression: "lambda_throttles + api_gw_429s + dynamo_throttles"
Label: "Combined Throttle Pressure"
Concurrent Execution Headroom
MetricDataQueries:
- Id: concurrent
MetricStat:
Metric: {Namespace: AWS/Lambda, MetricName: ConcurrentExecutions}
Period: 60
Stat: Maximum
- Id: headroom
Expression: "1000 - concurrent"
Label: "Remaining Concurrency (account limit 1000)"
Pattern 5: Incident Timeline Reconstruction
Process
Reconstruct a precise timeline by merging data from multiple sources:
- Collect timestamps:
| Source | Query | Yields |
|---|---|---|
| CloudWatch Alarms | Alarm history API | State transition times |
| CloudWatch Metrics | GetMetricData with 1-min period | First anomaly point |
| CloudWatch Logs | Logs Insights with earliest(@timestamp) |
First error occurrence |
| CloudTrail | LookupEvents filtered by time | Deployment/change events |
| AWS Health | DescribeEvents | AWS-side incidents |
- Build the timeline:
fields @timestamp, @message
| filter @message like /ERROR|WARN|timeout|refused|denied/
| stats earliest(@timestamp) as firstSeen, latest(@timestamp) as lastSeen, count(*) as occurrences
by @message
| sort firstSeen asc
| limit 20
- Identify the sequence:
Timeline:
T-15m: CloudTrail — UpdateFunctionCode by CI/CD role
T-12m: Logs — first error "Connection refused to payments-api.internal"
T-10m: Metrics — Error count crosses 5/min threshold
T-8m: Alarm — PaymentProcessorErrors enters ALARM
T-5m: Metrics — p99 latency spikes to 28s (timeout)
T-0: Current — error rate at 45%, alarm still firing
- Determine root event — the earliest change that preceded all symptoms. Walk backward from the first symptom to the most recent mutation (deploy, config change, scaling event, or external dependency shift).
Gotchas
- CloudWatch metric timestamps are end-of-period. A 1-minute datapoint at 14:05 covers 14:04-14:05.
- CloudTrail events can have up to 15-minute delivery delay. Use
eventTime, not ingestion time. - Log group timestamps depend on the agent/SDK flush interval. Allow for 30-60s of clock skew.
- Alarm state changes have a built-in evaluation delay (periods x evaluation periods). The actual anomaly started earlier.