chore: publish from staged

This commit is contained in:
github-actions[bot]
2026-06-22 01:34:47 +00:00
parent b5eb0d2306
commit 223acfc274
4 changed files with 453 additions and 0 deletions
@@ -0,0 +1,333 @@
---
name: AWS CloudWatch Investigation
description: >
Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates,
alarm-to-deployment correlation, blast-radius narrowing decision tree, and
PromQL-style metric query patterns for structured incident triage.
---
# AWS CloudWatch Investigation Skill
Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
---
## Pattern 1: Logs Insights Query Templates
### Error Spike Detection
Find the top errors in a time window, grouped by error type:
```
fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|critical)/
| stats count(*) as errorCount by bin(5m), @logStream
| sort errorCount desc
| limit 20
```
### P99 Latency Breakdown by Operation
Identify which operations are driving latency spikes:
```
fields @timestamp, @duration, operation
| filter ispresent(@duration)
| stats avg(@duration) as avgMs,
pct(@duration, 50) as p50Ms,
pct(@duration, 95) as p95Ms,
pct(@duration, 99) as p99Ms,
count(*) as invocations
by operation
| sort p99Ms desc
| limit 15
```
### Lambda Cold Start Detection
Quantify cold start impact during an incident:
```
fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as coldStarts,
avg(@initDuration) as avgInitMs,
max(@initDuration) as maxInitMs,
avg(@duration) as avgDurationMs
by bin(5m)
| sort @timestamp desc
```
### Out-of-Memory (OOM) Detection
Find Lambda functions or containers killed by memory pressure:
```
fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
| stats count(*) as oomEvents by @logStream, bin(10m)
| sort oomEvents desc
| limit 10
```
For memory utilization trending before OOM:
```
fields @timestamp, @maxMemoryUsed, @memorySize
| filter ispresent(@maxMemoryUsed)
| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
by bin(5m)
| sort @timestamp desc
```
### Timeout Detection
Find invocations that hit the configured timeout:
```
fields @timestamp, @duration, @logStream, @requestId
| filter @message like /Task timed out/ or @duration > 28000
| stats count(*) as timeouts by @logStream, bin(5m)
| sort timeouts desc
```
---
## Pattern 2: Alarm History to Deploy-Event Correlation
### Process
1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.
2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
```
# CloudTrail Lake query for deployment events
SELECT eventTime, eventName, userIdentity.arn, requestParameters
FROM <event-data-store-id>
WHERE eventTime > '<alarm_time_minus_30m>'
AND eventTime < '<alarm_time>'
AND eventName IN (
'UpdateFunctionCode', 'UpdateFunctionConfiguration',
'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
'CreateChangeSet', 'ExecuteChangeSet',
'StartPipelineExecution', 'PutImage'
)
ORDER BY eventTime DESC
```
3. **Correlation criteria** — a deploy is "correlated" if:
- It targets the same service/resource as the alarm
- It completed within 15 minutes before the alarm transition
- The deployer identity matches a CI/CD role (not a human applying a hotfix)
4. **Strengthening the correlation:**
- Check if the same alarm was healthy in the previous deployment cycle
- Verify no other environmental changes (scaling events, config changes) in the same window
- Look for canary/synthetic monitor failures that started at the same time
### Output Format
```
Deploy Correlation:
Event: UpdateFunctionCode
Time: 2024-03-15T14:23:07Z (12 min before alarm)
Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle
```
---
## Pattern 3: Narrow the Blast Radius Decision Tree
Use this tree to systematically scope an incident from broadest to most specific:
```
START
|
v
[1] ACCOUNT — Which account(s) show the alarm?
| - Check: Are alarms firing in multiple accounts?
| - If yes → suspect shared service (SSO, networking, shared deployment pipeline)
| - If no → proceed to Region
v
[2] REGION — Which region(s) are affected?
| - Check: Same alarm in other regions?
| - If multi-region → suspect global service (IAM, Route53, S3 global)
| - If single-region → proceed to Service
v
[3] SERVICE — Which service namespace shows degradation?
| - Check CloudWatch namespace: AWS/Lambda, AWS/ECS, AWS/ApiGateway, etc.
| - If multiple services → suspect shared dependency (VPC, NAT, DNS, IAM)
| - If single service → proceed to Operation
v
[4] OPERATION — Which API action or function is failing?
| - For Lambda: which function name?
| - For ECS: which service/task definition?
| - For API GW: which stage/resource/method?
| - If all operations → suspect service-level issue (throttling, quota)
| - If specific operation → proceed to Resource
v
[5] RESOURCE — Which specific resource instance?
- Function ARN, Task ID, DB instance identifier
- This is your investigation target
- Proceed to log and trace analysis scoped to this resource
```
### Shared Dependency Investigation
When blast radius spans multiple services, investigate in this order:
1. **VPC/Networking** — NAT Gateway ErrorPortAllocation, packet drops, DNS resolution failures
2. **IAM/STS** — ThrottlingException on AssumeRole, token vending latency
3. **Downstream dependency** — shared database, cache, or external API
4. **Deployment pipeline** — simultaneous deploys across services from same pipeline run
5. **AWS service event** — check AWS Health Dashboard and Service Health for the region
---
## Pattern 4: PromQL-Style Metric Query Patterns
These patterns use CloudWatch metric math and GetMetricData to build composite signals. Express them as metric queries for dashboards or programmatic retrieval.
### Error Rate as Percentage
```
MetricDataQueries:
- Id: errors
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 60
Stat: Sum
- Id: invocations
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 60
Stat: Sum
- Id: error_rate
Expression: "errors / invocations * 100"
Label: "Error Rate %"
```
### Latency Anomaly Detection (Compare to Baseline)
```
MetricDataQueries:
- Id: current_p99
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Duration
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 300
Stat: p99
- Id: baseline_p99
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Duration
Dimensions: [{Name: FunctionName, Value: TARGET}]
Period: 300
Stat: p99
# Use StartTime/EndTime set to same window last week
- Id: anomaly_ratio
Expression: "current_p99 / baseline_p99"
Label: "Latency vs Baseline (ratio > 2 = anomaly)"
```
### Throttling Pressure Score
Combine multiple throttling signals into a single pressure metric:
```
MetricDataQueries:
- Id: lambda_throttles
MetricStat:
Metric: {Namespace: AWS/Lambda, MetricName: Throttles}
Period: 60
Stat: Sum
- Id: api_gw_429s
MetricStat:
Metric: {Namespace: AWS/ApiGateway, MetricName: 4XXError, Dimensions: [{Name: ApiName, Value: TARGET}]}
Period: 60
Stat: Sum
- Id: dynamo_throttles
MetricStat:
Metric: {Namespace: AWS/DynamoDB, MetricName: ThrottledRequests, Dimensions: [{Name: TableName, Value: TARGET}]}
Period: 60
Stat: Sum
- Id: throttle_pressure
Expression: "lambda_throttles + api_gw_429s + dynamo_throttles"
Label: "Combined Throttle Pressure"
```
### Concurrent Execution Headroom
```
MetricDataQueries:
- Id: concurrent
MetricStat:
Metric: {Namespace: AWS/Lambda, MetricName: ConcurrentExecutions}
Period: 60
Stat: Maximum
- Id: headroom
Expression: "1000 - concurrent"
Label: "Remaining Concurrency (account limit 1000)"
```
---
## Pattern 5: Incident Timeline Reconstruction
### Process
Reconstruct a precise timeline by merging data from multiple sources:
1. **Collect timestamps:**
| Source | Query | Yields |
|--------|-------|--------|
| CloudWatch Alarms | Alarm history API | State transition times |
| CloudWatch Metrics | GetMetricData with 1-min period | First anomaly point |
| CloudWatch Logs | Logs Insights with `earliest(@timestamp)` | First error occurrence |
| CloudTrail | LookupEvents filtered by time | Deployment/change events |
| AWS Health | DescribeEvents | AWS-side incidents |
2. **Build the timeline:**
```
fields @timestamp, @message
| filter @message like /ERROR|WARN|timeout|refused|denied/
| stats earliest(@timestamp) as firstSeen, latest(@timestamp) as lastSeen, count(*) as occurrences
by @message
| sort firstSeen asc
| limit 20
```
3. **Identify the sequence:**
```
Timeline:
T-15m: CloudTrail — UpdateFunctionCode by CI/CD role
T-12m: Logs — first error "Connection refused to payments-api.internal"
T-10m: Metrics — Error count crosses 5/min threshold
T-8m: Alarm — PaymentProcessorErrors enters ALARM
T-5m: Metrics — p99 latency spikes to 28s (timeout)
T-0: Current — error rate at 45%, alarm still firing
```
4. **Determine root event** — the earliest change that preceded all symptoms. Walk backward from the first symptom to the most recent mutation (deploy, config change, scaling event, or external dependency shift).
### Gotchas
- CloudWatch metric timestamps are end-of-period. A 1-minute datapoint at 14:05 covers 14:04-14:05.
- CloudTrail events can have up to 15-minute delivery delay. Use `eventTime`, not ingestion time.
- Log group timestamps depend on the agent/SDK flush interval. Allow for 30-60s of clock skew.
- Alarm state changes have a built-in evaluation delay (periods x evaluation periods). The actual anomaly started earlier.