mirror of
https://github.com/github/awesome-copilot.git
synced 2026-06-22 07:27:39 +00:00
334 lines
10 KiB
Markdown
334 lines
10 KiB
Markdown
---
|
|
name: AWS CloudWatch Investigation
|
|
description: >
|
|
Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates,
|
|
alarm-to-deployment correlation, blast-radius narrowing decision tree, and
|
|
PromQL-style metric query patterns for structured incident triage.
|
|
---
|
|
|
|
# AWS CloudWatch Investigation Skill
|
|
|
|
Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
|
|
|
|
---
|
|
|
|
## Pattern 1: Logs Insights Query Templates
|
|
|
|
### Error Spike Detection
|
|
|
|
Find the top errors in a time window, grouped by error type:
|
|
|
|
```
|
|
fields @timestamp, @message, @logStream
|
|
| filter @message like /(?i)(error|exception|fatal|critical)/
|
|
| stats count(*) as errorCount by bin(5m), @logStream
|
|
| sort errorCount desc
|
|
| limit 20
|
|
```
|
|
|
|
### P99 Latency Breakdown by Operation
|
|
|
|
Identify which operations are driving latency spikes:
|
|
|
|
```
|
|
fields @timestamp, @duration, operation
|
|
| filter ispresent(@duration)
|
|
| stats avg(@duration) as avgMs,
|
|
pct(@duration, 50) as p50Ms,
|
|
pct(@duration, 95) as p95Ms,
|
|
pct(@duration, 99) as p99Ms,
|
|
count(*) as invocations
|
|
by operation
|
|
| sort p99Ms desc
|
|
| limit 15
|
|
```
|
|
|
|
### Lambda Cold Start Detection
|
|
|
|
Quantify cold start impact during an incident:
|
|
|
|
```
|
|
fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
|
|
| filter ispresent(@initDuration)
|
|
| stats count(*) as coldStarts,
|
|
avg(@initDuration) as avgInitMs,
|
|
max(@initDuration) as maxInitMs,
|
|
avg(@duration) as avgDurationMs
|
|
by bin(5m)
|
|
| sort @timestamp desc
|
|
```
|
|
|
|
### Out-of-Memory (OOM) Detection
|
|
|
|
Find Lambda functions or containers killed by memory pressure:
|
|
|
|
```
|
|
fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
|
|
| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
|
|
| stats count(*) as oomEvents by @logStream, bin(10m)
|
|
| sort oomEvents desc
|
|
| limit 10
|
|
```
|
|
|
|
For memory utilization trending before OOM:
|
|
|
|
```
|
|
fields @timestamp, @maxMemoryUsed, @memorySize
|
|
| filter ispresent(@maxMemoryUsed)
|
|
| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
|
|
avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
|
|
by bin(5m)
|
|
| sort @timestamp desc
|
|
```
|
|
|
|
### Timeout Detection
|
|
|
|
Find invocations that hit the configured timeout:
|
|
|
|
```
|
|
fields @timestamp, @duration, @logStream, @requestId
|
|
| filter @message like /Task timed out/ or @duration > 28000
|
|
| stats count(*) as timeouts by @logStream, bin(5m)
|
|
| sort timeouts desc
|
|
```
|
|
|
|
---
|
|
|
|
## Pattern 2: Alarm History to Deploy-Event Correlation
|
|
|
|
### Process
|
|
|
|
1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.
|
|
2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
|
|
|
|
```
|
|
# CloudTrail Lake query for deployment events
|
|
SELECT eventTime, eventName, userIdentity.arn, requestParameters
|
|
FROM <event-data-store-id>
|
|
WHERE eventTime > '<alarm_time_minus_30m>'
|
|
AND eventTime < '<alarm_time>'
|
|
AND eventName IN (
|
|
'UpdateFunctionCode', 'UpdateFunctionConfiguration',
|
|
'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
|
|
'CreateChangeSet', 'ExecuteChangeSet',
|
|
'StartPipelineExecution', 'PutImage'
|
|
)
|
|
ORDER BY eventTime DESC
|
|
```
|
|
|
|
3. **Correlation criteria** — a deploy is "correlated" if:
|
|
- It targets the same service/resource as the alarm
|
|
- It completed within 15 minutes before the alarm transition
|
|
- The deployer identity matches a CI/CD role (not a human applying a hotfix)
|
|
|
|
4. **Strengthening the correlation:**
|
|
- Check if the same alarm was healthy in the previous deployment cycle
|
|
- Verify no other environmental changes (scaling events, config changes) in the same window
|
|
- Look for canary/synthetic monitor failures that started at the same time
|
|
|
|
### Output Format
|
|
|
|
```
|
|
Deploy Correlation:
|
|
Event: UpdateFunctionCode
|
|
Time: 2024-03-15T14:23:07Z (12 min before alarm)
|
|
Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
|
|
Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
|
|
Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle
|
|
```
|
|
|
|
---
|
|
|
|
## Pattern 3: Narrow the Blast Radius Decision Tree
|
|
|
|
Use this tree to systematically scope an incident from broadest to most specific:
|
|
|
|
```
|
|
START
|
|
|
|
|
v
|
|
[1] ACCOUNT — Which account(s) show the alarm?
|
|
| - Check: Are alarms firing in multiple accounts?
|
|
| - If yes → suspect shared service (SSO, networking, shared deployment pipeline)
|
|
| - If no → proceed to Region
|
|
v
|
|
[2] REGION — Which region(s) are affected?
|
|
| - Check: Same alarm in other regions?
|
|
| - If multi-region → suspect global service (IAM, Route53, S3 global)
|
|
| - If single-region → proceed to Service
|
|
v
|
|
[3] SERVICE — Which service namespace shows degradation?
|
|
| - Check CloudWatch namespace: AWS/Lambda, AWS/ECS, AWS/ApiGateway, etc.
|
|
| - If multiple services → suspect shared dependency (VPC, NAT, DNS, IAM)
|
|
| - If single service → proceed to Operation
|
|
v
|
|
[4] OPERATION — Which API action or function is failing?
|
|
| - For Lambda: which function name?
|
|
| - For ECS: which service/task definition?
|
|
| - For API GW: which stage/resource/method?
|
|
| - If all operations → suspect service-level issue (throttling, quota)
|
|
| - If specific operation → proceed to Resource
|
|
v
|
|
[5] RESOURCE — Which specific resource instance?
|
|
- Function ARN, Task ID, DB instance identifier
|
|
- This is your investigation target
|
|
- Proceed to log and trace analysis scoped to this resource
|
|
```
|
|
|
|
### Shared Dependency Investigation
|
|
|
|
When blast radius spans multiple services, investigate in this order:
|
|
|
|
1. **VPC/Networking** — NAT Gateway ErrorPortAllocation, packet drops, DNS resolution failures
|
|
2. **IAM/STS** — ThrottlingException on AssumeRole, token vending latency
|
|
3. **Downstream dependency** — shared database, cache, or external API
|
|
4. **Deployment pipeline** — simultaneous deploys across services from same pipeline run
|
|
5. **AWS service event** — check AWS Health Dashboard and Service Health for the region
|
|
|
|
---
|
|
|
|
## Pattern 4: PromQL-Style Metric Query Patterns
|
|
|
|
These patterns use CloudWatch metric math and GetMetricData to build composite signals. Express them as metric queries for dashboards or programmatic retrieval.
|
|
|
|
### Error Rate as Percentage
|
|
|
|
```
|
|
MetricDataQueries:
|
|
- Id: errors
|
|
MetricStat:
|
|
Metric:
|
|
Namespace: AWS/Lambda
|
|
MetricName: Errors
|
|
Dimensions: [{Name: FunctionName, Value: TARGET}]
|
|
Period: 60
|
|
Stat: Sum
|
|
- Id: invocations
|
|
MetricStat:
|
|
Metric:
|
|
Namespace: AWS/Lambda
|
|
MetricName: Invocations
|
|
Dimensions: [{Name: FunctionName, Value: TARGET}]
|
|
Period: 60
|
|
Stat: Sum
|
|
- Id: error_rate
|
|
Expression: "errors / invocations * 100"
|
|
Label: "Error Rate %"
|
|
```
|
|
|
|
### Latency Anomaly Detection (Compare to Baseline)
|
|
|
|
```
|
|
MetricDataQueries:
|
|
- Id: current_p99
|
|
MetricStat:
|
|
Metric:
|
|
Namespace: AWS/Lambda
|
|
MetricName: Duration
|
|
Dimensions: [{Name: FunctionName, Value: TARGET}]
|
|
Period: 300
|
|
Stat: p99
|
|
- Id: baseline_p99
|
|
MetricStat:
|
|
Metric:
|
|
Namespace: AWS/Lambda
|
|
MetricName: Duration
|
|
Dimensions: [{Name: FunctionName, Value: TARGET}]
|
|
Period: 300
|
|
Stat: p99
|
|
# Use StartTime/EndTime set to same window last week
|
|
- Id: anomaly_ratio
|
|
Expression: "current_p99 / baseline_p99"
|
|
Label: "Latency vs Baseline (ratio > 2 = anomaly)"
|
|
```
|
|
|
|
### Throttling Pressure Score
|
|
|
|
Combine multiple throttling signals into a single pressure metric:
|
|
|
|
```
|
|
MetricDataQueries:
|
|
- Id: lambda_throttles
|
|
MetricStat:
|
|
Metric: {Namespace: AWS/Lambda, MetricName: Throttles}
|
|
Period: 60
|
|
Stat: Sum
|
|
- Id: api_gw_429s
|
|
MetricStat:
|
|
Metric: {Namespace: AWS/ApiGateway, MetricName: 4XXError, Dimensions: [{Name: ApiName, Value: TARGET}]}
|
|
Period: 60
|
|
Stat: Sum
|
|
- Id: dynamo_throttles
|
|
MetricStat:
|
|
Metric: {Namespace: AWS/DynamoDB, MetricName: ThrottledRequests, Dimensions: [{Name: TableName, Value: TARGET}]}
|
|
Period: 60
|
|
Stat: Sum
|
|
- Id: throttle_pressure
|
|
Expression: "lambda_throttles + api_gw_429s + dynamo_throttles"
|
|
Label: "Combined Throttle Pressure"
|
|
```
|
|
|
|
### Concurrent Execution Headroom
|
|
|
|
```
|
|
MetricDataQueries:
|
|
- Id: concurrent
|
|
MetricStat:
|
|
Metric: {Namespace: AWS/Lambda, MetricName: ConcurrentExecutions}
|
|
Period: 60
|
|
Stat: Maximum
|
|
- Id: headroom
|
|
Expression: "1000 - concurrent"
|
|
Label: "Remaining Concurrency (account limit 1000)"
|
|
```
|
|
|
|
---
|
|
|
|
## Pattern 5: Incident Timeline Reconstruction
|
|
|
|
### Process
|
|
|
|
Reconstruct a precise timeline by merging data from multiple sources:
|
|
|
|
1. **Collect timestamps:**
|
|
|
|
| Source | Query | Yields |
|
|
|--------|-------|--------|
|
|
| CloudWatch Alarms | Alarm history API | State transition times |
|
|
| CloudWatch Metrics | GetMetricData with 1-min period | First anomaly point |
|
|
| CloudWatch Logs | Logs Insights with `earliest(@timestamp)` | First error occurrence |
|
|
| CloudTrail | LookupEvents filtered by time | Deployment/change events |
|
|
| AWS Health | DescribeEvents | AWS-side incidents |
|
|
|
|
2. **Build the timeline:**
|
|
|
|
```
|
|
fields @timestamp, @message
|
|
| filter @message like /ERROR|WARN|timeout|refused|denied/
|
|
| stats earliest(@timestamp) as firstSeen, latest(@timestamp) as lastSeen, count(*) as occurrences
|
|
by @message
|
|
| sort firstSeen asc
|
|
| limit 20
|
|
```
|
|
|
|
3. **Identify the sequence:**
|
|
|
|
```
|
|
Timeline:
|
|
T-15m: CloudTrail — UpdateFunctionCode by CI/CD role
|
|
T-12m: Logs — first error "Connection refused to payments-api.internal"
|
|
T-10m: Metrics — Error count crosses 5/min threshold
|
|
T-8m: Alarm — PaymentProcessorErrors enters ALARM
|
|
T-5m: Metrics — p99 latency spikes to 28s (timeout)
|
|
T-0: Current — error rate at 45%, alarm still firing
|
|
```
|
|
|
|
4. **Determine root event** — the earliest change that preceded all symptoms. Walk backward from the first symptom to the most recent mutation (deploy, config change, scaling event, or external dependency shift).
|
|
|
|
### Gotchas
|
|
|
|
- CloudWatch metric timestamps are end-of-period. A 1-minute datapoint at 14:05 covers 14:04-14:05.
|
|
- CloudTrail events can have up to 15-minute delivery delay. Use `eventTime`, not ingestion time.
|
|
- Log group timestamps depend on the agent/SDK flush interval. Allow for 30-60s of clock skew.
|
|
- Alarm state changes have a built-in evaluation delay (periods x evaluation periods). The actual anomaly started earlier.
|