chore: publish from staged

2026-06-22 15:37:44 +00:00 · 2026-06-22 01:34:47 +00:00
parent b5eb0d2306
commit 223acfc274
4 changed files with 453 additions and 0 deletions
@@ -0,0 +1,333 @@
+---
+name: AWS CloudWatch Investigation
+description: >
+  Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates,
+  alarm-to-deployment correlation, blast-radius narrowing decision tree, and
+  PromQL-style metric query patterns for structured incident triage.
+---
+
+# AWS CloudWatch Investigation Skill
+
+Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
+
+---
+
+## Pattern 1: Logs Insights Query Templates
+
+### Error Spike Detection
+
+Find the top errors in a time window, grouped by error type:
+
+```
+fields @timestamp, @message, @logStream
+| filter @message like /(?i)(error|exception|fatal|critical)/
+| stats count(*) as errorCount by bin(5m), @logStream
+| sort errorCount desc
+| limit 20
+```
+
+### P99 Latency Breakdown by Operation
+
+Identify which operations are driving latency spikes:
+
+```
+fields @timestamp, @duration, operation
+| filter ispresent(@duration)
+| stats avg(@duration) as avgMs,
+        pct(@duration, 50) as p50Ms,
+        pct(@duration, 95) as p95Ms,
+        pct(@duration, 99) as p99Ms,
+        count(*) as invocations
+  by operation
+| sort p99Ms desc
+| limit 15
+```
+
+### Lambda Cold Start Detection
+
+Quantify cold start impact during an incident:
+
+```
+fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
+| filter ispresent(@initDuration)
+| stats count(*) as coldStarts,
+        avg(@initDuration) as avgInitMs,
+        max(@initDuration) as maxInitMs,
+        avg(@duration) as avgDurationMs
+  by bin(5m)
+| sort @timestamp desc
+```
+
+### Out-of-Memory (OOM) Detection
+
+Find Lambda functions or containers killed by memory pressure:
+
+```
+fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
+| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
+| stats count(*) as oomEvents by @logStream, bin(10m)
+| sort oomEvents desc
+| limit 10
+```
+
+For memory utilization trending before OOM:
+
+```
+fields @timestamp, @maxMemoryUsed, @memorySize
+| filter ispresent(@maxMemoryUsed)
+| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
+        avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
+  by bin(5m)
+| sort @timestamp desc
+```
+
+### Timeout Detection
+
+Find invocations that hit the configured timeout:
+
+```
+fields @timestamp, @duration, @logStream, @requestId
+| filter @message like /Task timed out/ or @duration > 28000
+| stats count(*) as timeouts by @logStream, bin(5m)
+| sort timeouts desc
+```
+
+---
+
+## Pattern 2: Alarm History to Deploy-Event Correlation
+
+### Process
+
+1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.
+2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
+
+```
+# CloudTrail Lake query for deployment events
+SELECT eventTime, eventName, userIdentity.arn, requestParameters
+FROM <event-data-store-id>
+WHERE eventTime > '<alarm_time_minus_30m>'
+  AND eventTime < '<alarm_time>'
+  AND eventName IN (
+    'UpdateFunctionCode', 'UpdateFunctionConfiguration',
+    'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
+    'CreateChangeSet', 'ExecuteChangeSet',
+    'StartPipelineExecution', 'PutImage'
+  )
+ORDER BY eventTime DESC
+```
+
+3. **Correlation criteria** — a deploy is "correlated" if:
+   - It targets the same service/resource as the alarm
+   - It completed within 15 minutes before the alarm transition
+   - The deployer identity matches a CI/CD role (not a human applying a hotfix)
+
+4. **Strengthening the correlation:**
+   - Check if the same alarm was healthy in the previous deployment cycle
+   - Verify no other environmental changes (scaling events, config changes) in the same window
+   - Look for canary/synthetic monitor failures that started at the same time
+
+### Output Format
+
+```
+Deploy Correlation:
+  Event: UpdateFunctionCode
+  Time: 2024-03-15T14:23:07Z (12 min before alarm)
+  Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
+  Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
+  Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle
+```
+
+---
+
+## Pattern 3: Narrow the Blast Radius Decision Tree
+
+Use this tree to systematically scope an incident from broadest to most specific:
+
+```
+START
+  |
+  v
+[1] ACCOUNT — Which account(s) show the alarm?
+  |  - Check: Are alarms firing in multiple accounts?
+  |  - If yes → suspect shared service (SSO, networking, shared deployment pipeline)
+  |  - If no → proceed to Region
+  v
+[2] REGION — Which region(s) are affected?
+  |  - Check: Same alarm in other regions?
+  |  - If multi-region → suspect global service (IAM, Route53, S3 global)
+  |  - If single-region → proceed to Service
+  v
+[3] SERVICE — Which service namespace shows degradation?
+  |  - Check CloudWatch namespace: AWS/Lambda, AWS/ECS, AWS/ApiGateway, etc.
+  |  - If multiple services → suspect shared dependency (VPC, NAT, DNS, IAM)
+  |  - If single service → proceed to Operation
+  v
+[4] OPERATION — Which API action or function is failing?
+  |  - For Lambda: which function name?
+  |  - For ECS: which service/task definition?
+  |  - For API GW: which stage/resource/method?
+  |  - If all operations → suspect service-level issue (throttling, quota)
+  |  - If specific operation → proceed to Resource
+  v
+[5] RESOURCE — Which specific resource instance?
+     - Function ARN, Task ID, DB instance identifier
+     - This is your investigation target
+     - Proceed to log and trace analysis scoped to this resource
+```
+
+### Shared Dependency Investigation
+
+When blast radius spans multiple services, investigate in this order:
+
+1. **VPC/Networking** — NAT Gateway ErrorPortAllocation, packet drops, DNS resolution failures
+2. **IAM/STS** — ThrottlingException on AssumeRole, token vending latency
+3. **Downstream dependency** — shared database, cache, or external API
+4. **Deployment pipeline** — simultaneous deploys across services from same pipeline run
+5. **AWS service event** — check AWS Health Dashboard and Service Health for the region
+
+---
+
+## Pattern 4: PromQL-Style Metric Query Patterns
+
+These patterns use CloudWatch metric math and GetMetricData to build composite signals. Express them as metric queries for dashboards or programmatic retrieval.
+
+### Error Rate as Percentage
+
+```
+MetricDataQueries:
+  - Id: errors
+    MetricStat:
+      Metric:
+        Namespace: AWS/Lambda
+        MetricName: Errors
+        Dimensions: [{Name: FunctionName, Value: TARGET}]
+      Period: 60
+      Stat: Sum
+  - Id: invocations
+    MetricStat:
+      Metric:
+        Namespace: AWS/Lambda
+        MetricName: Invocations
+        Dimensions: [{Name: FunctionName, Value: TARGET}]
+      Period: 60
+      Stat: Sum
+  - Id: error_rate
+    Expression: "errors / invocations * 100"
+    Label: "Error Rate %"
+```
+
+### Latency Anomaly Detection (Compare to Baseline)
+
+```
+MetricDataQueries:
+  - Id: current_p99
+    MetricStat:
+      Metric:
+        Namespace: AWS/Lambda
+        MetricName: Duration
+        Dimensions: [{Name: FunctionName, Value: TARGET}]
+      Period: 300
+      Stat: p99
+  - Id: baseline_p99
+    MetricStat:
+      Metric:
+        Namespace: AWS/Lambda
+        MetricName: Duration
+        Dimensions: [{Name: FunctionName, Value: TARGET}]
+      Period: 300
+      Stat: p99
+    # Use StartTime/EndTime set to same window last week
+  - Id: anomaly_ratio
+    Expression: "current_p99 / baseline_p99"
+    Label: "Latency vs Baseline (ratio > 2 = anomaly)"
+```
+
+### Throttling Pressure Score
+
+Combine multiple throttling signals into a single pressure metric:
+
+```
+MetricDataQueries:
+  - Id: lambda_throttles
+    MetricStat:
+      Metric: {Namespace: AWS/Lambda, MetricName: Throttles}
+      Period: 60
+      Stat: Sum
+  - Id: api_gw_429s
+    MetricStat:
+      Metric: {Namespace: AWS/ApiGateway, MetricName: 4XXError, Dimensions: [{Name: ApiName, Value: TARGET}]}
+      Period: 60
+      Stat: Sum
+  - Id: dynamo_throttles
+    MetricStat:
+      Metric: {Namespace: AWS/DynamoDB, MetricName: ThrottledRequests, Dimensions: [{Name: TableName, Value: TARGET}]}
+      Period: 60
+      Stat: Sum
+  - Id: throttle_pressure
+    Expression: "lambda_throttles + api_gw_429s + dynamo_throttles"
+    Label: "Combined Throttle Pressure"
+```
+
+### Concurrent Execution Headroom
+
+```
+MetricDataQueries:
+  - Id: concurrent
+    MetricStat:
+      Metric: {Namespace: AWS/Lambda, MetricName: ConcurrentExecutions}
+      Period: 60
+      Stat: Maximum
+  - Id: headroom
+    Expression: "1000 - concurrent"
+    Label: "Remaining Concurrency (account limit 1000)"
+```
+
+---
+
+## Pattern 5: Incident Timeline Reconstruction
+
+### Process
+
+Reconstruct a precise timeline by merging data from multiple sources:
+
+1. **Collect timestamps:**
+
+| Source | Query | Yields |
+|--------|-------|--------|
+| CloudWatch Alarms | Alarm history API | State transition times |
+| CloudWatch Metrics | GetMetricData with 1-min period | First anomaly point |
+| CloudWatch Logs | Logs Insights with `earliest(@timestamp)` | First error occurrence |
+| CloudTrail | LookupEvents filtered by time | Deployment/change events |
+| AWS Health | DescribeEvents | AWS-side incidents |
+
+2. **Build the timeline:**
+
+```
+fields @timestamp, @message
+| filter @message like /ERROR|WARN|timeout|refused|denied/
+| stats earliest(@timestamp) as firstSeen, latest(@timestamp) as lastSeen, count(*) as occurrences
+  by @message
+| sort firstSeen asc
+| limit 20
+```
+
+3. **Identify the sequence:**
+
+```
+Timeline:
+  T-15m: CloudTrail — UpdateFunctionCode by CI/CD role
+  T-12m: Logs — first error "Connection refused to payments-api.internal"
+  T-10m: Metrics — Error count crosses 5/min threshold
+  T-8m:  Alarm — PaymentProcessorErrors enters ALARM
+  T-5m:  Metrics — p99 latency spikes to 28s (timeout)
+  T-0:   Current — error rate at 45%, alarm still firing
+```
+
+4. **Determine root event** — the earliest change that preceded all symptoms. Walk backward from the first symptom to the most recent mutation (deploy, config change, scaling event, or external dependency shift).
+
+### Gotchas
+
+- CloudWatch metric timestamps are end-of-period. A 1-minute datapoint at 14:05 covers 14:04-14:05.
+- CloudTrail events can have up to 15-minute delivery delay. Use `eventTime`, not ingestion time.
+- Log group timestamps depend on the agent/SDK flush interval. Allow for 30-60s of clock skew.
+- Alarm state changes have a built-in evaluation delay (periods x evaluation periods). The actual anomaly started earlier.