CloudWatch Logs Insights: 8 Queries I Reuse in Production
- Shameem Abdul Salam
- Aws , Observability
- July 5, 2026
Table of Contents
This post is for teams already shipping workloads to AWS who reach for the CloudWatch console when something breaks—but re-type the same Logs Insights query from memory every time. It is not a replacement for APM or full distributed tracing, and it assumes your apps write structured or semi-structured logs to CloudWatch.
If you have ever stared at a flat metric graph while users report errors, these queries are the first fifteen minutes toolkit—not the whole observability strategy.
When to reach for Logs Insights
| Signal | Reach for |
|---|---|
| “Errors spiked after deploy” | Logs Insights on app + ALB access logs |
| “ECS tasks keep stopping” | Logs Insights on container logs + ECS event patterns |
| “Is it slow or failing?” | Tail latency query on request duration fields |
| “Steady-state capacity planning” | CloudWatch metrics and dashboards |
| “Cross-service trace of one request” | X-Ray, OpenTelemetry, or your APM |
Logs Insights is best for pattern search, aggregation, and ad-hoc correlation when you know roughly which log group and time window matter.
Setup tips
- Pick a consistent time field—
@timestampis default; some apps log epoch ms in@message. - Parse once in the query (
parse @message /.../) rather than hopinglikecatches every variant. - Save working queries in the console; name them for incidents (
incident-top-errors-1h). - Link saved queries from runbooks—not buried in someone’s browser history.
Replace YOUR_LOG_GROUP and field names with what your stack actually emits.
Query 1 — Top error messages (last hour)
Find the dominant failure strings fast.
fields @timestamp, @message
| filter @message like /(?i)(error|exception|fatal|failed)/
| stats count() as hits by @message
| sort hits desc
| limit 20
Use when alerts fire but the metric does not say which error.
Query 2 — Error rate over 5-minute bins
See if the spike aligns with a deploy or cron.
fields @timestamp, @message
| filter @message like /(?i)(error|exception)/
| stats count() as errors by bin(5m)
| sort bin(5m) desc
Overlay deploy timestamps manually at first; later, annotate from CI.
Query 3 — ALB 5xx after a deploy
Requires ALB access logs in a dedicated log group. Adjust status field parsing if your format differs.
fields @timestamp, @message
| parse @message "* * * * * * * * * * * * * * * \"*\" \"*\" * * * \"*\"" as
type, time, elb, client, target, request_processing, target_processing,
response_processing, elb_status, target_status, received_bytes, sent_bytes, request
| filter elb_status >= 500 or target_status >= 500
| stats count() as five_xx by bin(5m), target_status
| sort bin(5m) desc
If parsing is painful, filter with | filter @message like / 5[0-9]{2} / as a crude first pass.
Query 4 — Recent 5xx with request path
fields @timestamp, @message
| filter @message like / 5[0-9]{2} /
| parse @message "* * * * * * * * * * * * * * * \"* * *\"" as
f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, method, path, protocol
| display @timestamp, method, path, @message
| sort @timestamp desc
| limit 50
Correlate path spikes with a bad release or upstream dependency.
Query 5 — ECS task stop reason (container logs)
Many teams log ECS stop reasons from EventBridge to a log group; if not, search container stderr for OOM and exit codes.
fields @timestamp, @message
| filter @message like /(?i)(oom|out of memory|exit code|essential container|stopped|task failed)/
| stats count() as stops by @message
| sort stops desc
| limit 15
Pair with ECS console “stopped reason” when logs are thin.
Query 6 — OOM and memory hints
fields @timestamp, @message
| filter @message like /(?i)(oom|killed process|cannot allocate memory|java.lang.OutOfMemoryError)/
| sort @timestamp desc
| limit 30
Often explains “task ran fine in dev” when prod memory limits differ.
Query 7 — Slow request tail
Assumes your app logs duration in ms (adjust field name).
fields @timestamp, @message
| parse @message "duration_ms=* " as duration
| filter duration > 2000
| stats count() as slow_hits, avg(duration) as avg_ms, max(duration) as max_ms by bin(5m)
| sort bin(5m) desc
If duration is embedded differently, use a regex parse matching your format.
Query 8 — User-agent outliers (bot traffic)
Useful when cache or rate limits behave oddly after launch.
fields @timestamp, @message
| parse @message "*\" \"*\" \"*\"" as prefix, ua, suffix
| filter ispresent(ua)
| stats count() as hits by ua
| sort hits desc
| limit 25
Sudden crawler spikes can look like application errors in aggregate metrics.
Saving queries and runbooks
In the Logs Insights console: run query → Save → name with purpose and log group (prod-api-errors-1h). In runbooks, link directly to the saved query URL and note which log groups and which time window to start with.
For on-call: one page titled “First 15 minutes” with queries 1, 2, and 5 beats a forty-page monitoring doc nobody opens.
What’s next
Not every workload is a long-running ECS service behind an ALB. When jobs are finite, batch-shaped, and compute-heavy—genomics pipelines, ETL, simulations—AWS Batch on EC2 is often the right fit. Next: a deployment checklist for Batch covering Terraform, IAM, KMS, and the RUNNABLE jobs that never start.
If you only do one thing: save one “first 15 minutes of an incident” query and pin it in your team’s runbook.