CloudWatch Logs Insights: 8 Queries I Reuse in Production

Table of Contents

This post is for teams already shipping workloads to AWS who reach for the CloudWatch console when something breaks—but re-type the same Logs Insights query from memory every time. It is not a replacement for APM or full distributed tracing, and it assumes your apps write structured or semi-structured logs to CloudWatch.

If you have ever stared at a flat metric graph while users report errors, these queries are the first fifteen minutes toolkit—not the whole observability strategy.

When to reach for Logs Insights

Signal	Reach for
“Errors spiked after deploy”	Logs Insights on app + ALB access logs
“ECS tasks keep stopping”	Logs Insights on container logs + ECS event patterns
“Is it slow or failing?”	Tail latency query on request duration fields
“Steady-state capacity planning”	CloudWatch metrics and dashboards
“Cross-service trace of one request”	X-Ray, OpenTelemetry, or your APM

Logs Insights is best for pattern search, aggregation, and ad-hoc correlation when you know roughly which log group and time window matter.

Setup tips

Pick a consistent time field—@timestamp is default; some apps log epoch ms in @message.
Parse once in the query (parse @message /.../) rather than hoping like catches every variant.
Save working queries in the console; name them for incidents (incident-top-errors-1h).
Link saved queries from runbooks—not buried in someone’s browser history.

Replace YOUR_LOG_GROUP and field names with what your stack actually emits.

Query 1 — Top error messages (last hour)

Find the dominant failure strings fast.

fields @timestamp, @message
| filter @message like /(?i)(error|exception|fatal|failed)/
| stats count() as hits by @message
| sort hits desc
| limit 20

Use when alerts fire but the metric does not say which error.

Query 2 — Error rate over 5-minute bins

See if the spike aligns with a deploy or cron.

fields @timestamp, @message
| filter @message like /(?i)(error|exception)/
| stats count() as errors by bin(5m)
| sort bin(5m) desc

Overlay deploy timestamps manually at first; later, annotate from CI.

Query 3 — ALB 5xx after a deploy

Requires ALB access logs in a dedicated log group. Adjust status field parsing if your format differs.

fields @timestamp, @message
| parse @message "* * * * * * * * * * * * * * * \"*\" \"*\" * * * \"*\"" as
    type, time, elb, client, target, request_processing, target_processing,
    response_processing, elb_status, target_status, received_bytes, sent_bytes, request
| filter elb_status >= 500 or target_status >= 500
| stats count() as five_xx by bin(5m), target_status
| sort bin(5m) desc

If parsing is painful, filter with | filter @message like / 5[0-9]{2} / as a crude first pass.

Query 4 — Recent 5xx with request path

fields @timestamp, @message
| filter @message like / 5[0-9]{2} /
| parse @message "* * * * * * * * * * * * * * * \"* * *\"" as
    f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, method, path, protocol
| display @timestamp, method, path, @message
| sort @timestamp desc
| limit 50

Correlate path spikes with a bad release or upstream dependency.

Query 5 — ECS task stop reason (container logs)

Many teams log ECS stop reasons from EventBridge to a log group; if not, search container stderr for OOM and exit codes.

fields @timestamp, @message
| filter @message like /(?i)(oom|out of memory|exit code|essential container|stopped|task failed)/
| stats count() as stops by @message
| sort stops desc
| limit 15

Pair with ECS console “stopped reason” when logs are thin.

Query 6 — OOM and memory hints

fields @timestamp, @message
| filter @message like /(?i)(oom|killed process|cannot allocate memory|java.lang.OutOfMemoryError)/
| sort @timestamp desc
| limit 30

Often explains “task ran fine in dev” when prod memory limits differ.

Query 7 — Slow request tail

Assumes your app logs duration in ms (adjust field name).

fields @timestamp, @message
| parse @message "duration_ms=* " as duration
| filter duration > 2000
| stats count() as slow_hits, avg(duration) as avg_ms, max(duration) as max_ms by bin(5m)
| sort bin(5m) desc

If duration is embedded differently, use a regex parse matching your format.

Query 8 — User-agent outliers (bot traffic)

Useful when cache or rate limits behave oddly after launch.

fields @timestamp, @message
| parse @message "*\" \"*\" \"*\"" as prefix, ua, suffix
| filter ispresent(ua)
| stats count() as hits by ua
| sort hits desc
| limit 25

Sudden crawler spikes can look like application errors in aggregate metrics.

Saving queries and runbooks

In the Logs Insights console: run query → Save → name with purpose and log group (prod-api-errors-1h). In runbooks, link directly to the saved query URL and note which log groups and which time window to start with.

For on-call: one page titled “First 15 minutes” with queries 1, 2, and 5 beats a forty-page monitoring doc nobody opens.

What’s next

Not every workload is a long-running ECS service behind an ALB. When jobs are finite, batch-shaped, and compute-heavy—genomics pipelines, ETL, simulations—AWS Batch on EC2 is often the right fit. Next: a deployment checklist for Batch covering Terraform, IAM, KMS, and the RUNNABLE jobs that never start.

If you only do one thing: save one “first 15 minutes of an incident” query and pin it in your team’s runbook.