Parsing Web and Application Logs at Scale
Extract status codes, latencies, and client fields from access logs using grep, cut, and awk.
What Is a Web Access Log?
Every HTTP server — Apache, Nginx, Caddy — writes one line to an access log for each request. Understanding the structure of these lines is the foundation of all log analysis work.
A typical Combined Log Format (CLF) line looks like this:
- Client IP — who made the request
- Timestamp — when it happened
- Request line — method, path, protocol
- Status code — HTTP response (200, 404, 500…)
- Bytes sent — response body size
- Referer — origin page
- User-Agent — browser or bot string
Example line from /var/log/nginx/access.log:
192.168.1.10 - alice [11/Jun/2026:14:32:01 +0000] "GET /api/orders HTTP/1.1" 200 1482 "-" "curl/7.88.1"
At scale, these files grow to millions of lines per day. The goal of this lesson is to extract, filter, and aggregate fields from them efficiently using standard BASH tools.
Sampling a Live Log with tail and grep
Before writing any pipeline, inspect the log to understand its shape. tail lets you watch a live stream; grep narrows it to relevant lines immediately.
Common patterns:
tail -n 1000 access.log— last 1000 linestail -f access.log— follow in real timetail -f access.log | grep '" 5'— only 5xx errors as they arrive
The key insight is that grep matches against the entire line, so anchoring your pattern matters. Matching ' 500 ' (with spaces) avoids accidentally matching a URL path that contains the string 500.
#!/usr/bin/env bash
# Watch only HTTP 5xx errors arriving in real time
tail -f /var/log/nginx/access.log \
| grep --line-buffered '" 5[0-9][0-9] 'All lessons in this course
- Parsing Web and Application Logs at Scale
- Real-Time Log Following and Streaming Alerts
- Querying journald with journalctl in Scripts
- Computing Metrics and Histograms from Log Streams