Building System Health Check and Alert Scripts
Collect load, memory, and disk metrics and trigger threshold-based alerts from scheduled scripts.
Why System Health Checks Matter
Production servers can degrade silently. CPU spikes, memory leaks, and full disks cause outages — but only if nobody notices in time. System health check scripts automate the monitoring loop: collect metrics, compare against thresholds, and fire alerts before users feel the pain.
- Scheduled via
cron, they run every few minutes without human attention - They produce consistent, timestamped output suitable for log aggregation
- Threshold-based logic keeps alerts meaningful — not every hiccup pages the on-call team
In this lesson you will build a production-grade health check script from scratch, layer by layer, covering load average, memory pressure, and disk utilisation.
Capturing Load Average
Linux exposes the 1-minute, 5-minute, and 15-minute load averages through /proc/loadavg and the uptime command. For scripting, /proc/loadavg is the cleanest source — no locale issues, no parsing variation across distros.
The snippet below reads the 1-minute load average and stores it in a variable for threshold comparison. cut extracts the first field; awk strips the decimal for integer comparison using bc for float arithmetic.
#!/usr/bin/env bash
# Read 1-minute load average from /proc/loadavg
LOAD_RAW=$(cut -d' ' -f1 /proc/loadavg)
echo "Raw load average: $LOAD_RAW"
# Number of CPU cores — used to normalise load
CPU_CORES=$(nproc)
echo "CPU cores: $CPU_CORES"
# Compute load percentage (load / cores * 100) using bc
LOAD_PCT=$(echo "scale=2; $LOAD_RAW / $CPU_CORES * 100" | bc)
echo "Load %: $LOAD_PCT"All lessons in this course
- Automating User and Group Provisioning
- Controlling systemd Services and Writing Unit Files
- Disk, Filesystem, and Mount Automation
- Building System Health Check and Alert Scripts