Why Gatus Is My Preferred Health Check Tool (And Why Uptime Monitoring Isn't Enough)
Uptime tools tell you a service is running. Gatus tells you the data pipeline is actually working. How I use 73 custom health checks to monitor infrastructure, data freshness, and pipeline completeness.
Most health check tools answer one question: is the service responding? Gatus answers a different one: is the system actually working and allows you exactly configure what working means for you.
The distinction matters. A Kafka broker can respond to TCP connections while silently dropping messages. A Redis instance can pass a PING check while running out of memory. A data pipeline can be “up” while processing data from three hours ago.
What’s Wrong with Traditional Monitoring
Tools like Uptime Robot, Pingdom, or even basic Prometheus up metrics are service-level checks. They tell you:
- Is port 6379 accepting connections? (Redis is up)
- Does
/healthreturn 200? (API is up) - Is the process running? (Container is up)
What they don’t tell you:
- Is the data in Redis fresh? (Last write was 47 minutes ago)
- Are all 60 derived metrics being computed per entity? (Some stopped updating)
- Is the Kafka consumer actually advancing its offset? (It’s connected but stuck)
- Are supervisor-managed processes in a crash loop? (BACKOFF state, restarting every 5 seconds)
These are business-level health checks — they validate that the system is doing its job, not just that it’s alive.
Why Gatus
Gatus is a lightweight, open-source health check tool that runs as a single binary (or Docker container). What makes it different:
- Custom checks via external endpoints — you can point Gatus at any HTTP endpoint that returns a pass/fail, letting you write arbitrarily complex health logic
- Conditions on response body — check that a JSON response contains specific values, not just a 200 status
- Unified status page — single dashboard for all checks across all groups
- Built-in alerting — Telegram, Slack, PagerDuty, email — with configurable failure/success thresholds
- No database required — stores state in memory or a simple file
The killer feature is the external endpoint pattern. Instead of trying to express complex health logic in Gatus’s YAML config, you write a Python sidecar that runs the checks and exposes results as HTTP endpoints. Gatus just polls them.
The Architecture
My setup runs 73 checks organised into seven groups:
graph TD
G[Gatus - every 30s] -->|TCP/HTTP| STD[Standard Checks]
G -->|HTTP poll| PY[Python Sidecar]
STD -->|fail x3| TG[Telegram Alert]
PY -->|fail x3| TG
The seven check groups and Python sidecar modules:
| Group | Checks | Source |
|---|---|---|
| Infrastructure | 22 | Standard TCP/HTTP |
| Ingestion | 6 | Standard + Sidecar |
| Processing | 3 | Sidecar (supervisors, kafka_lag) |
| Data Latency | 25 | Sidecar (data_freshness, metrics_completeness) |
| Analytics | 8 | Sidecar (analytics) |
| Execution | 2 | Standard HTTP |
| Ops | 7 | Sidecar (connections, log_health) |
Standard Checks
Infrastructure checks are straightforward Gatus YAML:
endpoints:
- name: Kafka Broker 1
group: Infrastructure
url: tcp://kafka-1:9092
interval: 30s
conditions:
- "[CONNECTED] == true"
alerts:
- type: telegram
- name: Redis
group: Infrastructure
url: tcp://redis:6379
interval: 30s
conditions:
- "[CONNECTED] == true"
Custom Checks (The Interesting Part)
The Python sidecar runs every 30 seconds and writes results to a shared directory that Gatus polls via HTTP. Here’s what the data freshness check looks like:
def check_data_freshness(entity: str, source: str) -> HealthResult:
"""Check if data for an entity is stale."""
latest_ts = redis_ts.get_latest_timestamp(
f"metrics:{entity}:{source}"
)
if latest_ts is None:
return HealthResult(healthy=False, message=f"No data for {entity}")
age_seconds = time.time() - latest_ts
threshold = 60 if is_business_hours() else 300
if age_seconds > threshold:
return HealthResult(
healthy=False,
message=f"{entity} data is {age_seconds:.0f}s stale"
)
return HealthResult(healthy=True)
This check knows that 60-second staleness during business hours is critical, but 5-minute staleness outside those hours is normal. A TCP check can’t express that.
The metrics completeness check is even more specific — it verifies that all 60+ per-entity derived metrics have been computed in the last cycle. If one metric group is updating but another isn’t, the check identifies exactly which one stalled.
Supervisor Health
Several services use supervisord to manage multiple processes within a container. The supervisor check connects to the supervisord XML-RPC interface and flags any process in FATAL or BACKOFF state:
def check_supervisors(container: str) -> HealthResult:
"""Check supervisord processes for crash loops."""
processes = supervisor_client.getAllProcessInfo()
failures = [
p for p in processes
if p["statename"] in ("FATAL", "BACKOFF")
]
if failures:
names = ", ".join(p["name"] for p in failures)
return HealthResult(
healthy=False,
message=f"Processes in crash loop: {names}"
)
return HealthResult(healthy=True)
A container can be “running” (Docker health check passes) while a critical process inside it is in a BACKOFF crash loop. Without this check, you’d only notice when downstream data stops arriving.
Alerting: Failure Threshold + Resolution
Gatus sends alerts to Telegram with configurable thresholds:
alerting:
telegram:
token: "${TELEGRAM_BOT_TOKEN}"
id: "${TELEGRAM_CHAT_ID}"
default-alert:
enabled: true
failure-threshold: 3
success-threshold: 2
send-on-resolved: true
Three consecutive failures trigger an alert. Two consecutive successes send a resolution. This eliminates flapping — a brief network hiccup doesn’t page you, but a sustained failure does.
The send-on-resolved: true is underrated. Without it, you’re left wondering whether the alert self-resolved or is still active. With it, every alert has a clear lifecycle: fire → investigate → resolve.
Gatus vs Alternatives
| Feature | Gatus | Uptime Robot | Healthchecks.io | Prometheus + Blackbox |
|---|---|---|---|---|
| Custom check logic | External endpoints | No | Cron-only | Probe modules |
| Business-level checks | Yes (via sidecar) | No | No | Complex (custom exporter) |
| Status page | Built-in | Paid | Basic | Grafana (manual) |
| Self-hosted | Yes | No | Yes | Yes |
| Alert channels | 10+ built-in | Email, SMS | Email, webhooks | Alertmanager |
| Setup complexity | Low (single binary) | SaaS | Low | High (multiple components) |
| Cost | Free | Paid at scale | Paid at scale | Free but complex |
The Prometheus + Blackbox exporter approach is the closest competitor, but it requires writing custom exporters for every business-level check, configuring recording rules, and building Grafana dashboards for the status page. Gatus gives you all of this in a single config file.
Integration with AI Agents
The structured output from Gatus checks feeds directly into AI-assisted diagnosis. Each check failure maps to a specific Loki log query:
Kafka Broker Down → {container_name="kafka-1"} |= "ERROR"
Redis Connection → {container="..._to_redis"} |= "ConnectionError"
Data Latency → {container="..._to_redis"} |~ "timeout|stale"
This mapping is version-controlled. When an AI agent encounters a Gatus failure, it knows exactly which logs to query, which metrics to check, and which containers to inspect. The health check is the entry point; the diagnostic mapping is the playbook.
Lessons Learned
Start with business checks, not infrastructure checks. I built the TCP/HTTP checks first and got a green status page while data was stale. The data freshness and metrics completeness checks were the ones that actually caught real incidents.
The Python sidecar pattern scales. Adding a new check means writing one function that returns pass/fail. No YAML templating, no Prometheus recording rules, no Grafana panel configuration.
Failure thresholds matter more than check intervals. Running checks every 30 seconds with a 3-failure threshold means you get alerted after 90 seconds of sustained failure. That’s the right balance between speed and noise.
Send resolution alerts. Every alert should have a clear end. Otherwise your Telegram channel becomes a list of open questions.
Version-control everything. The Gatus config, the Python sidecar, the diagnostic mapping — all in git. When something breaks, you can diff what changed. When you onboard someone (or an AI agent), the playbook is in the repo.