Platform EngineeringInfrastructureObservabilityMonitoring

Why Gatus Is My Preferred Health Check Tool (And Why Uptime Monitoring Isn't Enough)

Uptime tools tell you a service is running. Gatus tells you the data pipeline is actually working. How I use 73 custom health checks to monitor infrastructure, data freshness, and pipeline completeness.

1 March 2026 · 8 min read

Most health check tools answer one question: is the service responding? Gatus answers a different one: is the system actually working and allows you exactly configure what working means for you.

The distinction matters. A Kafka broker can respond to TCP connections while silently dropping messages. A Redis instance can pass a PING check while running out of memory. A data pipeline can be “up” while processing data from three hours ago.

What’s Wrong with Traditional Monitoring

Tools like Uptime Robot, Pingdom, or even basic Prometheus up metrics are service-level checks. They tell you:

Is port 6379 accepting connections? (Redis is up)
Does /health return 200? (API is up)
Is the process running? (Container is up)

What they don’t tell you:

Is the data in Redis fresh? (Last write was 47 minutes ago)
Are all 60 derived metrics being computed per entity? (Some stopped updating)
Is the Kafka consumer actually advancing its offset? (It’s connected but stuck)
Are supervisor-managed processes in a crash loop? (BACKOFF state, restarting every 5 seconds)

These are business-level health checks — they validate that the system is doing its job, not just that it’s alive.

Why Gatus

Gatus is a lightweight, open-source health check tool that runs as a single binary (or Docker container). What makes it different:

Custom checks via external endpoints — you can point Gatus at any HTTP endpoint that returns a pass/fail, letting you write arbitrarily complex health logic
Conditions on response body — check that a JSON response contains specific values, not just a 200 status
Unified status page — single dashboard for all checks across all groups
Built-in alerting — Telegram, Slack, PagerDuty, email — with configurable failure/success thresholds
No database required — stores state in memory or a simple file

The killer feature is the external endpoint pattern. Instead of trying to express complex health logic in Gatus’s YAML config, you write a Python sidecar that runs the checks and exposes results as HTTP endpoints. Gatus just polls them.

The Architecture

My setup runs 73 checks organised into seven groups:

graph TD
    G[Gatus - every 30s] -->|TCP/HTTP| STD[Standard Checks]
    G -->|HTTP poll| PY[Python Sidecar]
    STD -->|fail x3| TG[Telegram Alert]
    PY -->|fail x3| TG

The seven check groups and Python sidecar modules:

Group	Checks	Source
Infrastructure	22	Standard TCP/HTTP
Ingestion	6	Standard + Sidecar
Processing	3	Sidecar (supervisors, kafka_lag)
Data Latency	25	Sidecar (data_freshness, metrics_completeness)
Analytics	8	Sidecar (analytics)
Execution	2	Standard HTTP
Ops	7	Sidecar (connections, log_health)

Standard Checks

Infrastructure checks are straightforward Gatus YAML:

endpoints:
  - name: Kafka Broker 1
    group: Infrastructure
    url: tcp://kafka-1:9092
    interval: 30s
    conditions:
      - "[CONNECTED] == true"
    alerts:
      - type: telegram

  - name: Redis
    group: Infrastructure
    url: tcp://redis:6379
    interval: 30s
    conditions:
      - "[CONNECTED] == true"

Custom Checks (The Interesting Part)

The Python sidecar runs every 30 seconds and writes results to a shared directory that Gatus polls via HTTP. Here’s what the data freshness check looks like:

def check_data_freshness(entity: str, source: str) -> HealthResult:
    """Check if data for an entity is stale."""
    latest_ts = redis_ts.get_latest_timestamp(
        f"metrics:{entity}:{source}"
    )
    if latest_ts is None:
        return HealthResult(healthy=False, message=f"No data for {entity}")

    age_seconds = time.time() - latest_ts
    threshold = 60 if is_business_hours() else 300

    if age_seconds > threshold:
        return HealthResult(
            healthy=False,
            message=f"{entity} data is {age_seconds:.0f}s stale"
        )
    return HealthResult(healthy=True)

This check knows that 60-second staleness during business hours is critical, but 5-minute staleness outside those hours is normal. A TCP check can’t express that.

The metrics completeness check is even more specific — it verifies that all 60+ per-entity derived metrics have been computed in the last cycle. If one metric group is updating but another isn’t, the check identifies exactly which one stalled.

Supervisor Health

Several services use supervisord to manage multiple processes within a container. The supervisor check connects to the supervisord XML-RPC interface and flags any process in FATAL or BACKOFF state:

def check_supervisors(container: str) -> HealthResult:
    """Check supervisord processes for crash loops."""
    processes = supervisor_client.getAllProcessInfo()
    failures = [
        p for p in processes
        if p["statename"] in ("FATAL", "BACKOFF")
    ]
    if failures:
        names = ", ".join(p["name"] for p in failures)
        return HealthResult(
            healthy=False,
            message=f"Processes in crash loop: {names}"
        )
    return HealthResult(healthy=True)

A container can be “running” (Docker health check passes) while a critical process inside it is in a BACKOFF crash loop. Without this check, you’d only notice when downstream data stops arriving.

Alerting: Failure Threshold + Resolution

Gatus sends alerts to Telegram with configurable thresholds:

alerting:
  telegram:
    token: "${TELEGRAM_BOT_TOKEN}"
    id: "${TELEGRAM_CHAT_ID}"
    default-alert:
      enabled: true
      failure-threshold: 3
      success-threshold: 2
      send-on-resolved: true

Three consecutive failures trigger an alert. Two consecutive successes send a resolution. This eliminates flapping — a brief network hiccup doesn’t page you, but a sustained failure does.

The send-on-resolved: true is underrated. Without it, you’re left wondering whether the alert self-resolved or is still active. With it, every alert has a clear lifecycle: fire → investigate → resolve.

Gatus vs Alternatives

Feature	Gatus	Uptime Robot	Healthchecks.io	Prometheus + Blackbox
Custom check logic	External endpoints	No	Cron-only	Probe modules
Business-level checks	Yes (via sidecar)	No	No	Complex (custom exporter)
Status page	Built-in	Paid	Basic	Grafana (manual)
Self-hosted	Yes	No	Yes	Yes
Alert channels	10+ built-in	Email, SMS	Email, webhooks	Alertmanager
Setup complexity	Low (single binary)	SaaS	Low	High (multiple components)
Cost	Free	Paid at scale	Paid at scale	Free but complex

The Prometheus + Blackbox exporter approach is the closest competitor, but it requires writing custom exporters for every business-level check, configuring recording rules, and building Grafana dashboards for the status page. Gatus gives you all of this in a single config file.

Integration with AI Agents

The structured output from Gatus checks feeds directly into AI-assisted diagnosis. Each check failure maps to a specific Loki log query:

Kafka Broker Down    → {container_name="kafka-1"} |= "ERROR"
Redis Connection     → {container="..._to_redis"} |= "ConnectionError"
Data Latency         → {container="..._to_redis"} |~ "timeout|stale"

This mapping is version-controlled. When an AI agent encounters a Gatus failure, it knows exactly which logs to query, which metrics to check, and which containers to inspect. The health check is the entry point; the diagnostic mapping is the playbook.

Lessons Learned

Start with business checks, not infrastructure checks. I built the TCP/HTTP checks first and got a green status page while data was stale. The data freshness and metrics completeness checks were the ones that actually caught real incidents.

The Python sidecar pattern scales. Adding a new check means writing one function that returns pass/fail. No YAML templating, no Prometheus recording rules, no Grafana panel configuration.

Failure thresholds matter more than check intervals. Running checks every 30 seconds with a 3-failure threshold means you get alerted after 90 seconds of sustained failure. That’s the right balance between speed and noise.

Send resolution alerts. Every alert should have a clear end. Otherwise your Telegram channel becomes a list of open questions.

Version-control everything. The Gatus config, the Python sidecar, the diagnostic mapping — all in git. When something breaks, you can diff what changed. When you onboard someone (or an AI agent), the playbook is in the repo.

Platform Engineering Infrastructure Observability Monitoring

Disclosure: Ideas and analysis are my own. AI assisted with drafting and editing.