Agentic AIPlatform EngineeringInfrastructureOpsMonitoring

Agentic Ops: Working Backwards from the Metric That Matters

Start from a single business SLA — data freshness under 60 seconds — and trace backwards through dependency trees, metadata layers, known-error memory, and automated fixes to build an AI-operated production platform.

15 March 2026 · 14 min read · Updated 2 April 2026

Critical operations in business run 24 x 7, and there are ops teams that manage them. A dashboard shows the latest data point for a critical entity where SLAs matter and if something in the pipeline is broken, and every downstream calculation — derived metrics, anomaly scores, decision signals — is working with stale data. In real-time systems, stale data does not just look wrong. It costs money. Stories of wrong data being used are a common scenario.

In a project that has 20+ microservice containers, three databases, a Kafka cluster, real-time data pipelines, and dozens of health checks. The traditional approach to operations — wake up, check dashboards, triage alerts, SSH into boxes, read logs — does not scale. So I built an agentic ops pipeline where AI agents handle monitoring, diagnosis, and remediation autonomously.

But the insight that changed how I structured the whole system was not about AI. It was about direction. Most ops architectures are built bottom-up: start with infrastructure monitoring, add application metrics, wire up alerting, then hope someone connects the dots to business impact. I built mine top-down: start from the number the end user sees, and trace backwards through every system that produces it.

This post walks through that architecture — from business SLAs, through dependency trees that agents traverse, to self-learning error memory that makes the system faster over time. The tools mentioned in this post can be swapped out for equivalents — the principles hold regardless of whether you use Grafana or Datadog, Supervisor or Kubernetes, OpenAI or Anthropic. The key is the structure and flow of information, not the specific tech.

graph TD
    SLA["Business SLA<br/><i>data freshness < 60s</i>"]
    DEP["Dependency Tree<br/><i>YAML — human-crafted</i>"]
    META["Metadata at Each Node<br/><i>logs · metrics · healthchecks</i>"]
    KE["Known-Error Memory<br/><i>pattern → fix, growing over time</i>"]
    AGENT["AI Agent<br/><i>traverse · diagnose · act</i>"]
    FIX["Automated Fix + Verify"]
    REPORT["Report to Human"]
    LEARN["Update Known Errors"]

    SLA -->|"violation detected"| AGENT
    AGENT -->|"walks"| DEP
    DEP -->|"at each node"| META
    META -->|"matches?"| KE
    KE -->|"known fix"| FIX
    META -->|"novel error"| AGENT
    AGENT -->|"LLM reasoning"| FIX
    FIX --> REPORT
    FIX -->|"new pattern"| LEARN
    LEARN -.->|"grows"| KE

The Metrics That Matter

Before instrumenting anything, I asked: what numbers, if wrong or stale, would break the product? The answer gave me four SLAs that the entire monitoring stack exists to protect:

Metric	SLA	Window	Why It Matters
Redis data freshness	< 60s	Business hours	User sees stale data, makes wrong decision
TSDB raw data	< 30s	Business hours	Analytics pipeline computes on old inputs
Derived metrics completeness	> 80% of entities	Business hours	Missing signals, blind spots in downstream systems
Consumer lag	< 10,000 messages	Always	Pipeline falling behind, cascading staleness

These are not aspirational targets. They are the boundaries between “the platform is working” and “the platform is silently lying to you.”

The Dependency Tree

SLAs are numbers on a page until you can trace what produces them. That is the dependency tree — a YAML file that maps every business metric to its full production chain, from the SLA down to the bare-metal resource.

graph TD
    SLA["🎯 Data Freshness SLA < 60s"]

    CONSUMER["Kafka→Redis Consumer<br/><i>container: primary_kafka_to_redis</i><br/><i>supervisor: primary_processes</i>"]
    TOPIC["Kafka Topic<br/><i>entity_filtered_v1</i><br/><i>min ISR: 2</i>"]
    WS["WebSocket Ingestion<br/><i>container: primary_websocket_kafka</i>"]
    EXT["Upstream Data Feed<br/><i>external — detect only</i>"]
    REDIS["Redis 7 + TimeSeries<br/><i>memory: 30GB threshold</i><br/><i>max clients: 300</i>"]

    SLA --> CONSUMER
    SLA --> REDIS
    CONSUMER --> TOPIC
    TOPIC --> WS
    WS --> EXT

    style SLA fill:#D4943A,color:#1A1814,stroke:#D4943A
    style EXT fill:#6B4C2A,color:#E8E6E1,stroke:#6B4C2A
    style REDIS fill:#2A3B4C,color:#E8E6E1,stroke:#2A3B4C

Each node in the tree carries its own observability pointers — the Loki query, Prometheus metrics, and healthcheck method an agent needs to inspect it. Here is the full YAML:

# dependency_tree.yaml — the map an agent traverses when something breaks

metrics:
  entity_data_freshness:
    description: "Primary entity data freshness in Redis"
    sla:
      threshold: 60s
      window: "PRIMARY_BUSINESS_HOURS"
    gatus_check: "Data Latency / Redis Freshness Primary"

    depends_on:
      - node: kafka_to_redis_consumer
        type: process
        description: "Consumes filtered events from Kafka, writes to Redis TimeSeries"
        container: primary_kafka_to_redis
        supervisor_group: primary_processes
        supervisor_process: kafka_to_redis_entity
        healthcheck:
          method: consumer_lag
          consumer_group: redis-writer-entity
          lag_threshold: 10000
        logs:
          loki_query: '{container_name="primary_kafka_to_redis"}'
        metrics:
          - kafka_consumer_group_lag{group="redis-writer-entity"}
          - process_cpu_seconds_total{process="kafka_to_redis_entity"}

        depends_on:
          - node: kafka_entity_topic
            type: kafka_topic
            description: "Kafka topic receiving filtered entity events"
            topic: entity_filtered_v1
            cluster: kraft-cluster
            min_isr: 2
            healthcheck:
              method: topic_freshness
              max_age: 30s
            metrics:
              - kafka_topic_partition_current_offset
              - kafka_server_BrokerTopicMetrics_MessagesInPerSec

            depends_on:
              - node: websocket_ingestion
                type: process
                description: "WebSocket client streaming upstream data"
                container: primary_websocket_kafka
                supervisor_group: primary_processes
                healthcheck:
                  method: log_recency
                  max_age: 30s
                logs:
                  loki_query: '{container_name="primary_websocket_kafka"}'

                depends_on:
                  - node: upstream_data_feed
                    type: external
                    description: "Upstream WebSocket data feed"
                    healthcheck:
                      method: tcp_connect
                    notes: "External — cannot fix, only detect and alert"

      - node: redis_instance  # parallel dependency — see diagram above
        type: infrastructure
        # ... healthcheck, alerts, metrics omitted for brevity

Read it top to bottom and you see the production chain for a single metric: Redis data freshness depends on a Kafka consumer, which depends on a Kafka topic, which depends on a WebSocket ingestion process, which depends on an upstream data feed. The Redis instance itself is a parallel dependency — the consumer needs both a working source (Kafka) and a working destination (Redis).

Four design decisions make this useful for AI agents:

Tree, not graph. The structure is deliberately a tree with named node references, not a full dependency graph. Trees are easier for agents to traverse linearly — start at the SLA, walk down until you find the broken node. It gives deterministic path and reduces chaos in stochastic LLM reasoning. Graphs with cycles and multiple paths are more realistic but harder to navigate for an agent.

Agent crafted with human in loop. Agent does a first pass over orchestrator (e.g. Supervisor configuration) checks input / outputs for modules and builds this YAML. Service discovery can tell you what containers exist. It cannot tell you that kafka_to_redis_entity is the critical path for the data freshness SLA. Human verification for this is critical.

Bidirectional traversal. An agent can go top-down (“what does this metric depend on?”) or bottom-up (“what metrics break if kafka-1 goes down?”). The first is for diagnosis. The second is for impact assessment for change.

Observability pointers at every node. Each node carries its own Loki query, Prometheus metrics, and healthcheck method. The agent does not need a separate mapping file or dashboard lookup — everything it needs to inspect a node is embedded in the node definition.

Metadata at Every Node

The dependency tree tells the agent where to look. But what does it look for at each node? As the agent traverses the tree, it consumes specific metadata to determine whether that node is the problem or just a waypoint.

Here is a real diagnostic flow. The SLA fires: entity data is 73 seconds stale. The Gatus health checks validate this every 30 seconds across seven groups covering 73 endpoints. I wrote a detailed post on why Gatus works well here — its flexible BYOC (Bring Your Own Check) nature lets you code custom checks in Python rather than being limited to HTTP/TCP probes.

Coming back to example, on health check breach, the agent walks the tree top-down:

graph TD
    SLA["❌ SLA Violated<br/><i>data 73s stale (limit 60s)</i>"]

    C["kafka_to_redis_consumer<br/><i>lag: 45k messages</i>"]
    K["kafka_entity_topic<br/><i>offset advancing normally</i>"]
    LOGS["Consumer Logs<br/><i>'Redis connection pool exhausted'</i>"]
    R["redis_instance<br/><i>clients: 298 / 300</i>"]
    ROOT["🔍 Root Cause<br/><i>Redis client pool exhaustion<br/>→ consumer stalls → lag → stale data</i>"]

    SLA -->|"check consumer"| C
    C -->|"check upstream"| K
    C -->|"read logs"| LOGS
    LOGS -->|"check destination"| R
    K -.-|"✅ healthy"| K
    R -->|"near limit"| ROOT

    style K fill:#2A4C2A,color:#E8E6E1,stroke:#2A4C2A
    style ROOT fill:#D4943A,color:#1A1814,stroke:#D4943A
    style SLA fill:#6B2A2A,color:#E8E6E1,stroke:#6B2A2A

The agent starts at the root and walks down:

Node: kafka_to_redis_consumer — Agent queries kafka_consumer_group_lag{group="redis-writer-entity"}. Lag is 45,000 messages. This node looks unhealthy, but the lag could be caused by upstream starvation or downstream pressure. Check both directions.

Node: kafka_entity_topic — Agent checks topic freshness. The latest offset is advancing normally, new messages arriving every second. Kafka is healthy. The problem is the consumer, not the producer.

Agent reads logs — Following the Loki query from the consumer node: {container_name="primary_kafka_to_redis"} |= "ERROR". The result: "Redis connection pool exhausted, retrying in 5s". The consumer is alive but blocked.

Node: redis_instance — Agent checks redis_connected_clients: 298. The threshold is 300. Redis is one connection away from refusing new clients.

Root cause identified: Redis client pool exhaustion is causing the Kafka-to-Redis consumer to stall on write retries, which creates consumer lag, which makes the data stale. The problem is not in the data pipeline — it is in the destination.

The traversal logic at each node is straightforward:

def diagnose_node(node: dict) -> DiagnosisResult:
    """Check a single node in the dependency tree."""
    # 1. Run the node's healthcheck
    health = run_healthcheck(node["healthcheck"])
    if health.ok:
        return DiagnosisResult(node=node["node"], status="healthy")

    # 2. Gather context from logs and metrics
    context = {}
    if "logs" in node:
        context["recent_errors"] = query_loki(
            node["logs"]["loki_query"] + ' |= "ERROR"',
            since="15m"
        )
    if "metrics" in node:
        context["metrics"] = {
            m: query_prometheus(m, since="15m")
            for m in node["metrics"]
        }

    # 3. Check known errors before deep analysis
    known = match_known_error(context["recent_errors"])
    if known:
        return DiagnosisResult(
            node=node["node"],
            status="known_error",
            error=known,
            fix=known.get("fix"),
        )

    # 4. Return context for LLM reasoning
    return DiagnosisResult(
        node=node["node"],
        status="unhealthy",
        context=context,
    )

Notice step 3: match_known_error. Before the agent spends tokens reasoning about the logs, it checks whether this error pattern has been seen before. If it has, the agent skips analysis and goes straight to the known fix. That lookup is the next piece of the architecture.

Known Errors and the Self-Learning Loop

Every time the agent diagnoses and resolves an issue, the pattern is stored in a known-error memory. Next time the same log signature appears, the agent skips the expensive reasoning and executes the fix directly. The knowledge base grows monotonically — the agent gets faster over time, never slower.

The memory format combines multiple signals to avoid false matches:

errors:
  - id: ke-001
    signature:
      log_pattern: "Redis connection pool exhausted"
      node: kafka_to_redis_consumer
      metric_condition: "redis_connected_clients > 280"
    root_cause: "Too many concurrent consumers holding Redis connections"
    fix:
      type: automated
      steps:
        - action: docker_exec
          container: redis
          command: "redis-cli CLIENT KILL AGE 300"
          purpose: "kill idle connections older than 5 minutes"
        - action: verify
          check: "redis_connected_clients < 200"
          timeout: 60s
    severity: high
    first_seen: "2026-01-15"
    occurrence_count: 7
    resolution_time_avg: "45s"

  - id: ke-002
    signature:
      log_pattern: "WebSocket disconnected.*reconnecting"
      node: websocket_ingestion
      metric_condition: "ws_reconnection_count increase > 3 in 5m"
    root_cause: "Upstream feed intermittent disconnections during high-volume periods"
    fix:
      type: wait_and_verify
      steps:
        - action: wait
          duration: 60s
          purpose: "allow automatic reconnection logic to recover"
        - action: verify
          check: "ws_messages_received rate > 0"
          timeout: 120s
        - action: escalate_if_failed
          message: "WebSocket not recovering — manual intervention needed"
    severity: medium
    occurrence_count: 23

Three things make this effective:

Multi-signal signatures. A known error is not just a log grep — it combines a log pattern, a specific node in the dependency tree, and a metric condition. The string “connection pool exhausted” means different things in different contexts. Anchoring it to a node and a metric threshold eliminates false matches.

Fix types reflect reality. Not every problem has an automated fix. automated means the agent executes the steps directly. wait_and_verify handles transient issues where the built-in retry logic usually recovers — the agent waits, then checks. escalate means the agent has learned that this problem requires a human. All three are useful knowledge.

Controlled agent freedom. For certain aspects of code not related to core infrastructure, the agent can raise merge requests and deploy fixes itself post testing. For example, if the agent learns that a specific error is caused by a wrong variable name or permission issues.

A human operator builds intuition over months — which alerts are noise, which need immediate action, which resolve themselves. An agent builds the same knowledge base in the same timeframe, but it never forgets a pattern, never second-guesses a proven fix, and shares its memory across every future session. After three months, the known-error memory has about 30 entries. That is the long tail of operational knowledge that usually lives in one person’s head and walks out the door when they leave.

Automated Fixes and the Closed Loop

For a defined subset of the infrastructure, the agent has full autonomy to detect, diagnose, fix, verify, and report — with the human reviewing after the fact rather than approving before. The key word is “defined.” Not everything gets automated. The boundary is explicit:

Action	Autonomy Level	Example
Read logs, metrics, status	Full autonomy	Query Loki, Prometheus, supervisord
Restart a process (not container)	After 5 min failure	`supervisorctl restart kafka_to_redis_entity`
Config tweak (non-persistent)	Known patterns only	`redis-cli CONFIG SET activedefrag yes`
Container restart	Escalate to human	Agent proposes via Telegram, human approves
Data deletion	Never autonomous	Always human
Kafka cluster operations	Never autonomous	Always human

The distinction between “restart a process” and “restart a container” matters. A supervisord process restart is surgical — it affects one pipeline stage. A container restart kills every process in that container, potentially including healthy ones. The agent has autonomy for the first, not the second.

Here is what a typical closed-loop resolution looks like in practice:

graph LR
    ALERT["Gatus Alert"]
    TREE["Walk Dependency<br/>Tree"]
    MATCH["Match Known<br/>Error"]
    FIX["Execute Fix"]
    VERIFY["Verify<br/>Metrics"]
    NOTIFY["Notify<br/>Human"]
    MEMORY["Update<br/>Memory"]

    ALERT --> TREE --> MATCH --> FIX --> VERIFY --> NOTIFY
    VERIFY --> MEMORY
    MEMORY -.->|"next time"| MATCH

    style ALERT fill:#6B2A2A,color:#E8E6E1,stroke:#6B2A2A
    style VERIFY fill:#2A4C2A,color:#E8E6E1,stroke:#2A4C2A
    style NOTIFY fill:#D4943A,color:#1A1814,stroke:#D4943A

Gatus check fires: Redis memory usage exceeds 80%.
Agent traverses dependency tree: Finds the redis_instance node, gathers metrics.
Agent matches known error: ke-003, memory fragmentation. Occurrence count is 4, all previous resolutions succeeded.
Agent executes fix: docker exec redis redis-cli CONFIG SET activedefrag yes.
Agent verifies: Polls redis_mem_fragmentation_ratio for 5 minutes. Ratio drops from 1.8 to 1.1.
Agent reports: Sends a Telegram message with the full resolution chain — what fired, what it found, what it did, what the metrics look like now.
Agent updates memory: Increments occurrence_count to 5, updates last_seen.

I see the notification, glance at the before/after metrics, and move on. The entire cycle took under 5 minutes. The same diagnosis would have taken me 30+ minutes of manual investigation — opening Grafana, querying Loki, checking Redis CLI, figuring out the fix, verifying it worked.

For incidents the agent cannot resolve — novel errors, problems in the Kafka cluster, anything requiring data-level decisions — it creates a GitHub issue with the full diagnostic context: which SLA was violated, which dependency tree path it traversed, what logs and metrics it collected, and what it tried. When I pick up the issue, the investigation is already done. I just need to make the decision.

What Changes When You Work Backwards

Working backwards from business metrics changes how you think about operations.

You stop instrumenting things because they are instrumentable and start instrumenting things because they protect an SLA. The dependency tree and known errors memory are not just nice-to-haves. They are the scaffolding that makes the whole system navigable for an agent.

The agent handles roughly 90% of alerts end-to-end. The remaining 10% are genuinely novel — new failure modes, edge cases in upstream data, infrastructure changes that the dependency tree has not been updated to reflect. That ratio improves as the known-error memory grows and as the dependency tree covers more metrics.

Roadmap

Not complete, but some initial ideas that I am actively working on to evolve this architecture:

Proactive anomaly detection — The dependency tree gives you the exact metric set to train a lightweight model on, which means you can detect drift before it becomes an SLA violation.
Multi-agent coordination — One agent per dependency tree branch, with a coordinator that handles cross-branch failures. The pieces are there, but the coordination protocol needs work.
Shared known-error memory across projects — The same patterns (Redis fragmentation, consumer lag spikes, WebSocket reconnection storms) appear in every real-time platform. Making this memory a shared artifact pays dividends.
Cost control and guardrails — As the agent takes more actions, you need budget limits, action rate limits, and a kill switch if things go sideways to prevent runaway costs. Use of Opensource LLMs is also on the roadmap.

Agentic AI Platform Engineering Infrastructure Ops Monitoring

Disclosure: My ideas and analysis. AI assisted with writing, code examples, and diagrams.