Agentic Ops: Working Backwards from the Metric That Matters
Start from a single business SLA — data freshness under 60 seconds — and trace backwards through dependency trees, metadata layers, known-error memory, and automated fixes to build an AI-operated production platform.
Critical operations in business run 24 x 7, and there are ops teams that manage them. A dashboard shows the latest data point for a critical entity where SLAs matter and if something in the pipeline is broken, and every downstream calculation — derived metrics, anomaly scores, decision signals — is working with stale data. In real-time systems, stale data does not just look wrong. It costs money. Stories of wrong data being used are a common scenario.
In a project that has 20+ microservice containers, three databases, a Kafka cluster, real-time data pipelines, and dozens of health checks. The traditional approach to operations — wake up, check dashboards, triage alerts, SSH into boxes, read logs — does not scale. So I built an agentic ops pipeline where AI agents handle monitoring, diagnosis, and remediation autonomously.
But the insight that changed how I structured the whole system was not about AI. It was about direction. Most ops architectures are built bottom-up: start with infrastructure monitoring, add application metrics, wire up alerting, then hope someone connects the dots to business impact. I built mine top-down: start from the number the end user sees, and trace backwards through every system that produces it.
This post walks through that architecture — from business SLAs, through dependency trees that agents traverse, to self-learning error memory that makes the system faster over time. The tools mentioned in this post can be swapped out for equivalents — the principles hold regardless of whether you use Grafana or Datadog, Supervisor or Kubernetes, OpenAI or Anthropic. The key is the structure and flow of information, not the specific tech.
graph TD
SLA["Business SLA<br/><i>data freshness < 60s</i>"]
DEP["Dependency Tree<br/><i>YAML — human-crafted</i>"]
META["Metadata at Each Node<br/><i>logs · metrics · healthchecks</i>"]
KE["Known-Error Memory<br/><i>pattern → fix, growing over time</i>"]
AGENT["AI Agent<br/><i>traverse · diagnose · act</i>"]
FIX["Automated Fix + Verify"]
REPORT["Report to Human"]
LEARN["Update Known Errors"]
SLA -->|"violation detected"| AGENT
AGENT -->|"walks"| DEP
DEP -->|"at each node"| META
META -->|"matches?"| KE
KE -->|"known fix"| FIX
META -->|"novel error"| AGENT
AGENT -->|"LLM reasoning"| FIX
FIX --> REPORT
FIX -->|"new pattern"| LEARN
LEARN -.->|"grows"| KE
The Metrics That Matter
Before instrumenting anything, I asked: what numbers, if wrong or stale, would break the product? The answer gave me four SLAs that the entire monitoring stack exists to protect:
| Metric | SLA | Window | Why It Matters |
|---|---|---|---|
| Redis data freshness | < 60s | Business hours | User sees stale data, makes wrong decision |
| TSDB raw data | < 30s | Business hours | Analytics pipeline computes on old inputs |
| Derived metrics completeness | > 80% of entities | Business hours | Missing signals, blind spots in downstream systems |
| Consumer lag | < 10,000 messages | Always | Pipeline falling behind, cascading staleness |
These are not aspirational targets. They are the boundaries between “the platform is working” and “the platform is silently lying to you.”
The Dependency Tree
SLAs are numbers on a page until you can trace what produces them. That is the dependency tree — a YAML file that maps every business metric to its full production chain, from the SLA down to the bare-metal resource.
graph TD
SLA["🎯 Data Freshness SLA < 60s"]
CONSUMER["Kafka→Redis Consumer<br/><i>container: primary_kafka_to_redis</i><br/><i>supervisor: primary_processes</i>"]
TOPIC["Kafka Topic<br/><i>entity_filtered_v1</i><br/><i>min ISR: 2</i>"]
WS["WebSocket Ingestion<br/><i>container: primary_websocket_kafka</i>"]
EXT["Upstream Data Feed<br/><i>external — detect only</i>"]
REDIS["Redis 7 + TimeSeries<br/><i>memory: 30GB threshold</i><br/><i>max clients: 300</i>"]
SLA --> CONSUMER
SLA --> REDIS
CONSUMER --> TOPIC
TOPIC --> WS
WS --> EXT
style SLA fill:#D4943A,color:#1A1814,stroke:#D4943A
style EXT fill:#6B4C2A,color:#E8E6E1,stroke:#6B4C2A
style REDIS fill:#2A3B4C,color:#E8E6E1,stroke:#2A3B4C
Each node in the tree carries its own observability pointers — the Loki query, Prometheus metrics, and healthcheck method an agent needs to inspect it. Here is the full YAML:
# dependency_tree.yaml — the map an agent traverses when something breaks
metrics:
entity_data_freshness:
description: "Primary entity data freshness in Redis"
sla:
threshold: 60s
window: "PRIMARY_BUSINESS_HOURS"
gatus_check: "Data Latency / Redis Freshness Primary"
depends_on:
- node: kafka_to_redis_consumer
type: process
description: "Consumes filtered events from Kafka, writes to Redis TimeSeries"
container: primary_kafka_to_redis
supervisor_group: primary_processes
supervisor_process: kafka_to_redis_entity
healthcheck:
method: consumer_lag
consumer_group: redis-writer-entity
lag_threshold: 10000
logs:
loki_query: '{container_name="primary_kafka_to_redis"}'
metrics:
- kafka_consumer_group_lag{group="redis-writer-entity"}
- process_cpu_seconds_total{process="kafka_to_redis_entity"}
depends_on:
- node: kafka_entity_topic
type: kafka_topic
description: "Kafka topic receiving filtered entity events"
topic: entity_filtered_v1
cluster: kraft-cluster
min_isr: 2
healthcheck:
method: topic_freshness
max_age: 30s
metrics:
- kafka_topic_partition_current_offset
- kafka_server_BrokerTopicMetrics_MessagesInPerSec
depends_on:
- node: websocket_ingestion
type: process
description: "WebSocket client streaming upstream data"
container: primary_websocket_kafka
supervisor_group: primary_processes
healthcheck:
method: log_recency
max_age: 30s
logs:
loki_query: '{container_name="primary_websocket_kafka"}'
depends_on:
- node: upstream_data_feed
type: external
description: "Upstream WebSocket data feed"
healthcheck:
method: tcp_connect
notes: "External — cannot fix, only detect and alert"
- node: redis_instance # parallel dependency — see diagram above
type: infrastructure
# ... healthcheck, alerts, metrics omitted for brevity
Read it top to bottom and you see the production chain for a single metric: Redis data freshness depends on a Kafka consumer, which depends on a Kafka topic, which depends on a WebSocket ingestion process, which depends on an upstream data feed. The Redis instance itself is a parallel dependency — the consumer needs both a working source (Kafka) and a working destination (Redis).
Four design decisions make this useful for AI agents:
Tree, not graph. The structure is deliberately a tree with named node references, not a full dependency graph. Trees are easier for agents to traverse linearly — start at the SLA, walk down until you find the broken node. It gives deterministic path and reduces chaos in stochastic LLM reasoning. Graphs with cycles and multiple paths are more realistic but harder to navigate for an agent.
Agent crafted with human in loop. Agent does a first pass over orchestrator (e.g. Supervisor configuration) checks input / outputs for modules and builds this YAML. Service discovery can tell you what containers exist. It cannot tell you that kafka_to_redis_entity is the critical path for the data freshness SLA. Human verification for this is critical.
Bidirectional traversal. An agent can go top-down (“what does this metric depend on?”) or bottom-up (“what metrics break if kafka-1 goes down?”). The first is for diagnosis. The second is for impact assessment for change.
Observability pointers at every node. Each node carries its own Loki query, Prometheus metrics, and healthcheck method. The agent does not need a separate mapping file or dashboard lookup — everything it needs to inspect a node is embedded in the node definition.
Metadata at Every Node
The dependency tree tells the agent where to look. But what does it look for at each node? As the agent traverses the tree, it consumes specific metadata to determine whether that node is the problem or just a waypoint.
Here is a real diagnostic flow. The SLA fires: entity data is 73 seconds stale. The Gatus health checks validate this every 30 seconds across seven groups covering 73 endpoints. I wrote a detailed post on why Gatus works well here — its flexible BYOC (Bring Your Own Check) nature lets you code custom checks in Python rather than being limited to HTTP/TCP probes.
Coming back to example, on health check breach, the agent walks the tree top-down:
graph TD
SLA["❌ SLA Violated<br/><i>data 73s stale (limit 60s)</i>"]
C["kafka_to_redis_consumer<br/><i>lag: 45k messages</i>"]
K["kafka_entity_topic<br/><i>offset advancing normally</i>"]
LOGS["Consumer Logs<br/><i>'Redis connection pool exhausted'</i>"]
R["redis_instance<br/><i>clients: 298 / 300</i>"]
ROOT["🔍 Root Cause<br/><i>Redis client pool exhaustion<br/>→ consumer stalls → lag → stale data</i>"]
SLA -->|"check consumer"| C
C -->|"check upstream"| K
C -->|"read logs"| LOGS
LOGS -->|"check destination"| R
K -.-|"✅ healthy"| K
R -->|"near limit"| ROOT
style K fill:#2A4C2A,color:#E8E6E1,stroke:#2A4C2A
style ROOT fill:#D4943A,color:#1A1814,stroke:#D4943A
style SLA fill:#6B2A2A,color:#E8E6E1,stroke:#6B2A2A
The agent starts at the root and walks down:
Node: kafka_to_redis_consumer — Agent queries kafka_consumer_group_lag{group="redis-writer-entity"}. Lag is 45,000 messages. This node looks unhealthy, but the lag could be caused by upstream starvation or downstream pressure. Check both directions.
Node: kafka_entity_topic — Agent checks topic freshness. The latest offset is advancing normally, new messages arriving every second. Kafka is healthy. The problem is the consumer, not the producer.
Agent reads logs — Following the Loki query from the consumer node: {container_name="primary_kafka_to_redis"} |= "ERROR". The result: "Redis connection pool exhausted, retrying in 5s". The consumer is alive but blocked.
Node: redis_instance — Agent checks redis_connected_clients: 298. The threshold is 300. Redis is one connection away from refusing new clients.
Root cause identified: Redis client pool exhaustion is causing the Kafka-to-Redis consumer to stall on write retries, which creates consumer lag, which makes the data stale. The problem is not in the data pipeline — it is in the destination.
The traversal logic at each node is straightforward:
def diagnose_node(node: dict) -> DiagnosisResult:
"""Check a single node in the dependency tree."""
# 1. Run the node's healthcheck
health = run_healthcheck(node["healthcheck"])
if health.ok:
return DiagnosisResult(node=node["node"], status="healthy")
# 2. Gather context from logs and metrics
context = {}
if "logs" in node:
context["recent_errors"] = query_loki(
node["logs"]["loki_query"] + ' |= "ERROR"',
since="15m"
)
if "metrics" in node:
context["metrics"] = {
m: query_prometheus(m, since="15m")
for m in node["metrics"]
}
# 3. Check known errors before deep analysis
known = match_known_error(context["recent_errors"])
if known:
return DiagnosisResult(
node=node["node"],
status="known_error",
error=known,
fix=known.get("fix"),
)
# 4. Return context for LLM reasoning
return DiagnosisResult(
node=node["node"],
status="unhealthy",
context=context,
)
Notice step 3: match_known_error. Before the agent spends tokens reasoning about the logs, it checks whether this error pattern has been seen before. If it has, the agent skips analysis and goes straight to the known fix. That lookup is the next piece of the architecture.
Known Errors and the Self-Learning Loop
Every time the agent diagnoses and resolves an issue, the pattern is stored in a known-error memory. Next time the same log signature appears, the agent skips the expensive reasoning and executes the fix directly. The knowledge base grows monotonically — the agent gets faster over time, never slower.
The memory format combines multiple signals to avoid false matches:
errors:
- id: ke-001
signature:
log_pattern: "Redis connection pool exhausted"
node: kafka_to_redis_consumer
metric_condition: "redis_connected_clients > 280"
root_cause: "Too many concurrent consumers holding Redis connections"
fix:
type: automated
steps:
- action: docker_exec
container: redis
command: "redis-cli CLIENT KILL AGE 300"
purpose: "kill idle connections older than 5 minutes"
- action: verify
check: "redis_connected_clients < 200"
timeout: 60s
severity: high
first_seen: "2026-01-15"
occurrence_count: 7
resolution_time_avg: "45s"
- id: ke-002
signature:
log_pattern: "WebSocket disconnected.*reconnecting"
node: websocket_ingestion
metric_condition: "ws_reconnection_count increase > 3 in 5m"
root_cause: "Upstream feed intermittent disconnections during high-volume periods"
fix:
type: wait_and_verify
steps:
- action: wait
duration: 60s
purpose: "allow automatic reconnection logic to recover"
- action: verify
check: "ws_messages_received rate > 0"
timeout: 120s
- action: escalate_if_failed
message: "WebSocket not recovering — manual intervention needed"
severity: medium
occurrence_count: 23
Three things make this effective:
Multi-signal signatures. A known error is not just a log grep — it combines a log pattern, a specific node in the dependency tree, and a metric condition. The string “connection pool exhausted” means different things in different contexts. Anchoring it to a node and a metric threshold eliminates false matches.
Fix types reflect reality. Not every problem has an automated fix. automated means the agent executes the steps directly. wait_and_verify handles transient issues where the built-in retry logic usually recovers — the agent waits, then checks. escalate means the agent has learned that this problem requires a human. All three are useful knowledge.
Controlled agent freedom. For certain aspects of code not related to core infrastructure, the agent can raise merge requests and deploy fixes itself post testing. For example, if the agent learns that a specific error is caused by a wrong variable name or permission issues.
A human operator builds intuition over months — which alerts are noise, which need immediate action, which resolve themselves. An agent builds the same knowledge base in the same timeframe, but it never forgets a pattern, never second-guesses a proven fix, and shares its memory across every future session. After three months, the known-error memory has about 30 entries. That is the long tail of operational knowledge that usually lives in one person’s head and walks out the door when they leave.
Automated Fixes and the Closed Loop
For a defined subset of the infrastructure, the agent has full autonomy to detect, diagnose, fix, verify, and report — with the human reviewing after the fact rather than approving before. The key word is “defined.” Not everything gets automated. The boundary is explicit:
| Action | Autonomy Level | Example |
|---|---|---|
| Read logs, metrics, status | Full autonomy | Query Loki, Prometheus, supervisord |
| Restart a process (not container) | After 5 min failure | supervisorctl restart kafka_to_redis_entity |
| Config tweak (non-persistent) | Known patterns only | redis-cli CONFIG SET activedefrag yes |
| Container restart | Escalate to human | Agent proposes via Telegram, human approves |
| Data deletion | Never autonomous | Always human |
| Kafka cluster operations | Never autonomous | Always human |
The distinction between “restart a process” and “restart a container” matters. A supervisord process restart is surgical — it affects one pipeline stage. A container restart kills every process in that container, potentially including healthy ones. The agent has autonomy for the first, not the second.
Here is what a typical closed-loop resolution looks like in practice:
graph LR
ALERT["Gatus Alert"]
TREE["Walk Dependency<br/>Tree"]
MATCH["Match Known<br/>Error"]
FIX["Execute Fix"]
VERIFY["Verify<br/>Metrics"]
NOTIFY["Notify<br/>Human"]
MEMORY["Update<br/>Memory"]
ALERT --> TREE --> MATCH --> FIX --> VERIFY --> NOTIFY
VERIFY --> MEMORY
MEMORY -.->|"next time"| MATCH
style ALERT fill:#6B2A2A,color:#E8E6E1,stroke:#6B2A2A
style VERIFY fill:#2A4C2A,color:#E8E6E1,stroke:#2A4C2A
style NOTIFY fill:#D4943A,color:#1A1814,stroke:#D4943A
- Gatus check fires: Redis memory usage exceeds 80%.
- Agent traverses dependency tree: Finds the
redis_instancenode, gathers metrics. - Agent matches known error: ke-003, memory fragmentation. Occurrence count is 4, all previous resolutions succeeded.
- Agent executes fix:
docker exec redis redis-cli CONFIG SET activedefrag yes. - Agent verifies: Polls
redis_mem_fragmentation_ratiofor 5 minutes. Ratio drops from 1.8 to 1.1. - Agent reports: Sends a Telegram message with the full resolution chain — what fired, what it found, what it did, what the metrics look like now.
- Agent updates memory: Increments
occurrence_countto 5, updateslast_seen.
I see the notification, glance at the before/after metrics, and move on. The entire cycle took under 5 minutes. The same diagnosis would have taken me 30+ minutes of manual investigation — opening Grafana, querying Loki, checking Redis CLI, figuring out the fix, verifying it worked.
For incidents the agent cannot resolve — novel errors, problems in the Kafka cluster, anything requiring data-level decisions — it creates a GitHub issue with the full diagnostic context: which SLA was violated, which dependency tree path it traversed, what logs and metrics it collected, and what it tried. When I pick up the issue, the investigation is already done. I just need to make the decision.
What Changes When You Work Backwards
Working backwards from business metrics changes how you think about operations.
You stop instrumenting things because they are instrumentable and start instrumenting things because they protect an SLA. The dependency tree and known errors memory are not just nice-to-haves. They are the scaffolding that makes the whole system navigable for an agent.
The agent handles roughly 90% of alerts end-to-end. The remaining 10% are genuinely novel — new failure modes, edge cases in upstream data, infrastructure changes that the dependency tree has not been updated to reflect. That ratio improves as the known-error memory grows and as the dependency tree covers more metrics.
Roadmap
Not complete, but some initial ideas that I am actively working on to evolve this architecture:
-
Proactive anomaly detection — The dependency tree gives you the exact metric set to train a lightweight model on, which means you can detect drift before it becomes an SLA violation.
-
Multi-agent coordination — One agent per dependency tree branch, with a coordinator that handles cross-branch failures. The pieces are there, but the coordination protocol needs work.
-
Shared known-error memory across projects — The same patterns (Redis fragmentation, consumer lag spikes, WebSocket reconnection storms) appear in every real-time platform. Making this memory a shared artifact pays dividends.
-
Cost control and guardrails — As the agent takes more actions, you need budget limits, action rate limits, and a kill switch if things go sideways to prevent runaway costs. Use of Opensource LLMs is also on the roadmap.