AI-Native Engineering

Agentic SRE — CultureTech Playbook

sre ai operations

Context

“AIOps with LLMs” is a popular promise. The reality: most impressive demos are agents that correlate logs and produce a plausible response — without guarantees of correctness, without rollback, without audit.

For SRE that’s not enough. SRE operates critical infrastructure. An agent that “vibe-checks” a Kafka corruption and restarts the cluster without human confirmation is itself an incident.

This playbook describes how to apply agents to SRE keeping the operational rigor the discipline demands.

3 fundamental rules

Rule 1: the agent’s state is legible

Every agent must have an explicit finite state machine. “The LLM reasons” is not enough. States like:

  • idle — waiting for signal.
  • triaging — classifying an alert.
  • executing-runbook — executing step of runbook X.
  • escalating-human — case exceeded agent confidence.

The operating human must be able to answer “what is the agent doing now?” by looking at the state, not inferring from the last log.

Rule 2: runbooks are contracts, not suggestions

The agent must execute declared runbooks, not invent steps. Each runbook is a graph of steps with:

  • Verifiable precondition.
  • Concrete action.
  • Verifiable postcondition.
  • Rollback mechanism.

The LLM chooses which runbook to execute, not which individual steps to take. This is similar to how a human SRE consults their wiki — the difference is discipline, not creativity.

Rule 3: the agent is observable like any other system

Every agent action is logged with:

  • Trigger that started it.
  • LLM reasoning (prompt + output).
  • Chosen runbook + execution state.
  • Result (success, failure, escalation).

This goes to the same observability pipeline as the rest of the infra. When the agent fails, it fails with context.

Anti-patterns

Anti-pattern 1: letting the agent decide runbooks dynamically

Happens when the LLM not only chooses runbook but writes the steps on the fly. Impossible to audit, impossible to test.

Fix: runbooks are versioned artifacts in repo. The agent picks from existing runbooks.

Anti-pattern 2: unlimited trust

When the agent can execute destructive actions (restart clusters, delete pods, etc.) without human confirmation in any case.

Fix: define a confidence threshold. Below X% confidence, the agent escalates before acting. And for specific destructive actions (defined in a list), always escalates, regardless of confidence.

Anti-pattern 3: a single monolithic agent

When there’s a single “SRE Agent” covering Kafka, Kubernetes, PostgreSQL, and everything else. Complexity explodes, audit becomes impossible.

Fix: one agent per domain (Themis follows this pattern). The Kafka domain has its own agent with its own FSM and its own runbooks. Kubernetes another.

Typical use case

Agent receives alert kafka-consumer-lag > 1h on critical topic.

  1. State: triaging.
  2. Reasoning: “High lag on topic X. Candidate runbooks: restart-consumer-group, scale-consumer, check-broker-health.”
  3. Action: consult metrics to choose among the three. Detects broker has 95% CPU.
  4. State: executing-runbook: check-broker-health.
  5. Runbook action: process check, GC pause, network.
  6. Result: 2.5s GC pause detected.
  7. State: escalating-human. Reason: corrective action (adjust XX:MaxGCPauseMillis) requires broker restart. Not destructive per se, but the explicit list marks it as “requires human confirmation”.
  8. Oncall human gets Slack with: original alert + full reasoning + proposed runbook. Approves or adjusts.

Total time from alert to proposal: 90 seconds. Without agent, same diagnosis takes 15-30 minutes of an SRE.

When NOT to use agentic SRE

  • Small SRE team (<5 people). Overhead of maintaining runbooks + agent observability exceeds savings.
  • Simple infrastructure (one monolith + DB). Not enough alert density to justify.
  • No pre-existing runbook culture. Agentic SRE amplifies good discipline, doesn’t create it.

Interested in exploring?

If you have an SRE team and want to discuss the specific use case: 60-minute Technical Deep Dive, free.