AI EngineeringEngineering PracticesCode ReviewCICDPerformance

Patterns of agent driven code development

Six lessons from one saturation incident, all about how to keep agent velocity without paying for it in production fires.

23 April 2026 · 11 min read

AI assisted code development is new normal. The scale and speed at which agents can write and build things is really attractive. However, without rules and gurardrails that were normal in old era of human written code, the chances of getting things multiply as the scale of code written surpases human review.

This blog is summary of lessons I learnt from doing lot of AI assisted coding, what worked well and what did not and lessons I learnt. It’s a post about what agent-era code workflow and how to match the velocity and quality balance.

Six lessons, drawn from one saga.

The saga, in three paragraphs

All code written and reviewed by agents.

Pull request (PR) A introduced a multi-window aggregation. The same record now produced outputs for 5 time-windows instead of 1. Each output went through the existing serialization path. PR A was correct. The serializer was unchanged. CI passed. Copilot reviewed and approved. A human approved. Merged.

PR B, two weeks later, added a second pass through the metrics struct for a new audit log. Reviewer noted “small overhead, acceptable.” Merged.

PR C, a week after that, registered 16 new pattern detectors against the per-event dispatch loop. Each detector was cheap on its own. PR C shipped behind a config flag — but the flag defaulted on in production. Reviewer noted “we benchmarked one detector, looks fine.” Merged.

A is fine. B is fine. C is fine. The serializer is unchanged across all three. But A multiplied the per-record path 5×. B added a second invocation of the same hot serializer. C added 16 new dispatches per event. The per-message work is now 5 × 2 × 17 ≈ 170× what it was. The serializer’s existing date-format-spec re-parse cost — invisible at 1× — is now 170× louder. CPU pegs.

This is cumulative regression: an outcome that emerges from the interaction between merges, not from any one merge. By the time you can see it in production, you have weeks of diffs to bisect.

This draws from the fact that agents are kind of good at narrow things and cannot see the big picture without being constantly reminded. They can write a function that does what you ask, but they can’t intuit the scale at which it runs or the cumulative effect of multiple changes. The defenses that catch these patterns are different than the ones that caught pre-agent bugs.

I hope you can relate to the story. Now coming to the new basics.

The agent-era hygiene baseline every team needs

Everything below assumes a baseline of defenses already in place. They’re “obvious things” in the sense that nobody argues against them — but agent-era development punishes their absence harder than human-era did, because the agent has no incentive to do them voluntarily. Lock them down at the tool layer, not the social layer.

Pre-commit hooks that don’t let bad commits through

Every commit on a real codebase should run, before the commit is allowed to land:

Format check — cargo fmt --check, black --check, prettier --check. Failing the format check blocks the commit. (Agents will sometimes auto-format their changes and mix unrelated formatting churn into the diff — running fmt locally upfront avoids the noise.)
Linter with warnings as errors — cargo clippy -- -D warnings, eslint --max-warnings=0, mypy --strict. The “warning vs error” distinction collapses in agent-era; agents will produce code that compiles cleanly but lints loudly, and “we’ll fix the lints later” never happens. Treat every warning as a build failure.
Type check — cargo check, tsc --noEmit, mypy. Catches everything the formatter doesn’t.
Unit test runner — cargo test, pytest, npm test. Yes, on every commit. Yes, even though it slows the commit. Yes, the agent will offer to skip it; the answer is no.
Doc-comment check — cargo doc with warnings-as-errors. Doc-code drift is a recurring agent failure mode (the agent edits the code but not the doc-comment above it).
Secret scanner — detect-secrets, gitleaks, or similar. Agents will sometimes paste credentials into example blocks “just for the docs.” The scanner catches it before it reaches a public repo.
Vulnerability audit — cargo audit, pip-audit, npm audit. Run on a less-frequent cadence (weekly is fine) but in pre-commit when dependencies change.
Contract / schema staleness checks — if you have generated code (Protobuf, JSON Schema, OpenAPI), a hook that fails when the generated artifact is out of sync with its source. Agents will sometimes hand-edit generated files; the hook catches it.

The single most important rule, written down: the agent must NOT bypass hooks with --no-verify or equivalent. If hooks fail, the underlying issue gets fixed; the bypass is never the answer. This is a hard rule worth putting in your project’s CLAUDE.md or equivalent agent-instruction file, because agents will absolutely offer to bypass when a hook fails and they think the failure is unrelated.

The framework choice doesn’t matter much (pre-commit, husky, lefthook, native git hooks). What matters is that the chain is configured, runs on every commit, and has no escape hatch.

Tests run against real infrastructure, not mocks

Agents love mocks. They’re trained on test code that’s mock-heavy because mock-heavy tests are easier to write, easier to copy from a tutorial, and run in CI without external setup. They are also where most of the “test passes but production fails” stories come from.

The pattern:

Kafka tests that need a kafka cluster — run them against a real broker (docker compose, testcontainers, embedded kraft). Don’t mock KafkaConsumer and assert against the mock; spin up a real one.
Redis tests that need redis — run them against a real redis. The behavior differences between MockRedis and the real thing (key expiration timing, pipeline semantics, persistence behavior) are exactly the differences your code has bugs in.
Database tests that need a database — run them against a real Postgres / TimescaleDB / whatever you use in production. Schema migrations against a fake DB don’t catch the issues that real ones do.
HTTP client tests that talk to internal services — run them against a real fixture service if you have one, or a recorded-and-replayed real conversation (VCR-style). Mocked HTTP clients drift from server behavior the moment the server changes.

The agent’s instinct, when asked to add a test, is to add a unit test with mocks. The discipline:

Default to integration tests with real infra (in docker / testcontainers).
Reserve unit tests with mocks for pure-logic code where the dependency boundary is genuine and the mock doesn’t paper over a real behavior question.
When the agent proposes a mock, ask “is this mock asserting something about how we call the dependency, or is it inventing behavior the dependency might not actually have?” The latter is the bug factory.

In practice, on a real codebase with a docker compose for the dependencies, integration tests cost ~2× the time of unit tests but catch ~10× the bugs. Easy trade.

End-to-end real-data validation

Beyond integration tests against real infra, you also need to validate the whole pipeline against real data — not synthetic data the agent generated.

Capture a slice of real production input (one hour of kafka traffic, one day of HTTP request logs, a sampled trace) into a fixture file. Build a test that:

Spins up your binary (or your pipeline of binaries) against a docker compose of the real dependencies.
Replays the captured input through the binary.
Asserts something concrete about the output: row counts, key invariants, p99 latency, total CPU seconds.

This is the single most powerful test you can have for a streaming or batch pipeline, and most teams don’t have it. Reasons not to: setup cost is real (~1 week the first time). Reasons it’s worth it: it catches everything that:

Looks fine in unit tests
Looks fine in integration tests
Looks fine in code review (human or agent)
Manifests only on real production-shaped input

The cumulative regression in this post’s saga would have been caught by an end-to-end replay test on PR A. The aggregate CPU seconds for replaying the fixture would have spiked 5×. None of the unit / integration tests would have flagged it because none of them run the realistic input shape.

A pre-push code-reviewer agent (separate from Copilot)

Most teams have Copilot or equivalent reviewing PRs after they’re pushed. That’s good but late. Add a second reviewer agent that runs locally before the push — its only job is to catch the issues a remote-reviewer round would otherwise flag, in time to fix them before they enter the PR review cycle.

Brief shape:

Review the diff I'm about to push. Focus on:
1. Is the public API shape (trait/fn signatures) the canonical idiomatic shape?
2. Every .clone() / .to_string() / Vec::new() in the diff — is it on a hot path?
   If yes, justify or eliminate.
3. Every doc comment in the diff — does it accurately describe the code at HEAD?
4. Top 3 nits a perf-minded reviewer would file.
Not-focus: style, naming, missing tests. If you'd approve for push, say so.

The local reviewer’s nits are cheap to address (your code is fresh; your editor is open). The remote reviewer’s nits cost 30 minutes per round-trip. Move the cheap ones earlier.

This is a velocity multiplier on top of agent-era development specifically — without it, you spend agent-review cycles on issues that should have been caught locally. With it, agent review focuses on the hard cross-PR architectural concerns.

Hard rules in `CLAUDE.md` (or equivalent agent-instruction file)

The non-negotiable rules need to be written down somewhere the agent reads at the start of every session. Examples that we keep in our equivalent file:

Never commit directly to main. Every change goes through a feature branch and a PR. (Agents will absolutely commit to the current branch if it’s main and you don’t tell them not to.)
Never bypass hooks (--no-verify, --no-gpg-sign, etc) unless explicitly authorized.
Never run destructive operations (git reset --hard, git push --force, rm -rf) without confirmation.
No backwards-compatibility shims by default. Fix the call sites; don’t add Option<T> or #[serde(default)] to make the change “safer.”
No mocks for system-boundary tests. Use the real infra in docker.
File issues for proper fixes; never sed-hack generated code or work around upstream bugs locally.

These rules don’t enforce themselves. But the agent reading them at session start internalizes them, and the rate of “agent did something dangerous because nobody told it not to” drops dramatically.

Persistent memory across sessions

Covered in more depth in Lesson 7 below, but worth flagging here as baseline: the agent’s context starts empty every session. Anything you don’t write down somewhere it will read, you’ll re-derive every session, often inconsistently. Treat memory updates as part of every retrospective, not an optional cleanup task.

Lessons

These six items aren’t lessons. They’re prerequisites. Without them, the lessons below are advanced techniques on a broken foundation. With them, the rest of the post is about what’s left after the basics are working — which is where most teams adopting agents fast actually end up after the honeymoon.

Lesson 1: Agent velocity raises the cumulative-regression rate

We started the blog with this example. When most pull requests are agent-assisted, the merge rate goes up. That’s the value proposition — write faster, review faster, ship faster. But cumulative-regression risk scales with merge rate, not with diff quality. Three “fine” PRs in three weeks have less compounding surface area than three “fine” PRs in three days.

The agent-era distribution looks something like:

	Pre-agent typical week	Agent-era typical week
PRs merged on a fast-moving service	3-5	10-15
Average review-cycle time	1-3 days	1-3 hours
Probability any single PR has a real bug	~5%	~5% (about the same)
Probability ANY of the week’s PRs interacts adversely with another	low	meaningfully higher

The per-PR error rate didn’t change. The number of PRs in the same calendar window did. The quadratic-ish growth in pairwise interactions is the cumulative-regression surface. Agent velocity is real; the surface area it creates is also real, and that surface gets paid for somewhere.

This isn’t an argument against agent velocity. It’s an argument for matching the velocity gain with proportional defense at the layers that scale: CI gates and production observability, not human review.

Lesson 2: Agents don’t intuit “slow at scale”

The chrono serde_json::to_value(<DateTime>) pattern that ate 50% of CPU in the saga is normal Rust. It’s what every textbook says to do. The agent that wrote the original to_value(metrics) call wrote idiomatic, reasonable code. The agent that reviewed it — same conclusion. The human reviewer who approved — same conclusion.

Idiomatic-at-small-scale is the agent’s default mode. They’ve absorbed the canon: how to serialize a struct to JSON, how to parse a date, how to dispatch over a registry. The canon is correct in the small. At 6,000 calls per second on a hot path, the canon’s cost gets multiplied by 6,000 and matters in ways the canon doesn’t address.

A human engineer with deployment-context intuition might catch this — “wait, this is on the per-message path, let’s check what to_value actually does under the hood.” A human engineer without deployment context (most of them, on most of their reviews) won’t. Neither will the agent.

This isn’t a flaw to fix in agents. It’s a structural feature of “code that’s idiomatic at the language level isn’t necessarily right at your service’s scale.” The fix is a perf gate that treats scale as a first-class CI concern, not a reviewer-intuition concern. (More on this in Lesson 6.)

Lesson 3: Code review’s role shifts when agents write the code

Pre-agent code review focused on:

Typos, naming, formatting
Line-level correctness
Idiomatic patterns
Catching obvious bugs

Agents are now competent at all four. A modern Copilot review will spot most of these on the first pass; you’ll spend zero seconds on them.

So what’s left for a human reviewer? Three things, and they’re a different muscle than what most senior engineers spent the last decade developing:

Architectural fit. Does this PR’s shape match how the system is supposed to evolve? Are we entrenching a pattern we’ll regret in 6 months? Is there a cleaner abstraction the agent didn’t see because it didn’t have system-wide context?
Reasoning verification. Agents now write substantial PR descriptions explaining the trade-offs they considered. The reviewer’s job is to verify the stated reasoning matches the diff. Did the agent claim “we use Mutex to be Send + Sync” while actually using RefCell? Did they claim “no behavior change” while changing a default? The PR description is the agent’s case; the reviewer is the judge.
Cross-PR effects. What does this change interact with? Does it amplify a cost in another module? This is the only line of defense against cumulative regressions short of running the code at production scale, and it’s the muscle that human reviewers have least practice with. Most reviewers have spent careers reviewing one diff at a time.

These three are the reviewer’s job in the agent era. Everything else is something the agents are good at, and demanding humans do it as well wastes review capacity on duplicated coverage.

Lesson 4: The “two rounds then merge” rule

Iteration with an agent reviewer is unbounded by default. Push a PR; agent finds three things; fix them; push again; agent finds two more (often opinionated polish, not real bugs); fix those; push; agent finds one more nit; and so on. The agent doesn’t get tired. You will.

The discipline I converged on during the saga: after two rounds of Copilot review, merge.

Round 1: agent’s first pass on the original push. Catches the obvious issues — there’ll usually be 1-3 real concerns. Address them.
Round 2: agent reviews the fix. Catches drift introduced by the round-1 fixes (rare but real), plus any nits the round-1 review missed. Address the real concerns; reply to opinionated nits with reasoning.
No round 3. By round 3 you’re polishing. The agent will keep finding things forever; some are real, most are diminishing-returns. Engineer makes the merge call: “this is good enough; ship.”

For sub-200-line PRs, the target is one round of agent review then merge — round 2 should be unnecessary if the pre-push self-check was thorough. Two rounds is the ceiling, not the goal.

The deferred-comment cost: any round-2 nit you choose not to address before merge becomes a follow-up issue. File it. Then close it within 14 days, or accept that you’ve created compounding tech debt — which has its own velocity cost in agent-era development (Lesson 5).

Lesson 5: Deferred-comment debt is the velocity cost you didn’t budget for

Higher merge velocity → more PRs → more deferred-fix issues → more agent context cluttered with “still open?” reminders.

The math is simple. If you defer one Copilot nit per PR (which is roughly what “two rounds then merge” produces), and you merge 50 PRs a quarter, you’ve created 50 follow-up issues per quarter. None of them is urgent. All of them are real. Without an SLA, the issue tracker becomes where the agent-flagged tech debt goes to die — and worse, every future agent session reviewing your repo encounters those open issues as context, biasing their reasoning toward “this codebase has a lot of unfinished business.”

Two practices, each cheap:

Auto-bumping label. A GitHub Action that pings any open deferred-from-pr issue after 14 days of inactivity, mentioning the original PR’s author. Surfaces stale items without nagging.
Quarterly debt sweeps. Once a quarter, dedicate a day to closing as many deferred-from-pr issues as possible. Treat it as a real workstream, not optional cleanup. The cost of not doing this is that future agent reviews keep flagging the same patterns over and over.

This is the part of agent-era development that nobody mentions in the demos. The demo shows velocity; the production reality shows the debt that accumulates from velocity.

Lesson 6: The performance benchmarks

Agent-era code review can’t catch cumulative regressions. Lesson 1 explains the structural reason. Three defenses do catch them — they sit at different layers and they’re complementary.

Microbench gate on the CI path

A microbenchmark on each hot path your flamegraph identifies as expensive. Run on every PR. Fail when any benchmark regresses by more than 10% vs the baseline on main.

The microbench won’t catch the cumulative effect of A + B + C combined. It WILL catch the first PR that pushes any single hot path past your threshold. In the saga: PR A’s per-record serialization microbench would have failed at the diff that introduced the 5× quintupling. The reviewer would have asked why; the 5× would be visible; mitigation would land in the same PR.

Tooling exists in every mainstream language:

Language	Microbench tool
Rust	`criterion-rs`
Python	`pytest-benchmark`, `pyperf`
Java / JVM	JMH
Go	`go test -bench`, `benchstat`
Node.js	`mitata`, `vitest bench`

The discipline is the same regardless of tool: commit a benchmark file alongside the hot-path code; CI runs it; CI fails on a regression > 10%. That 10% is arbitrary but defensible — smaller and you’ll get false-positive noise from runner variance; larger and you’ll miss real regressions.

Replay-based macro benchmark

The microbench catches per-function regressions. The macro bench catches the cumulative effect that microbenches miss.

Capture a representative slice of your production input — a recorded hour of kafka traffic, a captured HTTP log, a sampled trace — into a fixture file. Build a test harness that replays the fixture through your binary and asserts something concrete: total CPU seconds consumed, total memory allocated, total wall-clock to drain the queue.

Run on every PR that touches the runtime / hot-path layers. Fail when the asserted aggregate metric regresses by more than 10%.

The cumulative regression in the saga would have failed the macro bench on PR A: the aggregate CPU seconds for replaying the fixture would have jumped 5×, even though every microbench passed. That’s the kind of cumulative signal that only a real-traffic replay catches.

Setup cost is real — capturing the fixture, building the harness, getting the replay deterministic. Budget a week. After that, every future PR is checked against it for free.

Sustained-CPU alert tied to a budget

The first two move detection to CI. The third moves it to production observability — and catches the case where no single PR triggers the bench failure, but 30 days of drift puts the service on the wrong side of the saturation line.

Simple shape:

avg_over_time(
  container_cpu_usage_seconds_total{name="events_analytics"}[5m]
) > 0.6

60% is the budget for steady state. If the service is at 40% headroom under normal load, it’s healthy. If it’s persistently over 60%, something is consuming the headroom — even if nothing is technically broken. That’s the early signal.

Most alerting today catches outages — service down, requests 500’ing. Drift is a different failure: the service is up, the responses are correct, but the headroom is gone. A retry storm or a busy weekend tips the drifted service into outage territory. The alert in (3) buys time to investigate before that tipping.

Drift alerts also catch regressions that originated in other teams’ merges — the data team adds a metric, the ingest service fans it out 5× more, and your service drifts up. No one on your team merged anything; the alert still fires.

Lesson 7: Persistent memory is the lesson carrier

Agent-era development has an extra layer that pre-agent dev didn’t: the agent’s context is rebuilt from scratch each session. If the lesson “the chrono format-spec re-parse is 50% of CPU” doesn’t get written down somewhere the agent will read at the start of the next session, the next agent will write the same problematic code again. Same for “we use the warm-cache pattern not raw HashMap.insert in hot paths” — every session needs to be told.

The places to write the lessons:

Per-project CLAUDE.md for project-specific conventions (architecture, hot paths to be careful about, naming conventions).
Cross-project memory for ecosystem-wide lessons (chrono pitfall, allocator behaviors, deploy quirks).
Per-session memory files for working-context state that survives compaction.

This is the velocity multiplier nobody talks about. Without persistent memory, every agent-driven session re-derives the same patterns, often violating the ones a previous session learned. With it, each session’s lessons compound. Treat memory updates as a non-optional part of the post-incident retrospective — the same way most teams treat CHANGELOG updates.

The retrospective discipline: when you correct an agent (or yourself), write down the correction so the next session doesn’t repeat the mistake. The cost is one minute per correction. The savings compound across every future session.

The toolkit inventory: what we have, what we’re missing

Honest accounting from the team that lived through the saga, anonymized to the patterns that generalize. Most teams adopting agents fast will recognize a similar split.

I reflect now what was in place when the regression hit, and what we were still missing — and how to prioritize filling those gaps. The first table is the hygiene baseline; the second table is the next layer of defenses that catch the patterns the first table misses.

What was already in place when the regression hit

Layer	What we have	Catches
Pre-commit	fmt, clippy `-D warnings`, type-check, tests, doc-check, secret scan, vulnerability audit, generated-code staleness	Style drift, type errors, broken tests, leaking secrets, stale generated code
Test infra	Real kafka / redis / timescale in docker compose; integration-test default; ~800 unit tests	Most behavior bugs at the function-and-module level
Hard rules	”Never commit to main”, “never `--no-verify`”, “no backward-compat shims”, “no mocks for system-boundary tests” — all in the agent-instruction file	Agent shortcuts that look reasonable in the moment
Pre-push reviewer agent	Runs locally before `git push`; focuses on canonical API shape, hot-path allocations, doc-code drift	The cheap nits that would otherwise eat Copilot review cycles
Persistent memory	Per-project + cross-project memory files; updated on every correction	Lessons surviving across agent sessions
Schema gen + drift CI	Contracts as source of truth; round-trip tests; pre-commit staleness	Type drift across language boundaries
Structured logging	JSON output; Loki / Grafana	Post-incident forensics
CPU profiler in-binary	`pprof-rs` HTTP endpoint exposed on every analytics binary	Live profile capture without redeploy
Two-rounds-then-merge rule	Convention; written down	Agent review iterating without converging
Worktree isolation for parallel agents	`Agent(isolation: "worktree")` for independent work	Three concurrent perf fixes from one session

That’s a substantial baseline. The regression still happened. All ten of those defenses can be in place and the cumulative regression still surface. The defenses below are what we’re still missing — and what the next quarter is investing in.

What we’re still missing

Layer	Missing	Why it matters	Cost to add
CI perf gate	A criterion-rs microbench on hot paths, run per PR, fails on > 10% regression	Catches single-PR perf regressions at review time	1-2 hours per benchmark
Macro replay test	A 1-hour traffic capture replayed through the binary in a test harness, asserting total CPU seconds	Catches cumulative regressions across multiple PRs	1 week to set up first time
Sustained-CPU alert	Grafana rule firing when service CPU > 60% for 5 min during business hours	Catches drift that’s already in production	1 hour
Blue-green deploy	Two parallel containers, 5-minute green-soak with shadow consumer-group before promotion	Catches deploy-time regressions before they hit prod	~1 sprint
PR template field for perf claims	Mandatory “expected impact + how verified” on hot-path PRs	Forces the author (human or agent) to think about scale	30 minutes
Deferred-comment SLA tooling	GitHub Action that pings open `deferred-from-pr` issues after 14 days	Caps the agent-era debt accumulation rate	1 day

Reading the two tables together: the “have” column is what catches the obvious things; the “missing” column is what catches drift, cumulative effects, and deploy-time mismatches. These are different failure modes. You need both.

How to prioritize the gaps

If a team is starting from scratch on agent-era development, my ranked recommendation is the order in which each item pays back its cost most quickly:

Pre-commit hooks — first day, before any other work. Without these the agent will produce a steady stream of broken commits and the cleanup cost compounds daily.
Real-infra integration tests in docker — within the first week. Without these every behavior bug ships to staging and bounces back.
Hard rules in CLAUDE.md (or equivalent) — within the first week. Cheap, mostly-correct-by-itself, and prevents the most expensive agent shortcuts.
Pre-push code-reviewer agent — within the first month. Velocity multiplier on top of (1)-(3).
Persistent memory discipline — within the first month. Free, just requires the practice.
Sustained-CPU alert — within the first quarter. Catches drift that the others can’t.
Microbench gate on CI — within the first quarter, after you’ve identified the hot paths.
Macro replay test — when you have time and the team is past the honeymoon. Highest setup cost, highest ceiling.
Blue-green deploy — when you’re past tolerable for direct-to-prod deploys.
PR template + deferred-comment SLA — culture changes; do them when the team is ready to keep them honest.

You don’t need all ten on day one. Layer them as the cost of not having each one becomes visible — and it will become visible, in roughly that order.

What this changes in how the team works

The defenses aren’t optional anymore. Pre-agent, you could get away with code review as the primary line of defense for performance because the merge rate gave reviewers time to think about cross-PR effects and intuit “slow at scale” from deployment context. Agent-era, the merge rate is faster than that intuition can keep up with. The compensating layers have to be added.

Practical recommendations, ranked by cost-of-not-having-them:

Sustained-CPU alert tied to a budget. Cheapest to set up (~1 hour), catches the most failure modes (drift from any cause), no language or service-specific work needed. If you can only do one of these, start here.
Two-rounds-then-merge rule, written down. Cheaper than tooling — it’s a process discipline. Without it, agent review iterates forever.
Microbench gate on CI for hot paths your flamegraph already identified. ~1-2 hours per benchmark, immediate value, fails on the next regression.
Deferred-comment SLA tooling. ~1 day to set up the GitHub Action. Quarterly sweep is a culture decision, not a tooling one.
Replay-based macro benchmark. ~1 week to set up the first time. Highest ceiling on what it can catch; highest setup cost. Worth it after the first three are in place.
Persistent memory discipline. Free, just requires the practice. Pays back across every agent session.

You don’t need all six on day one. The point is to layer the defenses so each one is cheap, and each catches a different failure mode that the others miss.

What I’d say to a team adopting agents fast

Your code review is going to feel like it’s working. Tests are passing, agents are catching obvious bugs, PRs are merging at unprecedented rates. The signal you’ll miss is drift — the slow accumulation of correct-but-cumulative-cost merges that pushes a healthy service into a saturated one over weeks. There’s no warning. Every individual PR will pass review. Every individual PR was, in fact, fine.

The defenses above are how you keep the velocity without paying for it on a Sunday morning. They’re not exciting. They’re not a demo-friendly story. They’re the unglamorous infrastructure that makes agent-driven development sustainable past month two — when the velocity is real, the cumulative regression surface is real, and the question is whether you’ve built the layers that catch what code review can’t.

Code review doesn’t scale to agent velocity. The defenses do. Build them.

AI Engineering Engineering Practices Code Review CICD Performance

Disclosure: My ideas, AI-assisted writing.