Diagnostic kill-switches: A/B in production with one CLI flag
A two-minute production experiment ruled out 31 detection rules as the cause of a CPU saturation, without redeploying or shipping new code. The pattern is one CLI flag per expensive subsystem. This post is about the discipline that pays for itself the first time you need to bisect under fire.
A flamegraph showed 31 detection rules running on every event. CPU was pegged. The hypothesis everyone wanted to confirm: “the rules are the problem.” The hypothesis nobody could confirm without writing code: ditto.
Two minutes later, a single line edit in the docker compose file restarted the container without those rules loaded. CPU stayed pegged. Hypothesis ruled out conclusively. We moved on to the actual cause.
This is the value of a diagnostic kill-switch — a runtime flag that disables an expensive subsystem without code changes. One CLI argument, one if enabled branch, and the cost of mid-incident A/B testing drops from “ship a PR through CI and redeploy” to “edit a string and restart.” This post is about the discipline of having one for every expensive subsystem in any long-running service.
Three terms that get confused
Before going deeper: “kill-switch”, “CLI flag”, and “feature-flag service” overlap, but they’re not the same thing. Quick clarifier:
- Kill-switch is a goal, not a mechanism. It’s the ability to turn a subsystem off without redeploying or modifying source code. “I need a kill-switch for the rule engine” describes what you want, not how you’ll implement it.
- CLI flag is one mechanism for achieving that goal. A command-line argument the binary reads at startup. Toggling it requires a restart. No external dependency. Lives in your compose / Kubernetes manifest, version-controlled in git.
- Feature-flag service is a different mechanism, optimized for a different problem: per-user, per-tenant, or per-percent-of-traffic gating. Toggling is instant, no restart needed. Has an external dependency (LaunchDarkly, Unleash, GrowthBook, or your own). Audit log lives in the service.
The shapes overlap because both let you turn things on and off without modifying code. They differ on four axes:
| CLI flag | Feature-flag service | |
|---|---|---|
| Granularity | Whole binary instance | Per-user / per-tenant / per-% of traffic |
| Latency to flip | Restart (seconds–minutes) | Instant (no restart) |
| External dependency | None | The service must be up |
| Audit trail | Git commit on the compose file | Inside the service |
Which one do I want?
| If your goal is… | Reach for |
|---|---|
| ”Turn off this expensive subsystem to bisect a perf incident” | CLI kill-switch (this post) |
| “Roll out a new feature to 10% of users, then 50%, then 100%“ | Feature-flag service |
| ”A/B test two algorithm variants on real traffic” | Feature-flag service |
| ”Disable a noisy subsystem during an outage” | CLI kill-switch |
| ”Stage a new sink so we can flip back if it misbehaves” | CLI flag with default-off (or feature flag) |
| “Per-tenant overrides on subsystem behavior” | Feature-flag service, OR per-tenant config in the kill-switch impl |
A kill-switch can be implemented via either mechanism — flip a feature-flag key from true to false and the subsystem stops running, no restart required. The choice between them is about latency, dependency, and granularity, not about the goal.
For the rest of this post, kill-switch implemented as a CLI flag is the focus. It’s the simplest, cheapest implementation that solves the diagnostic / incident-response use case — and the one mature teams under-invest in most. If you already have a feature-flag service, use it for the user-facing rollouts; the patterns below still apply for the binary-internal subsystems where a CLI flag is the better fit.
The example that earned the post
The container had been pegged at 100% CPU for an hour. Phase-1 fixes had landed but barely moved the needle. The next-most-likely culprit was the rules engine — 31 user-defined detection rules, each running on every event for every region, each rebuilding a 333-field metrics map. Plausible. But also: maybe the rules engine had nothing to do with it, and the next round of code changes would chase the wrong frame.
The container already had a --enable-rules flag. (We’d added it months earlier for backtest mode, where you load only the strategies you want to evaluate.) In production it was always set. To test the hypothesis:
# docker-compose.yml — comment out one line
command:
- --tenants=us-east,us-west,eu,ap
- --consumer-group=events-analytics-prod-v1
# - --enable-rules # ← TEMP: disabled for perf diagnostic
- --rules-dir=/app/rules
Restart container. 30 seconds later, the rules engine was off and the binary log line confirmed it: "rule evaluation disabled (use --enable-rules to enable)". CPU sample 60 seconds after that: still 101%, 97%, 67%, 100%, 101%. The rules engine was contributing maybe 2–3% of the CPU; the saturation was somewhere else entirely.
That answer would have taken hours to extract by reading code or staring at the flamegraph. With the toggle, it took 2 minutes. Roll the change back, restart, production state is exactly where it was before the experiment.
The pattern
For every subsystem in your binary that:
- Could plausibly be the cause of a future incident, AND
- Is structurally optional (the binary still does something useful without it),
ship a CLI flag that turns it off. Convention I use:
--enable-rules # default: false in tests, true in prod compose
--enable-detectors=PARTIAL # full / partial / off
--enable-observation-sink # default: true
--enable-rollup-publisher # default: true
--enable-correlation-engine # default: false
Boolean toggles are the simple case. For things with multiple modes (e.g. “all detectors / cheap detectors only / off”), use a small enum. Don’t ship a 7-flag matrix where 2 flags would do.
The implementation is trivial:
// CLI parsing (clap, etc.)
#[arg(long, default_value = "true")]
enable_rules: bool,
// At construction time:
let rule_engine = if cli.enable_rules {
Some(RuleEngine::new(&cli.rules_dir)?)
} else {
tracing::info!("Rule evaluation disabled");
None
};
// On the hot path:
if let Some(engine) = &rule_engine {
engine.evaluate(&event);
}
That’s the whole thing. One field that’s Option<T>, one log line so operators can confirm the state, one branch on the hot path that the CPU branch predictor handles in roughly zero cycles.
Where toggles pay off
Three regimes, in increasing rarity-of-occurrence and increasing value-when-they-do.
Mid-incident bisection
The case from the opening. CPU is pegged or memory is climbing. You have a hypothesis about which subsystem is responsible. Without a toggle, confirming the hypothesis means reading code, writing a fix, sending it through CI, deploying — easily an hour, possibly more, all while the production issue is live. With a toggle, it’s 2 minutes.
Even when the answer is “it wasn’t this subsystem,” the negative result is valuable — you’ve eliminated one hypothesis without the cost of pursuing the wrong fix.
Gradual rollout / shadow mode
When you ship a new subsystem (a new sink, a new detector, a new pricing model), the cleanest staged-rollout is to ship it disabled-by-default, behind a flag, then enable it in stages: dev → staging → one production region → all regions. The flag is the rollback mechanism. If anything misbehaves, flip the flag back; no redeploy needed.
This is essentially a feature flag but expressed in CLI arguments instead of a hosted service like LaunchDarkly. For boundary-crossing features (consumer-side), feature-flag services are correct. For binary-internal subsystems, a CLI flag is simpler and has zero external dependency.
Regression bisection across binary versions
If a regression appeared somewhere between v1.0 and v1.5 and you don’t know which release introduced it, toggleable subsystems give you a search axis other than git bisect. Run v1.5 with each subsystem disabled in turn; the regression’s presence/absence under each toggle tells you where to look.
git bisect is more precise but slower (build N intermediate commits). Toggle-bisect is faster but coarser (eliminate at the subsystem level). They complement each other.
Anti-patterns
Three things to avoid:
1. Toggles that aren’t truly toggleable
The flag exists, but turning it off makes the binary crash, or the binary still loads the subsystem and runs initialization but skips the per-event hot path. Both defeat the diagnostic purpose. The contract for a kill-switch is “if this flag is off, the subsystem is not consuming CPU, memory, or I/O bandwidth” — not just “the visible per-event path is gated.”
Test this by toggling the flag in CI and asserting that the binary’s startup memory / connection count drops. If it doesn’t, the toggle is incomplete.
2. Toggles that mutate other state
If --enable-X off causes the binary to clear its persisted state on startup, you’ve turned a diagnostic into a destructive operation. Operators won’t use it during incidents because the cost of being wrong is too high.
The discipline: a kill-switch should be stateless and reversible. Toggling it on and off should put the binary in identical operating state apart from “is the subsystem running.”
3. Too many overlapping flags
A binary with 20 boolean flags is harder to reason about than one with 5 enum-shaped flags grouping related toggles. Group by subsystem, not by feature.
# Bad — overlapping, confusing
--enable-rules-a --enable-rules-b --enable-rules-c
--detector-vwap-on --detector-cvd-on --detector-zscore-on
# Better
--rules-mode=all|partial|off
--detectors=vwap,cvd,zscore # or "all", "off"
Operators will use 5 grouped flags during an incident. They will not learn 20 individual ones.
The cost-benefit math
Adding a kill-switch to a new subsystem is cheap: one CLI arg, one branch, one log line. Maybe 30 lines of code total including tests. You pay this once.
Using a kill-switch costs 2 minutes during an incident — 30 seconds to edit the compose file, 30 seconds for the container restart, 60 seconds for the CPU samples to stabilize.
Not using one when you need it: hours of investigation chasing the wrong subsystem.
The break-even is one incident. After the first time a kill-switch lets you bisect a hypothesis in 2 minutes, the discipline pays back its cost across the whole codebase. You don’t need to wait for that first incident to commit; the cost is so low that the right time to add a kill-switch is when you create the subsystem.
If you’re starting now: go through your long-running services and list the expensive subsystems. For each one not currently toggleable, that’s an issue to file. Land them as you touch the code for other reasons. Within a quarter you’ll have toggle coverage on the things that matter, and the next incident will cost minutes not hours.
Disclosure: My ideas and analysis. AI assisted with writing, code examples, and diagrams.