In a recent conversation, an engineering leader highlighted success as how quickly managers jump in and close escalations. Crisis skills matter: calm triage, clear communication, decisive calls. The leaders I admire do that and then make tomorrow quieter.
Use the visual above in your mind as you read:
- 
Treadmill: escalation → status update → next escalation 
- 
Flywheel: escalation → insight → guardrail → runbook → fewer pages 
The goal is simple. Keep the ability to respond under pressure. Turn every incident into a small step that reduces the chance and impact of the next one.
What the treadmill looks like
- 
Status meetings move faster than learning. 
- 
Fixes rely on a few heroes. 
- 
The same alerts reappear with new ticket numbers. 
- 
Confidence rises after a hot fix and fades by the next release. 
What the flywheel looks like
- 
Each escalation produces one clear insight. 
- 
Insights become guardrails: SLOs, timeouts, backpressure, flags, canaries, rollback. 
- 
Guardrails are backed by a short runbook that anyone on call can follow. 
- 
Fewer pages arrive next week. Confidence accumulates. 
Turning an escalation into a flywheel step
Capture the insight
One sentence. Example: “Cache stampede on product detail under peak traffic.”OR,
a 5-whys style root-cause: Example:
- Why did latency and 5xx spike on the product page?
- Many concurrent cache misses triggered recomputes that overloaded the source service.
- Why were there many recomputes at once?
- All pods refreshed the same hot key when the TTL expired at the same moment.
- Why did they all refresh simultaneously?
- No single-flight or request coalescing, and no stale-while-revalidate path.
- Why were those mechanisms missing?
- The caching library used plain GET/SET with a hard TTL and no jitter.
- Why was the library that limited?
- Caching patterns were not standardized; the design checklist lacked “cache truth” and dogpile control, and ownership was unclear.
Add a guardrail
Examples:
- Stale-while-refresh to limit stampedes
- Idempotent handlers and retry policy
- Circuit breaker and explicit timeouts
- SLO with burn alerts
Write the runbook
One page or less. Triggers, commands, owner, rollback, verification checks.Practice once
A team drill. Prove that anyone on call can execute it.Measure fewer pages
Track pages per week and time to rollback. Expect the curve to bend.Guardrail starters that pay off quickly
- 
Deploy and release are separate. Use feature flags. 
- 
Releases are reversible. Keep a tested rollback for each change. 
- 
Requests fail predictably. Timeouts, budgets, and backoff. 
- 
Data paths are safe. Idempotency for writes and migrations. 
- 
Traffic is honest. Rate limits and backpressure at ingress. 
- 
Observability is ready. Dashboards for latency, errors, saturation, cost, and “what changed”. 
A tiny runbook template
- 
Trigger: when to use this playbook 
- 
Checks: dashboards to open 
- 
Actions: commands or steps 
- 
Decision points: continue or roll back 
- 
Rollback: exact steps 
- 
Validation: what good looks like 
- 
Owner: who has the baton 
Metrics that make tomorrow quieter
- 
Pages per week and per person 
- 
Mean and p95 time to rollback 
- 
Incidents per change 
- 
Guardrail changes per week 
- 
Percentage of changes that are reversible 
- 
Preventive work completed each sprint 
Keep them lightweight. Learn, do not blame.
Roles: manager vs leaders
- 
Managers close escalations and keep the queue moving. 
- Leaders turn escalations into guardrails, training, and fewer pages across teams. They scale prevention. 
Anti-patterns to retire
- 
Status without a design change 
- 
Hero mode as a habit 
- 
Postmortems that name people instead of conditions 
- 
Big-bang fixes with no rollback 
TL;DR
Firefighting builds short term credibility. The best leaders turn that credibility into quieter systems by scaling prevention, not just response.
Choose the flywheel.
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps

Comments
Post a Comment