Skip to main content

Treadmills exhaust teams. Flywheels compound trust.

 In a recent conversation, an engineering leader highlighted success as how quickly managers jump in and close escalations. Crisis skills matter: calm triage, clear communication, decisive calls. The leaders I admire do that and then make tomorrow quieter.

Use the visual above in your mind as you read:

  • Treadmill: escalation → status update → next escalation

  • Flywheel: escalation → insight → guardrail → runbook → fewer pages

The goal is simple. Keep the ability to respond under pressure. Turn every incident into a small step that reduces the chance and impact of the next one.


What the treadmill looks like

  • Status meetings move faster than learning.

  • Fixes rely on a few heroes.

  • The same alerts reappear with new ticket numbers.

  • Confidence rises after a hot fix and fades by the next release.

What the flywheel looks like

  • Each escalation produces one clear insight.

  • Insights become guardrails: SLOs, timeouts, backpressure, flags, canaries, rollback.

  • Guardrails are backed by a short runbook that anyone on call can follow.

  • Fewer pages arrive next week. Confidence accumulates.


Turning an escalation into a flywheel step

Capture the insight

One sentence. Example: “Cache stampede on product detail under peak traffic.”
OR,
a 5-whys style root-cause: Example: 
  • Why did latency and 5xx spike on the product page?
  • Many concurrent cache misses triggered recomputes that overloaded the source service.
    • Why were there many recomputes at once?
    • All pods refreshed the same hot key when the TTL expired at the same moment.
      • Why did they all refresh simultaneously?
      • No single-flight or request coalescing, and no stale-while-revalidate path.
        • Why were those mechanisms missing?
        • The caching library used plain GET/SET with a hard TTL and no jitter.
          • Why was the library that limited?
          • Caching patterns were not standardized; the design checklist lacked “cache truth” and dogpile control, and ownership was unclear.

Add a guardrail

Examples:

  • Stale-while-refresh to limit stampedes
  • Idempotent handlers and retry policy
  • Circuit breaker and explicit timeouts
  • SLO with burn alerts

Write the runbook

One page or less. Triggers, commands, owner, rollback, verification checks.

Practice once

A team drill. Prove that anyone on call can execute it.

Measure fewer pages

Track pages per week and time to rollback. Expect the curve to bend.

Guardrail starters that pay off quickly

  • Deploy and release are separate. Use feature flags.

  • Releases are reversible. Keep a tested rollback for each change.

  • Requests fail predictably. Timeouts, budgets, and backoff.

  • Data paths are safe. Idempotency for writes and migrations.

  • Traffic is honest. Rate limits and backpressure at ingress.

  • Observability is ready. Dashboards for latency, errors, saturation, cost, and “what changed”.


A tiny runbook template

  • Trigger: when to use this playbook

  • Checks: dashboards to open

  • Actions: commands or steps

  • Decision points: continue or roll back

  • Rollback: exact steps

  • Validation: what good looks like

  • Owner: who has the baton


Metrics that make tomorrow quieter

  • Pages per week and per person

  • Mean and p95 time to rollback

  • Incidents per change

  • Guardrail changes per week

  • Percentage of changes that are reversible

  • Preventive work completed each sprint

Keep them lightweight. Learn, do not blame.


Roles: manager vs leaders

  • Managers close escalations and keep the queue moving.

  • Leaders  turn escalations into guardrails, training, and fewer pages across teams. They scale prevention.

Be both: a manager and a leader.

Anti-patterns to retire

  • Status without a design change

  • Hero mode as a habit

  • Postmortems that name people instead of conditions

  • Big-bang fixes with no rollback


TL;DR

Firefighting builds short term credibility. The best leaders turn that credibility into quieter systems by scaling prevention, not just response.

Choose the flywheel.

Comments

Popular posts from this blog

14 Essential Software Engineering Concepts for Engineers and Managers

There are many terms and concepts that are important for an engineer to be familiar with, in order to effectively build software. This post includes some of those terms. I will continually add to or update this list. Agile. A flexible and iterative approach to software development that emphasizes collaboration, customer feedback, and adaptive planning. My experience and success with agile development was the inspiration behind starting this blog. DevOps. A set of practices and tools that improve efficiency, speed, and reliability of the product through automation and optimization of the software development and delivery process where operational efficiency is part of the development process. Continuous Integration and Continuous Delivery/Deployment (CI/CD). A set of practices and tools that result in faster and more frequent releases, through automation of building, testing, and deployment of software. A key part of CI/CD is to deliver software to production frequently and using tec...

Forget Onboarding, do Alongboarding!

Alongboarding, an agile onboarding approach Alongboarding: We’re in it together! Organizations hire new people every day. A great first impression can make a tremendous difference in retaining employees. No one gets a second chance to make a great first impression, not even the best companies. An onboarding experience is an essential part of making that first impression on a new employee. Agile has been around for many years and has gained vast acceptance throughout the community. Yet, I find it disappointing that its tenets are not used well in most companies and most onboarding approaches follow a waterfall approach.  Alongboarding is an agile onboarding approach that applies agile tenets to onboarding new employees and makes the experience richer and more fulfilling. When I joined AppFolio as an agile coach, I experienced this approach during my onboarding. It felt like the team owned my success as much as I owned the team's success. It was a welcome change from some of m...

Make onboarding fun with Onboarding Canvas!

The Onboarding Canvas is a tool that can be used for onboarding a new team member . We derived this tool from Spotify's adaptation of the Toyota Kata . I like this tool because no one can tell you precisely how your onboarding should be like in order for you to be effective at your new job. This is a tool for continuous reflection and adaptation. It puts the newcomer in the driver’s seat, makes the onboarding process agile through continuous collaboration with your team. Four quadrants The onboarding canvas has four quadrants: Now: It defines where the team is now, what is going on and how is the new team member adapting to the change? Definition of awesome: With the addition of the new team member, how would the team like itself to be? What would be awesome for the new team member? Next target: In order to move towards "Definition of awesome" what outcomes should be achieved in the next x weeks? Next steps: What are the immediate next steps for the team...