Skip to main content

Big Design Phases Don’t Ship, Daily Risk Reduction Does

Big design phases don’t ship, daily risk reduction does

 Thesis: In complex programs, the thing that moves systems to production isn’t a perfect plan, it’s a daily discipline of reducing risk. “Design” isn’t a kickoff artifact; it’s how we make safer, smaller, more reversible decisions every day.

Related: my earlier post on estimation argued for appetite → slicing → triggers (“estimate to decide, then revisit on signals”). This post is the companion playbook: how to run the day-to-day so delivery stays safe and sane.

Why big design phases stall

  • Assumptions go stale. Dependencies, traffic, and org boundaries shift faster than a long design cycle.

  • Coordination tax grows. More reviewers ≠ more safety; often it’s slower feedback on the wrong risks.

  • Irreversibility sneaks in. One-way doors (schemas, data moves, infra) get locked before they’re tested in the wild.

The antidote isn’t less design—it’s design practiced as daily risk reduction.


The daily risk-reduction loop

Observe → Decide → Act → Verify.
Short loops, built into normal work, aimed at shrinking the blast radius and increasing reversibility.

  • Observe: SLOs, error budgets, cost, “what changed?” telemetry.

  • Decide: Next step that buys down the biggest risk.

  • Act: Small, reversible change with a rollback.

  • Verify: Canaries, flags, traces, pre-baked dashboards.

Repeat. Relentlessly.


Eight habits of teams that ship

  1. Guardrails first
    Write the control points before the code: SLOs, allowed blast radius, rollback criteria, data contracts, privacy/PII rules. Keep a living “risk ledger” with the top 3 unknowns.

  2. Design for reversibility
    Decouple deploy from release (feature flags), canary by segment/region, make operations idempotent. Use transactional/outbox patterns for dual-writes and safe migrations.

  3. Contract-first interfaces
    Treat APIs and schemas as contracts. Version on purpose, evolve with additive changes, and use consumer-driven contracts to catch breakage before prod.

  4. Sequence for independence
    Prefer integrable seams over big-bangs. Enable one service or path to go live quietly while others evolve. Align milestones to coherent increments, not team org charts.

  5. Instrument before you integrate
    Golden signals (latency, errors, saturation, cost), request tracing, change logs. If you can’t answer “what changed recently?” you can’t operate safely.

  6. Practice failure, not just success
    Game days, “rollback in 1” drills, dependency kill-switches, read-only fallbacks. Prove the escape hatches early.

  7. Replan on signals, not calendars
    Define re-estimate triggers (new dependency, defect spike, repeated rollovers). When a trigger fires, reshape scope or order. No blame—just a new decision. (This pairs with the earlier estimation post.)

  8. Measure risk burndown
    Track unknowns retired, reversibility % of changes, incidents per change, time-to-rollback, and cost trend. Precision follows practice.


Two tiny templates you can copy

A) Daily risk review (≤90 seconds in standup)

  • New dependency or approval discovered?

  • Any SLO or cost threshold breached?

  • Any slice rolled over twice?

  • Biggest irreversibility creeping in?

  • Next step that buys down the top risk?

B) Rollout runbook (skeleton)

  • Scope & success criteria (SLO, blast radius)

  • Flag/segment/canary plan (who/where/how long)

  • Telemetry to watch (dashboards, traces, “what changed”)

  • Rollback command + data repair plan

  • Owner on call + communication path


Scenario notes (quick hits)

  • Large data moves: mirror mode, dual-write with checksums, cut by tenant/region, verify rowcounts & drift before commit.

  • Safety/compliance: pre-bake audit evidence, privacy by construction, change windows with go/no-go criteria.

  • Multi-region reads: read-local, write-central; explicit staleness SLAs; promote region only when guardrails hold.

  • Caching at scale: versioned keys, stale-while-revalidate, single-flight/lease to stop dogpiles; choose an invalidation authority (CDC/binlog vs write-through vs events).


Anti-patterns to retire

  • Design theater: heavyweight docs that don’t change the next step.

  • Big-bang cutovers: betting the quarter on one push.

  • “We’ll monitor later.” If it isn’t observable, it isn’t ready.

  • Hero mode: speed demanded instead of speed engineered.


TL;DR

Big design phases don’t ship, daily risk reduction does.
Make design the way you operate, not a ceremony. The teams that practice small, reversible steps, verified by signals, not only ship more, they break less and even sleep better 💤.


If you want the estimation companion (appetite → slicing → triggers), see the earlier post. This piece is the operational half: the habits that turn designs into safe deliveries.

Comments

Popular posts from this blog

14 Essential Software Engineering Concepts for Engineers and Managers

There are many terms and concepts that are important for an engineer to be familiar with, in order to effectively build software. This post includes some of those terms. I will continually add to or update this list. Agile. A flexible and iterative approach to software development that emphasizes collaboration, customer feedback, and adaptive planning. My experience and success with agile development was the inspiration behind starting this blog. DevOps. A set of practices and tools that improve efficiency, speed, and reliability of the product through automation and optimization of the software development and delivery process where operational efficiency is part of the development process. Continuous Integration and Continuous Delivery/Deployment (CI/CD). A set of practices and tools that result in faster and more frequent releases, through automation of building, testing, and deployment of software. A key part of CI/CD is to deliver software to production frequently and using tec...

Forget Onboarding, do Alongboarding!

Alongboarding, an agile onboarding approach Alongboarding: We’re in it together! Organizations hire new people every day. A great first impression can make a tremendous difference in retaining employees. No one gets a second chance to make a great first impression, not even the best companies. An onboarding experience is an essential part of making that first impression on a new employee. Agile has been around for many years and has gained vast acceptance throughout the community. Yet, I find it disappointing that its tenets are not used well in most companies and most onboarding approaches follow a waterfall approach.  Alongboarding is an agile onboarding approach that applies agile tenets to onboarding new employees and makes the experience richer and more fulfilling. When I joined AppFolio as an agile coach, I experienced this approach during my onboarding. It felt like the team owned my success as much as I owned the team's success. It was a welcome change from some of m...

Make onboarding fun with Onboarding Canvas!

The Onboarding Canvas is a tool that can be used for onboarding a new team member . We derived this tool from Spotify's adaptation of the Toyota Kata . I like this tool because no one can tell you precisely how your onboarding should be like in order for you to be effective at your new job. This is a tool for continuous reflection and adaptation. It puts the newcomer in the driver’s seat, makes the onboarding process agile through continuous collaboration with your team. Four quadrants The onboarding canvas has four quadrants: Now: It defines where the team is now, what is going on and how is the new team member adapting to the change? Definition of awesome: With the addition of the new team member, how would the team like itself to be? What would be awesome for the new team member? Next target: In order to move towards "Definition of awesome" what outcomes should be achieved in the next x weeks? Next steps: What are the immediate next steps for the team...