Thesis: In complex programs, the thing that moves systems to production isn’t a perfect plan, it’s a daily discipline of reducing risk. “Design” isn’t a kickoff artifact; it’s how we make safer, smaller, more reversible decisions every day.
Related: my earlier post on estimation argued for appetite → slicing → triggers (“estimate to decide, then revisit on signals”). This post is the companion playbook: how to run the day-to-day so delivery stays safe and sane.
Why big design phases stall
-
Assumptions go stale. Dependencies, traffic, and org boundaries shift faster than a long design cycle.
-
Coordination tax grows. More reviewers ≠ more safety; often it’s slower feedback on the wrong risks.
-
Irreversibility sneaks in. One-way doors (schemas, data moves, infra) get locked before they’re tested in the wild.
The antidote isn’t less design—it’s design practiced as daily risk reduction.
The daily risk-reduction loop
Observe → Decide → Act → Verify.
Short loops, built into normal work, aimed at shrinking the blast radius and increasing reversibility.
-
Observe: SLOs, error budgets, cost, “what changed?” telemetry.
-
Decide: Next step that buys down the biggest risk.
-
Act: Small, reversible change with a rollback.
-
Verify: Canaries, flags, traces, pre-baked dashboards.
Repeat. Relentlessly.
Eight habits of teams that ship
-
Guardrails first
Write the control points before the code: SLOs, allowed blast radius, rollback criteria, data contracts, privacy/PII rules. Keep a living “risk ledger” with the top 3 unknowns. -
Design for reversibility
Decouple deploy from release (feature flags), canary by segment/region, make operations idempotent. Use transactional/outbox patterns for dual-writes and safe migrations. -
Contract-first interfaces
Treat APIs and schemas as contracts. Version on purpose, evolve with additive changes, and use consumer-driven contracts to catch breakage before prod. -
Sequence for independence
Prefer integrable seams over big-bangs. Enable one service or path to go live quietly while others evolve. Align milestones to coherent increments, not team org charts. -
Instrument before you integrate
Golden signals (latency, errors, saturation, cost), request tracing, change logs. If you can’t answer “what changed recently?” you can’t operate safely. -
Practice failure, not just success
Game days, “rollback in 1” drills, dependency kill-switches, read-only fallbacks. Prove the escape hatches early. -
Replan on signals, not calendars
Define re-estimate triggers (new dependency, defect spike, repeated rollovers). When a trigger fires, reshape scope or order. No blame—just a new decision. (This pairs with the earlier estimation post.) -
Measure risk burndown
Track unknowns retired, reversibility % of changes, incidents per change, time-to-rollback, and cost trend. Precision follows practice.
Two tiny templates you can copy
A) Daily risk review (≤90 seconds in standup)
-
New dependency or approval discovered?
-
Any SLO or cost threshold breached?
-
Any slice rolled over twice?
-
Biggest irreversibility creeping in?
-
Next step that buys down the top risk?
B) Rollout runbook (skeleton)
-
Scope & success criteria (SLO, blast radius)
-
Flag/segment/canary plan (who/where/how long)
-
Telemetry to watch (dashboards, traces, “what changed”)
-
Rollback command + data repair plan
-
Owner on call + communication path
Scenario notes (quick hits)
-
Large data moves: mirror mode, dual-write with checksums, cut by tenant/region, verify rowcounts & drift before commit.
-
Safety/compliance: pre-bake audit evidence, privacy by construction, change windows with go/no-go criteria.
-
Multi-region reads: read-local, write-central; explicit staleness SLAs; promote region only when guardrails hold.
-
Caching at scale: versioned keys, stale-while-revalidate, single-flight/lease to stop dogpiles; choose an invalidation authority (CDC/binlog vs write-through vs events).
Anti-patterns to retire
-
Design theater: heavyweight docs that don’t change the next step.
-
Big-bang cutovers: betting the quarter on one push.
-
“We’ll monitor later.” If it isn’t observable, it isn’t ready.
-
Hero mode: speed demanded instead of speed engineered.
TL;DR
Big design phases don’t ship, daily risk reduction does.
Make design the way you operate, not a ceremony. The teams that practice small, reversible steps, verified by signals, not only ship more, they break less and even sleep better 💤.
If you want the estimation companion (appetite → slicing → triggers), see the earlier post. This piece is the operational half: the habits that turn designs into safe deliveries.
Comments
Post a Comment