What My Incident Response Taught Me About Org Charts

That incident reinforced something I’ve seen play out a dozen times since: a P0 is an organizational MRI. It lights up every structural weakness you’ve been tolerating, from unclear ownership boundaries to missing escalation paths to political dynamics invisible during normal operations. I first stumbled onto this idea during a holiday outage that hit two months into my Head of Engineering role, when a silent replay failure taught me more about how the team actually communicated than my entire onboarding had.

The Peacetime Org Chart Is Incomplete

Every company has two org charts, and the gap between them is where incidents go to get worse. The first lives in your HR system: clean boxes, dotted lines, reporting relationships that look sensible on paper. It describes authority and reporting, which is real enough during business hours. The second activates when production is on fire and your phone won’t stop buzzing, mapping something different entirely: who actually knows what, and who trusts whom enough to act on it. The two rarely match.

Ambiguous ownership is perfectly tolerable in peacetime. Two teams both think they own the API gateway, and as long as it’s working, the overlap is harmless. The database sits in a gray zone between platform and product engineering, and it’s been stable for months, so who cares? The moment something breaks in that gray zone, though, ambiguity becomes a vacuum, and vacuums during incidents are where minutes turn into hours.

I’ve watched teams burn twenty minutes during a P0 figuring out who should be investigating what, every one of those minutes stretching like an hour. The engineers were plenty competent; the organization had never forced clarity on who owned the intersection of two systems. These intersections stay invisible when things work and become inescapable when they don’t.

Three Patterns That Incidents Expose

After years of running incident retrospectives, the same structural failures surface repeatedly. Good incident response demands both technical and organizational readiness, and the organizational side is chronically underserved. Three patterns show up most often.

The Hero Bottleneck. One person always ends up driving the incident, and it’s usually the person with enough cross-system context to diagnose what’s happening, regardless of whether they’re on-call or own the affected system. Every team knows who this person is, and every team quietly dreads the week that person takes vacation.

If your incident response depends on a specific person being awake, what you’ve built is a single point of failure with a salary.

The Escalation Void. An incident starts at the team level, the team realizes they can’t resolve it alone, and everything stalls because there’s no defined path for what happens next. They need another team or someone with authority to make a call that crosses team boundaries, and without a clear escalation structure they either try to solve it themselves for too long or page everyone they can think of, flooding the channel with people who lack context. Both responses waste time, both symptoms of the same gap: the organization never designed what’s supposed to happen when a problem outgrows a single team.

The Political Handoff. This one is subtle and toxic. One team identifies the root cause in another team’s system. Instead of collaborating on a fix, they write a message in the incident channel that’s a CYA move: “We’ve confirmed the issue is in the billing service. Handing off to the billing team.” Then they disengage. The billing team, who woke up minutes ago with no context, starts from scratch. The incident timer keeps running whilst organizational politics play out in real time.

What Incidents Teach You (If You Listen)

The most valuable artifact from any incident is the organizational map it draws for you, and most teams throw it away by focusing exclusively on the technical post-mortem.

After that incident, I started treating every P0 retrospective as an org design review, shifting the lens from the technical failure to what the incident revealed about how we’re structured. Those answers proved more useful than anything I got from planned organizational assessments.

Three metrics reshaped how I read incidents:

Time to first responder with cross-system context. If it takes thirty minutes to get someone in the room who understands how service A talks to service B, you have an ownership gap between those two services. This metric alone exposed three critical gray zones in our architecture that no amount of architecture diagramming had surfaced.

Number of ad-hoc pages. Every time someone bypasses the escalation list to page the person they trust to fix things, that’s a signal that your formal on-call rotation doesn’t match your actual knowledge distribution. Tracking those informal pull-ins reveals where your real subject-matter expertise lives.

Decision latency during cross-team issues. How long does it take to make a call that requires agreement from two team leads? If that number is high, your teams lack trust, shared context, or a clear tiebreaker. This is a management problem that engineering excellence alone can’t solve.

Every incident is an involuntary org design review. You can either treat it as one or waste the signal.

Designing for the Worst Page of Your Life

Once I started reading incidents as organizational signals, I changed how we structured ownership and escalation. The moves themselves are concrete; specificity is what matters when someone is staring at a broken production system with one eye open.

First, we mapped every service and infrastructure component to a single owning team. Where two teams touched the same system, we drew a line and assigned one side to each. Single ownership creates real costs, knowledge silos and duplicated effort, but for incident response clarity those costs are worth paying. The conversations about where to draw those lines were uncomfortable, and that discomfort was the point: we were resolving in a conference room what would otherwise get resolved during a P0.

Second, we built explicit escalation tiers with decision rights at each level. The on-call engineer can restart services, roll back deployments, and toggle feature flags. The team lead can pull in other teams, allocate engineering time, and make calls about data consistency tradeoffs. The engineering manager or I handle customer communication decisions, approve risky mitigation strategies, and break cross-team deadlocks. Authorization should never be a question during an active incident.

The goal is to make the org’s response to incidents reveal competence instead of confusion.

The Real Org Chart

I still run incident retrospectives with a dual lens: the technical failure and the organizational one. Technical fixes follow clear patterns: add a circuit breaker, fix the timeout, update the runbook. Organizational fixes are harder, because they require people to give up comfortable ambiguity in exchange for uncomfortable clarity.

That’s the trade. You can have a clean-looking org chart and chaotic incident response, or you can do the hard work of mapping ownership, defining escalation, and distributing knowledge so that when the next P0 jolts someone awake, the team executes instead of negotiates. The incident will draw your real org chart regardless, and the version it draws at three in the morning tends to be unflattering.