That First Fire Tells You Everything

I’d been at Hiro for about a month when the Stacks API stalled over the holiday break. Most of the team was offline, scattered across time zones and family obligations, and I had barely found the light switches. The Stacks API was the backbone of our developer tooling for Bitcoin; I’d been mapping the codebase and sketching plans for 2022, forming early opinions on where the architecture needed to go. Then I was leading an incident response for a system I’d only been looking at for four weeks.

I got in the channel, started debugging alongside the team, and stayed through the late nights and odd hours until the service was restored, reading logs and working through potential fixes whilst building a mental model in real time. The other thing I recognized quickly: the team needed space to fix the problem. I took ownership of decisions and external communications, fielding stakeholder questions and posting status updates so the engineers could stay heads-down on the technical work.

In an enterprise, a production incident triggers a well-rehearsed machine: a Security Operations Center staffing a war room, defined on-call rotations, escalation matrices, change management boards, and a communications team drafting status updates. At a startup with fewer than thirty engineers, you get a Discord channel and whoever happens to be around.

That incident taught me more about Hiro’s engineering culture than my entire onboarding had.

The Incident

On December 20th around noon Eastern, unoptimized queries on the /transfers and /nft_events endpoints consumed all available database memory. The API slowed to a crawl and then stalled. We identified the offending queries, deployed a fix in v1.0.0, and took the green API offline to replay events, estimating three to five hours.

The replay failed silently overnight, and we discovered it only because there were no alerts configured for that failure mode. The system had booted normally and appeared to be running, but it had stalled syncing to the chain tip. The next morning, around 9:14 AM, someone on the team noticed an error in the logs and we developed a v1.0.1 patch. The replay restarted around 3 PM, succeeded around 10 PM, and additional follower pools restored normal performance by roughly 11 PM.

The full analysis is available in the public post mortem. What I keep coming back to is how much the incident revealed beyond the technical failure itself.

Enterprise Muscle Memory vs. Startup Reality

My instincts coming in were shaped by larger organizations where incident roles are well-defined: someone owns communications, someone owns the technical investigation, someone manages stakeholders, and the on-call engineer drives the timeline. Escalation paths are documented, war rooms have standing Zoom links, and change advisory boards review every deployment that touches production.

What I got instead was a handful of engineers in a Discord channel over the holidays, figuring out the blast radius in real time. Discord is noisy on a good day; during an active incident, critical updates about deployment status and database metrics competed with general chatter, and keeping the signal visible took constant effort. There was no escalation matrix; the people in the channel were already every engineer available, and they were the entire response team.

That speed came at a cost. The event replay ran overnight with zero monitoring, and the failure stayed invisible until someone happened to check the logs the next morning. In an enterprise, alerting infrastructure would have caught that within minutes. We had the speed of a startup without the safety nets that make speed sustainable. The best startup incident response borrows selectively from enterprise patterns, and that’s exactly what we’ve been building in the weeks since.

What the Incident Taught Us

At a startup without redundant failover, the priority is getting the system back to a working state, even if that means applying a patch you don’t fully understand yet and untangling the deeper cause afterward. The overnight replay failure cost us more time than anything else in this incident, and it happened because the process had never been tested under real failure conditions. Extended downtime compounds fast when there’s no automatic failover; restoring a stable state has to come before a thorough diagnosis.

For an availability incident, restore first and explain later. The overnight replay failure cost us more than any diagnostic delay would have.

Holiday season meant sparse availability, and both internal stakeholders and external developers building on the API needed constant updates, even when the update was “we still don’t know.” Silence would have been more expensive than bad news; developers needed to know whether to wait or find workarounds. We posted updates to the community forum, and the response was overwhelmingly positive, with people appreciating the transparency even when we didn’t have answers yet.

The event replay process had never been stress-tested. It was documented, but the failure mode we hit, a silent stall that reported normal status, had never been exercised. The procedure assumed a success path that reality didn’t follow, and because we’d never run through a failure scenario, we discovered the gap at the worst possible time.

What We’re Changing

A month out, the incident has become the forcing function for building operational maturity the team hadn’t needed yet. We’re putting three things in place, and the early results are already showing.

First, comms protocols for incidents: who posts updates, how frequently, and through which channels. During the December incident, the handful of people responding sometimes duplicated effort because the communication flow was ad hoc, and Discord’s noise made it worse. We now have a dedicated incident channel format with a pinned status thread so critical updates don’t get buried. Even in the smaller issues we’ve handled since, knowing who owns status updates has noticeably reduced the chaos.

Second, periodic dry runs of our recovery procedures. We ran our first one two weeks ago, replaying the exact event replay process that failed silently in December. It exposed three assumptions in the runbook that would have bitten us again: a hardcoded path that had changed, a timeout value too short for the current chain state, and a missing check for sync status before declaring the replay complete. The gap between documented and tested is enormous, and we’re committing to running these regularly.

Third, alerting for the failure modes that bit us. The silent replay stall was a symptom of a broader pattern: our monitoring was optimized for failures we’d already seen, leaving blind spots for novel ones. We’ve added alerts for chain tip sync stalls and replay process health, and we’re building a habit of asking “what would fail silently?” during architecture reviews to surface monitoring gaps before they become incident gaps.

The Gift Disguised as a Crisis

A first fire at a new company compresses months of organizational learning into a few intense days. It reveals who steps up when the process doesn’t tell them to, which systems are more fragile than the architecture diagrams suggest, and whether the team’s instincts under pressure match its ambitions during planning season. I went into that December incident expecting to learn about the Stacks API’s bottlenecks; what I came away with was a map of how the team communicated, where operational assumptions hadn’t been tested, and what scaffolding we needed so the next incident wouldn’t depend on whoever happened to be online.

The technical fix took two days; the organizational changes we kicked off in the month since are already shaping how we plan for 2022.