The first four hours of a supply chain incident determine the next four weeks. When xz-utils dropped, the teams that had a playbook on file were in production-fixed state by hour 72 while teams without one were still arguing in Slack about who owned the Docker base image at hour 36. We wrote and rehearsed the playbook below across two tabletop exercises and one live event (a transitive dependency in the data pipeline, fortunately low-blast-radius), and it held up. It is written for organizations of 200 to 2,000 engineers with a regular on-call rotation and assumes you already have an incident commander framework in place. What it adds is the supply-chain-specific wrinkle: the compromise is often upstream of you, which means communications, legal, and procurement are in the room from hour one, not day three. This playbook is deliberately opinionated about timing; adapt the phone numbers, keep the clocks.
Who gets paged in the first 15 minutes?
The page fans out to five named roles within 15 minutes of confirmed compromise: incident commander from the Detection & Response on-call, AppSec lead, Platform Security lead, a product engineering director with prod deploy authority, and the communications on-call. The CISO is paged at minute 30 with a written situation report, not before, to avoid pulling them in without context.
Legal counsel is paged only when the compromise implicates data disclosure obligations, typically within 60 minutes if customer data is suspected in scope. Procurement is paged when the compromised component is a paid vendor, because contract and SLA enforcement moves faster with them in the chain from the start. The single most damaging delay we have seen is waiting until hour six to loop in Legal; by then, the clock on a 72-hour GDPR notification has already eaten a third of its window.
What does the first hour of triage look like?
The first hour is identification, scope, and containment decision, in that order. Minutes 0-15: incident commander confirms the compromise against the advisory source (CISA, GitHub Security Advisory, or vendor notification) and opens the incident channel with a published template. Minutes 15-30: AppSec runs the portfolio query to find every asset referencing the compromised component and its version range. Minutes 30-45: Platform Security identifies which of those are in production, exposed to the internet, or holding customer data. Minutes 45-60: the commander convenes the containment call and decides among three paths: hot patch, take offline, or accept risk with compensating control.
The query to identify affected assets should be muscle memory. In our case, a single search across the SBOM catalog by package and version range returns the full blast radius in under 90 seconds, attributes it to product owners, and files draft Jira tickets. If your team cannot answer "which services use package X at version Y" in under five minutes, that is the gap to close before the next incident.
How do stakeholder communications flow during the event?
Communications flow on three parallel tracks: internal engineering, executive, and external. Internal engineering gets a Slack update every hour in the incident channel, written by the scribe, following a fixed template of Status, Scope, Actions Taken, Next Update. Executive updates go to a pre-defined email list including the CEO, CFO, General Counsel, CISO, and CTO every two hours for the first eight hours, then every four hours until resolution.
External communications are authored by the communications on-call in partnership with Legal, and nothing goes external without both signoffs. For regulated industries, the 72-hour clock for GDPR, the 96-hour SEC material cyber incident disclosure window, and sector-specific obligations like HIPAA's 60-day breach notification all kick in on confirmation, not on resolution. Track these deadlines in the incident record with explicit UTC timestamps.
What are the containment options and when do you choose each?
Three containment patterns cover 90% of cases. Hot patch is the default for library-level compromises with a known fixed version: update the dependency, rebuild, redeploy through the standard pipeline with manual approval gates active. Typical timeline is 4-12 hours for a monorepo, 12-36 hours across a polyrepo estate. Take-offline is chosen when the service is internet-exposed, the exploit is active in the wild, and the fix is not yet available; accept the downtime cost if it is less than the expected breach cost, using a $50,000 per hour of downtime threshold for most SaaS companies.
Compensating controls (WAF rule, network segmentation, feature flag disable) are chosen when hot patching is blocked by vendor release cadence or when the service is internal with limited blast radius. Document the control, set a sunset date, and create a Jira ticket to remove it within 30 days or escalate. Never leave a compensating control in place past 90 days without CISO signoff.
When is the incident officially closed?
An incident closes when four conditions are met: all affected assets are patched or mitigated in production, the forensics review has confirmed no evidence of exploitation or data exfiltration, the root-cause write-up is drafted, and the post-incident review meeting is scheduled within 10 business days. The commander declares closure in the incident channel with a closing timestamp and a link to the RCA document.
Do not close an incident with unresolved assets. We once closed at hour 96 with 2% of assets still on the vulnerable version under "accepted risk with compensating control," and a customer audit three months later surfaced them, costing a quarter of TPRM analyst time to explain. Close clean or keep it open.
What does the 30-day follow-up look like?
The 30-day post-incident review covers four artifacts: the detailed timeline, the RCA, the action items with owners and dates, and the playbook delta. The RCA is a blameless Five Whys walk, published internally within 10 business days. Action items are tracked in a dedicated Jira epic with a monthly review until closed; we tolerate a 45-day median close time but escalate anything open at 90 days.
The most useful output of the review is usually the playbook delta: what did this incident teach us that the playbook should now say? In our last event, the delta was adding procurement to the hour-one page list, and the next tabletop ran two hours faster as a result. Playbooks are living documents; if yours has not changed in 12 months, it is wrong.
How Safeguard Helps
Safeguard collapses the hour-one portfolio query into a single search: reachability-aware SBOM queries powered by Griffin AI identify every asset using the compromised package and version range, filter to production-reachable paths, and auto-assign Jira tickets to the correct product owners in under two minutes. The compromised packages feed ingests GitHub Security Advisories, CISA alerts, and major vendor notifications, so the incident commander gets a pre-populated scope report instead of writing queries from scratch. TPRM surfaces the vendor angle automatically, including the relevant contract terms and SOC 2 status, and policy gates prevent the compromised version from being re-introduced during the post-incident rush. Executive status exports generate the two-hour update email directly from the incident record.