A supply chain incident is not a single event. It is a slow-motion collision between a third-party advisory, a build pipeline, an asset inventory that nobody trusts, and a Slack channel where everyone is asking the same five questions. The teams that handle these incidents well do not have better instincts. They have a runbook that turns instincts into steps and steps into evidence.
This article walks through a runbook structure that has held up across incidents ranging from a malicious npm package to a compromised CI plugin. The phases are deliberately boring. Boring is the goal. Boring means you can hand the runbook to whoever is on call at 3am and they will not be guessing.
The five phases
A supply chain incident response runbook should have five phases, in this order: triage, scope, contain, remediate, and learn. Each phase has an entry condition, a deliverable, and an exit condition. If you cannot point to the deliverable, you have not finished the phase.
The mistake most teams make is jumping from triage to remediate. They see a CVE, they grep for the package name, they file a ticket, and they call it done. Then a month later they find the same package in a forgotten container image and realize they never had scope at all.
Phase one: triage
Triage answers a single question. Is this incident worth opening? You need three pieces of data to answer. First, the advisory or signal source. Second, a rough estimate of whether the affected component exists in your environment. Third, the severity if it does.
The trap in triage is that you do not yet know whether the component exists. You only know whether it might. The runbook should say: if the component is in your top ecosystems and the advisory is high or critical, open the incident and move to scope. Do not wait for confirmation. Confirmation is the next phase.
Tooling matters here because triage is where most teams burn the most calendar time. If your inventory is a spreadsheet, triage takes hours. If your inventory is queryable, triage takes minutes. Use safeguard_search_components with the advisory's package coordinates to get a yes-or-no answer in seconds. If the answer is yes, the incident is open.
Phase two: scope
Scope answers two questions. Where does the affected component exist, and what does it touch? The deliverable is a scope document with three lists: a list of products, a list of projects within those products, and a list of environments where deployed artifacts contain the component.
The first list comes from your component graph. The second list comes from your project-to-product mapping. The third list comes from your runtime telemetry, which is the part most programs get wrong because they treat build-time SBOMs as if they were deployment manifests. They are not. A component in your repo may not be in your running container, and a component in your container may not be in your running pod.
Use safeguard_get_component_projects to expand from a component to every project that includes it, then safeguard_list_assets filtered by the component coordinates to find runtime instances. Cross-check the two lists. The intersection is your real scope. Components that appear in projects but not assets are likely candidates for safe upgrade. Components that appear in assets but not projects are evidence of drift and should be flagged in the lessons-learned phase.
Phase three: contain
Contain answers a different question. What can we stop right now to prevent the situation from getting worse? Containment is not remediation. Containment is a tourniquet. It might be ugly, it might cost performance, it might break a feature flag in staging, but it stops the bleeding.
For supply chain incidents, containment usually takes one of three shapes. You can block the affected component at the registry proxy, so no new build pulls it. You can add a policy gate that fails any pipeline trying to ship the component. Or you can disable the feature path in production that uses the component, if the component is optional.
Block-at-the-proxy is the most powerful option but the most disruptive. Use it when the advisory is critical and you have evidence of active exploitation. Policy gates are the right default for everything else. Create a gate with safeguard_create_policy_gate and link a policy that fails the build on the affected coordinates. Communicate the block so engineering teams do not waste time debugging mysterious failures.
Phase four: remediate
Remediate answers the question of what permanently fixes this. Remediation is not the patch. Remediation is the patch plus the verification plus the asset update plus the closure of the original ticket. Half-finished remediation is the most common cause of repeat incidents.
For each affected project, the remediation work has three parts. First, identify the upgrade path. Second, apply it through your normal pull request process. Third, verify the deployed asset no longer contains the affected version. Use safeguard_get_remediation_plan to generate the upgrade path automatically. Use safeguard_fix_vulnerability to open the pull request. Use safeguard_get_asset after the next deployment to confirm the new asset no longer reports the finding.
The runbook should require all three confirmations before the project is marked remediated. Skipping the third confirmation is how teams find themselves three weeks later with the same CVE in production because someone reverted the merge during a hotfix.
Phase five: learn
Learn is the phase that gets cut when calendars are tight. Do not cut it. The learn phase produces three artifacts: a timeline, a list of contributing factors, and a list of changes to the runbook.
The timeline should be in machine-readable form so you can compute mean time to detect, scope, contain, and remediate over your incident history. The contributing factors should be specific: not "we did not have visibility" but "we did not scan the image registry, so containers built outside CI were invisible." The runbook changes should be small and pointed: rename a phase, add a check, replace a manual step with an automated one.
Pull the timeline from safeguard_get_audit_logs filtered to the incident time window. Most of your runbook actions leave audit entries, and that is your timeline backbone. Add the human decisions on top, and you have a defensible postmortem with no narrative gaps.
Making the runbook live
A runbook is only useful if people use it. Three habits keep it alive. First, every incident references the runbook explicitly in the response channel, with the current phase named. Second, every incident closes with a runbook diff: what was added, what was removed, what was clarified. Third, the runbook is reviewed quarterly even when no incidents have happened, because the absence of incidents is also data.
The runbook is not a document. It is a control. Treat it like one.