Best Practices

Supply Chain Incident Forensics Playbook

A practical, hour-by-hour forensics playbook for responding to software supply chain incidents, from first alert through root cause and disclosure.

Shadab Khan
Senior Security Engineer
6 min read

When the pager fires at 2:47 in the morning because a package your team ships has been flagged as malicious on npm, there is no time to invent process. The first thirty minutes will decide whether you spend the next week containing a targeted campaign or apologizing to customers for a week of exfiltration you missed. This playbook is the one I wish I had taped to my monitor the first time I worked a real supply chain incident.

Hour Zero: Triage Without Touching Evidence

The instinct when an alert arrives is to log into the registry, pull the package, and start reading source. Do not do that yet. You are about to become the first witness on a scene that may need to survive discovery, regulatory review, and a post-mortem that you will not enjoy if your evidence is contaminated.

Start a fresh incident channel. I label mine #ir-YYYYMMDD-shortname and pin three things: the initial alert, the assigned IR lead, and a link to the evidence bucket. Every command I run from this point forward gets mirrored into that channel with script /tmp/ir-$(date +%s).log so that my terminal output is captured verbatim. The channel becomes your timeline.

Within the first ten minutes you need two decisions. First, is the affected artifact still being distributed? If yes, coordinate with the registry owner (npm security, PyPI admins, or your internal Artifactory team) to freeze the version. Second, do you have a known-good copy of the artifact in your own caches? aws s3 cp s3://artifacts-prod/mylib/1.4.2/ /evidence/1.4.2/ --recursive --no-progress gets you a pristine copy before anyone has the chance to overwrite history.

Hour One to Four: Evidence Acquisition

With the artifact frozen, the forensic clock starts. I run three parallel workstreams and I do not let them blur into each other.

The first workstream is artifact capture. Pull every published version of the suspect package, not just the flagged one. Attackers frequently seed a clean version first to build reputation, then publish a compromised minor version, then publish a clean patch to muddy the waters. A quick loop does the job:

for v in $(npm view @acme/widgets versions --json | jq -r '.[]'); do
  npm pack @acme/widgets@$v --pack-destination /evidence/npm/
done
sha256sum /evidence/npm/*.tgz > /evidence/npm/hashes.txt

The second workstream is registry telemetry. Most registries will give you publish logs if you ask through the right channel. For npm, npm audit signatures --json on your lockfile can surface publish metadata. For private registries, pull the raw access logs from your ingress layer. CloudFront logs, ALB logs, and the registry's own audit trail all go into the same S3 prefix with a README describing what each file is.

The third workstream is blast radius. Every hour your developers run npm install against unknown state is an hour of potential damage. I use Safeguard's dependency graph view plus a quick grep -r '"@acme/widgets"' /srv/repos/*/package-lock.json sweep across our mirrored repositories to produce a ranked list of downstream consumers. That list becomes the notification queue.

Hour Four to Twelve: Dynamic Analysis

By now you should have the artifact in an isolated environment. I use a disposable Firecracker microVM with no network egress except a tcpdump collector pointed at a sinkhole. The goal is to let the malicious code execute just enough to reveal its command and control pattern without letting it actually phone home.

firectl --kernel ./vmlinux --root-drive ./rootfs.ext4 \
  --tap-device tap0/aa:fc:00:00:00:01 --no-network
# inside the guest
npm install ./evidence-1.4.2.tgz
strace -f -e trace=network,openat -o /tmp/strace.log \
  node -e "require('@acme/widgets')"

What you are looking for: DNS resolutions to suspicious domains, writes to ~/.ssh/ or ~/.aws/, reads of environment variables with names like NPM_TOKEN or GITHUB_TOKEN, and any spawned child processes. Capture the strace output, the pcap, and any dropped files. A good analysis run should produce a timeline like "T+0.3s: reads ~/.npmrc, T+0.9s: HTTP POST to 185.220.101.44, T+1.2s: writes .bash_profile modification."

Hour Twelve to Forty-Eight: Root Cause and Scope

This is the phase where most investigations go wrong, because the urgency of containment has ebbed and people want to declare victory. Resist. The root cause question is not "what was in the package?" It is "how did it get there?" Those are very different investigations.

To answer the how, you need the publish path. Who owns the package? What credentials were used to publish? When were those credentials last rotated? Was MFA enforced? If the answer involves a personal access token that lived in a developer's ~/.npmrc for three years, the scope of your investigation just expanded to that developer's entire laptop history.

I use a simple matrix to track this: each access path (publish token, CI token, maintainer account, signing key) gets a row. Columns are "last rotated," "MFA enforced," "who has custody," and "known compromised." You finish the forensics phase when every row has a defensible answer.

Day Three Onward: Attribution Caution and Disclosure

Attribution is seductive and rarely helpful in the first week. Write down what you see, not what you conclude. "Command and control domain resolves to ASN 12345, registered 2024-03-20, with a TLS certificate also used by packages X and Y" is a fact. "Nation-state actor" is a headline risk.

Your disclosure should walk customers through three things: what they need to do in the next hour, what they need to do in the next week, and what you are doing differently. Mismatched timelines here are how vendors lose trust.

How Safeguard Helps

Safeguard collapses the first four hours of a supply chain forensics engagement into a workflow that anyone on your team can run. The platform maintains a continuously updated dependency graph across every repository, container, and deployment target, so the blast radius question becomes a single query instead of an afternoon of grep. Evidence capture hooks publish artifact hashes, SBOM deltas, and registry telemetry into immutable storage the moment a package is flagged. When you need to answer "who is still running the compromised version right now," Safeguard shows you the running asset inventory with provenance intact, which turns day-three panic into a checklist you can work through calmly.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.