Open Source Security

An npm Incident Response Playbook

When an npm package in your dependency graph is compromised at midnight, you need a playbook, not a brainstorm. Here is the one I wrote after three real incidents.

Nayan Dey
Senior Security Engineer
7 min read

The playbook in this post started as a single-page document I drafted at 3 AM during the event-stream incident in late 2018, grew through the ua-parser-js takeover in October 2021 and the node-ipc protestware in March 2022, and took its current shape after the xz-utils panic in March 2024, which, while not npm, forced me to re-read every assumption about package-trust models. The underlying pattern across all of these is the same: a package you depend on is now untrustworthy, you have minutes to hours to decide what to do, and the cost of getting it wrong is real.

This post is the playbook I give to engineering teams. It is not abstract; it is the sequence I follow when a real incident hits my own fleet.

Phase 1: Triage (First 30 Minutes)

The first question is: is this real? Not every tweet about an npm compromise is an incident; some are false alarms, some are exaggerations, some are real but affect a version nobody uses.

I confirm realness by checking three sources. First, GitHub Advisory Database for a formal advisory. Second, the package's issue tracker and maintainer communications. Third, any CERT or CVE assigned. If the advisory is not yet in GHAD but multiple credible sources are reporting the compromise, treat as real and move on.

The second question: is it in our graph? Run a recursive dependency search against every lockfile in the organization. For a small org with ten services this is a minute of work. For a fleet with hundreds of repos, you need this pre-built: a searchable index of every package-version tuple across every lockfile. Safeguard maintains this automatically; without it, you need a nightly job that extracts from lockfiles into a database.

If the package is not in your graph, close the incident with a note and move on. If it is, continue.

The third question: which versions are affected and which versions do we have? An advisory saying "versions 1.2.3 through 1.3.7 are compromised" gives you a specific matching problem. Extract the version of the package at every occurrence in every lockfile, cross-reference against the vulnerable range, and produce a list of repos that have vulnerable versions.

Phase 2: Containment (30 Minutes To 4 Hours)

Containment depends on whether the compromise is active (malicious code is running in your environment) or passive (a malicious version exists but you have not installed it yet).

For a passive risk, containment is preventing future installs. Options:

  • Add the vulnerable version range to a block list in your proxy registry (Verdaccio's uplinks.denylist, Artifactory's blocked-pattern config, Nexus's routing rules).
  • Push a commit to every affected repo that pins away from the vulnerable version via overrides / resolutions / pnpm.overrides.
  • Update CI to fail if the vulnerable version appears in the lockfile.

For an active risk, where CI or production has already installed the vulnerable version, you are in a different scenario. You need to stop the running code, clear caches, and remediate. Stopping the running code is straightforward: kill the deployment. Clearing caches is where it gets hard.

Npm's cache lives in ~/.npm/_cacache/. GitHub Actions caches it per-key. Docker images have it in layers. Kubernetes pods have it in writable layers. Every place you cached a compromised tarball is a place you will re-install it from if you are not careful.

The command npm cache clean --force wipes the local cache but does not touch remote caches. For GitHub Actions, delete the cache keys explicitly via the Actions API or via a repo admin action. For Docker, rebuild affected images from scratch with --no-cache. For Kubernetes, delete affected pods and confirm the new pods pull fresh images.

The Lockfile Correction

Once containment is in place, you need to correct every lockfile. The mechanical move is:

  1. Pin away from the vulnerable version in package.json (or via overrides).
  2. Run npm install to regenerate the lockfile.
  3. Verify the lockfile no longer contains the vulnerable version-resolution URL.
  4. Commit.

For a single repo this is ten minutes. For a fleet, automate it: a scripted PR across every affected repo that runs step 1, step 2, and opens the PR with a DEPENDABOT-LIKE body describing the incident. Tag every PR with a unified incident tag so you can track closure rate.

Phase 3: Investigation (2 Hours To 2 Days)

Investigation overlaps with containment. Once you have stopped the bleeding, you need to understand the damage.

The three questions I always answer.

First: what did the malicious code do? Read the compromised version's source. For the ua-parser-js incident in 2021, the injected code was a Windows-only crypto miner that also attempted to exfiltrate environment variables. For event-stream, the injected code was targeted specifically at users of a particular Bitcoin wallet library. Understanding the payload tells you what to look for in your own logs.

Second: did it run in a context where it could have done harm? For environment-variable exfiltration, look for outbound network connections to unexpected destinations from your CI runners during the window when the compromised version was in use. For file-modification payloads, look for filesystem integrity alarms during the same window.

Third: if it ran, what did it access? The npm install happens with whatever credentials the running process has. CI runners with broad AWS credentials, access to Vault, or access to your source repos would have leaked those credentials if the payload was interested in them. Assume the worst and rotate.

Phase 4: Credential Rotation (If Needed)

If there is any plausible path from the compromised install to credential exposure, rotate. The rotation sequence is:

  1. Identify every credential that was in the process environment or on the filesystem during the window of compromise.
  2. For each, revoke and reissue, with zero-overlap rotation (revoke first, deploy second, accept downtime).
  3. Review access logs for each credential for the compromise window plus a reasonable buffer (I use 14 days on either side).

Credentials to consider: npm tokens, GitHub tokens, AWS keys, Vault tokens, Sentry DSNs, any SaaS API key the CI job had access to. For npm tokens specifically, if the compromised install ran on a CI runner with your publish token in env, your publish token is compromised, full stop. Rotate.

Phase 5: Post-Incident (Within 2 Weeks)

Write the postmortem. I include, specifically: the timeline, the detection path (how did we learn), the containment actions, the remediation actions, and the preventive changes. I do not include blame on a specific engineer or a specific maintainer; the system failed, which is always the right framing.

The preventive changes are the most important part. After ua-parser-js I added a block on any npm package that had a maintainer change within the last seven days. After node-ipc I added egress-network controls to our CI runners. After xz-utils I increased the weight of "social-engineering signal" in the package scoring we use for triage.

A Note On Trust Restoration

After an incident with a package, the question of whether to keep using that package is a judgment call. If the maintainer acted promptly and transparently, I usually stay. If the compromise revealed structural problems in how the package was maintained (a single maintainer, no code review, publishing from a personal laptop), I look for alternatives. The 2024 lottie-player incident, where a compromised maintainer account published malicious versions while the maintainer was on holiday, made me re-examine every package in my graph with a single maintainer.

How Safeguard Helps

Safeguard maintains the live graph of every package version across every lockfile in your organization, so the "is it in our graph" question is a one-second search rather than a four-hour script run. When a new advisory lands, Safeguard auto-opens remediation PRs across every affected repo, pinning away from the vulnerable version in the right manifest format for each package manager. The incident runbook is integrated: Safeguard drives the containment sequence, surfaces credentials that were exposed during the compromise window, and tracks closure rate across the fleet. If your org does not have a dedicated incident-response team for supply-chain events, Safeguard gives you the workflow that replaces one.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.