A compromised CI/CD pipeline is one of the worst calls a security team can get. The pipeline has credentials to your registries, your cloud accounts, your production deploy paths, and often your code signing keys. When it is compromised, everything downstream has to be treated as potentially tainted until proven otherwise. Here is the investigation sequence my team runs, refined over half a dozen real incidents.
Step One: Freeze the Pipeline, Not the Team
The very first move is to stop new builds without stopping the humans who are trying to understand what happened. On GitHub Actions, disable the workflows at the repo level with gh api repos/:owner/:repo/actions/permissions -X PUT -F enabled=false. On GitLab, set the CI/CD on the project to disabled through the API. On Jenkins, put the affected jobs into "shutdown preparation" via Jenkins.instance.doQuietDown(). For self-hosted runners, drain them: gh api repos/:owner/:repo/actions/runners/:id -X DELETE.
Do not kill in-flight jobs immediately. Running jobs are evidence. Let them finish or capture their state before you terminate them. The exception is if a job is actively deploying code or moving data — then kill it and document the termination.
In parallel, freeze the permissions of every service account the pipeline uses. For AWS, attach an explicit deny policy rather than deleting keys, so the keys remain on disk for forensic analysis:
aws iam put-user-policy --user-name ci-deployer \
--policy-name IR-FREEZE-$(date +%s) \
--policy-document file://deny-all.json
The deny-all policy denies * on * with an effect of Deny. AWS evaluates Deny before Allow, so the keys become inert without being destroyed.
Step Two: Preserve Runner State
If you run self-hosted runners, each one is a crime scene. Before you wipe or redeploy them, capture the disk and memory state. For cloud runners on EC2, snapshot the EBS volume and take a memory dump with AWS SSM:
aws ec2 create-snapshot --volume-id vol-0abc123 \
--description "IR-evidence runner-prod-7 $(date -u +%FT%T)"
aws ssm send-command --instance-ids i-0abc123 \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["sudo insmod /opt/lime.ko path=/tmp/mem.lime format=lime"]'
For Kubernetes-backed runners, save the pod spec and any persistent volumes, then export the last N lines of logs for every container that ran jobs in the suspect window:
kubectl get pods -n actions-runner -o yaml > /evidence/pods.yaml
for pod in $(kubectl get pods -n actions-runner -o name); do
kubectl logs $pod -n actions-runner --tail=10000 --all-containers \
> /evidence/logs/${pod##*/}.log
done
The logs you care about most are the workflow command outputs, because a compromised build step almost always prints something the attacker wanted hidden. Look for base64 blobs longer than 200 characters, outbound network calls to unusual domains, and curl | bash patterns.
Step Three: Rebuild the Pipeline Timeline
Using the preserved logs, rebuild a minute-by-minute timeline of every pipeline execution in the suspect window. For GitHub Actions, the workflow run API gives you a structured view:
gh api "repos/:owner/:repo/actions/runs?created=>=2024-03-01" --paginate | \
jq '.workflow_runs[] | {id, name, head_sha, actor: .actor.login, started: .run_started_at, status: .conclusion}'
Cross-reference each run against the commits it built. Pay special attention to runs triggered by pull requests from external forks, runs where the workflow YAML was modified in the same PR, and runs that used secrets they should not have needed. The combination of "workflow changed in the same PR" plus "secret used that the old workflow did not reference" is a classic pattern for injection via malicious PR.
Step Four: Secrets Triage
Every secret the runner had access to during the suspect window is potentially compromised. Make the list exhaustive — include secrets that were in the environment, secrets that were in mounted volumes, and secrets that were fetched from a secrets manager during the job.
I build a spreadsheet with columns: secret name, storage location, last rotated, scope, compromise status. Status is one of "confirmed compromised" (secret was logged or transmitted), "presumed compromised" (secret was in env during suspect job), or "not in scope." Anything not in the last bucket gets rotated.
For AWS, rotate IAM access keys:
aws iam create-access-key --user-name ci-deployer
# deploy new key via your secret manager
aws iam update-access-key --user-name ci-deployer \
--access-key-id AKIA... --status Inactive
# monitor for breakage, then delete
aws iam delete-access-key --user-name ci-deployer --access-key-id AKIA...
For OIDC-based credentials (which you should be using instead of long-lived keys), the compromise scope is narrower — revoke the trust relationship and verify no session tokens are still active with aws sts get-caller-identity against known long-running sessions.
Step Five: Artifact Validation
Every artifact produced by the pipeline in the suspect window is suspect. Container images, npm packages, binaries, IaC bundles — all of it. Republishing everything from a clean pipeline is the only safe answer, but before you do that, you need to know what was produced so you can notify consumers of the old artifacts.
Pull the artifact inventory from your registries with timestamps. For container images in ECR:
aws ecr describe-images --repository-name myapp \
--query 'imageDetails[?imagePushedAt>=`2024-03-01`].[imageTags[0],imageDigest,imagePushedAt]' \
--output table
Tag every suspect artifact with a quarantine-<incident-id> marker so that nobody accidentally promotes it. Then rebuild from clean infrastructure and publish the new artifacts with a new version bump, not the same version rewritten.
Step Six: Root Cause and Hardening
The root cause of a CI compromise is almost always one of four things: a leaked token, a malicious dependency in the build itself, a misconfigured workflow trigger that allowed untrusted code execution, or a compromise of the runner host. Walk each of these four hypotheses explicitly and rule them in or out with evidence.
The hardening that follows depends on the root cause, but a few items apply universally. Move to short-lived OIDC credentials for everything. Require signed commits for any workflow changes. Isolate production deploy runners from PR builders. Enforce that workflow YAML changes require a separate approver from code changes. Ship runner logs to immutable storage so the next investigation does not start from scratch.
How Safeguard Helps
Safeguard watches your CI/CD pipelines for the tell-tale signs of compromise — unexpected package publishes, unusual secret access patterns, and drift between the built artifact and the source commit — and alerts you before the damage spreads. When an incident does occur, the platform gives you a pre-built timeline of every pipeline run, every artifact produced, and every downstream consumer of those artifacts, which collapses the first day of investigation into a single dashboard. Safeguard also enforces policy gates that can automatically quarantine artifacts built during an incident window, so the question of "what do we unship" becomes a button click rather than a weekend of inventory work.