Container Security

AWS ECR Image Signing in Production

Image signing in ECR has moved from nice-to-have to table stakes. Here is what it actually takes to run cosign and AWS Signer in production without breaking every deploy.

Nayan Dey
Senior Security Engineer
7 min read

I have rolled out ECR image signing at three organizations in the last two years and watched another five attempt it. The technology works. The operational story is where most teams get stuck. This post is about what actually happens when you turn on image signing in production, not the tidy version where everything signs on the first try.

AWS has two signing paths for ECR: cosign with a KMS key, and AWS Signer Notation with a signing profile. Both produce OCI artifacts stored alongside the image in ECR. Both verify against the same underlying public key material. The differences are operational, and they matter.

Why sign ECR images at all?

Because the container image is the only thing that actually gets deployed, and everything before it is advisory. Your SBOM is advisory. Your vulnerability scan is advisory. Your CI pipeline logs are advisory. The image is the artifact. If you cannot prove that a specific image came from a specific pipeline run on a specific commit, you are trusting the mutable tag that points at it, and tags in ECR are mutable.

The signed image gives you a verifiable chain: this digest was signed by this key at this time, by this pipeline role, against this commit. When an incident happens and you need to answer "did this production pod run code that came from our main branch," the signature is the answer. Without it, you are reading CloudTrail logs and hoping.

AWS Signer versus cosign, practically

AWS Signer Notation was released for ECR general availability in June 2023. It is the AWS-native option. It uses a signing profile, a KMS-backed signing key, and produces Notation v1.0 signatures that IAM policies can gate. The integration with EKS Pod Identity and with the aws-signer-notation-cli is clean. The biggest advantage is that you can tie signing authority directly to IAM, so a compromised pipeline role cannot sign images it was not authorized to sign.

cosign is the open-source option and the one most teams already know. It produces Sigstore-compatible signatures, integrates with the broader Sigstore ecosystem including transparency logs, and works identically across ECR, GCR, and any other OCI-compliant registry. If you are multi-cloud or you want a transparency log (Rekor), cosign is the answer.

My current recommendation: use cosign with an AWS KMS key reference (cosign sign --key awskms:///alias/cosign-signer ...). You get the Sigstore tooling, the KMS integrity properties, and you can swap signing backends without re-signing the world. If you are AWS-only and want IAM-gated signing, AWS Signer is fine.

Key management is the actual problem

The signing key is the thing attackers want. Treat it accordingly.

Create a dedicated KMS customer-managed key in a signing account. The key policy grants kms:Sign only to a specific CodeBuild role or GitHub Actions OIDC role, and only with a kms:EncryptionContext condition that includes the source repository. Grant kms:Verify and kms:GetPublicKey to the admission controller role in each deployment account. Do not grant kms:Sign to humans. Ever.

Rotate the signing key annually. KMS automatic key rotation does not help you here because it rotates the backing key material while keeping the same key ARN, which is fine for encryption but means your signature timestamps no longer tell you anything about which material signed. Manual rotation with a new key alias and a deliberate cutover is the correct pattern.

Keep old public keys around forever. When you rotate, you do not invalidate images signed by the previous key; you add the new key to the verification trust bundle and continue to accept signatures from the old key for images built before the rotation date. The trust bundle is append-only. This is the single most common mistake I see on rotation: teams remove the old key from verification and immediately break every deployment of an image older than the rotation.

Verification at admission

Signing is half the work. Verification is the other half, and it has to happen at the point where the image becomes a workload.

For EKS, the standard pattern is a Kyverno or Gatekeeper policy that intercepts pod admission and verifies the signature against a configured public key. Kyverno has native cosign and Notation support as of version 1.11. The policy reads the image reference, pulls the signature artifact from ECR, verifies against the configured key, and admits or denies the pod.

For ECS and Lambda, the pattern is weaker because there is no admission controller. What you can do is enforce that only signed images can be referenced in task definitions or Lambda functions, by running a verification step in the deploy pipeline that fails if the image is unsigned. This is weaker because a compromised deploy role could skip it, but it is better than nothing, and combined with a strict IAM boundary on who can update task definitions it is reasonable.

For each cluster, configure the verification policy to require signatures from a specific list of keys and to reject images without signatures. The failure mode you want is that an unsigned image produces a pod admission failure with a clear error message. The failure mode you do not want is that an unsigned image runs because the verification controller crashed three hours ago and nobody noticed.

The emergency deploy problem

Every signing rollout eventually hits the same wall: production is on fire, the fix is ready, and the CI pipeline that signs images is broken. What do you do?

You plan for this in advance. The two patterns that work:

Break-glass signing role. A second KMS key, in a separate account, with a signing policy that requires two human approvers via an AWS Organizations SCP or a manual approval step. In a true emergency, a release engineer can sign an image with the break-glass key, the verification policy accepts signatures from either the normal key or the break-glass key, and the deploy proceeds. Every break-glass signing event generates a CloudTrail entry and a Slack alert. I have seen this used exactly twice in two years. Both times it saved hours.

Pre-signed image inventory. Every week, the pipeline signs the last twenty passing main-branch images and keeps the signatures warm. If the pipeline dies, you can still roll forward or back among those images without needing to sign fresh.

The anti-pattern to avoid: a bypass label on pods that skips signature verification. Once that exists, it will be used for non-emergencies, and within a year half your production pods will have it. Do not ship that escape hatch.

What breaks on day one

When you enable enforcement, the first thing that breaks is your sidecars. Datadog agents, Istio proxies, OpenTelemetry collectors, the hundred other third-party images that run in every pod. You need a policy exception for images from known-good external registries, or you need to mirror and re-sign them into your own ECR. Mirroring is more work but cleaner. Exceptions are faster but grow.

The second thing that breaks is init containers that were pulled from DockerHub six years ago and nobody remembers why. This is a gift. Delete them.

How Safeguard Helps

Safeguard tracks every image pushed to ECR, verifies the signature against your trust bundle, and alerts when an image is pushed unsigned or when a verification fails at admission. We correlate the signing event back to the pipeline run and the source commit, so the provenance chain stays intact even when your CI logs rotate. Policy gates in Safeguard can block a release if the image is missing a signature, if the signing key is past its rotation date, or if the image references a base layer that itself is unsigned.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.