Container Security

Kubernetes Admission Policy Real-World Deployment

What it actually takes to put Kubernetes admission policy into enforcement mode without breaking deployments: phased rollout, exception workflows, audit-mode hygiene, and policy authoring conventions that survive contact with engineers.

Shadab Khan
Security Engineer
7 min read

Most organisations have a Kubernetes admission controller. Far fewer have one in enforcement mode. Even fewer have one in enforcement mode that does not generate a constant stream of escalations from product teams. The gap between "we have admission policy" and "admission policy is doing useful work" is where most of the practical security value lives, and it is where most rollouts stall.

This post is about the second half of the journey. It assumes you have picked a controller, written a few policies, and put them in audit mode. It covers what it takes to get those policies into enforce mode and keep them there.

The Audit Mode Problem

Audit mode is comforting. The admission controller logs every violation but admits the workload anyway. Security teams can produce a chart that goes up and to the right, showing more policies authored over time. Engineering teams ignore the chart because nothing has actually been blocked.

The problem with audit mode is that it accumulates technical debt invisibly. By the time a policy is ready to flip to enforce, the audit log shows ten thousand violations from two thousand workloads, half of which are now legacy, half of which are owned by teams that have rotated, and all of which are someone's problem to fix before the flip can happen.

The remedy is not better tooling. It is to refuse to leave any policy in audit mode for more than thirty days. If a policy is in audit, there is a calendar entry on day thirty when it either flips to enforce or gets retired. No exceptions, including for the policy author.

Picking The First Enforce Policy

The first policy you flip to enforce sets the tone for everything that comes after. Pick badly and the rollout dies of escalation fatigue. Pick well and engineering teams start treating admission as part of the normal deployment surface.

The criteria we used. The policy must be uncontroversial, meaning it expresses a rule that no engineer would defend in a design review. It must have a clean exception path for the rare legitimate case. And it must produce a clear, copy-pasteable error message that tells the engineer exactly what to change.

Our first enforce policy was: pods must declare a non-root runAsUser. Every legitimate workload in our estate already met it. The handful that did not were either prototypes that should never have shipped or vendor charts whose owners had been waiting for an excuse to upgrade. The flip happened on a Tuesday morning. By Friday the audit channel was quiet.

The Exception Workflow

No admission policy survives without an exception workflow. Pretending otherwise produces underground bypasses, where engineers learn that the path of least resistance is to copy a kubeconfig that has cluster-admin and skip the controller entirely.

The exception workflow that worked for us has four properties. Exceptions are first-class objects in the policy language, not annotations bolted on. They have a stated owner, a stated reason, and a stated expiry, all of which are required at creation time. They are visible in a dashboard that any engineer can see, so the population of exceptions is public knowledge rather than private accumulation. And expiry is enforced; an expired exception causes the policy to start blocking again, with a notification to the owner one week before.

The hardest part was teaching teams that an exception is not a defeat. An exception is the system working as intended. The defeat is when an exception lingers past its expiry because no one reviewed it.

Policy Authoring Conventions

Policy languages are unforgiving. Whether you are using OPA, Kyverno, or the Kubernetes-native CEL-based validating admission policy, the language rewards careful authorship and punishes cleverness. We adopted four conventions that paid off.

One. Every policy has a name, a description, and a remediation hint. The remediation hint is the message the engineer sees when the policy fires, and it is treated with the same care as a UX copy review.

Two. Policies are tested. Every policy ships with a fixture pack of pod manifests that should pass and pod manifests that should fail. The test runs in CI on every change to the policy. A policy without tests is not a policy; it is an aspiration.

Three. Policies are versioned. The cluster never runs an unversioned policy. Each version has a changelog entry naming what changed and why. When a policy is updated, the previous version remains queryable for thirty days so that audit traces from before the change still resolve.

Four. Policies are owned. Every policy has a single named team that is responsible for it, and that team appears in the failure message. The fastest way to deflect a frustrated engineer is to send them to a chat channel that no one reads.

The Quiet Killers

Three failure modes ate up most of our debugging time in the first six months of enforcement.

Mutating webhooks racing with validating webhooks. A pod gets mutated to add a sidecar, and the validating webhook then evaluates a manifest the user did not write. The error message references a field the user has never seen, and confusion spirals. The fix is to keep the order of webhooks deterministic and to make the mutating webhook explain what it changed.

Init containers being treated as second-class. Many policies were written against the main container spec and ignored init containers entirely. A few attacks took advantage of this. We rewrote every policy to walk both lists, and added a CI test that ensured each policy applied to both.

Vendor charts. Helm charts published by external vendors are the most common source of admission failures, because they are written for a permissive cluster and not tested against a hardened one. We built an internal chart preflight tool that ran every chart through our admission policies offline, produced a report, and either patched the chart or quarantined it before it ever hit a real namespace.

Measuring The Outcome

The metric that mattered most was not the number of policies in enforce mode. It was the time between a violation being introduced and a violation being caught. In audit mode that time was effectively infinite, because no one read the audit log. In enforce mode it was sub-second, because the deployment failed and the engineer got a message.

A secondary metric was the rate of exception creation. We expected a spike at the start of each policy enforcement, followed by decay as teams either fixed the underlying issue or routinised the exception. Policies whose exception rate did not decay were either too strict or too vague, and we revisited them in the next quarterly review.

A third metric was the rate of bypass attempts. Every cluster has a small number of identities with cluster-admin, and we logged when those identities created pods that would have been blocked by current policy. That rate should be near zero. When it spiked, it was almost always a CI system that had been over-privileged at setup time and never trimmed.

How Safeguard Helps

Safeguard ships an admission policy engine with the workflow features that make enforcement actually hold. Policies carry built-in metadata, owners, remediation hints, and test fixtures, and they are versioned in the policy store with a queryable history. The exception workflow is first class, with structured expiry and one-week renewal nudges. The dashboard surfaces the policies, the exceptions, and the bypass attempts in a single view, so that the security team sees the same posture the engineering teams see. And the chart preflight runs vendor Helm charts and operators against the policy set offline, so that the first time a chart meets the policy is in a sandbox rather than in a frustrated incident channel. The result is admission policy that is both enforced and operable, rather than one or the other.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.