Container Security

Kubernetes Operator Supply Chain Controls

Operators are powerful, privileged, and often under-governed. This post covers the supply chain controls that keep operator installations from becoming the largest attack surface in your cluster.

Shadab Khan
Security Engineer
7 min read

Kubernetes operators occupy an awkward middle ground in supply chain governance. They are infrastructure software, which means platform teams treat them as a deployment artifact. They are application software, which means security teams sometimes assume the platform team has covered them. They are also, in many cases, the most privileged components in a cluster, capable of creating and destroying resources cluster-wide and of running arbitrary controller logic with broad RBAC.

The result is that operators are often installed with less scrutiny than the applications they manage, despite running with more authority. This post covers the supply chain controls that close that gap.

What An Operator Actually Is

An operator is a controller that watches custom resource definitions and reconciles cluster state to match them. In supply chain terms, it is at minimum a container image, a set of CRD manifests, an RBAC bundle, and often a deployment template, all packaged together. The OLM bundle format and the Helm chart format are the two common packaging shapes.

The supply chain surface includes every part of that bundle. The image, which runs the controller. The CRDs, which extend the cluster's API. The RBAC, which grants the controller its privilege. And the bundle metadata, which describes how all of it gets installed.

Each of those pieces is a supply chain artifact and deserves the corresponding controls. In practice, most installations verify the image and skip the rest, which leaves the most important attack surface, the RBAC, ungoverned.

Bundle Verification

Operator bundles distributed through OperatorHub or vendor catalogues should be treated like any other third-party artifact. They get pulled into an internal mirror, verified, scanned, and re-attested.

The verification covers the bundle signature where one is available, the source repository if the bundle is built from open source, and the integrity of the bundle's component manifests. A bundle whose listed CRDs do not match the CRDs actually installed is a corrupted bundle, accidentally or otherwise, and should not be installed.

The internal mirror keeps a curated catalogue. New bundles enter the catalogue through a review process that involves the platform team and the security team. Updates to bundles flow through the same process, with the security team's review focusing on the diff in RBAC and CRDs rather than on the image content alone.

The catalogue is the single source of truth for what can be installed in production. Operators outside the catalogue are blocked from production by admission policy. Development clusters have a more permissive setup, but the production gate is firm.

RBAC Scoping

The RBAC granted to an operator is the most consequential part of its supply chain. A typical operator's bundle requests cluster-level read on its watched resources and cluster-level write on the resources it manages. Many bundles request more than they need, either out of laziness or out of an abundance of "future-proofing".

The control we apply is a structured RBAC review for every operator before it is admitted to the catalogue. The review walks the requested permissions and asks two questions per verb. Does the operator demonstrably need this verb to perform its stated function? And is there a less-privileged alternative, such as scoping to a namespace rather than cluster-wide, or to a label selector rather than to all resources of a kind?

The output of the review is either an approval, a request for the vendor to scope down, or a custom RBAC bundle that we generate ourselves and apply instead of the vendor's default. Custom bundles add maintenance overhead, but for the most powerful operators they are worth it.

The review is repeated when the bundle is updated. RBAC drift inside an operator's permissions is one of the easier supply chain compromises, because the diff is rarely the focus of the upgrade conversation.

CRD Review

CRDs are an under-examined surface. A CRD is, in effect, an extension to the cluster's API. Once installed, it can be referenced by any user with the right RBAC, and the operator that owns it can reconcile any object of that kind.

The CRD review for new operators looks at the schema, the validation rules, and the interaction with existing CRDs. Schemas with broad permissive types like arbitrary maps or unrestricted strings are flagged for closer look. Validation rules that are weaker than they should be are documented. Interactions with existing CRDs, particularly cases where two operators might both reconcile the same kind, are surfaced as conflicts.

CRDs also have a lifecycle. When an operator is removed, its CRDs may persist with stranded objects. The cleanup is rarely automated by default, and the residual objects can be a source of confusion or, occasionally, of policy bypass. The operator removal workflow we document covers the CRD cleanup explicitly.

The Operator Identity

Each operator runs as a service account with the granted RBAC. That service account is a workload identity in the same sense as any other, and it should be governed similarly.

The service account's token should be short-lived, refreshed by the kubelet rather than mounted as a long-lived static token. Audit logs should record actions taken by the service account with the same granularity as actions taken by human users. Anomaly detection should baseline the operator's normal API traffic and alert on deviations.

This last point matters most. A compromised operator typically does not start sending traffic that is novel in shape; it does, however, start sending traffic at unusual times, in unusual volumes, or against unusual subsets of resources. The baseline plus anomaly detection catches what RBAC alone cannot.

Lifecycle Controls

Operators have an upgrade cadence that often differs from the applications they manage. The supply chain controls have to handle the lifecycle, not just the install.

The pattern we use. Catalogue updates land in a staging cluster first. The operator runs there for a calibration period, during which any new behaviour, new RBAC, or new CRD interactions are observed. Promotion to production requires a sign-off from the platform team and the security team.

Major version upgrades, where the operator's behaviour or schema is expected to change, get a longer calibration. Patch upgrades, where the changes are bug fixes only, get a shorter one. Either way, the upgrade is a deliberate event with explicit approval, not a silent track-the-latest configuration.

Operators that have not been updated for an extended period are flagged. Stale operators are a common source of unfixed CVEs and deprecated CRD versions. The flag triggers a review, which results in either an upgrade, a sunset plan, or a documented justification for the staleness.

Removing An Operator

Removing an operator is harder than installing one, because the residual state can be complex. The bundle's deployment is straightforward to delete. The CRDs and the objects they describe are not. The RBAC may have granted permissions to other components that depended on the operator. The webhooks the operator registered may still be active and may now be misconfiguring admission for unrelated workloads.

The removal playbook we document covers all of those. The deployment, the RBAC, the webhooks, the CRDs, and the residual CR objects, in that order. Each step is verified before the next is attempted. The playbook is exercised periodically against a non-critical operator in staging, so that the process is not novel the first time it has to be applied to a real installation.

How Safeguard Helps

Safeguard treats operators as first-class supply chain surfaces. The internal catalogue stores bundles that have passed image verification, RBAC review, and CRD review, with the review history queryable per operator. Admission policy enforces catalogue-only operator installation in production namespaces. The runtime stack baselines each operator's API traffic and alerts on deviations, correlating findings with the operator's RBAC scope so that the impact of an alert is immediately legible. Lifecycle workflows handle staging promotion, version pinning, and staleness flags, and the removal playbook is built into the platform with verification at each step. The result is operator governance that is at least as strong as application governance, which is what the privilege level requires.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.