External Secrets Operator (ESO) has quietly become the default way to get secrets from an external vault into Kubernetes workloads. It sits in a sweet spot that the ecosystem has struggled to fill for years: first-class integration with cloud secret stores and HashiCorp Vault, native Kubernetes Secret semantics for workloads, and a controller pattern that does not require applications to change. I have rolled it out across several environments, and while it is usually the right answer, getting from "demo working" to "production-ready across fifty clusters" takes more thought than the quickstart suggests.
The architecture in one paragraph
ESO runs as a set of CRDs and a controller. The controller watches ExternalSecret resources, which reference a SecretStore or ClusterSecretStore that describes how to authenticate to the backend (AWS Secrets Manager, Vault, GCP Secret Manager, Azure Key Vault, 1Password, etc.). When an ExternalSecret reconciles, the controller fetches the requested data, optionally transforms it, and materializes a native Secret resource in the target namespace. Workloads mount that Secret like any other. The controller re-reconciles on a schedule, and the native Secret is kept in sync with the upstream.
The elegant part is that nothing in the application pod knows ESO exists. It just reads a standard Kubernetes Secret. The controller absorbs the vault-specific plumbing and can be swapped or reconfigured without touching workloads.
SecretStore vs ClusterSecretStore
The first design decision most teams get slightly wrong is the choice between SecretStore (namespaced) and ClusterSecretStore (cluster-wide). ClusterSecretStore is tempting because it lets you define the backend once and reference it from any namespace. It is also a blast radius hazard. A misconfigured ExternalSecret in a dev namespace can, under the wrong RBAC, pull production data from the same ClusterSecretStore.
The pattern I recommend is SecretStore per namespace for anything bound to a specific team or environment, and ClusterSecretStore only for genuinely cluster-wide plumbing (cluster telemetry credentials, ingress controller certs, and the like). When you do use ClusterSecretStore, scope the backend identity tightly: an IAM role that can only read a specific prefix in Secrets Manager, a Vault role that only allows reads from a namespaced path, etc.
Authentication patterns that scale
The authentication model is where ESO either becomes a joy or a disaster. The three patterns that hold up in production:
IRSA (AWS) or Workload Identity (GCP, Azure) for the controller itself. The ESO controller's service account assumes a cloud identity that has read access to the secret backend. SecretStore resources can then authenticate using that pod identity without storing any long-lived credentials. This is the cleanest model if your cluster is in a single account.
Per-namespace service account impersonation. For multi-tenant clusters, bind each SecretStore to a specific service account via the serviceAccountRef. The controller impersonates that service account when fetching secrets, so you get namespace-level IAM enforcement rather than cluster-level trust.
Vault Kubernetes auth. If you are using Vault, the Kubernetes auth method lets pods authenticate using their service account JWT. ESO supports this natively. The trick is to scope the Vault role to the specific service account, namespace, and audiences that the ExternalSecret will use. Wildcarding here defeats the whole model.
Avoid the pattern where the SecretStore holds a static token in a Kubernetes Secret. It is supported, and it is a trap. You now have a bootstrap secret problem sitting in etcd, and rotating it requires coordination that nobody does.
Rotation and refresh intervals
ESO does not actually rotate secrets. It reflects the upstream. Rotation is the backend's responsibility, whether that is Secrets Manager's managed rotation, Vault's dynamic credentials, or a human-driven process.
What ESO controls is refresh cadence via the refreshInterval field. The instinct is to set this low, but there are two reasons to be careful. First, every reconcile is an API call to the backend. Across a large cluster with hundreds of ExternalSecret resources, aggressive refresh intervals can produce noticeable cost and rate-limit exposure. Second, a short refresh does not give you faster rotation, because Kubernetes workloads do not automatically reload secrets when the backing Secret changes. You need a reload mechanism on top: reloader-style annotations, sidecars that watch the filesystem, or application-level refresh.
In practice, one hour is a reasonable default refresh interval for most secrets, with lower values only for short-lived credentials that workloads are designed to consume. Pair ESO with a controller like Stakater Reloader if you need pods to actually pick up new values without human intervention.
Templating and composition
One of ESO's most useful features is the template system. An ExternalSecret can pull several values from different backend locations and compose them into a single Kubernetes Secret with fields formatted exactly how the consuming application expects. This is especially useful for legacy applications that expect a specific config file format or an environment variable concatenation.
The temptation is to push business logic into these templates. Resist it. Templates should be formatting, not decision-making. If you find yourself writing conditional Sprig templates, you have drifted into logic that belongs in an init container or the application itself. Complex templates are hard to review, hard to test, and hard to debug when they fail at reconcile time.
Multi-cluster and multi-region
For organizations running fleets of clusters, the operational question is whether each cluster's ESO speaks directly to a single regional backend, or whether you federate. I have seen both work.
The direct model is simpler: each cluster's ESO has its own workload identity, each SecretStore points at the regional vault, and failover is handled by duplicating secrets across regions. The federated model uses a single source of truth, with regional replicas synced by the backend itself (Secrets Manager replication, Vault performance replication), and each cluster's ESO reads from its local replica.
Federated is usually the right answer for large enterprises because it keeps the source of truth singular and the blast radius of a compromised cluster local. Direct is fine for smaller footprints.
Observability is non-negotiable
ESO exposes Prometheus metrics for reconcile success, failure, and latency. Scrape them. The useful dashboards are reconcile error rates by SecretStore (spikes usually mean backend auth problems), secret age distribution (to catch stalled reconciles), and reconcile latency (to catch backend throttling). Alert on reconcile failures at the SecretStore level, because a single failing SecretStore can silently block hundreds of ExternalSecrets from updating.
Log aggregation matters too. ESO's controller logs include the backend error details, which are usually the only way to diagnose authentication and authorization issues. Forward them to your central logging system with appropriate retention.
Failure modes to design for
A few real-world failure modes worth designing against. Backend outages cause reconciles to fail, but existing Secrets keep working because Kubernetes does not delete them on reconcile failure. This is a feature for availability and a risk for secret freshness. Scale up throttling on the backend can stall large fleets; spread your refresh intervals to avoid herd behavior. A compromised controller is a critical incident because it can read every secret in every configured backend, so treat the ESO namespace and its RBAC as tier-zero infrastructure.
How Safeguard Helps
Safeguard complements External Secrets Operator by giving platform teams visibility into which workloads are consuming which secrets across every cluster, tracking rotation freshness against policy, and correlating secret backends into the broader software supply chain inventory. When a new CVE lands on a library that handles credential material, Safeguard identifies the affected services without waiting for manual triage. Multi-cluster ESO deployments get a unified view of SecretStore health, misconfigurations, and exposure, which is where most production incidents actually originate.