Best Practices

GCP Secret Manager Rotation Strategy

A workable rotation strategy for GCP Secret Manager: how to structure secret versions, schedule rotation, coordinate consumers, and avoid the outage patterns that scare teams off rotation in the first place.

Nayan Dey
Senior Security Engineer
7 min read

Most organizations I work with have Secret Manager adoption figured out. The secrets are in there, the IAM is mostly right, and the applications are reading via client libraries rather than shelling out to gcloud secrets versions access. What they do not have figured out is rotation. The rotation cadence ends up being "when the auditor asks," which means the database password is four years old and nobody remembers who generated it.

This post is the playbook I wrote after running two quarters of rotation work on a GCP organization with about 4,200 secrets spread across 70 projects. It is not glamorous, but the patterns are the patterns that worked.

Understand what Secret Manager gives you

Secret Manager separates the secret, which is the logical container, from the secret version, which holds the actual material. Versions are immutable. You add a new version, optionally disable or destroy older versions, and consumers choose which version to read. That structure makes rotation mechanically safe, because you never overwrite the current secret; you add the next one and migrate consumers across.

Each version has four states: enabled, disabled, destroyed, and pending destruction. Destruction is permanent. Disabling a version takes it out of circulation without losing the material, which is the control you use during rollback. I set versionDestroyTtl to seven days on every secret, so that a destroy call actually schedules destruction rather than executing it. This has saved more than one engineer from a fat-fingered cleanup command.

Rotation policies on secrets, available since September 2022, let you schedule a Pub/Sub notification at a fixed interval. The notification does not rotate the secret; it just tells you it is time. The actual rotation is your responsibility, which is the part most teams under-invest in.

Classify secrets before you rotate them

Not every secret rotates the same way. Before writing a rotation runbook, classify your secrets into four buckets, because each bucket has a different failure mode.

The first bucket is self-issued credentials: database passwords you set, API keys you generate, tokens you issue from your own identity service. These are the easy case, because you control both the issuance and the consumer side.

The second is externally issued credentials: third-party API keys, OAuth client secrets, vendor-specific tokens. Rotation here requires coordinating with the external issuer and typically has to go through a vendor portal, which is fiddly to automate.

The third is long-lived asymmetric material: signing keys, TLS private keys, SSH host keys. Rotation here is more about key rollover than secret rotation, and the overlap window matters more than for symmetric material.

The fourth is derivative secrets: JWT signing keys, session encryption keys, KMS-wrapped data keys. Rotation here interacts with data-at-rest encryption, so you usually need a dual-write or decrypt-and-re-encrypt strategy.

Do not build one rotation pipeline for all four buckets. You will end up with a pipeline that does the first case poorly and the others not at all. I build one pipeline per bucket and share only the orchestration layer.

The overlap pattern for symmetric secrets

The safe rotation pattern for self-issued symmetric secrets is the overlap pattern. At time T, you add version N+1 with a new value. At time T+X, you flip consumers from reading version N to reading version N+1. At time T+Y, you disable version N. At time T+Z, you destroy it.

The time windows matter. X should be long enough for deployed consumers to pick up the new version, which means it should exceed the longest credential-refresh interval in your application fleet. For a service that caches the secret for 10 minutes, X should be at least an hour. For a service that reads the secret once at process start and caches for the lifetime of the process, X has to be long enough to cover a rolling restart, which for a slow-rolling GKE deployment can mean a full day.

Y should be long enough to catch any consumer that missed the refresh, typically another 24 to 48 hours. Z should be seven days, which matches the versionDestroyTtl I mentioned earlier and gives you a rollback window.

Cloud Run and Cloud Functions make this pattern easier than it used to be, because the secret:// mount syntax lets you pin to the latest alias or to a specific version. For services that can tolerate being pinned, I pin to latest and accept that the consumer will pick up the new version within the revision's refresh interval. For services that cannot, I pin to a specific version and bake the version bump into the deploy.

The dual-key pattern for signing material

For signing keys and other asymmetric material, the overlap pattern does not quite work, because verifiers have to accept both the old and new keys during the overlap. The pattern I use is a dual-key pattern: the producer signs with key N, then signs with both N and N+1 during the overlap, then signs with only N+1. Verifiers accept both keys throughout the overlap window.

This requires application-level support, which is the honest reason most teams do not rotate signing keys. Building the dual-key capability into your auth library is a one-time investment that pays for itself the first time you have to rotate after a suspected compromise.

Automate the notification, not the rotation

I used to think full end-to-end automation was the goal. After two quarters of running rotations, I think the right target is a reliable notification pipeline and a well-scoped rotation runbook, rather than full automation.

The reason is that rotation failures are expensive, and the marginal cost of manual rotation for a well-scoped secret is low. A 30-minute runbook that an engineer runs quarterly is cheaper than the incident that happens when the automation misfires during a holiday weekend. The Secret Manager rotation policy publishes to Pub/Sub, a Cloud Function consumes the notification and creates a Jira ticket with the runbook attached, and the engineer executes the runbook.

For the high-volume, well-understood cases, I do automate. Database passwords for Cloud SQL, for example, are a natural automation target because the API surface is stable and the rollback is clean. For those, the Cloud Function rotates the password in Cloud SQL, writes the new version to Secret Manager, and waits for the consumers to refresh before disabling the old version.

IAM and audit, because rotation is a privileged operation

The service account that performs rotation has roles/secretmanager.secretVersionAdder and roles/secretmanager.secretVersionManager on the secrets it manages, scoped with IAM conditions to the specific secret resource names. It does not have roles/secretmanager.admin, which would let it delete secrets entirely, and it does not have access to read secrets it is rotating unless it needs to, which for most cases it does not.

Audit the rotation service account's actions through Cloud Audit Logs. The events worth alerting on are AddSecretVersion on a secret outside its expected rotation window, DestroySecretVersion ever (I have a standing alert for this), and any SetIamPolicy on a secret resource. These catch both rotation misfires and deliberate tampering.

Finally, track rotation as a metric, not just a ticket. I publish a custom metric secret_manager/days_since_last_rotation per secret to Cloud Monitoring, and alert when any secret exceeds its target interval. The dashboard showing the distribution of secret ages is the single most effective way I have found to keep rotation discipline over time.

How Safeguard Helps

Safeguard inventories every Secret Manager secret across your GCP organization, scores its rotation hygiene against your policy, and flags secrets whose age exceeds the target interval for their classification. Rotation events, destruction requests, and IAM policy changes on secrets are correlated with the services that consume them, so you can see the blast radius of a rotation before you run it. When a secret is used by a Cloud Run revision, a Cloud Function, or a GKE workload, Safeguard maps the dependency and tracks whether the consumer has refreshed to the new version. The result is a rotation programme that stays honest without requiring a quarterly audit to keep it that way.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.