Best Practices

Oncall Rotation Design For Modern SecOps

Oncall rotations break for SecOps because the work is asynchronous and the alerts are noisy. Here is a rotation design that respects both, with the tooling to back it up.

SecOps oncall rotations are usually copied from the SRE handbook. That copy fails for a reason that should be obvious in retrospect. SRE oncall responds to outages, which are synchronous and ephemeral. SecOps oncall responds to advisories and findings, which are asynchronous and durable. Run an SRE rotation against SecOps work, and within a quarter you will have an exhausted team, missed advisories, and a handoff document that nobody reads.

This article describes a rotation design that fits the actual shape of SecOps work. It is shorter than the on-call rotations you are used to, the handoff is structured, and the noise budget is enforced.

Start with the work, not the schedule

Before designing the rotation, take a week and categorize every interrupt the team handled. Sort each into one of four buckets. Critical advisories that needed action within the day. Routine findings that needed triage but not same-day action. Customer-facing requests for evidence or attestation. And operational tickets like access reviews or scan failures.

The shape of those buckets determines the shape of the rotation. If the critical advisory bucket is small, you do not need twenty-four-by-seven coverage. If the routine findings bucket is large, you need a primary who has uninterrupted blocks of time, not someone wedged between meetings. If the customer-facing bucket is large, you need a secondary who can answer questions without waking up the primary.

Most teams I have worked with discover that critical advisories make up less than ten percent of their oncall work, and routine findings make up more than half. That should drive the rotation design more than any external standard.

The two-tier rotation

A two-tier rotation works for most SecOps programs. The primary handles critical advisories and any active incident. The secondary handles routine findings, customer-facing requests, and operational tickets. The split exists so that a single noisy week does not destroy both engineers in the rotation.

Both tiers run on the same schedule, typically a week long. The handoff happens on a fixed day, ideally not Monday, because Monday handoffs collide with weekly planning meetings and the new oncall starts behind. Wednesday is a better choice. The previous week is fresh, the calendar is open, and the new oncall has two business days to settle in before the next weekend.

A week is long enough to see patterns and short enough that nobody dreads it. Anything shorter and the handoffs dominate. Anything longer and the secondary tier becomes a part-time job that nobody wants.

The noise budget

The hardest part of SecOps oncall is not the volume of alerts. It is the quality. A rotation that sees fifty alerts a week with twenty real positives is healthier than a rotation that sees five alerts a week with one real positive, because the latter trains the engineer to ignore everything.

The noise budget is the maximum number of false positives the rotation tolerates per week before something has to change. Set the budget at five for the primary tier. If the budget is exceeded, the next sprint must include at least one detection-tuning ticket. The ticket is not optional and does not get deferred.

Track the budget with safeguard_list_findings filtered to the oncall window, then subtract the count of findings that resulted in either a code change, a policy update, or an explicit accepted-risk decision. Anything left is noise. If that count is over five, the rotation is being trained to ignore alerts, and the team is one bad week from missing a real one.

The handoff

The handoff is the document that says what the next oncall needs to know. Most handoffs fail because they read like a status update instead of a state transfer. The next oncall does not need to know what you did. They need to know what is still open, what is partially handled, and what they are about to walk into.

Structure the handoff in three sections. Open advisories with current status and next action. Active customer requests with deadline and owner. And outstanding follow-ups from the previous week, including any that the outgoing oncall could not finish.

The handoff happens live, not async. Live means a fifteen-minute call, screen-shared, walking through each open item. The outgoing oncall does not get to log off until the incoming oncall has acknowledged each item. This sounds heavy. It is the cheapest insurance you can buy against missed advisories.

Pull the open items list from safeguard_list_tasks filtered to the oncall window with a status of in-progress. The list should be the spine of the handoff document. Anything missing from that list is something the outgoing oncall forgot to file as a task, which is itself a process bug.

Coverage gaps

Coverage gaps are the times when the rotation has nobody on it. They happen when somebody calls in sick, takes a last-minute vacation, or leaves the company. They are inevitable. The question is whether the rotation handles them gracefully or whether the gap turns into a missed advisory.

Three rules close most gaps. First, the rotation always has a named backup, who is the previous week's primary. The backup is not on call, but they are reachable, and they have full context from a fresh handoff. Second, the rotation has a documented escalation path that does not require finding the manager on Slack. Third, the rotation tooling auto-acknowledges advisories so that an unmanned hour does not turn into a silent breach.

The escalation path matters more than the backup roster, because backups can themselves be unavailable. The escalation path should end at a named individual, not a group. Groups diffuse responsibility. Names focus it.

Health metrics for the rotation itself

The rotation is a system. Like any system, it has health metrics. Track three. The number of advisories handled per week per primary. The number of pages received outside business hours. And the number of handoffs that took longer than thirty minutes. The first measures load. The second measures sleep. The third measures complexity.

A rotation where a primary handles more than ten advisories a week is overloaded, and the next quarter should reduce the surface area or hire. A rotation where pages outside business hours exceed two per week is producing burnout, and the alerting policy needs review. A rotation where handoffs take longer than thirty minutes is accumulating undocumented state, and the runbook needs to absorb the missing pieces.

A rotation that runs for a year without changing is not stable. It is calcifying. Schedule a rotation retrospective every quarter, change exactly one thing each time, and watch the health metrics for two cycles before changing the next. The rotation that survives is the one that evolves.

secops process program-design

Back to all articles