DevSecOps

Buildkite Supply Chain Hardening

A practical hardening guide for Buildkite: agent isolation, pipeline upload security, plugin risks, and the agent-token rotation strategy that keeps the trust model intact.

Nayan Dey
Senior Security Engineer
8 min read

Buildkite has a distinctive place in the CI landscape. The server-side is a hosted SaaS, but the agents run in your own infrastructure, so you get the operational simplicity of a managed control plane with the security model of self-hosted compute. That split model has real advantages — you never have to run the Buildkite server, and sensitive code never leaves your network — but it also creates hardening responsibilities that are different from either fully-SaaS or fully-self-hosted CI.

This post is about the supply chain hardening surface for Buildkite agents, as it stands in agent v3.85 and v3.86. It covers agent isolation, the pipeline-upload security model, the plugin ecosystem, and the agent-token rotation strategy that is central to the Buildkite trust model. It is written for teams running Buildkite in production, and it assumes some familiarity with the product.

The agent model and its isolation implications

A Buildkite agent is a Go binary that polls the Buildkite API for jobs, downloads the job's pipeline definition, and executes the steps on the host. The agent runs as a long-lived process, typically as a systemd service or a Kubernetes deployment, and it has persistent access to whatever the host's user account can do. Every job the agent runs executes with those permissions.

This is different from the container-per-job model that Drone and Tekton use. A Buildkite agent by default executes build steps as shell commands on the agent host, not in isolated containers. Job-to-job isolation is a property of the agent configuration, not of the agent binary itself. Two jobs running on the same agent share the same filesystem, the same environment, and the same access to whatever the agent user can do.

The hardening implications are significant. For any multi-tenant Buildkite deployment — where more than one team's pipelines run on the same agent pool — the isolation model needs to be designed explicitly. The common patterns are dedicated agent pools per team (operationally expensive but simple), ephemeral agents via the Buildkite Elastic CI Stack on AWS (each job gets a fresh EC2 instance), and container-based agents where each job runs in its own Docker container (requires careful configuration to avoid Docker socket exposure).

The Elastic CI Stack v6 for AWS is the production-grade option for most teams. It provisions a fresh EC2 instance per job via autoscaling, runs the job on that instance, and terminates the instance when the job completes. The job-to-job isolation is strong — each job runs on dedicated compute — and the attack surface is small. The operational cost is higher than a fixed agent pool, but for any deployment where builds handle sensitive credentials, it is the right default.

Pipeline uploads and the trust model

Buildkite pipelines can be defined in .buildkite/pipeline.yml in the repository, or they can be uploaded dynamically by the pipeline itself using the buildkite-agent pipeline upload command. The dynamic upload is powerful — it lets you generate pipeline steps programmatically — and it is the source of a specific class of supply chain risks.

The pattern to watch for is a pipeline upload that reads untrusted input and generates steps based on it. A common example is a pipeline that reads a file from the repository to determine what tests to run, where the file contents end up interpolated into the uploaded pipeline YAML. If an attacker can modify that file, they can inject new pipeline steps, including steps that exfiltrate secrets or modify artifacts.

The Buildkite security model has a specific flag for this: the pipeline settings include a "Teams" scope that restricts who can run pipelines, but it does not restrict what pipelines can do once they are running. The practical hardening is to treat every buildkite-agent pipeline upload call as a trust boundary. Validate the inputs to the upload, use schema validation to restrict what step types and plugins can appear in the uploaded YAML, and prefer static pipeline definitions in pipeline.yml over dynamic uploads wherever possible.

A concrete approach that works well is to wrap pipeline upload in a shell function that pipes the generated YAML through a validation step before uploading. The validation step parses the YAML, checks that every step conforms to an allowed schema, rejects any step that references a plugin not on the approved list, and fails the build if the validation fails. This adds a consistent trust boundary to every dynamic upload and catches injection bugs before they turn into compromises.

The plugin ecosystem and supply chain risks

Buildkite plugins are git repositories that provide reusable pipeline functionality. A plugin is referenced in a pipeline step with a ref (a branch, tag, or commit SHA), and the agent clones the plugin repository and executes its hooks during the build. The security model here is that the plugin has full access to the build environment, which means a malicious plugin version is an immediate compromise of every build that uses it.

Two patterns matter. First, every plugin reference should be pinned to a commit SHA or a signed tag, not a branch name. A pipeline that references some-org/some-plugin#main will pull the current main-branch code on every build, and a compromise of the plugin repository is instant. Pinning to a commit SHA ensures that the code has not changed since it was reviewed.

Second, plugins that are not maintained by your own organization should be mirrored. The official Buildkite plugins (buildkite-plugins/*) are generally high quality and are reviewed by the Buildkite team, but third-party plugins have variable quality and some have gone unmaintained. Mirroring the plugins you depend on into your own git organization gives you a stable point of reference that cannot be changed out from under you by an upstream compromise, and it lets you review updates on your own schedule.

Agent tokens and the rotation strategy

The Buildkite agent-to-server trust relationship is based on an agent registration token. Each agent pool has one, and any agent with a valid token can register, poll for jobs, and execute them. If a token is leaked, an attacker can register a rogue agent that will receive jobs intended for your trusted agents, including any secrets those jobs use.

The Buildkite API supports multiple agent tokens per pool and individual token revocation, which enables a clean rotation strategy. The pattern is:

  1. Generate a new agent token.
  2. Update the agent configuration to use the new token on every agent host.
  3. Restart the agents to pick up the new token.
  4. Revoke the old token.

The rotation should happen on a schedule — quarterly at minimum, more often for high-sensitivity environments — and the process should be automated. A rotation runbook that requires a human to SSH into every agent host is a rotation runbook that nobody will run in practice. The Elastic CI Stack on AWS supports token rotation via a CloudFormation parameter update, which is the cleanest option for AWS deployments.

The agent tokens should also be stored in a secret manager, not in version control. A surprising number of Buildkite deployments have the agent token in a Terraform state file or a provisioning script that is committed to git. Any of those locations is a credential disclosure, and the token is effectively the master credential for every build on that agent pool.

Secret injection and the build environment

Buildkite's secret injection is done via environment variables that are populated at job start, either from the Buildkite agent's configuration or from a hook that fetches secrets at runtime. The common pattern is an agent hook that fetches secrets from AWS Secrets Manager or HashiCorp Vault based on the pipeline name and injects them as environment variables for the job.

The hardening advice is to use short-lived credentials wherever possible, scope them narrowly to the pipeline that needs them, and audit the secret access logs on a regular basis. The runtime-fetched secret pattern is much better than long-lived secrets stored in the Buildkite configuration because it creates an audit trail and a rotation story, but it requires operational investment that many teams skip.

How Safeguard Helps

Safeguard integrates with Buildkite by ingesting the pipeline definitions across your organization, flagging steps that use unpinned plugin references, dynamic pipeline uploads without validation, or agent configurations with long-lived agent tokens. The platform also tracks the artifacts Buildkite produces and validates their provenance attestations against your policy gates. Combined with continuous monitoring of agent token rotation cadence and alerting on secret access anomalies, Safeguard gives you an auditable view of whether your Buildkite estate is operating to the hardening standards in this guide, which matters especially in environments where agent pools span multiple teams and the operational controls drift over time.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.