Best Practices

AWS Step Functions Workflow Security

Step Functions workflows orchestrate everything from data pipelines to security automations. The workflow IAM role is almost always the most powerful thing in the stack. Here is how to lock it down.

Nayan Dey
Senior Security Engineer
7 min read

Step Functions is the quiet workhorse of AWS. It orchestrates ETL jobs, fans out batch processing, runs security response playbooks, coordinates multi-step deployments. Once a workflow is wired up and running, it tends to stay running for years, accruing integrations, inputs from more sources, and gradually expanding IAM scope every time someone adds one more task. The result is often the most powerful IAM role in the account, buried in a workflow nobody has looked at since 2021.

This post is about Step Functions workflow security from the perspective of an engineer who has to go clean up these workflows. The structure follows how I approach a workflow review.

Map the integrations first

Before anything else, look at what services the workflow talks to. Step Functions has two ways to integrate with other services: optimized integrations (the arn:aws:states:::service:action.sync resource ARNs) and direct AWS SDK service integrations (the newer arn:aws:states:::aws-sdk:service:action style).

For each task state in the workflow, note the service, the action, and what resource patterns the workflow can target. A workflow with tasks like lambda:invoke, ecs:runTask.sync, sns:publish, and dynamodb:putItem talks to Lambda, ECS, SNS, and DynamoDB. The workflow IAM role needs permissions for all of those, and the question is whether the permissions are scoped to the specific resources the workflow uses or whether they are wildcarded.

The common failure: a wildcarded lambda:InvokeFunction on Resource: "*" because at some point the workflow invoked several Lambdas and the author did not want to maintain a list. Every one of those wildcards is a potential privilege escalation path. A compromised workflow can invoke any Lambda in the account, including Lambdas owned by entirely unrelated teams.

Fix: replace every wildcard with an explicit resource list. If the workflow invokes three Lambdas, the role has three explicit lambda:InvokeFunction statements. If the workflow needs to invoke a dynamic Lambda name passed in as workflow input, that is a different (and worse) problem, covered below.

Input validation and the taint problem

Step Functions workflows accept input. That input can include ARNs, table names, S3 paths, and other parameters. Some workflows use those parameters to construct the targets of subsequent tasks: "run an ECS task on the cluster specified in input.clusterArn," "put an object to the S3 bucket in input.bucketName."

If the workflow role has ecs:RunTask on Resource: "*" because the cluster ARN is an input parameter, the workflow is a confused deputy. Anyone who can invoke the workflow with a crafted input can run ECS tasks on any cluster in the account. The workflow acts on behalf of the caller with its own broader permissions.

Three mitigations, in order of strength:

Restrict workflow invocation. Who can call StartExecution on the workflow? If the answer is "any Lambda in the account via IAM," the attack surface is everything that can assume any role in the account. Scope the trust. The state machine's invocation path should be explicit and minimal.

Constrain inputs at the state machine level. Use the state machine definition to reject inputs that do not match expected patterns. JSONata (the newer expression language) makes input validation straightforward. States that take ARNs should validate the ARN structure and extract the account and region, and reject if they do not match expected values.

Use dedicated roles per target. If the workflow legitimately operates on one of several resources, grant the role access to exactly those resources, not a wildcard. If the list is too long to enumerate, the design is probably wrong and the workflow should be split.

Express workflows versus Standard workflows

Step Functions has two workflow types: Standard (up to one year duration, at-most-once execution, full execution history) and Express (up to five minutes, at-least-once for synchronous and at-least-once-or-more for asynchronous, sampled logging).

The security difference is logging. Standard workflows log every state transition to CloudWatch Logs or to the execution history, which you can query via GetExecutionHistory. Express workflows ship execution telemetry to CloudWatch Logs if you configure it, but the default is sampled and the default retention is short.

For security-sensitive workflows, use Standard or configure Express workflows with full logging. The retention on execution history matters: CloudTrail alone does not give you the per-state detail you need to reconstruct what a workflow did.

The logging configuration nobody sets

When you create a state machine, the default logging level is OFF. No execution history goes to CloudWatch Logs. No state inputs or outputs are logged. If an incident happens and you need to know what input a specific execution was invoked with, the answer is usually "we cannot tell" because the history retention has expired and the log level was off.

Set it correctly at creation:

{
  "loggingConfiguration": {
    "level": "ALL",
    "includeExecutionData": true,
    "destinations": [{
      "cloudWatchLogsLogGroup": {
        "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/states/MyWorkflow:*"
      }
    }]
  }
}

includeExecutionData: true logs the inputs and outputs of each state. This is what you need for forensics. It also means your logs may contain data that is sensitive (PII, credentials that shouldn't be there in the first place), so the log group should be encrypted with a customer-managed KMS key and retention should be set deliberately.

For highly sensitive workflows, combine ALL with includeExecutionData: false to log state transitions without the data. You lose some forensic detail but avoid the sensitive-data-in-logs problem.

The Lambda task anti-pattern

The most common Step Functions anti-pattern: a workflow where most states are Lambda tasks, each Lambda has its own broad IAM role, and the workflow role has lambda:InvokeFunction: Resource: "*". The workflow is effectively a chain of Lambdas, each running with its own permissions, and the actual authorization boundaries are in the Lambdas rather than in Step Functions itself.

This is workable but it means the security model lives in the Lambdas. Each Lambda needs to be reviewed independently, each role needs to be scoped, and the Step Functions layer is just orchestration. If the orchestration is all there is, Step Functions buys you the execution history and retry semantics but not much else.

The alternative that uses Step Functions better: replace Lambda-wrapping-SDK-call tasks with direct SDK integrations. Instead of a Lambda that does s3.PutObject, use the aws-sdk:s3:putObject task type directly. The workflow role, not a Lambda role, performs the S3 operation. You have one role to scope instead of ten. You have one audit trail in the workflow's execution history instead of logs scattered across Lambda executions.

This refactor is real work but it produces workflows that are much easier to secure and reason about. I recommend it for any new workflow, and for existing workflows where the Lambda-per-task pattern is causing IAM sprawl.

The .sync variant and running tasks

When a task uses the .sync variant (ecs:runTask.sync, batch:submitJob.sync), Step Functions waits for the downstream operation to complete and surfaces the result. The permissions needed for .sync are broader than the base action: the role also needs events:PutTargets, events:PutRule, and events:DescribeRule to set up the completion notification.

These permissions are documented but easy to miss. The common outcome: a role with wildcarded events:* so that .sync works, which is far broader than needed. The fix: scope the EventBridge permissions to the specific managed rule ARN that Step Functions creates, which is arn:aws:events:region:account:rule/StepFunctionsGetEventsForECSTaskRule (the exact name varies by service).

How Safeguard Helps

Safeguard analyzes every Step Functions state machine in your AWS organization and maps each task state to the IAM permissions it actually requires versus what the workflow role is granted. We flag wildcarded Resource: "*" permissions where the state definition targets specific resources, identify state machines with logging disabled or includeExecutionData: false, and surface workflows where the trust path for invocation is broader than the downstream targets would suggest. For workflows that operate on input-supplied ARNs, Safeguard traces the confused-deputy paths and recommends input validation at the state level.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.