Best Practices

How to Audit Open Source Licenses for Compliance

A senior engineer's playbook for auditing open source licenses across modern polyglot repos, from SPDX extraction to enforcement in CI and legal reporting.

Shadab Khan
Security Engineer
8 min read

Every modern application ships with thousands of open source components, and each one carries a license agreement that your legal team is on the hook for. Most engineering organizations discover their licensing problems the week before a funding round, an acquisition, or a government audit — at which point unwinding a GPL contamination or a missing attribution file becomes a painful exercise. I've walked through this process with dozens of engineering teams, and the pattern that produces reliable outcomes is surprisingly mechanical. It just requires treating license compliance as a first-class engineering concern rather than a periodic paperwork exercise.

This guide walks through how to build a repeatable license audit process that runs continuously, produces evidence your legal counsel can defend, and doesn't block developers from shipping code.

Why do open source license audits fail so often?

Most audits fail because they depend on manual inventories that go stale the moment someone merges a dependency bump. Teams run a tool once, email a spreadsheet to legal, and then forget about the problem until the next audit cycle. By then the dependency tree has shifted dramatically, transitive packages have been added with incompatible licenses, and the audit has no relationship to what's actually in production.

The second failure mode is relying on package manager metadata as ground truth. The license field in package.json or pyproject.toml is self-reported by the package author, frequently wrong, and sometimes missing entirely. A LICENSE file in the repo may say Apache 2.0 while the package.json declares MIT, and neither matches the actual code origin. A real audit reconciles all three sources — declared metadata, license files, and source-level notices — and flags the mismatches.

The third failure is ignoring transitive dependencies. Your direct dependency list might be spotless, but the packages it pulls in could include copyleft components, unapproved licenses, or packages with no license at all. Any audit that stops at direct dependencies is missing 80 to 95 percent of the risk surface.

What license categories should my policy define?

A workable policy maps every license into one of four buckets: allowed, allowed-with-notice, restricted, and forbidden. Allowed licenses are permissive (MIT, Apache 2.0, BSD, ISC) and require nothing beyond basic attribution. Allowed-with-notice licenses (MPL 2.0, EPL 2.0, LGPL) are acceptable but trigger distribution obligations you need to track. Restricted licenses (GPL, AGPL, SSPL, Commons Clause) require explicit review by legal before inclusion. Forbidden licenses include JSON's "do no evil" clause, custom non-commercial licenses, and the growing set of "fair source" licenses that look open but aren't.

Your policy also needs to account for license exceptions, dual licensing, and license propagation rules. A package dual-licensed under GPL-2.0-or-later and MIT can be used under either license, but your SBOM should document which one you selected. Runtime-linked LGPL code has different obligations than statically-linked LGPL code, and your policy should spell out which linking patterns your architecture uses.

Write this policy in a machine-readable format. SPDX license expressions are the lingua franca here, and tools from license-checker to scancode-toolkit can consume SPDX identifiers directly. A plain YAML file committed to your monorepo beats a PDF owned by legal every time.

How do I extract accurate license data from a polyglot codebase?

Start with SPDX as the canonical exchange format and generate an SBOM per artifact, not per repo. A repo might produce three Docker images, a Go binary, and a Python wheel, each with different dependency closures. Tools like syft produce SPDX SBOMs for most ecosystems, and cdxgen handles CycloneDX with similar coverage.

For each artifact, run three parallel scans: package-manager metadata extraction, license file detection, and source-level scanning for copyright headers and license notices. Reconcile the three. Here's a minimal CI gate that catches the most common failure modes:

# .github/workflows/license-audit.yml
name: license-audit
on: [pull_request]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Generate SBOM
        run: |
          syft packages dir:. -o spdx-json > sbom.spdx.json
      - name: Enforce license policy
        run: |
          jq -r '.packages[] | "\(.name)\t\(.licenseConcluded)"' sbom.spdx.json \
            | python scripts/check_policy.py --policy license-policy.yml \
            | tee license-report.txt
      - name: Fail on forbidden licenses
        run: |
          if grep -qE 'FORBIDDEN|UNKNOWN' license-report.txt; then
            echo "License policy violation"
            exit 1
          fi
      - uses: actions/upload-artifact@v4
        with:
          name: license-report
          path: |
            sbom.spdx.json
            license-report.txt

The key insight is that your audit produces artifacts on every pull request, not once a quarter. Every merged PR has a license report in its CI history, and you can reconstruct the license state of any release from its SBOM.

How do I handle transitive dependencies and license propagation?

Resolve the full transitive closure per artifact and treat each level as a first-class citizen. A Go binary statically links everything it imports, so every transitive dependency's license applies to your redistributed binary. A Python wheel with runtime dependencies has a more nuanced story — obligations may shift depending on how the downstream user installs it.

Track propagation rules explicitly. If a package is GPL-licensed and your policy allows GPL only for build-time tooling, your audit needs to know which parts of your tree are "build-time" versus "runtime" versus "ship-to-customer." Monorepos make this harder because a shared utility library might be used in three different contexts with three different obligations.

Finally, track license changes over time. A package you've depended on for two years can re-license itself from MIT to BSL, and most dependency update tools won't flag this. Your SBOM diff tooling should alert when a package's license identifier changes between versions, not just when its version string changes.

What evidence do auditors and acquirers actually want?

External auditors want three things: a current inventory, a historical trail, and proof of enforcement. The inventory is your SBOM for each shipping artifact. The historical trail shows that SBOMs have been produced continuously, signed, and archived. The proof of enforcement is your CI history — commits that were rejected because they violated the license policy, and documentation of exceptions that were granted by name and date.

For M&A due diligence specifically, expect questions about AGPL in any SaaS-facing code, questions about LGPL linking boundaries, and questions about packages with missing or ambiguous licenses. Having the answers cached in your SBOM metadata — with the exact commit, version, and license expression — turns what could be a month-long forensic exercise into a same-day response.

Attribution is the easy part to forget. Most permissive licenses require you to reproduce the copyright notice and license text in distributed binaries. Generating a NOTICE.md file from your SBOM during build and including it in every container image and released tarball closes this gap with almost no developer friction.

How should I operationalize license exceptions?

Exceptions are inevitable. A vendor-critical dependency might be AGPL, or a data science team might need a restricted-license model weight file. Build an exception process with explicit expiration dates and named owners. Store exceptions in the same YAML policy that defines your license categories, and have CI read from both.

Tie every exception to a replacement plan. "We accept this AGPL dependency until 2026-09-30 while we evaluate alternatives" is a defensible position. "We accept this AGPL dependency indefinitely because the team lead approved it in Slack" is not. Auto-expiring exceptions force periodic review and prevent the common failure mode where exceptions pile up until the policy is meaningless.

Route the exception workflow through the same system that produces your SBOMs so the audit trail is unified. When counsel asks why a specific copyleft dependency shipped in release 4.12, the answer should trace from the SBOM to the exception record to the approver and the expiration date. That traceability is also what turns a routine license question from legal into a thirty-second lookup rather than a multi-day investigation, which is how you get legal to stay out of the engineering team's way between audits.

How Safeguard.sh Helps

Safeguard.sh generates SPDX and CycloneDX SBOMs for every build and applies reachability analysis at 100-level depth so your license audit focuses on code paths that actually execute in production — not the noise from unused transitive dependencies. Griffin AI reviews each dependency's license metadata, license files, and source-level notices in parallel and flags mismatches before they reach your main branch. Eagle continuously monitors the package ecosystem for license changes, relicensing events, and policy violations across your TPRM surface so a BSL flip on a critical dependency is caught the day it ships. Container self-healing rebuilds and redistributes affected images when a license violation is remediated upstream, keeping your attribution manifests in sync with what's actually in production. The result is a continuous, evidence-producing license compliance program that survives audits, acquisitions, and the pace of modern open source.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.