Open Source

Software Heritage and the Case for Source Code Preservation

Software Heritage archives the world's source code. Here is why that matters for supply chain security, reproducibility, and long-term software integrity.

Shadab Khan
Security Architect
7 min read

Software Heritage is a nonprofit initiative by Inria that aims to collect, preserve, and share all publicly available source code. As of 2023, it has archived over 16 billion unique source files from more than 250 million software projects. It is the largest source code archive in existence, and its relevance to software supply chain security is underappreciated.

The archive matters because software disappears. Repositories get deleted, hosting platforms shut down, maintainers take packages offline, and organizations go bankrupt. When the code that your critical infrastructure depends on vanishes, you have a supply chain problem that no vulnerability scanner can solve.

Why Source Code Disappears

The impermanence of source code might seem like a theoretical concern, but it happens constantly.

Repository deletions: Developers delete repositories for many reasons -- legal disputes, burnout, protest, or simply cleaning up their GitHub profile. When the event-stream package was compromised in 2018, investigators needed access to historical versions of the source code to understand the attack. If those versions had been deleted, the investigation would have been severely hampered.

Platform shutdowns: Google Code shut down in 2016. Gitorious shut down in 2015. BitBucket dropped Mercurial support in 2020. Each time a platform winds down, projects that were not actively migrated are at risk of loss.

Package removal: The left-pad incident demonstrated that removing a package from npm can break thousands of downstream projects. But beyond immediate breakage, removal eliminates the ability to audit the code that was previously in your dependency tree.

Yanked versions: Package registries allow maintainers to yank or remove specific versions. If a yanked version was the one in your production deployment, you can no longer inspect its source code to investigate a security incident.

How Software Heritage Works

Software Heritage operates as a universal archive with a content-addressable storage model. Every source file, directory, revision, and release is identified by a cryptographic hash (SHA-1 with plans to transition to SHA-256).

The archive collects code from multiple sources:

  • Git repositories: Mirrored from GitHub, GitLab, Bitbucket, and self-hosted instances
  • Package registries: Source packages from npm, PyPI, CRAN, and others
  • Legacy archives: GNU FTP, historical Debian archives, and other historical sources
  • Manual deposits: Researchers and organizations can deposit code directly

The key technical property is that every object in the archive is immutable and content-addressed. A source file with a given hash will always return the same content. This provides the integrity guarantee that supply chain security requires.

SWHIDs: Software Heritage Identifiers

Software Heritage defines a persistent identifier scheme called SWHIDs (Software Heritage Identifiers). A SWHID looks like:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

The format encodes the object type (content, directory, revision, release, snapshot) and the content hash. SWHIDs are intrinsic -- they are derived from the content itself, not assigned by an authority. This means the same code will always have the same SWHID, regardless of where it is hosted.

SWHIDs have been adopted by several standards and tools:

  • SPDX 2.3 and 3.0 support SWHIDs as package identifiers
  • CycloneDX supports SWHIDs in the external references section
  • CodeMeta (software metadata standard) uses SWHIDs for precise version identification

Supply Chain Security Implications

Reproducible Vulnerability Analysis

When a CVE is published for a specific version of a library, you need to inspect that exact version's source code to understand the vulnerability, determine exploitability, and verify the fix. If the source code is only available on the project's current repository, you depend on the project maintaining historical branches and tags.

Software Heritage provides an independent archive where any historical version can be retrieved. This is particularly valuable for:

  • Abandoned projects: If the maintainer has moved on and the repository is deleted, the archived source remains available
  • Disputed versions: If there is a question about what code was actually in a specific release, the archive provides an authoritative reference
  • Long-term analysis: Investigating vulnerabilities in old software that may still be deployed in legacy environments

SBOM Enrichment

SBOMs identify components by name and version, but they do not typically include the source code itself. By linking SBOM entries to Software Heritage identifiers, you create a verifiable reference to the exact source code in each component.

This enrichment enables:

  • Source code audit: For any component in your SBOM, retrieve the exact source code from Software Heritage and audit it
  • Diff analysis: Compare the source code of two versions to understand what changed when a vulnerability was introduced or fixed
  • Provenance verification: Verify that a compiled artifact was built from the source code it claims to be built from, by comparing against the archived source

Build Reproducibility

Reproducible builds verify that a binary artifact was produced from specific source code. Software Heritage provides the stable, content-addressed source reference that reproducible build verification requires. Even if the original source repository is modified or deleted, the archived source remains available as the ground truth.

Incident Response

During a supply chain incident, investigators need to analyze multiple versions of affected packages quickly. If the package has been removed from the registry or the repository has been deleted (which happens -- attackers sometimes clean up after themselves), Software Heritage may be the only source of the code.

Integrating Software Heritage into Your Workflow

SBOM Generation

When generating SBOMs, include Software Heritage identifiers for source-available components. Some SBOM tools are beginning to support this natively, and for others, a post-processing step can add SWHIDs based on the component's source repository URL and version.

Dependency Archival

Proactively trigger archival of your dependencies in Software Heritage. The "Save Code Now" feature allows you to request that a specific repository be archived. For critical dependencies, ensure that the specific versions you depend on are captured.

Source Verification

For high-assurance environments, use Software Heritage as an independent source of truth. Download a dependency's source code from the archive, build it yourself, and compare the result against the published binary. This provides an additional verification layer beyond package registry integrity.

Policy Requirements

Consider adding Software Heritage archival status to your dependency evaluation criteria. Components whose source code is archived have a preservation guarantee that non-archived components lack. This matters for long-lived systems where dependency continuity is critical.

Challenges and Limitations

Coverage gaps: Software Heritage does not have everything. Private repositories, proprietary code, and some smaller hosting platforms may not be archived. The archive is continuously growing, but it is not comprehensive.

Timeliness: There is a delay between code being published and code being archived. For time-sensitive incident response, the latest version might not yet be in the archive.

Binary artifacts: Software Heritage archives source code, not compiled artifacts. If you need to analyze a specific binary (a .jar, a .whl, a compiled .so), the archive does not help directly -- you need the source and a reproducible build process.

Scale of integration: Adding SWHID support to existing SBOM tooling and workflows requires engineering effort. The standards support is there, but tooling adoption is still early.

The Bigger Picture

Software Heritage represents a fundamental shift in how we think about software as a shared resource. Source code is a form of knowledge, and like other forms of knowledge, it needs to be preserved for future access.

For supply chain security specifically, the archive provides a backstop against the fragility of the current ecosystem. Repositories disappear, registries have outages, and maintainers walk away. Having an independent, immutable, content-addressed archive of the world's source code is an insurance policy for the entire software supply chain.

How Safeguard.sh Helps

Safeguard.sh supports Software Heritage identifiers in its SBOM processing pipeline. When you upload an SBOM that includes SWHIDs, Safeguard.sh uses them to verify component identity against the archive, ensuring that the component you are tracking is exactly the source code you expect.

For components without SWHIDs, Safeguard.sh can resolve source repository references and check Software Heritage archival status, flagging dependencies whose source code has not been preserved. This adds a resilience dimension to your supply chain risk assessment, helping you identify components where source disappearance could leave you unable to audit, rebuild, or verify your software.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.