SBOM generation from source code is a solved problem. If you have access to the project's manifest files -- package.json, requirements.txt, pom.xml, go.mod -- any of a dozen tools can produce a comprehensive SBOM in seconds. But a significant portion of the software deployed in enterprise environments does not come with source code.
Commercial off-the-shelf software arrives as installers, binaries, or container images. Firmware for network devices, IoT hardware, and industrial control systems is distributed as binary blobs. Legacy applications may have lost their build systems years ago. Vendor-supplied libraries are sometimes distributed as pre-compiled objects.
For these cases, binary SBOM analysis is the only option. The technology has been advancing steadily, driven by regulatory mandates (the EU CRA applies to all products with digital elements, not just those with available source) and the practical need to understand what is running in production.
How Binary SBOM Analysis Works
String Extraction and Pattern Matching
The simplest and most widely-used technique. Compiled binaries often contain embedded strings that reveal their components: version strings ("OpenSSL 1.1.1w"), copyright notices, build identifiers, and library names. Extracting these strings and matching them against known component databases produces a rudimentary SBOM.
This technique is fast and works across all binary formats, but it has significant limitations:
- Not all components leave identifiable strings in the binary.
- String stripping (common in release builds) removes many identifiable markers.
- Version strings can be inaccurate if developers forgot to update them.
- The technique cannot distinguish between statically linked libraries and bundled third-party code.
Despite these limitations, string extraction catches a surprising amount. Studies have shown that string-based analysis can identify 60-70% of known components in typical compiled binaries.
File Hash Matching
For binaries that bundle shared libraries, plugins, or other files as distinct entities, file hash matching compares the hashes of extracted files against databases of known component versions.
This works well for:
- Dynamically linked shared libraries (.so, .dll, .dylib files).
- Bundled JavaScript, CSS, or font files in Electron applications.
- Java JAR files bundled within application archives.
- Python .pyc or .pyo files in frozen applications.
The effectiveness depends on whether the component files are distributed in their original form. Modified or stripped files will not match known hashes.
Code Fingerprinting
More sophisticated than string or hash matching, code fingerprinting analyzes the actual code patterns in a binary to identify components.
Function-level fingerprinting computes signatures based on the control flow graph, instruction sequences, or other structural properties of individual functions. These signatures are compared against a database of known function signatures from open source libraries.
Snippet matching identifies code regions that match known open source code, even when the code has been compiled with different optimization levels, different compilers, or minor modifications.
Machine learning approaches train classifiers on features extracted from binary code to identify the likely source library. These approaches can handle compiler variations and optimizations that confuse simpler matching techniques.
Code fingerprinting is the most accurate approach for statically linked binaries where components are compiled directly into the executable. It is also the most computationally expensive and requires extensive reference databases of compiled library signatures.
Firmware Unpacking
Firmware images present a unique challenge. They are often compressed, encrypted, or stored in proprietary formats that must be unpacked before analysis can begin.
The firmware analysis pipeline typically follows these steps:
- Identify the firmware format: Is it a flat binary, a filesystem image, a compressed archive, or a layered container?
- Extract the filesystem: Tools like binwalk can identify and extract common firmware filesystem formats (squashfs, cramfs, JFFS2, ext4).
- Analyze extracted components: Once the filesystem is extracted, individual binaries, shared libraries, and configuration files can be analyzed using the techniques described above.
- Identify the base OS: Many firmware images are based on Linux distributions (OpenWrt, Yocto, Buildroot) whose package databases can be matched against extracted files.
Firmware analysis often reveals components that the device manufacturer may not have documented: outdated versions of OpenSSL, busybox with known vulnerabilities, or Linux kernel versions with unpatched CVEs.
Container Image Analysis
Container images are a more structured case. They consist of layered filesystem snapshots, and the base image layers typically correspond to known Linux distributions with package managers whose databases can be queried.
Binary SBOM analysis for containers involves:
- Extracting layers: Each layer is a tar archive that can be unpacked.
- Identifying the base OS: /etc/os-release or equivalent files identify the distribution and version.
- Querying package databases: dpkg, rpm, apk, and other package manager databases within the image list installed packages and versions.
- Analyzing application layers: Application-specific layers may contain compiled binaries, bundled libraries, or language-specific packages that require the other analysis techniques described above.
Container image analysis is the most mature form of binary SBOM analysis because the container format provides structure that raw binaries lack.
Accuracy Considerations
Known Unknowns
Every binary SBOM analysis technique has a detection rate less than 100%. The gap between identified components and actual components represents a known unknown. Organizations should treat binary-generated SBOMs as lower bounds -- the actual component inventory is at least this large, and likely larger.
Quality metrics for binary SBOMs should include:
- Identification confidence: How confident is the tool in each component identification? String-matched components with clear version strings are high confidence. Code-fingerprinted components with partial matches are lower confidence.
- Coverage estimate: What percentage of the binary's code was matched to known components? Large unmatched regions suggest unidentified components.
- Version precision: Was the exact version identified, or only a version range?
False Positives
Binary analysis can produce false positives -- identifying components that are not actually present. This happens when:
- Generic code patterns (sorting algorithms, hash functions) match library fingerprints from multiple sources.
- Vendored code has diverged from its upstream source.
- String patterns match coincidentally.
False positives in a binary SBOM lead to unnecessary vulnerability investigation. While less dangerous than false negatives, they waste time and erode trust in the SBOM.
The Commercial Software Problem
When analyzing commercial software that you did not build, the accuracy of binary SBOM analysis is your only source of component information. You cannot verify against a manifest file because you do not have one. This creates an uncomfortable situation where vulnerability management decisions are based on component identifications of uncertain accuracy.
The practical mitigation is to use multiple analysis techniques and correlate results. A component identified by both string matching and code fingerprinting is more likely correct than one identified by a single technique.
Regulatory Drivers
EU Cyber Resilience Act
The CRA requires manufacturers to provide SBOMs for products with digital elements. For products that include third-party binary components, the manufacturer must either obtain SBOMs from their suppliers or generate SBOMs from binary analysis. This creates a supply chain incentive: vendors who proactively provide SBOMs for their binary components reduce the burden on their customers.
FDA Medical Device Guidance
Medical device manufacturers are required to provide SBOMs for premarket submissions. Many medical devices include commercial firmware, RTOS components, and third-party libraries available only in binary form. Binary SBOM analysis is the primary method for documenting these components.
Government Procurement
US federal procurement increasingly requires SBOMs from software vendors. Agencies receiving binary software without source code may use binary SBOM analysis to verify vendor-provided SBOMs or to generate their own component inventories.
How Safeguard.sh Helps
Safeguard supports binary SBOM analysis for container images, compiled applications, and firmware images. The platform combines multiple identification techniques -- string extraction, file hash matching, and package database queries -- to produce the most complete component inventory possible from binary artifacts.
Binary-generated SBOMs are treated as first-class SBOMs within the platform, subject to the same continuous vulnerability monitoring, policy evaluation, and VEX enrichment as source-generated SBOMs. Confidence scores for each component identification help teams understand the reliability of their binary SBOMs and prioritize verification efforts for low-confidence identifications.
For organizations navigating CRA compliance or federal procurement requirements for software they consume rather than build, Safeguard's binary analysis capabilities provide the component visibility that regulatory mandates demand.