Every supply chain security platform assigns risk scores. A package gets a number -- 3.2 out of 10, or "High Risk," or a letter grade. The number is supposed to collapse the complexity of supply chain risk into something actionable. Sometimes it does. Often it obscures more than it reveals.
Understanding how these scores are calculated is necessary for interpreting them correctly. A risk score is not a measurement like temperature -- it is a model output based on weighted inputs and assumptions. Changing the weights or the inputs changes the score. Two platforms can score the same package differently because they use different models.
Common Scoring Inputs
Most supply chain risk scoring algorithms use some combination of these input categories:
Known vulnerabilities. The most straightforward input. Does the package or its dependencies have published CVEs? What is the severity (CVSS score) of those CVEs? Are exploits publicly available? This is the input with the strongest signal-to-noise ratio because CVEs are concrete, documented, and independently verified.
Maintainer health. How many maintainers does the package have? When was the last commit? How frequently are issues addressed? Is the maintainer responsive to security reports? A package maintained by a single developer who has not committed in 18 months carries different risk than a package maintained by a team with weekly releases.
Dependency depth and breadth. How many transitive dependencies does the package pull in? Each dependency is a potential vulnerability source. A package with 3 dependencies is inherently less risky than one with 300, all else being equal.
Code quality signals. Does the package have tests? Does it use linting? Does it have a security policy? Are there signed releases? These are proxy indicators -- they do not directly measure security but correlate with development practices that reduce vulnerabilities.
Publication metadata. When was the package first published? How many versions have been released? What is the download trend? New packages with no history are riskier than established packages with stable release cadences.
License compliance. Certain licenses create legal risk that intersects with security risk. A package with no license or a restrictive license may be unmaintained or may have usage restrictions that affect your ability to patch vulnerabilities.
Scoring Models
Weighted additive models are the simplest. Each input category gets a weight, the inputs are normalized to a common scale, and the weighted sum produces the score. OpenSSF Scorecard uses a variant of this approach. The advantage is transparency -- you can see exactly which factors contributed to the score. The disadvantage is that additive models can produce misleading results when factors interact. A package with excellent maintainer health but a critical unpatched CVE should not score well just because the maintainer health score offsets the vulnerability score.
Threshold models categorize packages into risk tiers based on binary criteria. If a package has any critical CVE, it is high risk regardless of other factors. If it has no CVEs and meets minimum maintainer health criteria, it is low risk. This avoids the compensation problem of additive models but loses nuance.
Probabilistic models estimate the probability that a package will be involved in a security incident within a given time window. These models use historical data -- which packages have been compromised, what characteristics did they share before compromise -- to predict future risk. They require large training datasets and are sensitive to the base rate problem (compromises are rare events, making prediction difficult).
Graph-based models consider the package's position in the dependency graph. A vulnerability in a package depended on by 10,000 other packages has a larger blast radius than the same vulnerability in a package with 3 dependents. These models capture systemic risk that package-level scoring misses.
Where Scoring Fails
The zero-day blind spot. Risk scores based on known vulnerabilities cannot account for vulnerabilities that have not been discovered. A package with zero CVEs might have zero CVEs because it is secure, or because nobody has audited it. Scoring algorithms that rely heavily on CVE counts systematically underestimate risk for unaudited packages.
Maintainer health is not security. A responsive maintainer with frequent commits might be introducing vulnerabilities with every commit. A dormant project might be stable and secure precisely because it is feature-complete and does not need changes. Maintainer health correlates with the ability to respond to security issues, not with the absence of security issues.
The popularity bias. Popular packages are more scrutinized, which means more vulnerabilities are discovered, which means higher CVE counts, which some scoring models interpret as higher risk. In reality, a popular package with 50 patched CVEs might be more secure than an obscure package with zero CVEs and no security audit history.
Gaming the score. When developers know which factors contribute to a risk score, they optimize for those factors. Adding a SECURITY.md file, enabling a linter, publishing frequently -- these actions improve scores without necessarily improving security. Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure.
Context blindness. A risk score for a package does not account for how you use the package. A cryptographic library with a vulnerability in its RSA implementation is high risk if you use RSA and no risk if you only use its AES functions. Context-free scoring cannot capture this.
Effective Use of Risk Scores
Despite their limitations, risk scores are useful when interpreted correctly.
Use scores for prioritization, not decisions. A score tells you which packages to investigate first, not what action to take. The highest-risk packages deserve human analysis. The lowest-risk packages can be deferred. But do not auto-block or auto-approve based solely on scores.
Understand your platform's model. Know which inputs your scoring platform uses and how it weights them. If the platform weights maintainer health heavily, understand that dormant-but-stable packages will score poorly. Adjust your interpretation accordingly.
Combine multiple scoring sources. Different platforms use different models with different blind spots. A package that scores well on one platform and poorly on another deserves investigation. The disagreement itself is a signal.
Track score changes, not absolute values. A package that drops from 7.0 to 4.0 between versions has changed in a way that warrants investigation, regardless of whether 4.0 crosses your risk threshold. Score deltas often provide more actionable information than absolute scores.
Supplement with manual review. For critical dependencies -- the packages that your application cannot function without -- supplement automated scoring with periodic manual security review. No scoring algorithm replaces human analysis for your most important dependencies.
Building Your Own Scoring Framework
If existing scoring platforms do not fit your risk model, you can build a custom scoring framework that reflects your organization's specific risk tolerances.
Define your input categories based on what matters for your context. A financial services company might weight license compliance and vulnerability patching speed more heavily than a startup. An organization with air-gapped deployments might weight dependency count more heavily because each dependency is a patch management burden.
Weight the inputs based on historical incident data if available, or based on expert judgment if not. Document the weights and the rationale. Review and adjust annually.
Set thresholds that trigger actions -- not just scores that populate dashboards. A score below X requires a security review before the dependency can be added. A score that drops by more than Y between versions triggers an investigation.
How Safeguard.sh Helps
Safeguard.sh incorporates multi-factor risk scoring that evaluates vulnerability data, maintainer health, dependency complexity, and behavioral signals in a unified model. Its scoring is transparent -- teams can see which factors contributed to a package's risk assessment and adjust policy thresholds based on their risk tolerance. Continuous monitoring tracks score changes across every dependency, alerting teams when a previously low-risk package deteriorates. For organizations that need risk scoring they can trust and act on, Safeguard.sh provides the data and the policy enforcement layer to make scores operational.