Data Breach

LinkedIn Data Scraping: 700 Million User Records Sold on the Dark Web

A threat actor scraped data from 700 million LinkedIn users — 93% of the platform's user base — and put it up for sale, reigniting the debate over API abuse and data privacy.

Bob
Cybersecurity Analyst
5 min read

In June 2021, a user on a popular hacking forum advertised a dataset containing records from 700 million LinkedIn users. Given that LinkedIn had approximately 756 million users at the time, this represented data from roughly 93% of the entire platform. A sample of one million records was posted as proof, and analysis confirmed the data was legitimate and up-to-date.

This was the second major LinkedIn scraping incident of 2021 — just months after a dataset of 500 million records had appeared in April.

What Was Scraped

The dataset included:

  • Full names
  • Email addresses
  • Phone numbers
  • Physical addresses
  • Geolocation records
  • LinkedIn usernames and profile URLs
  • Personal and professional experience
  • Gender
  • Other social media account details

Notably, the dataset did not include passwords or financial data. It consisted entirely of information that was technically "visible" through LinkedIn's platform, though not in aggregated form and not intended for mass download.

The seller, using the handle "TomLiner," claimed to have obtained the data through LinkedIn's API. They offered the full dataset for $5,000.

How the Scraping Worked

LinkedIn's API, like those of many social platforms, provided endpoints that returned user profile information. While individual queries were legitimate — returning data that was already visible on the platform — automated tools could iterate through user IDs or search parameters at scale, collecting millions of records over time.

The scraper likely combined several techniques:

  • API enumeration — systematically querying LinkedIn's API with sequential or predictable user identifiers
  • Search result harvesting — automating searches across different criteria (company, location, title) and collecting all returned profiles
  • Data enrichment — combining LinkedIn profile data with information from other sources to build more complete records

LinkedIn's official response stated: "This was not a LinkedIn data breach, and no private member account data from LinkedIn was included in what we've been able to review. Any misuse of our members' data, such as scraping, violates LinkedIn terms of service."

The Breach vs. Scraping Debate (Again)

Just as with the Facebook 533-million incident, the characterization of this event became contentious. LinkedIn maintained this was not a breach because:

  • No systems were compromised
  • No authentication was bypassed
  • The data was already "publicly available" on the platform
  • No passwords or private messages were exposed

Critics argued that this framing missed the point entirely:

  • Users did not consent to their data being collected in bulk and sold
  • The aggregation of data creates risks that individual profile views do not
  • Rate limiting and access controls should have prevented this scale of collection
  • "Public on LinkedIn" does not mean "public for any purpose"

The Italian Data Protection Authority (Garante) opened an investigation. Several class-action lawsuits were filed in the United States.

Why Scraped Data Is Dangerous

The combination of professional and personal information in the LinkedIn dataset makes it exceptionally useful for attackers:

Spear phishing. With details about a person's employer, role, colleagues, and professional background, attackers can craft highly convincing phishing emails. "Hi [Name], I saw your talk at [Conference]. I'm working on a similar project at [Company]..."

Business Email Compromise (BEC). Knowing reporting structures and executive identities enables impersonation attacks. An attacker who knows who reports to whom can send a convincing "urgent wire transfer" email from a spoofed executive address.

Credential stuffing. Email addresses from the LinkedIn dump can be cross-referenced with passwords from other breaches. People who reuse passwords are immediately vulnerable.

Social engineering for account takeover. Knowing a target's phone number, email, employer, and personal details gives an attacker everything needed for convincing social engineering calls to customer support teams.

Recruitment scams. Fake job offers targeting professionals whose employment history and skills are known become trivially easy to personalize.

The Platform Responsibility Question

The recurring pattern of mass scraping incidents across major platforms — Facebook, LinkedIn, Twitter, Clubhouse — raises a fundamental question about platform responsibility.

When a platform collects hundreds of millions of user profiles and makes them queryable through APIs and search functions, what level of protection are they obligated to provide? The "it's not a breach, it's scraping" response effectively shifts responsibility to users for having put their information on the platform in the first place.

Effective anti-scraping measures exist but require investment:

  • Behavioral analysis to distinguish automated access from human browsing patterns
  • Token-based rate limiting tied to authenticated sessions with strict quotas
  • Honeypot profiles that trigger alerts when accessed by scrapers
  • Data minimization in API responses — returning only what the requester needs
  • Legal enforcement against scraping operations, though jurisdiction makes this challenging

How Safeguard.sh Helps

Safeguard.sh monitors your organization's exposure across breach and scraping datasets, alerting you when employee data appears in newly circulated dumps. This early warning is critical because scraped professional data fuels the kind of targeted phishing and BEC attacks that bypass technical controls. Our platform also helps enforce security policies that account for the reality of widespread data exposure — requiring phishing-resistant MFA, monitoring for impersonation attempts, and tracking which third-party platforms your organization's data flows through. When 93% of a professional network's data is for sale, assuming your people are in the dataset is the only safe posture.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.