In June 2021, a user on a popular hacking forum advertised a dataset containing records from 700 million LinkedIn users. Given that LinkedIn had approximately 756 million users at the time, this represented data from roughly 93% of the entire platform. A sample of one million records was posted as proof, and analysis confirmed the data was legitimate and up-to-date.
This was the second major LinkedIn scraping incident of 2021 — just months after a dataset of 500 million records had appeared in April.
What Was Scraped
The dataset included:
- Full names
- Email addresses
- Phone numbers
- Physical addresses
- Geolocation records
- LinkedIn usernames and profile URLs
- Personal and professional experience
- Gender
- Other social media account details
Notably, the dataset did not include passwords or financial data. It consisted entirely of information that was technically "visible" through LinkedIn's platform, though not in aggregated form and not intended for mass download.
The seller, using the handle "TomLiner," claimed to have obtained the data through LinkedIn's API. They offered the full dataset for $5,000.
How the Scraping Worked
LinkedIn's API, like those of many social platforms, provided endpoints that returned user profile information. While individual queries were legitimate — returning data that was already visible on the platform — automated tools could iterate through user IDs or search parameters at scale, collecting millions of records over time.
The scraper likely combined several techniques:
- API enumeration — systematically querying LinkedIn's API with sequential or predictable user identifiers
- Search result harvesting — automating searches across different criteria (company, location, title) and collecting all returned profiles
- Data enrichment — combining LinkedIn profile data with information from other sources to build more complete records
LinkedIn's official response stated: "This was not a LinkedIn data breach, and no private member account data from LinkedIn was included in what we've been able to review. Any misuse of our members' data, such as scraping, violates LinkedIn terms of service."
The Breach vs. Scraping Debate (Again)
Just as with the Facebook 533-million incident, the characterization of this event became contentious. LinkedIn maintained this was not a breach because:
- No systems were compromised
- No authentication was bypassed
- The data was already "publicly available" on the platform
- No passwords or private messages were exposed
Critics argued that this framing missed the point entirely:
- Users did not consent to their data being collected in bulk and sold
- The aggregation of data creates risks that individual profile views do not
- Rate limiting and access controls should have prevented this scale of collection
- "Public on LinkedIn" does not mean "public for any purpose"
The Italian Data Protection Authority (Garante) opened an investigation. Several class-action lawsuits were filed in the United States.
Why Scraped Data Is Dangerous
The combination of professional and personal information in the LinkedIn dataset makes it exceptionally useful for attackers:
Spear phishing. With details about a person's employer, role, colleagues, and professional background, attackers can craft highly convincing phishing emails. "Hi [Name], I saw your talk at [Conference]. I'm working on a similar project at [Company]..."
Business Email Compromise (BEC). Knowing reporting structures and executive identities enables impersonation attacks. An attacker who knows who reports to whom can send a convincing "urgent wire transfer" email from a spoofed executive address.
Credential stuffing. Email addresses from the LinkedIn dump can be cross-referenced with passwords from other breaches. People who reuse passwords are immediately vulnerable.
Social engineering for account takeover. Knowing a target's phone number, email, employer, and personal details gives an attacker everything needed for convincing social engineering calls to customer support teams.
Recruitment scams. Fake job offers targeting professionals whose employment history and skills are known become trivially easy to personalize.
The Platform Responsibility Question
The recurring pattern of mass scraping incidents across major platforms — Facebook, LinkedIn, Twitter, Clubhouse — raises a fundamental question about platform responsibility.
When a platform collects hundreds of millions of user profiles and makes them queryable through APIs and search functions, what level of protection are they obligated to provide? The "it's not a breach, it's scraping" response effectively shifts responsibility to users for having put their information on the platform in the first place.
Effective anti-scraping measures exist but require investment:
- Behavioral analysis to distinguish automated access from human browsing patterns
- Token-based rate limiting tied to authenticated sessions with strict quotas
- Honeypot profiles that trigger alerts when accessed by scrapers
- Data minimization in API responses — returning only what the requester needs
- Legal enforcement against scraping operations, though jurisdiction makes this challenging
How Safeguard.sh Helps
Safeguard.sh monitors your organization's exposure across breach and scraping datasets, alerting you when employee data appears in newly circulated dumps. This early warning is critical because scraped professional data fuels the kind of targeted phishing and BEC attacks that bypass technical controls. Our platform also helps enforce security policies that account for the reality of widespread data exposure — requiring phishing-resistant MFA, monitoring for impersonation attempts, and tracking which third-party platforms your organization's data flows through. When 93% of a professional network's data is for sale, assuming your people are in the dataset is the only safe posture.