Incident Analysis

DeepSeek ClickHouse Exposure: When the AI Vendor Forgets the Database

In January 2025 Wiz Research found a wide-open ClickHouse instance belonging to AI startup DeepSeek, leaking chat history, API keys, and internal log streams. We unpack the AI-supply-chain implications.

Nayan Dey
Security Researcher
7 min read

On January 29, 2025, Wiz Research published a disclosure detailing an unauthenticated ClickHouse database belonging to the Chinese AI startup DeepSeek, accessible on the public internet at oauth2callback.deepseek.com:9000 and dev.deepseek.com:9000. The exposed cluster permitted arbitrary SQL queries, granted full database operations privileges, and contained over a million log lines including end-user chat history, plaintext API keys, backend system logs, and operational metadata. The exposure occurred at the height of DeepSeek's first viral month — the company's chat application briefly topped the US Apple App Store in late January 2025 — and Wiz disclosed responsibly to DeepSeek, which secured the cluster the same day. The incident is the canonical example of the modern AI supply chain's weakest link: not the model, not the inference path, but the analytics database that everyone forgets to patch. For enterprise security leaders the most uncomfortable implication is that the prompts pasted by employees, contractors, and customers into AI provider chat interfaces frequently sit in long-lived logging pipelines at vendors who have not yet built the security-engineering muscle to protect them.

Who is DeepSeek and what did Wiz find?

DeepSeek is a Hangzhou-based AI lab spun out from quantitative trading firm High-Flyer that released the R1 reasoning model in late January 2025. Wiz Research, while routinely scanning attack surfaces of new AI providers, identified the exposed ClickHouse cluster through standard internet reconnaissance — a /play HTTP interface bound to port 8123 and the native protocol on 9000 — with no authentication required. The researchers gained immediate access to internal databases including a log_stream table that contained chat content, internal API references, and metadata describing the DeepSeek inference architecture.

What did the exposed data actually contain?

According to Wiz's published analysis, the exposed databases included over one million log entries spanning at least January 6 to January 29, 2025. Visible content categories included end-user chat prompts and model responses, DeepSeek-issued API keys for paying customers and developers, internal service identifiers and backend hostnames, references to plaintext authentication material and operational tokens, and detailed information that would have enabled an attacker to enumerate DeepSeek's internal microservice architecture. Researchers also noted that the ClickHouse server's clickhouse-client interface accepted arbitrary queries, meaning an attacker could have exfiltrated files from the host filesystem and potentially executed privileged operations within the database environment.

How long was the database exposed?

Wiz disclosed publicly that the database was open and unauthenticated when first observed. DeepSeek did not publish an official acknowledgement of dwell time. Independent researchers analysing the log_stream table dates, plus Censys and Shodan historical scan data, suggest the cluster had been internet-exposed for at least several weeks before disclosure, with the earliest log entries dating to early January 2025. Whether other parties accessed the cluster before Wiz remains unknown publicly; the lack of authentication means any unauthenticated visit would not have produced an audit trail distinguishable from legitimate developer traffic. The absence of any per-session identity binding means even retrospective analysis cannot determine whether nation-state actors, criminal operators, or other security researchers obtained copies of the exposed log content during the dwell window.

What did existing controls miss?

Three failures, none specific to DeepSeek but all amplified by the AI gold-rush dynamics. First, default ClickHouse deployments do not require authentication; documentation strongly encourages enabling users.xml authentication and IP allow-listing, but a developer running a quick analytics POC can spin up an internet-reachable cluster with zero credentials in less than five minutes. Second, attack-surface management at startup scale frequently misses ports outside 80 and 443; 8123 and 9000 are not in most off-the-shelf scanning baselines. Third, log content hygiene was absent: the log_stream table contained raw chat content and API keys, indicating no field-level redaction in the logging pipeline. The combination — exposed port, no auth, sensitive log content — is what turned a misconfiguration into a multimillion-record data exposure.

# Hardening baseline for analytics databases handling AI telemetry
clickhouse_hardening:
  network:
    listen_host: '127.0.0.1'
    interserver_https_port: enforced
    public_ports_8123_9000: forbidden
    bastion_or_private_link_only: required
  authentication:
    users_xml_auth: required
    default_user_password: prohibited
    ldap_or_jwt_integration: preferred
  data_handling:
    chat_content_field_level_redaction: required
    api_key_redaction_regex_at_ingest: required
    log_retention_days_max: 90
  monitoring:
    failed_auth_alerting: required
    bulk_select_alert_threshold_rows: 10000
    new_database_creation_alert: high
  evidence:
    quarterly_external_scan_report: required
    soc2_control_mapping_cc6_1: required

What should AI-consuming enterprises do now?

Six steps. First, any API key issued by DeepSeek before January 30, 2025 should be considered burned; rotate it on the assumption that it was logged in plaintext. Second, treat the AI vendors you ingest into your environment as tier-zero data processors with the same review depth as your SIEM or your identity provider. Demand SOC 2 Type II, ISO 27001, and field-level data-handling attestations. Third, inventory which employees and applications send sensitive prompts to which AI providers; the chat-history exposure is the most consequential category because prompts frequently include source code, credentials, and PII pasted by users. Fourth, implement provider-side DLP — egress filtering on prompt content for known-bad providers and contractual zero-retention modes for the rest. Fifth, instrument client-side telemetry so that an API-key leak surfaces an immediate revocation workflow, not a Friday-afternoon trickle of customer-support tickets. Sixth, evaluate AI providers against the new EU AI Act high-risk-system obligations and the NIST AI RMF 1.0 profile, which formalise documentation and incident-disclosure requirements that DeepSeek-class startups need to meet to remain enterprise-viable.

What broader implications does this carry for AI provider risk?

DeepSeek's exposure surfaced at a moment of intense scrutiny on AI-provider data handling, and several regulatory and industry threads tightened immediately. The Italian data-protection authority Garante opened a formal investigation into DeepSeek under the EU General Data Protection Regulation in late January 2025 and blocked the service from operating in Italy pending review. Australia, Taiwan, and several US federal agencies issued advisories restricting government use of the DeepSeek consumer applications. The US National Institute of Standards and Technology accelerated work on AI provider risk assessment under the AI Risk Management Framework, with a profile expected in mid-2026 specifically addressing analytics-pipeline data handling at AI providers. Enterprise procurement teams responded by adding contractual zero-retention modes, data-residency clauses, and per-prompt logging restrictions to AI vendor agreements. The DeepSeek case also illustrated a pattern that is now widely acknowledged but rarely fixed: developer-experience tooling at fast-growing AI startups frequently outpaces the maturity of the security-engineering function, and the gap manifests as analytics misconfigurations long before it manifests as inference-layer compromise. Defenders should expect more of these exposures throughout 2026 as new AI providers race for market share.

How Safeguard Helps

Safeguard inventories every AI service-provider integration in your environment, maps each to the SBOM and dependency footprint of its consumer-side SDK, and continuously scores providers against the EU AI Act high-risk-system obligations, NIST AI RMF, and SOC 2 Trust Services Criteria. Griffin AI reachability analysis surfaces which AI provider endpoints can be reached from production source code and which prompts contain sensitive data categories — credentials, PII, regulated content — that should never leave your perimeter. TPRM workflows enforce contractual zero-retention modes and breach-notification SLAs, and continuously verify that DeepSeek-class providers publish attack-surface evidence rather than rely on attestations. Policy gates block new AI provider integrations that lack documented authentication and field-level redaction on the provider's analytics pipeline, and ingest Wiz, Censys, and Shodan exposure feeds so that an exposed-database disclosure surfaces every API key needing rotation within minutes.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.