AI data leakage
Company data reaches external models through ordinary work, quietly. Where the paths run, and what narrows them without a ban.
of Polish employees pasted sensitive data into AI tools
2025
•
ESET / DAGMA
name genAI data leaks their top AI concern
2026
•
World Economic Forum
added breach cost where shadow AI runs high
2025
•
IBM
Contracts, source code, and customer records leave through ordinary prompts pasted on ordinary working days.
what AI data leakage is
AI data leakage occurs when company data, information that the organization has an interest in protecting, is entered into an external AI model without a data processing agreement, without review of the vendor's terms of service, and without the organization's knowledge that it happened.
The data involved ranges from business-sensitive to legally protected. Source code and product documentation. Customer names, contact details, and correspondence. Financial projections and contract terms. Employee records. Internal strategy documents.
The mechanism is ordinary and daily: an employee needs help drafting a document, improving a translation, summarizing a meeting, or explaining a piece of code. They paste the relevant content into an AI tool and get a useful result. The data has now left the organization under terms the organization never reviewed.
why AI data leakage is a structural problem
The behavior that produces AI data leakage is legitimate and productive. Employees are using AI tools to work faster. The problem is that adoption happened before governance, so there are no clear rules about what can and cannot be shared.
Consumer-tier AI tools, including free versions of widely used platforms, typically retain the right to use inputs to improve their models. An employee using a personal or free-tier account is working under a consumer terms of service they almost certainly have not read, not under your data processing agreement.
In 2023, Samsung employees pasted source code and internal data into ChatGPT over a roughly three-week period. The company detected the incidents and restricted AI tool use internally. The case stands out because detection and reporting made it visible. The same pattern occurs in organizations that have no equivalent detection capability and no record of what left.
the concrete risk from AI data leakage
The consequences of uncontrolled AI data leakage depend on what left and where it went.
Regulatory exposure. Personal data about customers or employees entering an AI model without a DPA is a GDPR issue. The question is whether you have a lawful basis for the transfer and whether you can demonstrate it, regardless of whether any harm occurred. In most cases, a consumer-tier AI tool does not provide the DPA terms that would make that transfer lawful.
Intellectual property exposure. Source code, product roadmaps, and unreleased research that enters a third-party model's training data cannot be recalled. The exposure is permanent.
Contractual exposure. NDA-covered information that reaches a third-party model may constitute a breach of the NDA, even if the employee had no intent to disclose. Client contracts increasingly include data handling clauses that personal AI tool use can trigger.
No audit trail. If data does leave through a personal AI account, you have no log of it. In any subsequent review, incident investigation, or audit, you cannot demonstrate what happened because nothing was recorded.
what works
Control starts with naming the data that cannot leave. A classification that maps categories to AI use, personal data under GDPR, NDA-covered information, financial data not yet public, source code, gives every employee a test they can apply to the specific thing they are about to paste. The wording does most of the work here: "do not share confidential information" is too abstract to act on, while a short list of categories with examples drawn from people's actual work is concrete enough to change behavior. The classification lands better alongside a published approved-tool list that notes which tools have a DPA in place and which run with enterprise mode enabled, because employees who can see the sanctioned option stop improvising.
Technical controls hold the line where the stakes are highest. DLP policies configured on the sensitive data categories can flag or block transfers, and browser-based DLP, available through some endpoint management platforms, watches the paste-into-a-web-form pattern that dominates real incidents. It does not catch everything, and it does not need to; the most common leak paths are exactly what it sees.
The quieter exposure sits in the OAuth layer. Connector apps holding broad grants to email or documents reach far more data than any employee consciously pastes, so a periodic review of AI-related grants in the IdP, trimming the scopes that exceed the task, reduces exposure at the access layer rather than one prompt at a time.
practical guides you might find useful
let's start with a conversation
Most first conversations start with not quite knowing what you have or where to begin. That's normal, and it's exactly where we're useful.
Tell us what prompted this. An upcoming audit, an incident, a client's security questionnaire, or just a sense that things have gotten messy.
We'll take it from there

+48 783 762 997
julian@unshadowit.com



.svg.png)


