The PII paradox: how to safely connect your customer database to an LLM
TL;DR
Connecting raw customer data to an LLM is a compliance and security risk. The solution is a two-layer defence: deterministic tokenisation so the LLM never sees real PII, combined with the right deployment model (self-hosted or private cloud). Together, they let you unlock AI-powered insights without gambling your customers' data.
What is the PII paradox businesses face when adopting LLMs?
The promise is clear: LLMs can deliver instant insights, hyper-personalised communication, and automated data analysis. The problem is equally clear: most businesses hold databases full of Personally Identifiable Information (PII), names, emails, phone numbers, that they cannot legally or safely hand to a third-party AI service.
That tension is the PII paradox. You need the data to get value from the LLM. Feeding the data in exposes you to significant security and compliance risk. Solving it requires two non-negotiable security pillars: Data-Centric Security (Anonymisation) and Infrastructure-Centric Security (Deployment Model).
Why is sending raw PII to a third-party LLM so dangerous?
Third-party LLM providers process your data on infrastructure you do not control. Without contractual guarantees and technical safeguards, that data could be used for model training, logged, or exposed in a breach. Under GDPR, you remain the data controller, you are responsible for what happens to that PII regardless of where you send it.
The risk is not theoretical. LLM security researchers have documented prompt injection attacks, data leakage via model outputs, and training data extraction techniques that can surface PII from model weights [9]. Sending raw customer records into that environment without a data-centric safeguard is a significant gamble.
What is deterministic tokenisation and how does it protect PII?
Deterministic tokenisation is the gold standard for protecting PII before it reaches an LLM [2, 7]. The technique replaces every piece of sensitive data with a consistent, non-sensitive placeholder, a token.
For example, 'Alice Johnson' becomes 'PERSON_12345' and 'alice@example.com' becomes 'EMAIL_54321.' The replacement is deterministic: the same input always produces the same token. A secure proxy or gateway sits between your database and the LLM, performing the swap before data leaves your environment and reversing it after the LLM returns its output. The mapping table that links tokens to real identities never leaves your secure environment.
How does tokenisation satisfy GDPR and privacy regulations?
By replacing PII with tokens, the data sent to the LLM is pseudonymised, it is no longer directly linked to an identifiable individual without access to the separately stored mapping key. This satisfies the pseudonymisation requirements under GDPR [2].
Critically, the mapping table never leaves your secure environment. The LLM processes only tokens. Your system re-identifies the output after the fact, so the personalised result reaches the right customer without the LLM ever holding the raw PII.
Does tokenisation break the analytical usefulness of an LLM?
No, and this is the clever part. Because the tokenisation is deterministic, 'PERSON_12345' always refers to the same individual across every query. The LLM can still identify patterns, segment customers, and track behaviour at the individual level. It simply does so using tokens rather than real names.
The utility is fully preserved. A task like 'draft a personalised email for EMAIL_54321' still produces a highly relevant output. Your proxy then substitutes the real email address before delivery. The customer receives a personalised message; the LLM never knew who they were.
What are the two secure deployment models for LLMs handling customer data?
Once you have a tokenisation layer in place, your second decision is where the LLM itself runs. There are two primary options [3, 4, 5, 8].
Option A: Self-hosted / on-premise LLM (maximum control)
You deploy an open-source or licensed model directly within your own private infrastructure. Your data, even tokenised, never leaves your network. You have absolute data sovereignty and full control over security, access controls, and model fine-tuning. The trade-off is significant: high upfront investment in specialised hardware (GPUs) and MLOps expertise, plus full operational and scaling responsibility on your team.
Option B: Private cloud LLM (the scalable compromise)
You use dedicated, isolated instances of LLMs from major cloud providers, for example, Azure OpenAI or Google Vertex AI. These services typically offer contractual guarantees that your data will not be used for model training and remains isolated within your tenancy. You gain the provider's robust infrastructure and effortless scaling. The trade-off is that data still transits the cloud provider's network, requiring trust in their security posture and contractual commitments.
See where AI fits in your business. Free.
A 45-minute audit. We map the highest-value automations and what they're worth in time and money. No pitch, no pressure.
Which deployment model should most businesses choose?
For most businesses, those without a dedicated MLOps team or GPU infrastructure, a Private Cloud LLM combined with a deterministic tokenisation layer is the recommended path [2, 5]. It offers the best balance of security, scalability, and cost.
Self-hosting makes sense when you have extreme data sovereignty requirements, the technical capability to run and maintain the infrastructure, or compliance mandates that prohibit any third-party data processing. For everyone else, a well-configured private cloud deployment with strong contractual protections is the pragmatic choice.
What does a complete layered defence strategy look like?
The safest strategy is a layered defence, combining the technical safeguard of tokenisation with the operational safeguard of a secure deployment model [6, 7, 9, 10]:
- Tokenisation layer, A secure proxy intercepts all data before it leaves your environment, replacing PII with deterministic tokens.
- Secure transmission, Tokenised (pseudonymised) data is sent to the LLM via encrypted channels.
- LLM processing, The model analyses tokens only. It has no access to real customer identities.
- De-anonymisation, Your proxy maps tokens back to real identities after the LLM returns its output.
- Delivery, Personalised, high-value results reach the customer without the LLM ever handling raw PII.
This architecture means that even if the LLM service were compromised, an attacker would obtain only meaningless tokens, not your customer database.
What to do this week
- Audit what PII you are currently sending to any LLM. If your team is using ChatGPT, Copilot, or any third-party AI tool with customer data, that is your immediate exposure to assess.
- Map your data flows. Identify every point where customer records touch an AI service, integrations, automations, manual copy-paste into chat interfaces.
- Evaluate tokenisation options. Tools like Protecto and Kong's AI Gateway offer PII sanitisation layers you can insert in front of your LLM calls without rewriting your stack [2, 6].
- Review your cloud LLM contracts. If you are using Azure OpenAI or Google Vertex AI, confirm the data processing terms and your tenancy isolation guarantees are in writing.
- Do not wait for a breach. GDPR enforcement is active and regulators are catching up with AI-specific data processing. Layering your defences now is considerably cheaper than post-breach remediation.
References
[1] Yi Ai, Preventing Sensitive Data Exposure in LLMs. https://yia333.medium.com/preventing-sensitive-data-exposure-in-llms-f3e8ce2dcd01
[2] Protecto, 7 Proven Ways To Safeguard Personal Data In LLMs. https://www.protecto.ai/blog/7-proven-ways-safeguard-llm-personal-data/
[3] Plural, Self-Hosted LLM: A 5-Step Deployment Guide. https://www.plural.sh/blog/self-hosting-large-language-models/
[4] Private AI, BYO LLM: Privacy Concerns and Other Challenges. https://www.private-ai.com/en/blog/byo-llm
[5] Signity Solutions, On Premise vs Cloud Based LLM. https://www.signitysolutions.com/blog/on-premise-vs-cloud-based-llm
[6] KongHQ, PII Sanitization Needed for LLMs and Agentic AI. https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai
[7] DZone, Secure LLM Usage With Reversible Data Anonymization. https://dzone.com/articles/llm-pii-anonymization-guide
[8] Latitude Blog, Cloud vs On-Prem LLMs: Long-Term Cost Analysis. https://latitude-blog.ghost.io/blog/cloud-vs-on-prem-llms-long-term-cost-analysis/
[9] Oligo Security, LLM Security in 2025: Risks, Examples, and Best Practices. https://www.oligo.security/academy/llm-security-in-2025-risks-examples-and-best-practices
[10] Sentra, Safeguarding Data Integrity and Privacy in the Age of LLMs. https://www.sentra.io/blog/safeguarding-data-integrity-and-privacy-in-the-age-of-ai-powered-large-language-models-llms
Where to from here
Book a free 60-minute AI audit, we'll explore exactly what workflows are worth augmenting with AI.
Live with passion & AI,
Brett
Want this installed in your business?
Bespoke AI implementation across your operations: strategy, build, rollout, and ongoing drift maintenance.
Frequently asked questions
What does PII mean in the context of AI and LLMs?
+
PII stands for Personally Identifiable Information, any data that can identify a specific individual, such as names, email addresses, or phone numbers. When this data is fed into an LLM it creates significant security and compliance obligations, particularly under regulations like GDPR, because you remain the data controller regardless of where the data is processed.
Can I use ChatGPT or other third-party LLMs with my customer database?
+
Not safely without a data-centric safeguard in place. Sending raw PII to a third-party LLM means your customer data transits and is processed on infrastructure you do not control. Without contractual guarantees and technical protections like tokenisation you are exposed to data leakage, potential use of your data for model training, and regulatory liability.
What is deterministic tokenisation and why is it the gold standard for PII protection?
+
Deterministic tokenisation replaces every piece of PII with a consistent, non-sensitive placeholder called a token, for example, 'Alice Johnson' becomes 'PERSON_12345.' The same input always produces the same token, so the LLM can still analyse patterns and track individual customers without ever seeing real identity data. The mapping key is stored securely in your own environment and never shared with the LLM.
Does tokenising data before sending it to an LLM break its usefulness?
+
No. Because the tokenisation is deterministic, 'PERSON_12345' always refers to the same individual across every query. The LLM can still identify patterns, segment customers, and generate personalised outputs. Your system then de-tokenises the result before delivery, so the customer receives a fully personalised response without the LLM ever handling raw PII.
What is the difference between a self-hosted LLM and a private cloud LLM?
+
A self-hosted LLM runs entirely within your own infrastructure, maximum data sovereignty, but high upfront hardware and operational cost. A private cloud LLM uses dedicated, isolated instances from providers like Azure OpenAI or Google Vertex AI, easier to scale, backed by contractual data protection guarantees, but data still transits the provider's network.
Which deployment model is right for most businesses?
+
For most businesses without dedicated MLOps capability or GPU infrastructure, a private cloud LLM combined with a deterministic tokenisation layer offers the best balance of security, scalability, and cost. Self-hosting is reserved for organisations with extreme data sovereignty requirements or compliance mandates that prohibit any third-party data processing.
What regulations apply to using customer PII with an LLM in the UK or Australia?
+
In the UK, UK GDPR applies post-Brexit. In Australia, the Privacy Act 1988 and the Australian Privacy Principles govern how PII must be handled. Both frameworks hold you responsible as the data controller regardless of where the data is processed, meaning you remain liable even if it is a third-party LLM provider that causes the breach.

Brett is a four-time founder (Darra Tyres, Gladfish, EzyTrac, Anaboo) and the operator behind AIOS, Anaboo's AI Operating System. He writes from inside the build, installing AI in his own businesses first and reporting back what actually moves the numbers. Based between Singapore, the UK and Australia.



