AI Chatbot Data Sovereignty: Where Does Your Data Actually Go?

Q: What certifications should I look for when choosing a sovereign AI chatbot vendor?

ISO 27001 is a useful baseline but does not cover legal sovereignty. For EU organizations: SecNumCloud (France) or C5 (Germany) provide the strongest sovereignty guarantees. For healthcare: look for HDS, HITRUST, or SOC 2 Type II in addition to sovereignty. Always combine a security certification with verification that the provider has no parent company subject to US law.

Q: Is my data used to train AI models when I use a chatbot?

This depends on the vendor and your contract. Some providers use conversations to improve models by default. Sovereign AI solutions contractually commit to never using your data to train their models. This point must appear explicitly in your DPA — if it is not in writing, it is not guaranteed.

Data sovereignty is the number one objection from privacy officers, IT leaders, and legal teams before any AI chatbot deployment — and it is a legitimate one. When an employee types a question into an AI chatbot — a customer query, a contract excerpt, a patient file — where does that data actually go?

The short answer: with most mainstream AI products (ChatGPT, Copilot, Gemini), your data travels to servers governed by US law. That means the US CLOUD Act applies: American federal authorities can compel those providers — via a court order — to hand over data even if it is physically stored in Europe or anywhere else outside the US. This is not a theoretical risk. It is a legal reality that directly conflicts with GDPR, HIPAA, and the confidentiality obligations of regulated industries worldwide.

This guide explains precisely what happens to your data when you use an AI chatbot, why "hosted in Europe" is not sufficient on its own, and which measures — data residency, encryption, anonymization, and sovereign certifications — allow you to deploy an AI chatbot in full compliance with international data protection standards.

Where does your data go when you use an AI chatbot?
The US Cloud Act: why "hosted in Europe" is not enough
GDPR and AI chatbots: what the law actually requires
Sovereignty certifications: SecNumCloud, ISO 27001, and international equivalents
Encryption, anonymization, data residency: the protections that matter
Healthcare, legal, and regulated industries: sector-specific requirements
How to choose a sovereign AI chatbot solution: the DPO checklist
FAQ — AI Chatbots and Data Sovereignty

Where does your data go when you use an AI chatbot?

Before discussing sovereignty, you need to understand the technical data flow. A modern AI chatbot is not software running locally on your server. It is a service that sends your requests to a remote API — that of OpenAI, Anthropic, Google, or another large language model (LLM) provider.

Here is what concretely happens during an exchange with a non-sovereign AI chatbot:

The user types a question into the chatbot interface.
That question (the "prompt") is sent over HTTPS to the LLM provider's API.
The prompt may contain personal data, excerpts from internal documents, customer information, or health data — depending on what the user has entered.
The LLM generates a response and sends it back. The prompt and response may be logged by the provider to improve its models, unless you have explicitly opted out.
The conversation history is stored on the provider's servers, whose location depends on their infrastructure choices.

In the case of OpenAI (ChatGPT, GPT API), data flows to Microsoft Azure servers located in the US or Europe depending on configuration — but in all cases the parent company is American and subject to US law. That is where the Cloud Act becomes relevant.

Data sent to the LLM: what you need to know

A common misconception is conflating chatbot application data (conversation history, RAG knowledge base) with data sent to the LLM on every single request. These are two distinct layers. Even if the knowledge base is hosted in your region, every user request that triggers an LLM call sends text — potentially sensitive text — to an external API.

For RAG solutions, the content retrieved from the knowledge base is injected into the prompt sent to the LLM. In practice, an excerpt from your internal specifications, a client contract, or a medical record can end up in the prompt processed by the model at the provider's infrastructure.

The three data flow layers you need to audit

Layer	What is involved	Questions to ask your vendor
Chatbot application	Conversation history, user profiles, configuration data	Where is the application hosted? Which cloud provider? In which region?
Knowledge base (RAG)	Uploaded documents, vectorized chunks, knowledge repository	Where are vectors and source documents stored? Who can access them?
LLM API calls	Prompts sent to the model (conversation content + RAG context)	Which LLM is used? Is the API subject to US jurisdiction? Are prompts logged?

The US Cloud Act: why "hosted in Europe" is not enough

The CLOUD Act (Clarifying Lawful Overseas Use of Data Act), enacted by the United States in March 2018, is the law that creates the main friction point between US-based AI solutions and international data sovereignty.

Its core principle: any company subject to US law — including OpenAI, Microsoft, Amazon (AWS), or Google — is legally compelled to produce data requested by a US federal judge, even if that data is physically stored in Europe, Canada, or anywhere else. The geographic location of the server offers no protection against a US court order if the service provider is a US company.

This is not a theoretical concern: organizations in the financial, legal, and industrial sectors have already encountered situations where data held with US-based providers was demanded by US authorities, creating a direct conflict with their local data protection obligations — whether under GDPR in Europe, the PDPA in Southeast Asia, or PIPEDA in Canada.

The European subsidiary trap

Some US providers create European subsidiaries or "EU cloud" offerings to address customer concerns. Microsoft Azure Europe, AWS EU, and Google Cloud EU are common examples. But these subsidiaries remain part of a US corporate group, subject to US law through their parent company. A Cloud Act order directed at the parent can compel the European subsidiary to produce data.

The key distinction: hosted in Europe ≠ sovereign. True data sovereignty means the company operating the service is legally independent of any entity subject to US law. This is what European cloud providers like OVHcloud, Scaleway, or Outscale (Dassault Systèmes) offer — and it is the architectural foundation Heeya built on.

The Cloud Act vs. GDPR structural conflict

GDPR prohibits the transfer of personal data to third countries without adequate safeguards. The Cloud Act compels exactly that transfer on US judicial demand. These two legal frameworks are in structural contradiction. For any organization subject to GDPR, relying on an AI chatbot that is subject to the Cloud Act to process customer, employee, or patient personal data creates a permanent compliance risk — even if no order has yet been issued.

Our guide to GDPR-compliant AI chatbots covers the specific obligations and steps required to remain within the legal framework for EU-based deployments.

GDPR does not say "don't use AI chatbots." It establishes principles that every chatbot deployment must satisfy. As the data controller, your organization is accountable for what happens to your users' data — even when a third-party processor (the chatbot vendor) is doing the processing on your behalf.

Core obligations for the data controller

Lawfulness and purpose limitation: data processed by the chatbot must be processed for a specific, explicit, and legitimate purpose. Using a customer support chatbot to train the vendor's AI models without explicit consent violates this principle.
Data minimization: the chatbot should only process data that is strictly necessary. A FAQ chatbot does not need to collect a user's national ID number.
Mandatory DPA: every AI chatbot vendor is a data processor under GDPR. A Data Processing Agreement (DPA) is mandatory before any data is shared. The DPA must specify: categories of data processed, server locations, sub-processors used, and security measures applied.
International transfers: any data flow outside the EU/EEA must be governed by an adequate mechanism (adequacy decision, Standard Contractual Clauses). Simply calling the OpenAI API constitutes a transfer to the United States.
Retention periods: conversation history must be deleted according to defined schedules. The vendor must guarantee effective deletion — including backup copies — upon user request.

Shared liability: who is responsible when something goes wrong?

In the event of a data breach via an AI chatbot, data protection authorities consider the data controller — your organization — to remain co-responsible even if the incident originated with the vendor. European DPAs have already fined organizations for integrating third-party tools without verifying their data protection compliance. With fines reaching up to 4% of global annual turnover, the question of data sovereignty is not excessive caution — it is a legal requirement.

Sovereignty certifications: SecNumCloud, ISO 27001, and international equivalents

To evaluate the genuine sovereignty level of a cloud provider or SaaS solution, certifications provide a structured framework. Different frameworks apply depending on your region and sector.

SecNumCloud: France's gold standard for sovereignty

SecNumCloud is the qualification awarded by ANSSI (the French national cybersecurity agency) to cloud service providers that meet more than 360 compliance criteria across 14 domains. Version 3.2 of the framework, in force since 2024, explicitly includes protections against extra-territorial laws — the Cloud Act in particular.

To obtain SecNumCloud, a provider must:

Host all data in France or the EU.
Be legally independent of any entity subject to non-European legislation that could compromise data sovereignty.
Ensure all operations are performed from European territory by European personnel.
Pass a rigorous technical audit by ANSSI-accredited assessors.

SecNumCloud-qualified providers include OVHcloud, Outscale (Dassault Systèmes), and a handful of others. They form the infrastructure foundation on which sovereign SaaS solutions are built. While SecNumCloud is specific to France, it represents one of the most rigorous data sovereignty frameworks globally and is increasingly referenced as a model by EU regulators.

ISO 27001 and international equivalents

For organizations outside France, ISO/IEC 27001 is the most widely recognized international security certification. It covers information security management broadly but does not address legal sovereignty — a provider can be ISO 27001 certified and still be subject to the Cloud Act. Think of it as a necessary baseline, not a sovereignty guarantee.

Other relevant frameworks by region include:

C5 (Germany): the BSI Cloud Computing Compliance Criteria Catalogue, similar in scope and rigor to SecNumCloud, with explicit cloud jurisdiction requirements.
ENS Alto (Spain): Spain's National Security Scheme, mandatory for providers serving Spanish public administrations.
IRAP (Australia): the Information Security Registered Assessors Program, required for Australian government cloud services.
FedRAMP (US): the US federal risk and authorization program — relevant if you are a US government supplier, but does not address Cross-border jurisdiction conflicts for non-US customers.

Healthcare: HDS and sector-specific requirements

The French HDS (Health Data Host) certification is mandatory for any organization hosting personal health data in France. It guarantees a high level of technical security. But like ISO 27001, HDS does not cover legal sovereignty: a provider can be HDS-certified and still be subject to the Cloud Act if its parent company is American.

For healthcare organizations operating under GDPR or HIPAA, the rule is therefore: sector security certification (HDS, HITRUST, SOC 2 Type II) plus a provider legally independent of US jurisdiction. The two layers are complementary, not interchangeable.

The "trusted cloud" model

Several countries have developed "trusted cloud" frameworks that allow well-known technologies (sometimes originally American) to be operated by legally independent local entities. France's S3NS (Thales operating Google Cloud technology) and Bleu (a Microsoft Azure equivalent) are prominent examples. This is a pragmatic compromise, though some legal experts maintain that technological dependency always carries residual risk.

Encryption, anonymization, data residency: the protections that matter

Beyond certifications, concrete technical measures can significantly reduce the risks associated with AI chatbot usage — even when the LLM used is an external service.

End-to-end encryption

All data in transit — between the user and the application, and between the application and the LLM — must be encrypted using TLS 1.2 at minimum, ideally TLS 1.3. Data at rest (knowledge base, conversation history) should be encrypted with customer-managed keys (BYOK — Bring Your Own Key), so that even the hosting provider cannot read file contents in plain text.

Prompt anonymization and pseudonymization

One of the most powerful techniques for sensitive sectors: anonymize or pseudonymize data before it is sent to an external LLM. For example, before sending a contract excerpt to a model for analysis, a pre-processing step automatically replaces proper names, contract numbers, and identifying data with generic tokens. The LLM processes the anonymized text, and the response is re-contextualized at the application layer. This approach allows organizations to use powerful LLMs while never sending them identifiable personal data.

Data residency for the knowledge base

For RAG solutions, the vectorized knowledge base — the documents the chatbot consults when generating responses — must be hosted on a sovereign server. Heeya uses a vector database (Qdrant) hosted in Europe, which means your internal documents never leave European territory. Only the relevant content, selected and optionally anonymized, is injected into the prompt sent to the LLM. If you manage RFP documents and sensitive bids, our article on AI knowledge bases for RFP responses shows how to structure sensitive documents in a compliant environment.

Retention periods and the right to erasure

GDPR — and many other data protection regulations — impose limited retention periods. A compliant AI chatbot must allow you to configure how long conversations are kept (30, 90, 365 days, for example) and trigger effective deletion — including backup copies — on user request (right to erasure). This functionality must be operationally real, not just promised in the terms of service.

Our RAG expertise page details how we architect these security and compliance layers in AI chatbot deployments for our clients. For a comprehensive overview of threats and countermeasures, our guide on AI chatbot data security for enterprises covers attack vectors, encryption, and configuration best practices.

Healthcare, legal, and regulated industries: sector-specific requirements

Certain industries face additional regulatory constraints that make data sovereignty even more critical. Here are the key considerations by sector.

Healthcare

Health data is a "special category" of personal data under GDPR (Article 9), subject to reinforced obligations. Any AI chatbot that processes health data — even indirectly, if a patient describes symptoms — must rely on a certified health data host. In the EU, the applicable certification is HDS (France) or equivalent national requirements. In the US, HIPAA governs health data hosting; Business Associate Agreements (BAAs) are mandatory.

For hospital and clinical environments, the stakes are high: a chatbot deployment without a prior security audit is a major regulatory and operational risk. Healthcare organizations should require both the sector security certification and legal sovereignty from their AI vendors.

Legal sector and law firms

Attorney-client privilege is protected by law in virtually every jurisdiction. Sending client file content to a US LLM API — even for document analysis — can constitute a breach of professional secrecy. Several incidents have been documented where lawyers using ChatGPT for document analysis inadvertently exposed protected information.

Sovereign AI solutions for the legal sector exist: they rely on LLMs hosted in the EU (such as Mistral, or open-source models deployed on sovereign infrastructure) and guarantee the absence of prompt logging for training purposes.

Public sector and regulated government agencies

Government agencies deploying AI chatbots — for citizen services, internal HR, or document management — typically face strict data residency mandates. In the EU, national cybersecurity agencies (ANSSI in France, BSI in Germany, NCSC in the UK) publish cloud security guidance that effectively requires sovereign infrastructure for sensitive government data. In the US, FedRAMP authorization is a prerequisite for federal deployments.

The general rule across jurisdictions: any data processed on behalf of a government entity must be hosted on certified sovereign infrastructure. Public sector buyers should verify both the technical certification and the provider's legal independence from foreign jurisdictions.

How to choose a sovereign AI chatbot solution: the DPO checklist

Here are the concrete questions to ask any AI chatbot vendor before signing a contract — especially if your organization processes sensitive data.

Questions about hosting and data residency

Where are the servers physically located that host the application and data?
Is the hosting provider a legally independent entity under European (or local) law, with no US parent company?
Does the provider hold relevant certifications for your sector (SecNumCloud, ISO 27001, C5, HDS, HITRUST, SOC 2 Type II)?
Where is the RAG vector database stored? On the same sovereign infrastructure?

Questions about the LLM used

Which LLM model is used to generate responses?
Is that LLM accessible via an API subject to US jurisdiction (OpenAI, Anthropic, Google)?
Are prompts sent to the LLM logged by the model provider? For how long?
Is there an EU-based LLM option (Mistral, Aleph Alpha) or an on-premise deployment option?
Is data anonymized before being sent to the LLM?

Contractual and compliance questions

Is a Data Processing Agreement (DPA) provided and available to sign before any data is shared?
Is the list of sub-processors available and kept up to date?
What are the conversation retention periods? Is deletion technically guaranteed, including backups?
Does the vendor commit contractually to never using your data to train its models?
What is the data breach notification procedure (timeline, content, notification channel)?

A solution that answers all of these questions satisfactorily is the operational definition of a sovereign AI chatbot. It is the standard Heeya holds itself to. If you are evaluating a concrete project, our article on the AI chatbot implementation timeline gives a realistic estimate of the steps involved — including compliance audits.

FAQ — AI Chatbots and Data Sovereignty

Where does my data go when I use an AI chatbot?

Your data flows through several layers: the chatbot application (which stores conversation history), the RAG knowledge base (which holds your vectorized documents), and the LLM (which receives your prompts on every exchange). With most mainstream solutions such as ChatGPT or Copilot, prompts are sent to APIs governed by US law. For a sovereign solution, the application and knowledge base must be hosted with a legally independent provider in your region, and the LLM must either be hosted locally or receive only anonymized data.

What is the US Cloud Act and why is it a problem for my data?

The CLOUD Act (2018) is a US law that compels any company subject to US jurisdiction — such as OpenAI, Microsoft, Google, or Amazon — to produce data on request from a US federal judge, even if that data is physically stored outside the United States. This creates a direct conflict with GDPR and similar data protection laws, which prohibit the transfer of personal data to third countries without adequate safeguards. Using an AI chatbot whose provider is a US company exposes your data to a permanent legal risk, regardless of where the servers are physically located.

What is SecNumCloud and why does it matter internationally?

SecNumCloud is a qualification awarded by France's national cybersecurity agency (ANSSI) to cloud providers that meet over 360 security and sovereignty criteria. Version 3.2 includes explicit protections against extra-territorial laws like the US Cloud Act. A SecNumCloud-qualified provider must host data in France or the EU, be legally independent of any entity subject to US law, and pass audits by accredited assessors. While specific to France, it is one of the most rigorous data sovereignty frameworks globally and serves as a reference point for EU regulatory discussions on cloud sovereignty.

Is an AI chatbot hosted in the EU automatically GDPR-compliant?

No. EU hosting is a necessary condition but not a sufficient one. A chatbot can be hosted on servers in Ireland or the Netherlands while being operated by a subsidiary of a US company, and therefore remain subject to the Cloud Act. Real GDPR compliance requires: a legally independent EU-based provider, a signed DPA, defined retention periods, no use of your data to train models, and operational management of data subject rights (access, rectification, erasure).

What certifications should I look for when choosing a sovereign AI chatbot vendor?

The right certifications depend on your sector and geography. ISO 27001 is a useful baseline for security management but does not cover legal sovereignty. For EU organizations: SecNumCloud (France) or C5 (Germany) provide the strongest sovereignty guarantees. For healthcare: look for HDS (EU), HITRUST, or SOC 2 Type II in addition to sovereignty requirements. For financial services: check for compliance with your national financial regulator's cloud guidelines. Always combine a security certification with verification that the provider has no parent company subject to US law.

How does prompt anonymization work before sending data to an LLM?

Pre-LLM anonymization means detecting and replacing personally identifiable information (names, contract numbers, addresses, phone numbers) in text before it is sent to the model API. NLP libraries such as Microsoft Presidio, or custom NER models trained on domain-specific data, can automate this detection. The anonymized text is sent to the LLM; the response is then re-contextualized at the application layer with the original identifiers. This technique allows organizations to use powerful LLMs without ever sending them identifiable personal data, dramatically reducing the GDPR and sovereignty risk associated with API calls.

Is my data used to train AI models when I use a chatbot?

This depends entirely on the vendor and your contract configuration. Some providers use conversations to improve their models by default. Via the OpenAI enterprise API, training on your data is disabled by default. Sovereign AI solutions contractually commit to never using your data to train their models. This point must appear explicitly in your DPA — if it is not in writing, it is not guaranteed.

Is Heeya GDPR-compliant and where is data hosted?

Yes. Heeya is designed for organizations with data protection compliance requirements. The application and RAG knowledge base are hosted in Europe with legally independent European providers. We provide a GDPR-compliant DPA, configurable retention periods, and we never use client data to train models. For highly sensitive sectors (healthcare, legal, public sector), we can explore configurations using sovereign LLMs (such as Mistral) or on-premise deployments depending on your constraints.

Table of Contents