Personal Data and LLMs: How Well Does Automated Anonymization Actually Work?

A practical evaluation of standard NER libraries for German business texts

Andre Jahn, Jahn Consulting - March 2026

The Problem

Anyone looking to use a Large Language Model like Claude or GPT with enterprise data faces a fundamental question: What happens to the personal data in the text?

The GDPR requires a legal basis for every processing of personal data. The moment a document containing names, addresses, or identifiable personal references is passed to an LLM, processing occurs - regardless of whether the model runs in the cloud or on-premises.

In theory, several paths exist to legitimize this processing. In practice, most of them fail against the realities of everyday business operations.

Why Consent Rarely Works

The obvious approach - asking the data subjects - sounds simple but is problematic in the most common scenarios.

For employee data, consent is barely defensible. The GDPR and European supervisory authorities view the power imbalance between employer and employee as a fundamental problem. Consent given in the context of an employment relationship is considered not truly voluntary - and therefore challengeable. "Do you agree that we process your personnel file through an LLM?" is not a question where an employee decides freely.

For customer data, consent is theoretically feasible but operationally demanding. It must be specific, informed, and voluntary. You need to explain what the LLM does with the data - which is conceptually difficult for a probabilistic language model. Additionally, consent is revocable at any time, and processing must be stopped immediately upon revocation.

And even with consent, Art. 5(1)(c) GDPR applies - the principle of data minimization. Even with a valid legal basis, only the data necessary for the specific purpose may be processed. Feeding an entire case file into an LLM when only a summary is needed is at least arguable as a violation.

This leaves the most pragmatic path for most enterprise scenarios: Anonymize personal data before the LLM call, let the LLM work with sanitized data, and re-insert the original data in the output.

The Technical Approach: Anonymization and De-Anonymization

The architecture of such a pipeline looks conceptually straightforward:

Original text → PII detection → Anonymization → LLM processing → De-anonymization → Output

At its core lies automated detection of personal data - Named Entity Recognition (NER). Specialized models analyze the text and tag entities such as person names, locations, organizations, and other identifiable information.

The most widely used framework for this purpose is Microsoft Presidio - an open-source tool that combines detection and anonymization and also provides a de-anonymization path. Under the hood, Presidio uses NER models for the actual detection. The default backend is spaCy, the most widely deployed NLP library for production environments. An alternative is Flair, a framework developed by Zalando Research that leverages contextual embeddings and often achieves higher accuracy in entity recognition benchmarks.

The decisive question is: How well do these models perform on German business texts?

The Test

To find out, I tested both models with the same input text - a sentence that occurs daily in German enterprises:

"Max Müller aus Berlin hat am 15. Januar bei der Deutschen Bank angerufen und mit Frau Krause über die Erweiterung der Kreditlinie gesprochen."

(Max Müller from Berlin called Deutsche Bank on January 15th and spoke with Mrs. Krause about extending the credit line.)

I added variations representing typical patterns in German business correspondence: honorific plus surname ("Frau Weise" / Mrs. Weise, "Herrn Dr. Schmidt" / Mr. Dr. Schmidt), abbreviated honorifics ("Fr. Krause"), uncommon surnames ("Frau Waise"), and a contextual person reference without a named entity ("The colleague from accounting who had the incident last month").

Models Tested

spaCy with the de_core_news_lg model - the largest non-transformer German model, trained on news text
Flair with the ner-german-large model - a transformer-based model (XLM-RoBERTa), trained on the CoNLL-2003 dataset

Results

Test Case	spaCy	Flair
"Max Müller" (first + last name)	✓ PER	✓ PER
"Frau Krause" (Mrs. + surname)	✗ not detected	✓ PER
"Frau Waise" (Mrs. + rare surname)	✗ not detected	✓ PER
"Frau Weise" (Mrs. + surname)	✗ not detected	✓ PER
"Herrn Dr. Schmidt" (Mr. Dr. + surname)	✗ not detected	✓ PER
"Fr. Krause / Heinrich&Co" (abbr. + company)	✗ merged as one entity	✓ PER + ORG (separated)
"Anja Krause" (first + last name)	✓ PER	✓ PER
"Kollege aus Buchhaltung" (contextual reference)	✗ not detected	✗ not detected

Analysis

spaCy: First Name as Trigger

spaCy recognizes "Max Müller" and "Anja Krause" - classic first-name-plus-surname patterns. But every variation using an honorific instead of a first name is missed: "Frau Krause", "Frau Weise", "Frau Waise", "Herrn Dr. Schmidt" - not a single hit. The most common naming pattern in German business texts is completely ignored.

Particularly problematic is the case "Fr. Krause von Heinrich&Co". spaCy recognizes the entire string "Fr. Krause von Heinrich&Co" as a single person entity. The company disappears as an independent organization. If you anonymize this text, the entire string gets replaced by a single pseudonym - the company reference is lost, and the anonymized text becomes unusable.

The cause is understandable: The de_core_news_lg model is trained on news texts. In newspaper articles, you find "Angela Merkel" or "Chancellor Scholz" - not "Mrs. Merkel" or "Mr. Dr. Schmidt". The model simply hasn't learned the honorific-plus-surname pattern sufficiently.

Flair: Significantly Better, But Not Perfect

Flair recognizes all honorific patterns. "Frau Krause", "Frau Waise", "Herrn Dr. Schmidt" - all correctly identified, all with a confidence score of 1.0. For "Fr. Krause von Heinrich&Co", Flair cleanly separates: "Krause" as PER, "Heinrich&Co" as ORG.

One detail deserves attention: Flair returns only the surname as the entity, not the honorific. "Frau Krause" becomes entity "Krause". For anonymization, this means the replacement logic must consider context - a naive search-and-replace of "Krause" could also hit "Krausestraße" (Krause Street). Position-based replacement using the start and end indices provided by Flair is essential.

The tradeoff: Flair is significantly slower than spaCy. In benchmarks, spaCy processes the same dataset in three minutes; Flair takes over 100. For batch processing of large document collections, this is relevant.

The Hard Limit: Contextual Personal References

"The colleague from accounting who had the incident last month" - for both models: empty. No entity detected.

This is not a bug. There is no named entity here. No name, no location, no organization. Yet in an enterprise context, the sentence is potentially personally identifiable - if the accounting department has only three employees and one of them had an incident last month, the person is identifiable.

This kind of contextual personal reference can only be detected by something that understands the sentence - a language model. And that creates the chicken-and-egg problem: To protect data from AI, you need an AI that sees the data.

A possible architectural compromise is a weaker, local model as a pre-filter for semantic detection, whose output then goes to the actual LLM. From a data protection perspective, this would be internal processing. Whether European open-source models are capable enough for this remains an open question.

Practical Implications

Model Selection Is an Architecture Decision

spaCy in its default configuration is insufficient for anonymizing German business texts. This is not a judgment on spaCy's quality as a framework - it is a statement about the trained model and its training dataset. Anyone deploying Presidio with the default backend and relying on the detection rate is operating under a false sense of security.

Flair as the NER backend - specifically the ner-german-large model - delivers significantly better results for German texts. Presidio supports Flair as an alternative backend; switching requires no rebuild of the anonymization pipeline.

Rule-Based Augmentation Remains Necessary

Even with the better model, NER-based detection should be supplemented by rule-based patterns. Email addresses, IBANs, phone numbers, social security numbers - these are structured data that can be detected deterministically via regex, more reliably than any model and fully auditable.

The recommended pipeline is three-tiered:

Tier 1 - Regex patterns for structured PII (deterministic, auditable)
Tier 2 - NER model (Flair ner-german-large) for names, locations, organizations
Tier 3 - Documented residual risk for contextual personal references that are only semantically detectable

The Residual Risk Must Be Documented

No technical solution achieves 100% detection. This is not a weakness of the tools but a structural property of the task. Contextual personal references, indirect identification through combinations of attributes (the mosaic problem), and temporally bound references remain as residual risk.

This residual risk is documentable and assessable within a Data Protection Impact Assessment (DPIA). The combination of technical measures (regex + NER), organizational measures (training, processes), and documented residual risk is a defensible approach for most enterprise scenarios - and more honest than claiming to have a complete solution.

Reproducibility

The test scripts used are available as a GitHub Gist. Results can be reproduced with the following packages:

pip install spacy flair
python -m spacy download de_core_news_lg

The Flair model ner-german-large is downloaded automatically on first use (approximately 1.5 GB).

Conclusion

A GDPR-compliant LLM pipeline for enterprise data is technically feasible. The tools exist, the architecture pattern is clear. But the default configuration is insufficient for German business texts - neither in model selection nor in expectations about detection rates.

Anyone taking this path must make three conscious decisions: Which NER model is appropriate for their text domain. Which additional rule-based filters augment the detection. And which residual risk is acceptable and documentable.

The honest answer is not "we solved it," but "here is the boundary of what is technically solvable - and this is how we handle the rest."

Spacy-Test Gist

Fler-Test Gist

---

Andre Jahn is a Solution Architect at Jahn Consulting, advising enterprises on AI governance and enterprise AI architecture.