All posts
Data Cleaning

Free-Text Fields from Hell

How to clean fields that contain everything except what belongs in them — using a local LLM, without a single record leaving the company network and without trusting the model’s hallucinations.

Anyone who migrates data or prepares it for AI systems gets to know one kind of field especially well: the free-text field. Nothing gets filled in more creatively.

Example: a field for customer numbers, labelled “enter customer numbers separated by commas”. What ends up in it are customer numbers — mixed with contacts, phone numbers, half-addresses.

Cleaning data like this with conventional means is practically hopeless. A customer number looks like a postcode, which in turn looks like a phone number. Filtering out the text only helps so far: once it’s gone, the remaining numbers form a new, false context.

This is where AI plays its one real strength — it recognises context, much like a human does. It reads “45312 Teststadt” as a place and “0170 123456” as a phone number, even without a clean field boundary.

That leaves the objection this audience raises immediately: we’re talking about names, addresses, phone numbers. Anyone who sends data like that to Claude or ChatGPT has a GDPR problem.

Data cleaning doesn’t need high-end models. A local LLM is enough, and the data never leaves the company network. In practice that means Ollama as the runtime, plus a model like Qwen2.5.

The process is simple. The data is exported. The AI first gets a basic cleaning prompt. Based on the edge cases in the real data, this prompt is then sharpened step by step — on lines like this one:

12345 Mr X, 45312 Teststadt, Tel. 0170 123456, 34321 - Ms Y

Then comes the part that decides whether the result is usable or not: the check. The AI’s output is verified with classic means — regular expressions — and in both directions. First: are the new values formatted correctly? Second, and this is the important direction: does every newly formatted value appear exactly like that in the original data?

This second direction catches what a language model occasionally does — invent a plausible-looking value that was never in the data. What the reverse check doesn’t confirm hasn’t been cleaned; it has been hallucinated. Without this check, the method wouldn’t be production-ready.

Depending on how badly the data is contaminated, several passes are needed. In the end there is usually a remainder of records that has to be cleaned by hand. The goal is to keep that remainder as small as possible.

Two limits you need to know. AI models recognise patterns well but calculate poorly. Format adjustments — inserting leading zeros, padding digits — therefore don’t belong in the AI step but in a classic script afterwards.

And cleaning never runs directly in the database or the target system. The path is export → cleaning → check → import, with a backup of the original data beforehand. Without exception.