Skip to main content
Back to Blog
Technology6 min read27.06.2026Max Fey

A CSV looks like the safest file you can move. It is the most dangerous.

A CSV imports cleanly and still leaves wrong data in your system. Here is why that happens, and how we secure every file handoff before it quietly breaks something.

A CSV looks like the safest thing you can move between systems. That is exactly the problem.

Open a CSV and you see rows and columns of plain, readable text. Nothing binary, nothing hidden, nothing that needs a special viewer. That transparency is what makes teams trust it, and trusting a file without checking it is where automations quietly fall apart.

The thing most people miss is that CSV is not really a format. There is a specification, RFC 4180, but almost no tool follows it strictly and nobody enforces it. The delimiter, the character encoding, the quoting rules, the line endings all vary between systems. Two tools that both claim to "support CSV" routinely fail to read each other's output. With JSON or a proper API, a number is a number and the encoding is fixed. With CSV, everything is a string and each system guesses what was meant. That guessing is behind almost every CSV mess I have had to clean up.

Four ways a CSV corrupts your data without saying a word

The first is encoding. UTF-8 against the older Windows code pages. If a file is saved as one and read as the other, accented characters turn into garbage and the import never complains. It writes the broken text straight into the database, and you find out when a customer asks why their name is full of symbols.

The second is delimiters. A comma sits inside a value, or a region exports with semicolons instead of commas, and a parser that splits on the wrong character turns one column into two. Every column after it shifts by one, on every row.

The third, and the worst, is the spreadsheet. The damage here happens before the file even reaches your workflow. A spreadsheet tries to be helpful and rewrites your data on its own. It strips leading zeros from product codes and postal codes. It converts long digit strings like order numbers and barcodes into scientific notation, so 4012345678901 becomes 4.01234E+12 and the original digits are gone. It turns anything that resembles a date into a date, so a part number like 3-11 becomes the eleventh of March. Once the file is saved, those values are not damaged. They are gone, and you cannot rebuild them from the file.

The fourth is quoting and special characters. A free-text field with a comma, a quote, or a line break inside it: an address, or a comment with a paragraph in the middle. A parser that cuts at every delimiter tears the row apart, and an embedded line break splits one record into two, so the row count is wrong for the rest of the file.

Automation makes this worse, not better

A person typing fifty rows by hand stops when a postal code looks too short. An automation imports forty thousand rows at three in the morning and reports success.

That green check mark means "file processed," not "data correct." The scenario did its job. It read rows and wrote records. It has no idea what a valid order number or a real postal code looks like, so it cannot tell that the contents are nonsense. The machine checks the shape, never the meaning. The corruption stays silent and surfaces weeks later, after the bad data has already flowed into invoices, mailings, and reports.

The damage rarely stays in one place

The nasty part about broken data from a CSV is that it does not sit still. The wrong postal code travels from the CRM into the shipping provider, onto the label, and into the returns pile. The order number that got mangled into a float lands in the warehouse system and matches the wrong item at the next sync. The amount with the misread thousands separator goes into the billing run. Every downstream step treats the garbage as truth, because the upstream system reported "success."

Then there are the numbers themselves. 1.000 means one thousand in Germany and one-point-something in English-speaking countries. Read the file with the wrong assumption and a thousand euros becomes one euro, or the other way around. In a report someone might catch it. In an automatic billing run, probably not.

That is exactly why the point where the file comes in is the only place where the damage is still cheap to stop. An hour of validation at the interface costs less than the question, three weeks later, of which of the 38,000 records can still be saved.

What we do instead

When we control both ends, we skip the file entirely and use an API or JSON rather than a CSV export. A number stays a number, the encoding is fixed, and nobody can open a spreadsheet in the middle.

When CSV is unavoidable, we pin everything down. The encoding is set to UTF-8 and checked at the top of the file. The delimiter is stated explicitly instead of guessed. And before anything is processed, the file is validated against a schema: the right number of columns, every required field present, formats that actually match. A file that breaks the schema is rejected, not half-imported.

We also read everything as text and parse it ourselves, with the tool's automatic type detection switched off. A postal code is text even when it is all digits. Hand that decision to the tool and it strips the leading zero again.

And spreadsheets stay out of the pipeline. A spreadsheet is a tool for people, not for machines. The moment someone "just opens it and saves it again," the whole effort is wasted. In a clean automation, no spreadsheet touches the data between source and destination.

On top of that we carry a canary row, a known test record with the tricky values at the start or end of the file: a postal code with a leading zero, an accented character, a number in the local format. If it arrives intact, the file survived the trip. If it arrives broken, the workflow stops before it touches the real data.

The real problem is not CSV

CSV is just where the failure shows up first, because the format looks so harmless. Behind it sits trust without verification. We treat a file that has passed through unknown hands and unknown programs as if its contents were clean, simply because it opens.

Something that looks like a table is still not a table. It is a text file built on conventions nobody agreed on. Check at every handoff instead of trusting, and you have ruled out most of this class of failure already. Skip that, and you will not notice it at the workflow. You will notice it in the mail that comes back.

#CSV#Datenformate#Encoding#UTF-8#Excel#Datenqualität#Automatisierung