March 27, 2023

Common expressions are a really useful gizmo in a programmer’s toolbox. However they’ll’t do every little thing. And one of many issues they’ll’t do is to reliably parse CSV (comma separated worth) information. It’s because a daily expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.

For instance, think about this (very quick) CSV file (3 double quotes + 1 comma + 3 double quotes):


That is accurately interpreted as:

quote to begin the info worth + escaped quote + comma + escaped quote + quote to finish the info worth

E.g. a single worth of:


How every character is interpreteted depends upon what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside knowledge’ state. The second quote places you right into a ‘is likely to be an escaped for the next character or is likely to be finish of knowledge’ state. The third quote places you again right into a ‘inside knowledge’ state.

Regardless of how sophisticated a regex you give you, it can at all times be doable to create a CSV file that your regex can’t accurately parse. And as soon as the parsing goes fallacious, every little thing after that time might be rubbish.

You may write a regex that may deal with CSV file the place you might be assured there aren’t any commas, quotes or carriage returns within the knowledge values. However commas, quotes or carriage returns within the knowledge values are completely legitimate in CSV information. So it’s only ever going to deal with a subset of all of the doable well-formed CSV information.

Be aware that you just can parse a TSV (tab separated worth) file with a regex, as TSV information are (typically!) not allowed to include tabs or carriage returns in knowledge and due to this fact don’t want escaping.

See additionally on Stackoverflow:

Using regular expressions to parse HTML: why not?