A cheap way of matching up pairs of quotes can be done with any machine with PCLMULQDQ or equivalent (the ARM instruction temporarily escapes me). This wonderful instruction gets a great writeup from "Wunkolo" here: https://wunkolo.github.io/post/2020/05/pclmulqdq-tricks/ To my knowledge, I invented or reinvented this trick during the development of simdjson (https://github.com/simdjson/simdjson): by multiplying a bitmask corresponding to the quotes by -1 using "carryless multiply", we transform a "quote mask" into a "quoted regions mask". There's even a relatively non-awful way to render the computation 'continuous' as of course we don't just want to handle 64 bytes at a time, but to keep matching quotes for the whole buffer. For more details, see the section covering quote matching in our VLDB paper at https://arxiv.org/abs/1902.08318. I hope it makes sense. Such quote-matching doesn't solve the whole problem - naturally we'd like to normalize. But it's a good first step to finding which ',' characters are truly field separators and which are not (as your article focuses on) - the double-quoting convention of CSV means that we effectively 'leave-and-reenter' our quoted region every time we encounter such a quote, which works just fine for establishing a mask that allows us to distinguish a quoted-comma vs a real one. I started some work on a simd csv parser (imaginatively named "simdcsv") but ran out of steam. Regards, Geoff Langdale.
Thanks, Geoff, this is great! I wasn't aware of PCLMULQDQ, and that writeup you shared is excellent. I'm definitely going to use this instruction in the future. I was aware of simdjson through Daniel Lemire, but it hadn't occurred to me to look into how you handled quotes. I'll need to study your paper to see what other techniques I've been missing.