Re: Fast CSV processing with SIMD

Message ID
DKIM signature
Download raw message
A cheap way of matching up pairs of quotes can be done with any
machine with PCLMULQDQ or equivalent (the ARM instruction temporarily
escapes me).  This wonderful instruction gets a great writeup from
"Wunkolo" here:

To my knowledge, I invented or reinvented this trick during the
development of simdjson (https://github.com/simdjson/simdjson): by
multiplying a bitmask corresponding to the quotes by -1 using
"carryless multiply", we transform a "quote mask" into a "quoted
regions mask". There's even a relatively non-awful way to render the
computation 'continuous' as of course we don't just want to handle 64
bytes at a time, but to keep matching quotes for the whole buffer.

For more details, see the section covering quote matching in our VLDB
paper at https://arxiv.org/abs/1902.08318. I hope it makes sense.

Such quote-matching doesn't solve the whole problem - naturally we'd
like to normalize. But it's a good first step to finding which ','
characters are truly field separators and which are not (as your
article focuses on) - the double-quoting convention of CSV means that
we effectively 'leave-and-reenter' our quoted region every time we
encounter such a quote, which works just fine for establishing a mask
that allows us to distinguish a quoted-comma vs a real one.

I started some work on a simd csv parser (imaginatively named
"simdcsv") but ran out of steam.

Geoff Langdale.

Re: Fast CSV processing with SIMD

Message ID
<CABwTFSrDpNkmJs6TpkAfofcZq6e8YWaJUur20xZBz7mDBnvQ2w@mail.gmail.com> (view parent)
DKIM signature
Download raw message
Thanks, Geoff, this is great! I wasn't aware of PCLMULQDQ, and that 
writeup you shared is excellent. I'm definitely going to use this 
instruction in the future.

I was aware of simdjson through Daniel Lemire, but it hadn't occurred to 
me to look into how you handled quotes. I'll need to study your paper to 
see what other techniques I've been missing.
Reply to thread Export thread (mbox)