Authentication-Results: mail-b.sr.ht; dkim=pass header.d=gmail.com header.i=@gmail.com Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by mail-b.sr.ht (Postfix) with ESMTPS id 6CC9D11EF86 for <~skeeto/public-inbox@lists.sr.ht>; Sat, 4 Dec 2021 11:14:08 +0000 (UTC) Received: by mail-ed1-f45.google.com with SMTP id l25so22056471eda.11 for <~skeeto/public-inbox@lists.sr.ht>; Sat, 04 Dec 2021 03:14:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=eeHbu6z7QhN7g8JjXYZ+pxc4s5g6/yMaoXPZO8adcVg=; b=cZQcgr1aZO4RfVcR5Q8d9w9b/97XSZEdpQaFHOn+4bM8je2rWNa/wxiwJxJuL+rYX5 uzeT3pixZNY44kAiw3Zc5AgqEWCqefz9tji9fcbgdDiJ63AWRo0lsIxdm9IJKPyUJX9D 6b65hT6qZKQ0COy7vhN2lBZ3xcu5rjtNZ1bR/siWgInnIwTHkeq3C66my1fKjXrX1b5Q vveoQaf6uyJXIA1R7+nmGRstxloOAze/ubqreU7avrQw/aGbo2JMH6AVNyDWY/rDdQ9e YG6wrvHdoKCPfrkQzPVjpx6a+rqAeXRN2ORIV4xnEzCvx8/3IQsslFr/iAxm4NWoZSOU MYRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=eeHbu6z7QhN7g8JjXYZ+pxc4s5g6/yMaoXPZO8adcVg=; b=IN2HIkRDRUQIZ8iTP++QS1JyR5duZbB89CsKUorZqseV+pV8pfAZT2woKkgHxVlj4h zgcjxB7zUt/6Mf/4BL7nq4D9S/31OXkbUScwaSzgeX+uEz8e8Sz6WLGVw5bkZ8zw7Wi0 KHR8FfZ/D6iPSL4cRjNj96gu9I/oki4Sx8ZIi+d6+v7JbJEzVoV1iesKA2J0gjXFbdi+ Ca0Ozlmy4YnUhbwaP7sG2pTXvQqT0SDdycQi794jjFdYWQs2d5Of9+YRjkwdrCFaR8mM 3gI8MWsRIpF/Hn5A3y/71O9Bc+VPK8AA/AvjiOu8e/USHiVNXmM5lPyQM6nwS5if/Jol wFQg== X-Gm-Message-State: AOAM5300AyivMJpNlasyCx7Sj3z0CD9LxI7sUSHcLgsxcuhckNeeuFfG j4cX1Wdys9rj4NGDzgTEh3yx7PyovZz2n9edELHSevrOYQg= X-Google-Smtp-Source: ABdhPJwJi8ND6+1rNPSy3cREMbt4X9+sY11gN7F5T9W+y3CgFIWg+ZJlKjBOdw/Q2di45Hd2+qN2Sf6R6bcfVO9+Fwo= X-Received: by 2002:a17:906:4787:: with SMTP id cw7mr32721428ejc.311.1638616447321; Sat, 04 Dec 2021 03:14:07 -0800 (PST) MIME-Version: 1.0 From: Geoff Langdale Date: Sat, 4 Dec 2021 22:13:40 +1000 Message-ID: Subject: Re: Fast CSV processing with SIMD To: ~skeeto/public-inbox@lists.sr.ht Content-Type: text/plain; charset="UTF-8" A cheap way of matching up pairs of quotes can be done with any machine with PCLMULQDQ or equivalent (the ARM instruction temporarily escapes me). This wonderful instruction gets a great writeup from "Wunkolo" here: https://wunkolo.github.io/post/2020/05/pclmulqdq-tricks/ To my knowledge, I invented or reinvented this trick during the development of simdjson (https://github.com/simdjson/simdjson): by multiplying a bitmask corresponding to the quotes by -1 using "carryless multiply", we transform a "quote mask" into a "quoted regions mask". There's even a relatively non-awful way to render the computation 'continuous' as of course we don't just want to handle 64 bytes at a time, but to keep matching quotes for the whole buffer. For more details, see the section covering quote matching in our VLDB paper at https://arxiv.org/abs/1902.08318. I hope it makes sense. Such quote-matching doesn't solve the whole problem - naturally we'd like to normalize. But it's a good first step to finding which ',' characters are truly field separators and which are not (as your article focuses on) - the double-quoting convention of CSV means that we effectively 'leave-and-reenter' our quoted region every time we encounter such a quote, which works just fine for establishing a mask that allows us to distinguish a quoted-comma vs a real one. I started some work on a simd csv parser (imaginatively named "simdcsv") but ran out of steam. Regards, Geoff Langdale.