~skeeto/public-inbox

6 2

Re: w64devkit: a Portable C and C Development Kit for Windows

Kurt Jung
Details
Message ID
<CAPhJv9Svc1ujT90x+33U9mTARKr4J+GG26D8arzC_8FJAoAECQ@mail.gmail.com>
DKIM signature
pass
Download raw message
Excellent toolkit for achieving full productivity on Windows. Your writeup
explains your choices very well.

My recent Windows projects have all been amenable to cross-compiling with Go
from Linux and using various tools like goversioninfo and makensys to package
everything for installation. If I ever need to actually develop on Windows I
will definitely turn to your w64devkit.

> Iā€™d love to see a clean, straightforward rewrite of Git in a single
> (appropriate) implementation language.

You may consider including fossil as an alternative. It is a joy to use and its
repositories can be ported to and from git easily.

Re: w64devkit: a Portable C and C Development Kit for Windows

Details
Message ID
<20200516014641.67d642sj2ndrhsme@nullprogram.com>
In-Reply-To
<CAPhJv9Svc1ujT90x+33U9mTARKr4J+GG26D8arzC_8FJAoAECQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Thanks, Kurt! I didn't know about goversioninfo. That looks like it 
might be useful sometime. I've used windres for C applications, but I 
wouldn't want to rely on it for Go applications.

By "makensys" do you mean NSIS? I've used NSIS in the past, and I mostly 
liked the results. But these days the lack of 64-bit support is a real 
deal-breaker. A 32-bit installer for a 64-bit application confuses 
Windows and makes for a poor experience. I know there are hacks to work 
around this, but I've tried them and they don't work well.

> You may consider including fossil as an alternative.

Every few years I check in on Fossil to see how it's doing. There are a 
couple features where it's better than Git, but it's worse in nearly 
every other way.

Single-file repositories are a great idea, and, given its SQLite 
heritage, of course that's what Fossil would use. Git's performance on 
Windows wouldn't be so awful (though I wonder if it would even make a 
difference on Linux). Repositories would be more manageable. Though I 
dislike Fossil's interface for it, where the repository file is an 
explicit parameter.

A built-in ticket system has sounds like a good idea. Particularly that 
the tickets live alongside the bugs in the code, so a commit both fixes 
a bug and closes the ticket. Git hosts usually integrate these as well, 
but the ticket system is always second class (not cloneable, etc.). I'd 
need to get experience with it really see if it's a worthwhile feature. 
I don't so much like most of the other bonus features being part of the 
source control system (wiki, docs, forum, etc.).

Not being so tied to POSIX is of course what I was complaining about 
with Git. However, that's not really fundamental to Git itself but a 
problem of its main implementation. The interfaces, data model, and 
storage format are all essentially OS agnostic.

However, Fossil makes the same fatal mistake as Mercurial: It's 
anti-rebase, treating commit history as sacred and immutable. This is 
absolutely upside-down and wrong-headed. Rebasing is awesome and 
powerful, and it's half of what makes source control worth using. It's 
optional with Git, and the repositories that make good use of rebase are 
*so* much easier to follow and understand.

Anti-rebase also scares away most power users and forever guarantees low 
adoption. It's one of the major reasons Mercurial withered away into 
irrelevance. Sure, most users aren't power users, but power users are 
leaders. In the 5-stage adoption model, they're the innovators and early 
adopters. If you can't convince them, you'll have a much harder time 
convincing everyone else. You can see this sentiment in action here:

https://news.ycombinator.com/item?id=19006036
https://lobste.rs/s/e3blgf/fossil_versus_git

Since it's been a few years, I took a look at the Fossil website again. 
There's enough misinformation to really irritate me. It misuses the word 
"blockchain" to sound trendy (no, citing the incomplete definition on 
Wikipedia doesn't cut it). Complaints about detached heads in Git (a 
very useful feature, and other SCMs are poorer for their lack of it!). 
There's the whole anti-rebase thing all over the place, all in direct 
contradiction with my personal experiences using rebase routinely and 
casually for years now.

Besides that there's the discussion about SHA-1. Git and Fossil are both 
late to the game dealing with SHA-1, so Fossil doesn't have much to brag 
about. What really gets under my skin, though, is how it implies SHA-3 
is the correct path forward and that Git's choice of the "older SHA-2 
algorithm" is somehow mistaken. No, choosing SHA-3 is the mistake! 
There's no reason at all to choose SHA-3. It's slower in software and 
not proven to be more secure. In fact, some cryptographers were calling 
for the SHA-3 to result in "no award" because they didn't think the 
results would be useful enough. If you're not satisfied with SHA-2, the 
right choice is to choose an even better hash function than SHA-3.

https://www.schneier.com/blog/archives/2012/09/sha-3_will_be_a.html

The Fossil site talks a lot about efficiency, but talk is cheap. I want 
to see some real benchmark comparisons. This is the best I could find, 
strongly suggesting Fossil can't scale up, so all their talk of 
efficiency is irrelevant:

https://lobste.rs/s/sxpmar/game_trees_version_control_system_under#c_6nwn2k

That result is not surprising since SQLite doesn't scale well either. In 
my experience, SQLite databases become uselessly slow starting around 
10GiB or so, so of course it can't handle huge repositories ā€” huge as in 
many small objects, not a few big objects. At that point it's better to 
hand-roll a custom solution. (And, yes, I'm a big Markus Winand fan and 
know how to use indexes properly, so that's not it.)

So, anyway, Fossil is definitely not for me. :-)

Re: w64devkit: a Portable C and C Development Kit for Windows

Kurt Jung
Details
Message ID
<CAPhJv9SZwf-Fsv1-zqOw28U9FMePbi5ie7q01hS2Tc2GP6TdsA@mail.gmail.com>
In-Reply-To
<20200516014641.67d642sj2ndrhsme@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
Wow -- impressive response. You clearly understand not only the tools
you use but the ones you have rejected.

> By "makensys" do you mean NSIS?

Yes. https://sourceforge.net/projects/nsis/

> But these days the lack of 64-bit support is a real deal-breaker. A
> 32-bit installer for a 64-bit application confuses Windows and
> makes for a poor experience.

I wasn't aware of this shortcoming. I used it to deliver a factory
application that depended on a 32-bit DLL so with regret I compiled
the executable to match. I thought hard and long about
reverse-engineering that DLL (an interface to profilometer
controllers) so everything could be done with Linux, but expedience
prevailed.

> Repositories would be more manageable. Though I dislike Fossil's
> interface for it, where the repository file is an explicit
> parameter.

I'm not sure I follow. Each checkout is associated with a repository
file but that file is usually identified in a configuration file so
it doesn't have to be specified with each command. If you mean you
dislike having to explicitly state it in a configuration file, I
guess that's the price of not having the repository embedded into the
checkout as is done with the .git directory.

> I don't so much like most of the other bonus features being part
> of the source control system (wiki, docs, forum, etc.).

Yes, fossil definitely includes the kitchen sink. In addition to
tickets, I find I use the integrated docs feature a lot -- it's
convenient to have my developer journal and documentation
version-controlled and served along with the rest of the repository.

> However, Fossil makes the same fatal mistake as Mercurial: It's
> anti-rebase, treating commit history as sacred and immutable. This is
> absolutely upside-down and wrong-headed. Rebasing is awesome and
> powerful, and it's half of what makes source control worth using. It's
>optional with Git, and the repositories that make good use of rebase are
> *so* much easier to follow and understand.

Rebasing is admittedly out of my league. Maybe this is a result of
using fossil for so many years and not knowing about its benefits
because it wasn't an option.

> https://news.ycombinator.com/item?id=19006036

Excellent discussion -- thanks for linking to it.

> That result is not surprising since SQLite doesn't scale well
> either. In my experience, SQLite databases become uselessly slow
> starting around 10GiB or so, so of course it can't handle huge
> repositories ā€” huge as in many small objects, not a few big
> objects.

I wonder how David Crawshaw
(https://crawshaw.io/blog/one-process-programming-notes) would
respond to that claim. He is pushing SQLite far beyond any use I have
made of it.

Re: w64devkit: a Portable C and C Development Kit for Windows

Details
Message ID
<20200516214341.2x7vvmvbcpdjn36a@nullprogram.com>
In-Reply-To
<CAPhJv9SZwf-Fsv1-zqOw28U9FMePbi5ie7q01hS2Tc2GP6TdsA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> I guess that's the price of not having the repository embedded into 
> the checkout as is done with the .git directory.

Yeah, that's basically what meant. That particular gripe is off the 
cuff, and I can accept that in practice it's a total non-issue. It's 
just something I noticed poking around with Fossil before writing that 
message. It reminded me of Subversion's awful branch interface where to 
make a branch you had to provide the fully-qualified repository URL in 
the commands.

> I wonder how David Crawshaw would respond to that claim. He is pushing 
> SQLite far beyond any use I have made of it.

Thanks for sharing Davis Crawshaw's article. That's interesting. I'm in 
absolute agreement with "Don't use N computers when 1 will do."

I retried one particular experiment where I had given up on SQLite. It's 
not quite as slow as I remembered, but it still demonstrates my point. 
Have you heard of "Have I Been Pwned"? The database has 555,278,657 
SHA-1 password hashes from real leaks. The hashes are computed by the 
database's maintainer who has access to the actual passwords, and the 
purpose of using SHA-1 is privacy (vs. handing out raw passwords).

https://haveibeenpwned.com/Passwords

You can download it for yourself to do your own offline password checks, 
and this is what services should be using instead of arbitrary password 
complexity rules. Just consider the database a blacklist of banned 
passwords and you're done!

[Complaint #1: The database contains only ASCII passwords. I suspect 
this is a simple processing error during database construction since 
other obvious mistakes were made (see other complaints). This limits the 
usefulness of the database.]

The distributed dataset is a text file where each line is HASH:COUNT. I 
don't care about the count, so it will be ignored for my purposes. All I 
need is a membership test so that I can blacklist passwords.

[Complaint #2: You can download the data sorted by hash, but, as if to 
frustrate simple solutions, each line has a different length due to the 
count. Otherwise the Nth line would be at offset N*LEN into the file and 
it could be used in-place without pre-processing. I would solve this by 
fixing the count length, like "%40s:%06d".]

[Complaint #3: The text file uses CRLF (Windows) instead of LF (everyone 
else). That's literally 530MiB of useless CR bytes in the decompressed 
output. Why?]

[Complaint #4: Despite being a single file, it's distributed as a .7z 
archive. Not only does the 7z command line tool generally suck, it means 
the file can't be part of a pipeline. When there's an updated database, 
I have to unpack it then recompress with something sensible (i.e. zstd). 
That leaves me with two copies instead of one: The good copy and the 
original .7z seeded by my torrent client.]

So, anyway, one obvious thing to do is load it into SQLite and leverage 
its fancy B-tree to do the lookups. It's literally an awk one-liner:

zstd -d <hashes.zst | \
    awk -F: '
      BEGIN {
        print "PRAGMA journal_mode=WAL;"
        print "CREATE TABLE hashes(hash TEXT PRIMARY KEY) WITHOUT ROWID;"
        print "BEGIN;"
      }
      END {
        print "COMMIT;"
      }
      {
        print "INSERT INTO hashes VALUES (\x27" $1 "\x27);"
      }' | \
    sqlite3 db

I enable WAL since that's supposed to be faster. I'm careful to disable 
the rowid (I probably didn't do this last time I tried it) so the table 
is just a cluster index. I won't need an explicit index since the table 
is already an index.

This took my laptop just over an hour to finish. Waiting on SQLite to 
import my data is what usually causes me to abandon it. While I'm 
waiting, I spend my time working out another solution. If my other 
solution is working before SQLite is done, I give up on it.

This isn't not a completely fair comparison. I'm driving this by the 
command line sqlite3 program because it's so damn convenient: I didn't 
write a single line of code to a file! But that means no prepared 
statements, and SQLite has to parse and compile each of the 555 million 
lines of SQL. (Doing it the Right Way is more complex than my non-SQLite 
solution.)

How about size? The resulting database was 27GiB. The actual amount of 
data is 10GiB (555 million passwords x 20 bytes of hash). Though I'm 
storing it in hexadecimal because the command line client can't do 
binary blobs, so it's really double this, 21MiB. That's about 30% 
overhead. A 15 minute "VACUUM" brings it down to 24GiB, or 14% overhead 
(b-tree overhead, text pointers?).

Testing password hashes with SELECT is practically instant, so long as 
the database is warmed up by at least one lookup. Otherwise it takes 
about 4.5 seconds to warm up. So at this point performance is basically 
fine. I did another experiment truncating each hash to the first 64 
bits.

CREATE TABLE hashes (hash INTEGER PRIMARY KEY);

This took 30 minutes to load, 10 minutes to VACUUM, and the database is 
7.3GiB (130% overhead). Query performance is the same.

The first time I was processing this data and waiting on SQLite, I 
designed and wrote this (well, the unpolished first draft) before SQLite 
was done loading input:

https://github.com/skeeto/pwcheck

It's a dead simple database format, and uses binary search (i.e. no 
index). The database is built as fast as you can write to storage. It 
has 0% overhead. When truncated to 8 byte digests, it's 4.1GiB. Cold 
lookups are 9ms, and hot lookups are 1ms. At only 4.1GiB, the entire 
database can easily remain cached in RAM even on a modest server.

This sounds a lot like that OpenBSD developer spending days waiting for 
Fossil to load data that Git can process in just a few minutes (except 
when run on Windows). Git uses a fast hand-rolled solution, and Fossil 
uses the conservative, traditional database solution. The latter is more 
flexible (can change schemas without rewriting a bunch of code), but it 
comes at a cost.

I'll end with some good news about SQLite since I still really do like 
it as a tool. I also tried to load the data into PostgreSQL (via psql) 
using an equivalent table schema and the same technique. After several 
hours of waiting, the database ballooned to 64GiB and it still wasn't 
done, so I killed it.

SQLite is still well ahead of PostgreSQL for these types of problems, 
which I suppose is why David Crawshaw likes it so much!

Re: w64devkit: a Portable C and C Development Kit for Windows

Kurt Jung
Details
Message ID
<CAPhJv9SSWkWXjd59zy02+SGXO5_79q+iqEBerkkKn3r55tm+1g@mail.gmail.com>
In-Reply-To
<20200516214341.2x7vvmvbcpdjn36a@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
> The database has 555,278,657 SHA-1 password hashes from real leaks.

That's a staggering amount of breached data.

> You can download it for yourself to do your own offline password
> checks,

I wish I could, but with my limited bandwidth and data quotas out
here in the sticks it just isn't practical.

> and this is what services should be using instead of arbitrary
> password complexity rules. Just consider the database a blacklist
> of banned passwords and you're done!

I completely agree.

> So, anyway, one obvious thing to do is load it into SQLite and
> leverage its fancy B-tree to do the lookups. It's literally an awk
> one-liner:

> zstd -d <hashes.zst | \
>    awk -F: '
>      BEGIN {
>        print "PRAGMA journal_mode=WAL;"
>        print "CREATE TABLE hashes(hash TEXT PRIMARY KEY) WITHOUT ROWID;"
>        print "BEGIN;"
>      }
>      END {
>        print "COMMIT;"
>      }
>      {
>        print "INSERT INTO hashes VALUES (\x27" $1 "\x27);"
>      }' | \
>    sqlite3 db

This use of awk in a stream is perfect -- really stunning.

Can you explain what the escape characters do? It looks like awk
passes lines like

  INSERT INTO hashes VALUES (\x27a69...99c6\x27);

to sqlite. Do those escapes serve as quotes?

> This took my laptop just over an hour to finish. Waiting on SQLite
> to import my data is what usually causes me to abandon it. While
> I'm waiting, I spend my time working out another solution. If my
> other solution is working before SQLite is done, I give up on it.

Another approach with the sqlite3 tool is the .import dot command.

  CREATE TABLE hashes(hash TEXT PRIMARY KEY, count INT) WITHOUT ROWID;
  .mode csv
  .separator :
  .import hashes.txt hashes

A comment at

https://stackoverflow.com/Questions/364017/faster-bulk-inserts-in-sqlite3

indicates that it reduced a 20 minute load to three seconds.

There are two obvious drawbacks: no streaming from a zstd file, and
you either need to include the unneeded count field or tolerate half
a billion "argument ignored" warnings from sqlite or trim off the
count in the input file.

> The first time I was processing this data and waiting on SQLite, I
> designed and wrote this (well, the unpolished first draft) before
> SQLite was done loading input:

> https://github.com/skeeto/pwcheck

> It's a dead simple database format, and uses binary search (i.e. no
> index).

Given the problem of checking passwords, this is the perfect solution.

The sqlite documentation states:

  SQLite does not compete with client/server databases. SQLite
  competes with fopen().

For this case, your fopen clearly is more appropriate.

Re: w64devkit: a Portable C and C Development Kit for Windows

Details
Message ID
<20200518174329.e7euaumwoqr3qdj3@nullprogram.com>
In-Reply-To
<CAPhJv9SSWkWXjd59zy02+SGXO5_79q+iqEBerkkKn3r55tm+1g@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> Can you explain what the escape characters do?

The escape is interpreted by awk, and it's just a single quote. I needed 
that because it's triple-nested. The awk "script" is in single quotes, 
awk's strings are in double quotes, and I need single quotes again for 
generating SQL: double quotes are for identifiers, single quotes for 
strings. So ultimately the output from awk is:

INSERT INTO hashes VALUES ('a69...99c6');

> Another approach with the sqlite3 tool is the .import dot command.

I've used .import often for CSV but I easily forget it can be used for 
more than that. Thanks for the tip!

> There are two obvious drawbacks: no streaming from a zstd file [...] 
> unneeded count field

Oh, we can solve both those problems easily with a FIFO. :-)

mkfifo /tmp/fifo
zstd -d <data.zst | awk '{print $1}' >/tmp/fifo

Then in SQLite:

.import hashes /tmp/fifo

However SQLite was now consuming the input so fast that awk was at risk 
of being a bottleneck, so I replaced awk with a tiny C program that 
consumes virtually zero CPU time. The result: Importing the full-size 
hash now takes 27 minutes, cutting the load time half!

Then I tested different settings for "journal_mode" and "synchronous". 
The latter had no impact, but *disabling* WAL brought it down to a 12 
minute load. Lesson learned: WAL is inappropriate for static databases 
like this. The 8-byte truncated hashes are a 10 minute load. I was 
expecting better than that, but at least the truncated database is ā…“ in 
size.

This is may just quick enough that I might not have worked out an 
alternative while waiting on SQLite.

Re: w64devkit: a Portable C and C Development Kit for Windows

Kurt Jung
Details
Message ID
<CAPhJv9RNNEWjO_xMEPB0S=tk++W8K=ToiM9=kpTA9=WCfLZOiQ@mail.gmail.com>
In-Reply-To
<20200518174329.e7euaumwoqr3qdj3@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
>> Can you explain what the escape characters do?

> The escape is interpreted by awk, and it's just a single quote.

Of course -- I saw the 27 and my addled brain confused decimal with
hexadecimal.

> Oh, we can solve both those problems easily with a FIFO. :-)

> mkfifo /tmp/fifo
> zstd -d <data.zst | awk '{print $1}' >/tmp/fifo
> [...]
> .import hashes /tmp/fifo

I kneel at your feet.

> The result: Importing the full-size hash now takes 27 minutes,
> cutting the load time half!

I wonder if that is due primarily to sqlite3 preparing the insert
statement. Also, I suspect the poster of the comment on stackoverflow
who had such a remarkable reduction in load time using .import had
neglected to place all of his inserts into a transaction. SQLite
would have implicitly put each insert statement into its own
transaction.

> This is may just quick enough that I might not have worked out an
> alternative while waiting on SQLite.

Ha! Well, maybe not a fully documented, tested, commercial-grade
solution published on github.
Export thread (mbox)