Hi!
I'm trying to import the OSSP archives onto lists.sr.ht,
but I'm getting partial results:
$ grep '^From ' ossp-dev -c
129
but a direct import of the archive
https://lists.sr.ht/~nabijaczleweli/ossp-dev-1
has 124 messages (none of them threaded, idk if any of them should be).
Cracking the mailbox and then pasting it back together with the
recommended suite
$ ./mbox-split ossp-dev 1KiB
$ ./pile-o-emails-to-mbox ossp-dev-reconstructed ossp-dev.*
yields just 18 messages:
https://lists.sr.ht/~nabijaczleweli/ossp-dev-2
Neither of them manage to import the first message of the archive with
Message-ID: <20020523115704.D22880@dev14.dev.de.cw.net>.
I've also tried importing the ossp-users spool, and got an empty archive
https://lists.sr.ht/~nabijaczleweli/ossp-users-1
None of these have a rejected mimetype, and changing the
"Permitted mimetypes" field fails with "internal system error".
But they're all pre-MIME mails, so.
Similarly, importing the first fifth of the ossp-cvs spool yielded no
mails in https://lists.sr.ht/~nabijaczleweli/ossp-cvs-1.
Is there something in the log?
The spools I'm using are available for download at
https://foreign.nabijaczleweli.xyz/pub/ossp-spools
I extracted the importer and looked at the errors (which it swallows):
Error ("<20020828091907.W2689@canonware.com>") reading In-Reply-To: mail: missing '<' in msg-id
Error ("<20020730130258.A71796@canonware.com>") reading In-Reply-To: mail: missing '<' in msg-id
<20020703134036.GC1464@dt4.dev.de.cw.net> unknown charset: unknown charset: message: unhandled charset "iso-8859-1"
Error ("<20020523115704.D22880@dev14.dev.de.cw.net>") reading In-Reply-To: mail: missing '<' in msg-id
And sure enough, there are like 3 mails that have
In-Reply-To: <3D6BE829.CEEA7B8C@packetdesign.com>; from archie@packetdesign.com on Tue, Aug 27, 2002 at 01:59:21PM -0700
But I can correct this manually to (from ...).
OTOH, the encoding thing is insurmountable without a patch
because this archive is thousands-deep.
I've found that importing _ "github.com/emersion/go-message/charset"
will, without damaging the envelope, convert the body from ISO-8859-1
(and other common encodings) to UTF-8.
This appears to be the intended behaviour:
https://github.com/emersion/go-message/blob/fa228c85f131cbb252d485042f21713141dd3f69/charset.go#L27-L35
I'm testing if this will work as expected on my test sourcehut instance,
then I'll post a patch.
Best,
On Wed, Sep 04, 2024 at 10:13:10PM +0200, наб wrote:
> I've found that importing _ "github.com/emersion/go-message/charset"> will, without damaging the envelope, convert the body from ISO-8859-1> (and other common encodings) to UTF-8.
The full listssrht/api program already does this, so no need for a patch.
As for the ossp-cvs list? It refused to import, no matter how much I split it.
So I split it into 5027 pieces, and uploaded one mail at a time with a
for f in *; do curl; sleep loop. (with apologies to any log readers).
This left me with 67 e-mails that didn't upload.
Installing sourcehut (personal record, 70 minutes), I got
2024/09/05 01:25:59 "POST http://127.0.0.1:5106/query HTTP/1.0" from 127.0.0.1 - 200 40B in 19.380792ms
2024/09/05 01:25:59 Error importing message: pq: invalid byte sequence for encoding "UTF8": 0xe4 0xf6 0xfc
2024/09/05 01:25:59 Attempt 1/1 failed (panic: pq: Could not complete operation in a failed transaction), retrying in 2m0s
and, indeed,
$ grep -a +.*Hallo xx0420 | hd
000000 20 20 2b 20 20 20 20 63 68 61 72 20 20 20 6d 73 > + char ms<
000010 67 5b 5d 20 3d 20 22 3c 48 61 6c 6c 6f 3e 20 5d >g[] = "<Hallo> ]<
000020 5d 3e 26 3c 26 3e 57 6f 72 6c 64 3a 20 e4 f6 fc >]>&<&>World: ...<
000030 df 22 3b 0a >.";.<
000034
this is unescaped(!) ISO-8859-1-encoded äöüß. Further,
for f in *; do iconv $f > /dev/null 2>&1 && echo $f; done
yields no results, and thus all the left-over mails are UTF-8-invalid.
Is it broken? not really. Is sourcehut right to reject this? sure.
Are other users likely to have these same issues? Unlikely.
But, with minor massaging and a lot of messaging,
I've managed to fully archive the OSSP mailing lists:
https://sr.ht/~nabijaczleweli/ossp/lists
Thanks,
Hey,
On 9/5/24 2:12 AM, наб wrote:
> Is it broken? not really. Is sourcehut right to reject this? sure.> Are other users likely to have these same issues? Unlikely.
Thanks for digging into this yourself! Real-world and RFCs are often at
odds when it comes to MIME. If you are interested in making things
better for everyone, I think the best way would be to file an issue with
go-message, as that is essentially what we rely on. Note, however, that
Simon will likely only accept fixes for things that can be reasonably
interpreted to be RFC-conformant. He's made it clear in the past that
he's not interested in supporting unspecified syntax, even if it is or
was widely used.
Cheers,
Conrad
On Thu, Sep 05, 2024 at 12:15:51PM +0200, Conrad Hoffmann wrote:
> On 9/5/24 2:12 AM, наб wrote:> > Is it broken? not really. Is sourcehut right to reject this? sure.> > Are other users likely to have these same issues? Unlikely.> He's made it clear in the past that he's not interested in> supporting unspecified syntax, even if it is or was widely used.
23:07:41 <emersion> nabijaczleweli: sounds invalid per RFC
No loss, I don't think anyone has produced an email like this for two decades.
Best,