~sircmpwn/sr.ht-discuss

4 2

lists.sr.ht imports broken/partial/crash midway? on OSSP archives

Details
Message ID
<fy2e7yvtatzw6ato2zggkr7pokcx6vbhhmprapbqmmz2zlzprf@tarta.nabijaczleweli.xyz>
DKIM signature
pass
Download raw message
Hi!

I'm trying to import the OSSP archives onto lists.sr.ht,
but I'm getting partial results:
  $ grep '^From ' ossp-dev -c
  129
but a direct import of the archive
  https://lists.sr.ht/~nabijaczleweli/ossp-dev-1
has 124 messages (none of them threaded, idk if any of them should be).
Cracking the mailbox and then pasting it back together with the
recommended suite
  $ ./mbox-split ossp-dev 1KiB
  $ ./pile-o-emails-to-mbox ossp-dev-reconstructed ossp-dev.*
yields just 18 messages:
  https://lists.sr.ht/~nabijaczleweli/ossp-dev-2

Neither of them manage to import the first message of the archive with
Message-ID: <20020523115704.D22880@dev14.dev.de.cw.net>.

I've also tried importing the ossp-users spool, and got an empty archive
  https://lists.sr.ht/~nabijaczleweli/ossp-users-1

None of these have a rejected mimetype, and changing the
"Permitted mimetypes" field fails with "internal system error".
But they're all pre-MIME mails, so.

Similarly, importing the first fifth of the ossp-cvs spool yielded no
mails in https://lists.sr.ht/~nabijaczleweli/ossp-cvs-1.

Is there something in the log?

The spools I'm using are available for download at
  https://foreign.nabijaczleweli.xyz/pub/ossp-spools
Details
Message ID
<zxq5vfzvkrlkoljxfr72xp523u4fymr3y2ky4ruyakavhc2geg@tarta.nabijaczleweli.xyz>
In-Reply-To
<fy2e7yvtatzw6ato2zggkr7pokcx6vbhhmprapbqmmz2zlzprf@tarta.nabijaczleweli.xyz> (view parent)
DKIM signature
pass
Download raw message
I extracted the importer and looked at the errors (which it swallows):
  Error ("<20020828091907.W2689@canonware.com>") reading In-Reply-To: mail: missing '<' in msg-id
  Error ("<20020730130258.A71796@canonware.com>") reading In-Reply-To: mail: missing '<' in msg-id
  <20020703134036.GC1464@dt4.dev.de.cw.net> unknown charset: unknown charset: message: unhandled charset "iso-8859-1"
  Error ("<20020523115704.D22880@dev14.dev.de.cw.net>") reading In-Reply-To: mail: missing '<' in msg-id

And sure enough, there are like 3 mails that have
  In-Reply-To: <3D6BE829.CEEA7B8C@packetdesign.com>; from archie@packetdesign.com on Tue, Aug 27, 2002 at 01:59:21PM -0700

But I can correct this manually to (from ...).

OTOH, the encoding thing is insurmountable without a patch
because this archive is thousands-deep.

I've found that importing _ "github.com/emersion/go-message/charset"
will, without damaging the envelope, convert the body from ISO-8859-1
(and other common encodings) to UTF-8.
This appears to be the intended behaviour:
  https://github.com/emersion/go-message/blob/fa228c85f131cbb252d485042f21713141dd3f69/charset.go#L27-L35

I'm testing if this will work as expected on my test sourcehut instance,
then I'll post a patch.

Best,
Details
Message ID
<zi6do2u4btrqnfwftmiyc3nfzjryb5txsplynxorjgakixe6xb@tarta.nabijaczleweli.xyz>
In-Reply-To
<zxq5vfzvkrlkoljxfr72xp523u4fymr3y2ky4ruyakavhc2geg@tarta.nabijaczleweli.xyz> (view parent)
DKIM signature
pass
Download raw message
On Wed, Sep 04, 2024 at 10:13:10PM +0200, наб wrote:
> I've found that importing _ "github.com/emersion/go-message/charset"
> will, without damaging the envelope, convert the body from ISO-8859-1
> (and other common encodings) to UTF-8.
The full listssrht/api program already does this, so no need for a patch.

As for the ossp-cvs list? It refused to import, no matter how much I split it.
So I split it into 5027 pieces, and uploaded one mail at a time with a
for f in *; do curl; sleep loop. (with apologies to any log readers).
This left me with 67 e-mails that didn't upload.

Installing sourcehut (personal record, 70 minutes), I got
  2024/09/05 01:25:59 "POST http://127.0.0.1:5106/query HTTP/1.0" from 127.0.0.1 - 200 40B in 19.380792ms
  2024/09/05 01:25:59 Error importing message: pq: invalid byte sequence for encoding "UTF8": 0xe4 0xf6 0xfc
  2024/09/05 01:25:59 Attempt 1/1 failed (panic: pq: Could not complete operation in a failed transaction), retrying in 2m0s
and, indeed, 
  $ grep -a +.*Hallo xx0420 | hd
  000000 20 20 2b 20 20 20 20 63 68 61 72 20 20 20 6d 73  >  +    char   ms<
  000010 67 5b 5d 20 3d 20 22 3c 48 61 6c 6c 6f 3e 20 5d  >g[] = "<Hallo> ]<
  000020 5d 3e 26 3c 26 3e 57 6f 72 6c 64 3a 20 e4 f6 fc  >]>&<&>World: ...<
  000030 df 22 3b 0a                                      >.";.<
  000034
this is unescaped(!) ISO-8859-1-encoded äöüß. Further,
  for f in *; do iconv $f > /dev/null 2>&1 && echo $f; done
yields no results, and thus all the left-over mails are UTF-8-invalid.

Is it broken? not really. Is sourcehut right to reject this? sure.
Are other users likely to have these same issues? Unlikely.

But, with minor massaging and a lot of messaging,
I've managed to fully archive the OSSP mailing lists:
  https://sr.ht/~nabijaczleweli/ossp/lists

Thanks,
Details
Message ID
<94b35fe9-bb4a-409c-894b-f2dadd4eab6d@bitfehler.net>
In-Reply-To
<zi6do2u4btrqnfwftmiyc3nfzjryb5txsplynxorjgakixe6xb@tarta.nabijaczleweli.xyz> (view parent)
DKIM signature
pass
Download raw message
Hey,

On 9/5/24 2:12 AM, наб wrote:
> Is it broken? not really. Is sourcehut right to reject this? sure.
> Are other users likely to have these same issues? Unlikely.

Thanks for digging into this yourself! Real-world and RFCs are often at 
odds when it comes to MIME. If you are interested in making things 
better for everyone, I think the best way would be to file an issue with 
go-message, as that is essentially what we rely on. Note, however, that 
Simon will likely only accept fixes for things that can be reasonably 
interpreted to be RFC-conformant. He's made it clear in the past that 
he's not interested in supporting unspecified syntax, even if it is or 
was widely used.

Cheers,
Conrad
Details
Message ID
<co5m6blnkl42zwuiosqs6hnd5va3yw6q45rczhks43kqroeycc@tarta.nabijaczleweli.xyz>
In-Reply-To
<94b35fe9-bb4a-409c-894b-f2dadd4eab6d@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On Thu, Sep 05, 2024 at 12:15:51PM +0200, Conrad Hoffmann wrote:
> On 9/5/24 2:12 AM, наб wrote:
> > Is it broken? not really. Is sourcehut right to reject this? sure.
> > Are other users likely to have these same issues? Unlikely.
> He's made it clear in the past that he's not interested in
> supporting unspecified syntax, even if it is or was widely used.
23:07:41 <emersion> nabijaczleweli: sounds invalid per RFC

No loss, I don't think anyone has produced an email like this for two decades.

Best,
Reply to thread Export thread (mbox)