~sircmpwn/sr.ht-discuss

5 4

Re: error 400 while trying to build my page

Details
Message ID
<167017893029.8.9689300244912766911.81407612@ploum.eu>
DKIM signature
pass
Download raw message
On 22/12/04 06:19, Lionel Dricot - Ploum wrote:
>On 22/11/30 03:50, Simon Ser - contact at emersion.fr wrote:
>>On Wednesday, November 30th, 2022 at 16:39, <sourcehut@ploum.eu> wrote:
>>
>>> {"errors":[{"message":"Authentication error: Invalid Authorization header"}]}
>>
>>Hm, it seems like this can only happen with malformed requests: either
>>the Authorization header doesn't contain two space-separated fields,
>>either the first field isn't "Bearer" nor "Internal".
>>
>>Looking at the header more closely, it seems like $OAUTH2_TOKEN is
>>unset... So it's probably set from the build env.
>>
>>Can you try `source ~/.buildenv` before executing curl? (acurl should
>>also be available with this.)

This is the output when doing things manually:

curl --oauth2-bearer "$OAUTH2_TOKEN" https://pages.sr.ht/publish/ploum.net -Fcontent=@site.tar.g
z --verbose > output.txt


lot of :

* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]

then:

* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* We are completely uploaded and fine
100 93.1M    0     0  100 93.1M      0   908k  0:01:44  0:01:44 --:--:--     0* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sun, 04 Dec 2022 18:32:12 GMT
< Content-Length: 173
<
{ [173 bytes data]
100 93.1M  100   173  100 93.1M      1   905k  0:02:53  0:01:45  0:01:08    39
* Connection #0 to host pages.sr.ht left intact


(and the slowing is very visible, it gradually drops to less than 10kb/s
before closing)

Re: error 400 while trying to build my page

Details
Message ID
<5dda215c-9a79-042e-52a1-b053be853a21@bitfehler.net>
In-Reply-To
<167017893029.8.9689300244912766911.81407612@ploum.eu> (view parent)
DKIM signature
pass
Download raw message
Hey all,

I've been taking a look at this, and I am fairly sure the problem is the 
load on S3/minio. Currently, the following predicates apply:

P1: We do tarball extraction and upload of files in a streaming manner,
     to avoid loading (potentially) big files completely into memory
P2: We use the SHA256 of the tarball as "version"
P3: S3/minio does not have "move" semantics

Combined, we end up with the following happening in the hot loop here:

- SHA sum unknown until we have processed the entire tarball
- each file gets uploaded to a temporary destination
- Once the SHA sum is know, **for each file**
   - _copy_ to final destination
   - _remove_ from temp destination

So, for 2000 files, this is 6000 requests to storage. The only 
improvement to be made without touching any of the predicates is to use 
a single "RemoveObjects" call instead of "RemoveObject" for each one, 
which would bring this down to 4001 requests. Better, but still not great.

Regarding the predicates, I think P1 just really makes sense, and P3 we 
cannot realistically change. Hence, I'd like to focus on P2. AFAICT, it 
is an implementation detail and not used to interface with anything. The 
SHA sum ends up as a prefix on S3/minio (string) and in the `version` 
field in the database (VARCHAR).

Unless I am missing something I think we should be easily able to use a 
different identifier here which is known before uploading, so we could 
upload files right to their final destination. My suggestions would be 
either a UUID, or something like the SHA sum of the site ID in the DB 
(which is known at upload time).

Thoughts?

Conrad

Re: error 400 while trying to build my page

Details
Message ID
<52057283-e8a5-3086-f212-e6c24d744de4@bitfehler.net>
In-Reply-To
<5dda215c-9a79-042e-52a1-b053be853a21@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On 12/9/22 10:05, Conrad Hoffmann wrote:
> ...

I got around to some more testing, and here are some numbers.

Taken from my local setup (i.e. NO network latency):

1. publish tarball w/ 2000 1K files + some (2.2MB):
    - current setup: ~12s
    - with single call to multi-remove: ~10s
    - with upload to final destination: ~4s

and to just drive home the point that the number of minio requests are 
the bottleneck:

2. publish tarball w/ 2 1MB files + some (2.2MB):
    - current setup: ~0.3s
    - N/A
    - with upload to final destination: ~0.1s

I think this paints a pretty clear picture. Path forward? Not sure.

Suggestion for short term:

1. See if we can get to a setup w/ initial upload to final destination
    (see previous mail)
2. Decide on a sensible timeout so reasonable uploads can make it

Suggestion for long term: difficult. Apparently `mc` uses a certain 
amount of parallelization, we could try that, but it makes things a good 
deal more complicated. Maybe see if Ceph does any better? :)

Cheers,
Conrad

Re: error 400 while trying to build my page

Details
Message ID
<167062308795.10.10499998898869034424.82765204@ploum.eu>
In-Reply-To
<5dda215c-9a79-042e-52a1-b053be853a21@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On 22/12/09 10:05, Conrad Hoffmann - ch at bitfehler.net wrote:
>Hey all,

Thanks a lot for all your investigations. Really interesting.

>Unless I am missing something I think we should be easily able to use
>a different identifier here which is known before uploading, so we
>could upload files right to their final destination. My suggestions
>would be either a UUID, or something like the SHA sum of the site ID
>in the DB (which is known at upload time).
>
>Thoughts?
>
Not sure I really understand the problem.
Could we imagine a method to upload individual files instead of a
tar.gz?

Re: error 400 while trying to build my page

Details
Message ID
<CP0R5ZJUJRNL.386NCU2YD5GZQ@nitro>
In-Reply-To
<5dda215c-9a79-042e-52a1-b053be853a21@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On Fri Dec 9, 2022 at 4:05 AM EST, Conrad Hoffmann wrote:
> Unless I am missing something I think we should be easily able to use a 
> different identifier here which is known before uploading, so we could 
> upload files right to their final destination. My suggestions would be 
> either a UUID, or something like the SHA sum of the site ID in the DB 
> (which is known at upload time).

I think using a UUID is a reasonable approach. We could either scrap the
SHA256 entirely or maintain a mapping from SHA256 to UUID in the
database.

Re: error 400 while trying to build my page

Details
Message ID
<CP5SIXUSA8TK.IUJ6NLAHV0EM@taiga>
In-Reply-To
<52057283-e8a5-3086-f212-e6c24d744de4@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
On Fri Dec 9, 2022 at 10:07 PM CET, Conrad Hoffmann wrote:
> 1. See if we can get to a setup w/ initial upload to final destination
>     (see previous mail)
> 2. Decide on a sensible timeout so reasonable uploads can make it

Let's keep it simple and incremental. I like your pitch for improving
the upload process and changing the way checksums are computed as a good
first step.
Reply to thread Export thread (mbox)