On 22/12/04 06:19, Lionel Dricot - Ploum wrote:
>On 22/11/30 03:50, Simon Ser - contact at emersion.fr wrote:>>On Wednesday, November 30th, 2022 at 16:39, <sourcehut@ploum.eu> wrote:>>>>> {"errors":[{"message":"Authentication error: Invalid Authorization header"}]}>>>>Hm, it seems like this can only happen with malformed requests: either>>the Authorization header doesn't contain two space-separated fields,>>either the first field isn't "Bearer" nor "Internal".>>>>Looking at the header more closely, it seems like $OAUTH2_TOKEN is>>unset... So it's probably set from the build env.>>>>Can you try `source ~/.buildenv` before executing curl? (acurl should>>also be available with this.)
This is the output when doing things manually:
curl --oauth2-bearer "$OAUTH2_TOKEN" https://pages.sr.ht/publish/ploum.net-Fcontent=@site.tar.g
z --verbose > output.txt
lot of :
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
then:
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* We are completely uploaded and fine
100 93.1M 0 0 100 93.1M 0 908k 0:01:44 0:01:44 --:--:-- 0* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sun, 04 Dec 2022 18:32:12 GMT
< Content-Length: 173
<
{ [173 bytes data]
100 93.1M 100 173 100 93.1M 1 905k 0:02:53 0:01:45 0:01:08 39
* Connection #0 to host pages.sr.ht left intact
(and the slowing is very visible, it gradually drops to less than 10kb/s
before closing)
Hey all,
I've been taking a look at this, and I am fairly sure the problem is the
load on S3/minio. Currently, the following predicates apply:
P1: We do tarball extraction and upload of files in a streaming manner,
to avoid loading (potentially) big files completely into memory
P2: We use the SHA256 of the tarball as "version"
P3: S3/minio does not have "move" semantics
Combined, we end up with the following happening in the hot loop here:
- SHA sum unknown until we have processed the entire tarball
- each file gets uploaded to a temporary destination
- Once the SHA sum is know, **for each file**
- _copy_ to final destination
- _remove_ from temp destination
So, for 2000 files, this is 6000 requests to storage. The only
improvement to be made without touching any of the predicates is to use
a single "RemoveObjects" call instead of "RemoveObject" for each one,
which would bring this down to 4001 requests. Better, but still not great.
Regarding the predicates, I think P1 just really makes sense, and P3 we
cannot realistically change. Hence, I'd like to focus on P2. AFAICT, it
is an implementation detail and not used to interface with anything. The
SHA sum ends up as a prefix on S3/minio (string) and in the `version`
field in the database (VARCHAR).
Unless I am missing something I think we should be easily able to use a
different identifier here which is known before uploading, so we could
upload files right to their final destination. My suggestions would be
either a UUID, or something like the SHA sum of the site ID in the DB
(which is known at upload time).
Thoughts?
Conrad
On 12/9/22 10:05, Conrad Hoffmann wrote:
> ...
I got around to some more testing, and here are some numbers.
Taken from my local setup (i.e. NO network latency):
1. publish tarball w/ 2000 1K files + some (2.2MB):
- current setup: ~12s
- with single call to multi-remove: ~10s
- with upload to final destination: ~4s
and to just drive home the point that the number of minio requests are
the bottleneck:
2. publish tarball w/ 2 1MB files + some (2.2MB):
- current setup: ~0.3s
- N/A
- with upload to final destination: ~0.1s
I think this paints a pretty clear picture. Path forward? Not sure.
Suggestion for short term:
1. See if we can get to a setup w/ initial upload to final destination
(see previous mail)
2. Decide on a sensible timeout so reasonable uploads can make it
Suggestion for long term: difficult. Apparently `mc` uses a certain
amount of parallelization, we could try that, but it makes things a good
deal more complicated. Maybe see if Ceph does any better? :)
Cheers,
Conrad
On 22/12/09 10:05, Conrad Hoffmann - ch at bitfehler.net wrote:
>Hey all,
Thanks a lot for all your investigations. Really interesting.
>Unless I am missing something I think we should be easily able to use>a different identifier here which is known before uploading, so we>could upload files right to their final destination. My suggestions>would be either a UUID, or something like the SHA sum of the site ID>in the DB (which is known at upload time).>>Thoughts?>
Not sure I really understand the problem.
Could we imagine a method to upload individual files instead of a
tar.gz?
On Fri Dec 9, 2022 at 4:05 AM EST, Conrad Hoffmann wrote:
> Unless I am missing something I think we should be easily able to use a > different identifier here which is known before uploading, so we could > upload files right to their final destination. My suggestions would be > either a UUID, or something like the SHA sum of the site ID in the DB > (which is known at upload time).
I think using a UUID is a reasonable approach. We could either scrap the
SHA256 entirely or maintain a mapping from SHA256 to UUID in the
database.
On Fri Dec 9, 2022 at 10:07 PM CET, Conrad Hoffmann wrote:
> 1. See if we can get to a setup w/ initial upload to final destination> (see previous mail)> 2. Decide on a sensible timeout so reasonable uploads can make it
Let's keep it simple and incremental. I like your pitch for improving
the upload process and changing the way checksums are computed as a good
first step.