~radicle-link/discuss

4 3

git-server out of memory error

Details
Message ID
<CAAfTjKCVerbwij6N9UU+TJUDj_6khZLt+CuMt8bAzhwMhGRaaA@mail.gmail.com>
DKIM signature
missing
Download raw message
Hi!

We had an incident few days ago with one of our public instances of the
git-server [1].  Namely, the cloud instance ran out of memory.
Fortunately, we're aware of the circumstances that triggered it: A
community member attempted to mirror *all* of the public GitHub
repositories under the Ethereum Foundation's GitHub organisation [2] to
the mentioned git-server instance.  The way they did that was via a
Node.js script [3].

That Node.js script is quite simple: it boils down to running these few
commands repeatedly for each repository REPO under [2]:

    (1) git clone https://github.com/ethereum/REPO
    (2) cd REPO
    (3) rad init
    (4) rad push

Of course it's (4) where things get interesting.  What I've observed on
my machine (that was running both the `rad push` and `git-server`) is
that memory utilization of the git-server itself never exceeds 500MB per
thread.  For the child processes, it's quite a different story however.
After around 50 repositories get pushed, a `git gc` process gets kicked
off and its memory usage climbs upwards of 2GB of *resident* memory.  A
screenshot: [4].

@cloudhead has pointed out that it may not make sense to trigger `git
gc` at all, if it works across all the objects of those disparate
repositories because they most likely will be worked on completely
separately by end users and thus git-packing git objects in a way that
optimizes fetching them all at once makes little sense.

There was also a mention on a planned feature of "delta islands" that
make the GC aware of namespaces.

My question is what would be the best way to address the visible effect
of big memory consumption of `git pack-objects`.  Would it be safe to
disable GC altogether and re-enable when delta islands feature lands?


Cheers
— Adam


[1]: https://github.com/radicle-dev/radicle-client-services/tree/master/git-server
[2]: https://github.com/ethereum
[3]: https://github.com/jgresham/radicle-mirroring/
[4]: https://i.imgur.com/wuXdGbp.png
Details
Message ID
<CAH_DpYRPq3cK5=aHzoXdcwK3e168vmfDrfWpVyBbMS+4mSfEKA@mail.gmail.com>
In-Reply-To
<CAAfTjKCVerbwij6N9UU+TJUDj_6khZLt+CuMt8bAzhwMhGRaaA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On 24/03/22 03:35pm, Adam Szkoda wrote:
> There was also a mention on a planned feature of "delta islands" that
> make the GC aware of namespaces.

Note that delta islands are a git feature[1]. It would be interesting - if
you are able to reproduce the issue - to see if writing the namespace
ref trees to `pack.island` in the monorepo config changes the memory
usage. I'm not sure whether it would or not, delta islands are more about
making the packfiles which are generated useful in the context of
fetching, but maybe memory usage is related to the size of the generated
packfiles?

[1]: https://git-scm.com/docs/git-pack-objects#_delta_islands
Details
Message ID
<a1iE5qGXfSFWXxaFHMxAugbwytJSjS3f-czNlO5Jpy5KAKDtjdPcDR6rSmXnQ3xYfmWEmFHlTLGP3Qet6UjieRK4bm_8TTPNfKA2Itf44mk=@radicle.foundation>
In-Reply-To
<CAH_DpYRPq3cK5=aHzoXdcwK3e168vmfDrfWpVyBbMS+4mSfEKA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Yeah, so I think there's two things:

1) If pack.islands works as expected, it might reduce memory footprint due to less objects being packed at once (just speculating).
2) If that's not the case, but the packfiles are useful, we would then opt to run `git gc` much more regularly, or perhaps after every push instead of periodically

------- Original Message -------

On Thursday, March 24th, 2022 at 15:47, Alex Good <alex@memoryandthought.me> wrote:

> On 24/03/22 03:35pm, Adam Szkoda wrote:
> 

> > There was also a mention on a planned feature of "delta islands" that
> > 

> > make the GC aware of namespaces.
> 

> Note that delta islands are a git feature[1]. It would be interesting - if
> 

> you are able to reproduce the issue - to see if writing the namespace
> 

> ref trees to `pack.island` in the monorepo config changes the memory
> 

> usage. I'm not sure whether it would or not, delta islands are more about
> 

> making the packfiles which are generated useful in the context of
> 

> fetching, but maybe memory usage is related to the size of the generated
> 

> packfiles?
> 

> [1]: https://git-scm.com/docs/git-pack-objects#_delta_islands
Details
Message ID
<20220324165733.GD70004@eagain.st>
In-Reply-To
<CAAfTjKCVerbwij6N9UU+TJUDj_6khZLt+CuMt8bAzhwMhGRaaA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
It is always safe (provided the usual You Know What You're Doing) to modify the
git config of the monorepo. It is less obvious how to provide some good defaults
from our code, that's why we have none.

I am little skeptical about delta islands, because they require static knowledge
about the data being stored. It is probably a good approximation to experiment
with to pack each namespace into an island -- however, anything below `rad/`
refs would be better served by one global island (otherwise a lot of packfiles
have to be consulted for the very frequent identity operations).

I would expect a bigger immediate win by turning GC off, but instead set up
`git-maintenance`[0] to run incremental-repack relatively frequently. Note that:

- this will not (yet) work if you run replication-v3
- it is basically a system configuration (systemd timer or whatever), so not
  something _we_ could currently provide sans packaging

Note that turning repacking off completely would make matters worse, because it
increases the probability of the server having to build huge packfiles on the
fly for clones, and thrash your disks for IOPS.

[0]: https://git-scm.com/docs/git-maintenance
Details
Message ID
<CpwNpqguQrw1IC79uRYKYVJirPgKcGDnaMtN2FT5xpgBhuItxjOW5OEPC_Sl4sIu8Hc3oko_71ichiGGxC0F-ap7rmemVJdyo45qyCnRXLo=@radicle.foundation>
In-Reply-To
<20220324165733.GD70004@eagain.st> (view parent)
DKIM signature
missing
Download raw message
Thanks, this matches my intuition. We'll experiment with gc=off and incremental-repack
once it works with replication-v3 (since we will probably move to that soon), and report
back if there are new findings or issues.

------- Original Message -------

On Thursday, March 24th, 2022 at 16:57, Kim Altintop <kim@eagain.st> wrote:

> It is always safe (provided the usual You Know What You're Doing) to modify the
> 

> git config of the monorepo. It is less obvious how to provide some good defaults
> 

> from our code, that's why we have none.
> 

> I am little skeptical about delta islands, because they require static knowledge
> 

> about the data being stored. It is probably a good approximation to experiment
> 

> with to pack each namespace into an island -- however, anything below `rad/`
> 

> refs would be better served by one global island (otherwise a lot of packfiles
> 

> have to be consulted for the very frequent identity operations).
> 

> I would expect a bigger immediate win by turning GC off, but instead set up
> 

> `git-maintenance`[0] to run incremental-repack relatively frequently. Note that:
> 

> - this will not (yet) work if you run replication-v3
> 

> - it is basically a system configuration (systemd timer or whatever), so not
> 

> something we could currently provide sans packaging
> 

> Note that turning repacking off completely would make matters worse, because it
> 

> increases the probability of the server having to build huge packfiles on the
> 

> fly for clones, and thrash your disks for IOPS.
> 

> [0]: https://git-scm.com/docs/git-maintenance
Reply to thread Export thread (mbox)