~dmbaturin/soupault

5 5

Is soupault's capability to process sites that don't fit in RAM important to you?

Details
Message ID
<ebddbb7b-d2f7-413c-bdc7-6973894d7fd8@baturin.org>
DKIM signature
missing
Download raw message
Hi everyone,

I'm thinking of one possible simplification of soupault's internals that 
can only come at cost of an incompatible change.
That change shouldn't affect any of the current users I know of, but I 
want to ask what you all think about it.

The feature in question is the ability to process websites that don't 
fit in RAM.

When I was writing the first soupault version, I did a survey of 
existing SSG implementations
and found that all of they loaded all pages in RAM before doing any 
processing.
I felt quite uneasy about that because:
1. People may use SSGs in limited environments such as CI worker containers.
2. People may want to process something very large — like a static 
mirror of Wikipedia, for example.

My approach was to build a list of all page files, split it into content 
and index pages.
Then content pages into memory one by one, process them, extract index data;
then use the collected index data for rendering the index pages.
In that approach, once a page is processed, the memory used for its 
source and element tree is safe to reclaim,
thus it's possible to process a site of any size, even if it's terabytes 
large
(as long as its page file list fits in RAM).

However, that approach has a huge disadvantage: content pages have no 
access to the index data.
Initially I thought that use cases for accessing index data from content 
pages were rare,
but now it's obvious to me that they are not.
It's the most frequently asked question here how to do it, and use cases 
are numerous:
tag clouds, site-wide navigation sidebars, "see also" blocks...

To work around it, I introduced the two-pass workflow (enabled with 
`index.index_first = true`).
That approach leads to substantial duplication, since pages are loaded 
from disk twice:
first to run widgets that must run before index extraction and extract 
index data,
then to process them with all widgets and save them to disk.
That's really inelegant, but that was the only approach I could think of 
that would still allow
processing larger-than-RAM sites.

My impression is that a lot of people want `index.index_first = true` to 
be the default.
In all other SSGs, access to index data from every page (content or 
index) is effortless,
and soupault is an outlier there.
I also still haven't seen anyone try to process a website with gigabytes 
worth of pages.

So now I'm thinking: does anyone actually want to process sites that 
don't fit in memory?
If I give up on that goal, soupault's internals can become much simpler,
and working with index data will become a lot more intuitive for users.

In short, which one of these options would you vote for (and why):

1. Keep the ability to process really large websites in all cases.
2. Keep that ability only in the pre-processor mode, or only when 
indexing is disabled.
3. Remove that ability completely.
Details
Message ID
<y3ucuosj3z7gn74sto2zmptzu4m4d46xcapvbqqxeeh53ao7we@7zc6sstpwfym>
In-Reply-To
<ebddbb7b-d2f7-413c-bdc7-6973894d7fd8@baturin.org> (view parent)
DKIM signature
missing
Download raw message
Hiya Daniil. Happy new year!

> So now I'm thinking: does anyone actually want to process sites that don't
> fit in memory?

To be completely honest with you, I haven’t realized yet soupault was
able to do that!

On the one hand, this can be a nice unique selling features for
soupault. On the other hand, it’s one not a lot of people care about. I
myslef can live without it, no questions asked.

> 1. Keep the ability to process really large websites in all cases.

If this is the source of a bad UX for newcomers in practise, as
suggested by your message, then definitely this does not sound like the
best option.

> 2. Keep that ability only in the pre-processor mode, or only when indexing
> is disabled.

That’d be a good compromise, in my opinion.

> 3. Remove that ability completely.

But truth be told, I think you can just go with this one if it makes
soupault simpler (both in UX and code, and the latter is always a
welcome bonus).

So yeah, no strong opinion between 2 and 3. 2 is more conservative and
leaves a door open for unique use cases, but 3 is probably the
reasonable choice here.

Cheers,
Thomas
Details
Message ID
<83708c1c-1a65-473e-a278-9ce5715e7d9f@aoirthoir.com>
In-Reply-To
<ebddbb7b-d2f7-413c-bdc7-6973894d7fd8@baturin.org> (view parent)
DKIM signature
missing
Download raw message
There could be a solution.

If you remember when i needed to just process part of a website, i asked 
for a switch. But then I realized it could be done with the site-dir= 
and build-dir= switches.

So let us say we needed to process a really large site, this could be 
done, if all pages were in memory, by running soupault multiple times, 
on sections of the website.

To be honest, i do like the current method, but it is about to be 
Shabbat where I am, so I will read your email in detail after Shabbat 
and respond more thoroughly.

On 1/5/24 11:45, Daniil Baturin wrote:
> Hi everyone,
>
> I'm thinking of one possible simplification of soupault's internals 
> that can only come at cost of an incompatible change.
> That change shouldn't affect any of the current users I know of, but I 
> want to ask what you all think about it.
>
> The feature in question is the ability to process websites that don't 
> fit in RAM.
>
> When I was writing the first soupault version, I did a survey of 
> existing SSG implementations
> and found that all of they loaded all pages in RAM before doing any 
> processing.
> I felt quite uneasy about that because:
> 1. People may use SSGs in limited environments such as CI worker 
> containers.
> 2. People may want to process something very large — like a static 
> mirror of Wikipedia, for example.
>
> My approach was to build a list of all page files, split it into 
> content and index pages.
> Then content pages into memory one by one, process them, extract index 
> data;
> then use the collected index data for rendering the index pages.
> In that approach, once a page is processed, the memory used for its 
> source and element tree is safe to reclaim,
> thus it's possible to process a site of any size, even if it's 
> terabytes large
> (as long as its page file list fits in RAM).
>
> However, that approach has a huge disadvantage: content pages have no 
> access to the index data.
> Initially I thought that use cases for accessing index data from 
> content pages were rare,
> but now it's obvious to me that they are not.
> It's the most frequently asked question here how to do it, and use 
> cases are numerous:
> tag clouds, site-wide navigation sidebars, "see also" blocks...
>
> To work around it, I introduced the two-pass workflow (enabled with 
> `index.index_first = true`).
> That approach leads to substantial duplication, since pages are loaded 
> from disk twice:
> first to run widgets that must run before index extraction and extract 
> index data,
> then to process them with all widgets and save them to disk.
> That's really inelegant, but that was the only approach I could think 
> of that would still allow
> processing larger-than-RAM sites.
>
> My impression is that a lot of people want `index.index_first = true` 
> to be the default.
> In all other SSGs, access to index data from every page (content or 
> index) is effortless,
> and soupault is an outlier there.
> I also still haven't seen anyone try to process a website with 
> gigabytes worth of pages.
>
> So now I'm thinking: does anyone actually want to process sites that 
> don't fit in memory?
> If I give up on that goal, soupault's internals can become much simpler,
> and working with index data will become a lot more intuitive for users.
>
> In short, which one of these options would you vote for (and why):
>
> 1. Keep the ability to process really large websites in all cases.
> 2. Keep that ability only in the pre-processor mode, or only when 
> indexing is disabled.
> 3. Remove that ability completely.
>
Details
Message ID
<64DEFE00-3E98-49EA-A462-748E5E782869@posteo.net>
In-Reply-To
<ebddbb7b-d2f7-413c-bdc7-6973894d7fd8@baturin.org> (view parent)
DKIM signature
missing
Download raw message
It’s easy to download more RAM! 😅

Can instructions on how to set up swap files/page files help? I suppose it doesn’t in the case of a limited builder but maybe if processing Wikipedia. Is there anything that could used to reduce the size of the in-memory representation more generally to make the chances of running into a memory issue less likely?

I don’t know I would personally ever run into the ceiling, but making sure the bases are covered.
-- 
toastal ไข่ดาว | https://toast.al
PGP: 7944 74b7 d236 dab9 c9ef  e7f9 5cce 6f14 66d4 7c9e
Hristos <me@hristos.co>
Details
Message ID
<b69f8344-f4b8-481a-931b-8989598fb667@hristos.co>
In-Reply-To
<ebddbb7b-d2f7-413c-bdc7-6973894d7fd8@baturin.org> (view parent)
DKIM signature
missing
Download raw message
On 1/5/24 10:45, Daniil Baturin wrote:
> My impression is that a lot of people want `index.index_first = true` to 
> be the default.
> In all other SSGs, access to index data from every page (content or 
> index) is effortless,
> and soupault is an outlier there.
> I also still haven't seen anyone try to process a website with gigabytes 
> worth of pages.
> 
> So now I'm thinking: does anyone actually want to process sites that 
> don't fit in memory?
> If I give up on that goal, soupault's internals can become much simpler,
> and working with index data will become a lot more intuitive for users.
> 
> In short, which one of these options would you vote for (and why):
> 
> 1. Keep the ability to process really large websites in all cases.
> 2. Keep that ability only in the pre-processor mode, or only when 
> indexing is disabled.
> 3. Remove that ability completely.

I'd like to echo what some other folks have said. My thinking is:

It's nice that Soupault could do this, but I think we should carefully 
weigh it against users' expectations.

Of course it's impossible to know if anyone will miss the feature, but 
I'm strongly in favor of approach #3. I think the overall user 
experience will benefit, especially for new users.

-Hristos
Raphaël Bauduin <rb@raphinou.com>
Details
Message ID
<4257c218-088f-40ca-88ff-3dd3ef50d96e@raphinou.com>
In-Reply-To
<b69f8344-f4b8-481a-931b-8989598fb667@hristos.co> (view parent)
DKIM signature
missing
Download raw message
Hi,

As a recent user, this hasn't been a concern for me and it wasn't the 
reason I started to use soupault.
Not sure it is related, but I'm currently much more interested in 
conditionally skipping the publication of a page than handling sites 
that don't fit in RAM.

Have any soupault users managed websites that don't fit in RAM? How big 
would such a website be? Are soupault website really built in 
constrained environments?

3. might be the more reasonable choice, except if I'm missing something 
on the usefulness of the feature

Cheers
Reply to thread Export thread (mbox)