Hi everyone,
I'm thinking of one possible simplification of soupault's internals that
can only come at cost of an incompatible change.
That change shouldn't affect any of the current users I know of, but I
want to ask what you all think about it.
The feature in question is the ability to process websites that don't
fit in RAM.
When I was writing the first soupault version, I did a survey of
existing SSG implementations
and found that all of they loaded all pages in RAM before doing any
processing.
I felt quite uneasy about that because:
1. People may use SSGs in limited environments such as CI worker containers.
2. People may want to process something very large — like a static
mirror of Wikipedia, for example.
My approach was to build a list of all page files, split it into content
and index pages.
Then content pages into memory one by one, process them, extract index data;
then use the collected index data for rendering the index pages.
In that approach, once a page is processed, the memory used for its
source and element tree is safe to reclaim,
thus it's possible to process a site of any size, even if it's terabytes
large
(as long as its page file list fits in RAM).
However, that approach has a huge disadvantage: content pages have no
access to the index data.
Initially I thought that use cases for accessing index data from content
pages were rare,
but now it's obvious to me that they are not.
It's the most frequently asked question here how to do it, and use cases
are numerous:
tag clouds, site-wide navigation sidebars, "see also" blocks...
To work around it, I introduced the two-pass workflow (enabled with
`index.index_first = true`).
That approach leads to substantial duplication, since pages are loaded
from disk twice:
first to run widgets that must run before index extraction and extract
index data,
then to process them with all widgets and save them to disk.
That's really inelegant, but that was the only approach I could think of
that would still allow
processing larger-than-RAM sites.
My impression is that a lot of people want `index.index_first = true` to
be the default.
In all other SSGs, access to index data from every page (content or
index) is effortless,
and soupault is an outlier there.
I also still haven't seen anyone try to process a website with gigabytes
worth of pages.
So now I'm thinking: does anyone actually want to process sites that
don't fit in memory?
If I give up on that goal, soupault's internals can become much simpler,
and working with index data will become a lot more intuitive for users.
In short, which one of these options would you vote for (and why):
1. Keep the ability to process really large websites in all cases.
2. Keep that ability only in the pre-processor mode, or only when
indexing is disabled.
3. Remove that ability completely.
Hiya Daniil. Happy new year!
> So now I'm thinking: does anyone actually want to process sites that don't> fit in memory?
To be completely honest with you, I haven’t realized yet soupault was
able to do that!
On the one hand, this can be a nice unique selling features for
soupault. On the other hand, it’s one not a lot of people care about. I
myslef can live without it, no questions asked.
> 1. Keep the ability to process really large websites in all cases.
If this is the source of a bad UX for newcomers in practise, as
suggested by your message, then definitely this does not sound like the
best option.
> 2. Keep that ability only in the pre-processor mode, or only when indexing> is disabled.
That’d be a good compromise, in my opinion.
> 3. Remove that ability completely.
But truth be told, I think you can just go with this one if it makes
soupault simpler (both in UX and code, and the latter is always a
welcome bonus).
So yeah, no strong opinion between 2 and 3. 2 is more conservative and
leaves a door open for unique use cases, but 3 is probably the
reasonable choice here.
Cheers,
Thomas
There could be a solution.
If you remember when i needed to just process part of a website, i asked
for a switch. But then I realized it could be done with the site-dir=
and build-dir= switches.
So let us say we needed to process a really large site, this could be
done, if all pages were in memory, by running soupault multiple times,
on sections of the website.
To be honest, i do like the current method, but it is about to be
Shabbat where I am, so I will read your email in detail after Shabbat
and respond more thoroughly.
On 1/5/24 11:45, Daniil Baturin wrote:
> Hi everyone,>> I'm thinking of one possible simplification of soupault's internals > that can only come at cost of an incompatible change.> That change shouldn't affect any of the current users I know of, but I > want to ask what you all think about it.>> The feature in question is the ability to process websites that don't > fit in RAM.>> When I was writing the first soupault version, I did a survey of > existing SSG implementations> and found that all of they loaded all pages in RAM before doing any > processing.> I felt quite uneasy about that because:> 1. People may use SSGs in limited environments such as CI worker > containers.> 2. People may want to process something very large — like a static > mirror of Wikipedia, for example.>> My approach was to build a list of all page files, split it into > content and index pages.> Then content pages into memory one by one, process them, extract index > data;> then use the collected index data for rendering the index pages.> In that approach, once a page is processed, the memory used for its > source and element tree is safe to reclaim,> thus it's possible to process a site of any size, even if it's > terabytes large> (as long as its page file list fits in RAM).>> However, that approach has a huge disadvantage: content pages have no > access to the index data.> Initially I thought that use cases for accessing index data from > content pages were rare,> but now it's obvious to me that they are not.> It's the most frequently asked question here how to do it, and use > cases are numerous:> tag clouds, site-wide navigation sidebars, "see also" blocks...>> To work around it, I introduced the two-pass workflow (enabled with > `index.index_first = true`).> That approach leads to substantial duplication, since pages are loaded > from disk twice:> first to run widgets that must run before index extraction and extract > index data,> then to process them with all widgets and save them to disk.> That's really inelegant, but that was the only approach I could think > of that would still allow> processing larger-than-RAM sites.>> My impression is that a lot of people want `index.index_first = true` > to be the default.> In all other SSGs, access to index data from every page (content or > index) is effortless,> and soupault is an outlier there.> I also still haven't seen anyone try to process a website with > gigabytes worth of pages.>> So now I'm thinking: does anyone actually want to process sites that > don't fit in memory?> If I give up on that goal, soupault's internals can become much simpler,> and working with index data will become a lot more intuitive for users.>> In short, which one of these options would you vote for (and why):>> 1. Keep the ability to process really large websites in all cases.> 2. Keep that ability only in the pre-processor mode, or only when > indexing is disabled.> 3. Remove that ability completely.>
It’s easy to download more RAM! 😅
Can instructions on how to set up swap files/page files help? I suppose it doesn’t in the case of a limited builder but maybe if processing Wikipedia. Is there anything that could used to reduce the size of the in-memory representation more generally to make the chances of running into a memory issue less likely?
I don’t know I would personally ever run into the ceiling, but making sure the bases are covered.
--
toastal ไข่ดาว | https://toast.al
PGP: 7944 74b7 d236 dab9 c9ef e7f9 5cce 6f14 66d4 7c9e
On 1/5/24 10:45, Daniil Baturin wrote:
> My impression is that a lot of people want `index.index_first = true` to > be the default.> In all other SSGs, access to index data from every page (content or > index) is effortless,> and soupault is an outlier there.> I also still haven't seen anyone try to process a website with gigabytes > worth of pages.> > So now I'm thinking: does anyone actually want to process sites that > don't fit in memory?> If I give up on that goal, soupault's internals can become much simpler,> and working with index data will become a lot more intuitive for users.> > In short, which one of these options would you vote for (and why):> > 1. Keep the ability to process really large websites in all cases.> 2. Keep that ability only in the pre-processor mode, or only when > indexing is disabled.> 3. Remove that ability completely.
I'd like to echo what some other folks have said. My thinking is:
It's nice that Soupault could do this, but I think we should carefully
weigh it against users' expectations.
Of course it's impossible to know if anyone will miss the feature, but
I'm strongly in favor of approach #3. I think the overall user
experience will benefit, especially for new users.
-Hristos
Hi,
As a recent user, this hasn't been a concern for me and it wasn't the
reason I started to use soupault.
Not sure it is related, but I'm currently much more interested in
conditionally skipping the publication of a page than handling sites
that don't fit in RAM.
Have any soupault users managed websites that don't fit in RAM? How big
would such a website be? Are soupault website really built in
constrained environments?
3. might be the more reasonable choice, except if I'm missing something
on the usefulness of the feature
Cheers