~dmbaturin/soupault

11 3

caching and preprocessors

Details
Message ID
<27df1c33-9327-cdfb-0bd5-d2fd385ff16a@fo.am>
DKIM signature
missing
Download raw message
I've been trying to get the new caching feature to work, but keep 
getting errors about missing hash files.

[WARNING] Cache directory for page site/crystal/Calibrating Future 
Experiences.odt does not contain a page source hash file 
(.page_source_hash),cache will be discarded!

the config includes these settings

   caching = true
   cache_dir = ".soupault-cache"

Pages are generated in the preprocessors stage using an external pandoc 
script. The site is generated as expected and looks like the site 
structure is successfully created in the cache dir however each of the 
page dirs are empty.

Are there other settings or changes required to cache the preprocessor 
results?
Details
Message ID
<fc42fd53-3a71-edf9-673e-7a534a9078e1@baturin.org>
In-Reply-To
<27df1c33-9327-cdfb-0bd5-d2fd385ff16a@fo.am> (view parent)
DKIM signature
missing
Download raw message
This is odd. There shouldn't be a need for any other settings, so you 
have likely found a bug.

Is your site source public so that I could test it myself on the same data?

On 1/25/23 10:25, nik.srht@fo.am wrote:
> I've been trying to get the new caching feature to work, but keep 
> getting errors about missing hash files.
>
> [WARNING] Cache directory for page site/crystal/Calibrating Future 
> Experiences.odt does not contain a page source hash file 
> (.page_source_hash),cache will be discarded!
>
> the config includes these settings
>
>   caching = true
>   cache_dir = ".soupault-cache"
>
> Pages are generated in the preprocessors stage using an external 
> pandoc script. The site is generated as expected and looks like the 
> site structure is successfully created in the cache dir however each 
> of the page dirs are empty.
>
> Are there other settings or changes required to cache the preprocessor 
> results?
>
Details
Message ID
<1c21d2ac-7a95-16fb-2bdc-2dac78613765@fo.am>
In-Reply-To
<fc42fd53-3a71-edf9-673e-7a534a9078e1@baturin.org> (view parent)
DKIM signature
missing
Download raw message
The site isn't public yet, but i'll try making a minimal example to see 
if the bug can be replicated.

On 2023-01-25 11:54, Daniil Baturin wrote:
> This is odd. There shouldn't be a need for any other settings, so you 
> have likely found a bug.
> 
> Is your site source public so that I could test it myself on the same data?
> 
> On 1/25/23 10:25, nik.srht@fo.am wrote:
>> I've been trying to get the new caching feature to work, but keep 
>> getting errors about missing hash files.
>>
>> [WARNING] Cache directory for page site/crystal/Calibrating Future 
>> Experiences.odt does not contain a page source hash file 
>> (.page_source_hash),cache will be discarded!
>>
>> the config includes these settings
>>
>>   caching = true
>>   cache_dir = ".soupault-cache"
>>
>> Pages are generated in the preprocessors stage using an external 
>> pandoc script. The site is generated as expected and looks like the 
>> site structure is successfully created in the cache dir however each 
>> of the page dirs are empty.
>>
>> Are there other settings or changes required to cache the preprocessor 
>> results?
>>
nik gaffney <nik@fo.am>
Details
Message ID
<48712b3d-2655-4580-bc4b-75aca6afd081@fo.am>
In-Reply-To
<fc42fd53-3a71-edf9-673e-7a534a9078e1@baturin.org> (view parent)
DKIM signature
missing
Download raw message
I took a closer look and managed to confuse myself by setting debug=true 
which appears to avoid the problem.

maybe the cache is not being built correctly when debug is false?

e.g.

%  soupault --debug

[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[DEBUG] Widget processing order:
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[DEBUG] Saving new page hash to 
.soupault-cache/site/faustroll.odt/.page_source_hash
[INFO] Calling page preprocessor "pandoc --from=odt --to=html 
--wrap=preserve --reference-links --reference-location=block" on page 
site/faustroll.odt
[DEBUG] Saving a cached object to 
.soupault-cache/site/faustroll.odt/631383cb2530d44b9cde4d4b454b38869b26fb667d4c6067548b1ffc8394efca_40053855b355ea57f47456bebc41688c5330c8a296a7bbe9df89cc04834a351d
[INFO] Using the default template for page site/faustroll.odt
[DEBUG] Not inserting index data: indexing is disabled in the configuration
[INFO] Writing generated page to build/faustroll/index.html

% soupault --debug

[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[DEBUG] Widget processing order:
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[DEBUG] Cache for page site/faustroll.odt is considered valid and will 
be used
[DEBUG] Reading a cached object from 
.soupault-cache/site/faustroll.odt/631383cb2530d44b9cde4d4b454b38869b26fb667d4c6067548b1ffc8394efca_40053855b355ea57f47456bebc41688c5330c8a296a7bbe9df89cc04834a351d
[INFO] Using the default template for page site/faustroll.odt
[DEBUG] Not inserting index data: indexing is disabled in the configuration
[INFO] Writing generated page to build/faustroll/index.html
; [vrt] soupault
[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[INFO] Using the default template for page site/faustroll.odt
[INFO] Writing generated page to build/faustroll/index.html

% soupault

[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[INFO] Using the default template for page site/faustroll.odt
[INFO] Writing generated page to build/faustroll/index.html

% rm -rf .soupault-cache/*

% soupault

[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[INFO] Calling page preprocessor "pandoc --from=odt --to=html 
--wrap=preserve --reference-links --reference-location=block" on page 
site/faustroll.odt
[INFO] Using the default template for page site/faustroll.odt
[INFO] Writing generated page to build/faustroll/index.html

% soupault

[INFO] Starting soupault 4.4.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/faustroll.odt
[WARNING] Cache directory for page site/faustroll.odt does not contain a 
page source hash file (.page_source_hash),cache will be discarded!
[INFO] Calling page preprocessor "pandoc --from=odt --to=html 
--wrap=preserve --reference-links --reference-location=block" on page 
site/faustroll.odt
[INFO] Using the default template for page site/faustroll.odt
[INFO] Writing generated page to build/faustroll/index.html


On 2023-01-25 11:54, Daniil Baturin wrote:
> This is odd. There shouldn't be a need for any other settings, so you 
> have likely found a bug.
> 
> Is your site source public so that I could test it myself on the same data?
> 
> On 1/25/23 10:25, nik.srht@fo.am wrote:
>> I've been trying to get the new caching feature to work, but keep 
>> getting errors about missing hash files.
>>
>> [WARNING] Cache directory for page site/crystal/Calibrating Future 
>> Experiences.odt does not contain a page source hash file 
>> (.page_source_hash),cache will be discarded!
>>
>> the config includes these settings
>>
>>   caching = true
>>   cache_dir = ".soupault-cache"
>>
>> Pages are generated in the preprocessors stage using an external 
>> pandoc script. The site is generated as expected and looks like the 
>> site structure is successfully created in the cache dir however each 
>> of the page dirs are empty.
>>
>> Are there other settings or changes required to cache the preprocessor 
>> results?
>>
Details
Message ID
<a6e63dcf-4081-c1d3-4196-d29164a242b8@baturin.org>
In-Reply-To
<48712b3d-2655-4580-bc4b-75aca6afd081@fo.am> (view parent)
DKIM signature
missing
Download raw message
Hi Nik,

I fixed the problem. It was a funny case of missing parentheses that 
made bits of actual logic interpreted
as a part of a debug log function body: 
https://codeberg.org/PataphysicalSociety/soupault/commit/599f0f921c32b0d5daf41e5ba4fa369f55acb15c

Could you try building again and let me know if it works for you without 
debug now?

On 1/25/23 12:49, nik gaffney wrote:
>
> I took a closer look and managed to confuse myself by setting 
> debug=true which appears to avoid the problem.
>
> maybe the cache is not being built correctly when debug is false?
>
> e.g.
>
> %  soupault --debug
>
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [DEBUG] Widget processing order:
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [DEBUG] Saving new page hash to 
> .soupault-cache/site/faustroll.odt/.page_source_hash
> [INFO] Calling page preprocessor "pandoc --from=odt --to=html 
> --wrap=preserve --reference-links --reference-location=block" on page 
> site/faustroll.odt
> [DEBUG] Saving a cached object to 
> .soupault-cache/site/faustroll.odt/631383cb2530d44b9cde4d4b454b38869b26fb667d4c6067548b1ffc8394efca_40053855b355ea57f47456bebc41688c5330c8a296a7bbe9df89cc04834a351d
> [INFO] Using the default template for page site/faustroll.odt
> [DEBUG] Not inserting index data: indexing is disabled in the 
> configuration
> [INFO] Writing generated page to build/faustroll/index.html
>
> % soupault --debug
>
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [DEBUG] Widget processing order:
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [DEBUG] Cache for page site/faustroll.odt is considered valid and will 
> be used
> [DEBUG] Reading a cached object from 
> .soupault-cache/site/faustroll.odt/631383cb2530d44b9cde4d4b454b38869b26fb667d4c6067548b1ffc8394efca_40053855b355ea57f47456bebc41688c5330c8a296a7bbe9df89cc04834a351d
> [INFO] Using the default template for page site/faustroll.odt
> [DEBUG] Not inserting index data: indexing is disabled in the 
> configuration
> [INFO] Writing generated page to build/faustroll/index.html
> ; [vrt] soupault
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [INFO] Using the default template for page site/faustroll.odt
> [INFO] Writing generated page to build/faustroll/index.html
>
> % soupault
>
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [INFO] Using the default template for page site/faustroll.odt
> [INFO] Writing generated page to build/faustroll/index.html
>
> % rm -rf .soupault-cache/*
>
> % soupault
>
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [INFO] Calling page preprocessor "pandoc --from=odt --to=html 
> --wrap=preserve --reference-links --reference-location=block" on page 
> site/faustroll.odt
> [INFO] Using the default template for page site/faustroll.odt
> [INFO] Writing generated page to build/faustroll/index.html
>
> % soupault
>
> [INFO] Starting soupault 4.4.0 in website generator mode
> [INFO] Loading plugins
> [INFO] Loading widgets
> [INFO] Loading hooks
> [INFO] Starting website build
> [INFO] Processing page site/faustroll.odt
> [WARNING] Cache directory for page site/faustroll.odt does not contain 
> a page source hash file (.page_source_hash),cache will be discarded!
> [INFO] Calling page preprocessor "pandoc --from=odt --to=html 
> --wrap=preserve --reference-links --reference-location=block" on page 
> site/faustroll.odt
> [INFO] Using the default template for page site/faustroll.odt
> [INFO] Writing generated page to build/faustroll/index.html
>
>
> On 2023-01-25 11:54, Daniil Baturin wrote:
>> This is odd. There shouldn't be a need for any other settings, so you 
>> have likely found a bug.
>>
>> Is your site source public so that I could test it myself on the same 
>> data?
>>
>> On 1/25/23 10:25, nik.srht@fo.am wrote:
>>> I've been trying to get the new caching feature to work, but keep 
>>> getting errors about missing hash files.
>>>
>>> [WARNING] Cache directory for page site/crystal/Calibrating Future 
>>> Experiences.odt does not contain a page source hash file 
>>> (.page_source_hash),cache will be discarded!
>>>
>>> the config includes these settings
>>>
>>>   caching = true
>>>   cache_dir = ".soupault-cache"
>>>
>>> Pages are generated in the preprocessors stage using an external 
>>> pandoc script. The site is generated as expected and looks like the 
>>> site structure is successfully created in the cache dir however each 
>>> of the page dirs are empty.
>>>
>>> Are there other settings or changes required to cache the 
>>> preprocessor results?
>>>
Details
Message ID
<8fcd5002-d374-d172-b49f-79d85d88b4cf@fo.am>
In-Reply-To
<a6e63dcf-4081-c1d3-4196-d29164a242b8@baturin.org> (view parent)
DKIM signature
missing
Download raw message
Thanks! that fixed it.

On 2023-01-26 03:49, Daniil Baturin wrote:
> Hi Nik,
> 
> I fixed the problem. It was a funny case of missing parentheses that 
> made bits of actual logic interpreted
> as a part of a debug log function body: 
> https://codeberg.org/PataphysicalSociety/soupault/commit/599f0f921c32b0d5daf41e5ba4fa369f55acb15c
> 
> Could you try building again and let me know if it works for you without 
> debug now?
> 
Details
Message ID
<a3736be7-c793-51fb-bb51-31b628f33be2@baturin.org>
In-Reply-To
<8fcd5002-d374-d172-b49f-79d85d88b4cf@fo.am> (view parent)
DKIM signature
missing
Download raw message
Thanks for testing it! I'm planning to make a release early next week then.

On 1/26/23 10:03, nik gaffney wrote:
>
> Thanks! that fixed it.
>
> On 2023-01-26 03:49, Daniil Baturin wrote:
>> Hi Nik,
>>
>> I fixed the problem. It was a funny case of missing parentheses that 
>> made bits of actual logic interpreted
>> as a part of a debug log function body: 
>> https://codeberg.org/PataphysicalSociety/soupault/commit/599f0f921c32b0d5daf41e5ba4fa369f55acb15c
>>
>> Could you try building again and let me know if it works for you 
>> without debug now?
>>
>
Details
Message ID
<503b36ff-a348-098e-3d25-f63f0d250f80@fo.am>
In-Reply-To
<a3736be7-c793-51fb-bb51-31b628f33be2@baturin.org> (view parent)
DKIM signature
missing
Download raw message
On the topic of caching, do you have any plans to add caching for 
asset_processors?

I'm currently using an external script for the asset_processors which 
checks pre and post checksums. would certainly simplify things if that 
was part of the standard build process.

also, is there a way to ensure asset_processors are run after the 
preprocessors have completed?
Details
Message ID
<fb92fc8e-9505-0bfe-45aa-b0e422eefd49@baturin.org>
In-Reply-To
<503b36ff-a348-098e-3d25-f63f0d250f80@fo.am> (view parent)
DKIM signature
missing
Download raw message
 >do you have any plans to add caching for asset_processors?

That's complicated. With pages, it's simple since the output path is 
decided by soupault itself.
With asset processors, the user specifies a template for generating a 
complete command.
That is required to accommodate commands with peculiar syntax that makes 
it impossible to just append the output file path,
and to allow original and processed files to have different extensions.
However, it also means that soupault doesn't actually know the output 
path and cannot replicate what  the user-given command would do.

I agree that it would be nice to cache asset processor outputs but it's 
going to require syntax design changes.
If you have ideas how to best handle that, please share.

 >is there a way to ensure asset_processors are run after the 
preprocessors have completed?

Asset files are always processed before page files: 
https://codeberg.org/PataphysicalSociety/soupault/src/commit/599f0f921c32b0d5daf41e5ba4fa369f55acb15c/src/soupault.ml#L895-L908
The reason is simply that asset file processing is the same (if it's to 
be done at all), while page processing workflows differ for cases when 
index.index_first is enabled and when it's not.

That said, the decision to make processing pages and assets separate 
steps is strategic, but the order of those steps is trivial.
If there's a compelling reason to switch them, I see no reason why not 
to do it.

On 1/26/23 13:28, nik gaffney wrote:
>
> On the topic of caching, do you have any plans to add caching for 
> asset_processors?
>
> I'm currently using an external script for the asset_processors which 
> checks pre and post checksums. would certainly simplify things if that 
> was part of the standard build process.
>
> also, is there a way to ensure asset_processors are run after the 
> preprocessors have completed?
>
>
Details
Message ID
<ef7c0998-e75f-545a-d3bf-976e79a67cc9@fo.am>
In-Reply-To
<fb92fc8e-9505-0bfe-45aa-b0e422eefd49@baturin.org> (view parent)
DKIM signature
missing
Download raw message
On 2023-01-27 05:07, Daniil Baturin wrote:
>  >do you have any plans to add caching for asset_processors?
> 
> That's complicated. With pages, it's simple since the output path is 
> decided by soupault itself.
> With asset processors, the user specifies a template for generating a 
> complete command.
> That is required to accommodate commands with peculiar syntax that makes 
> it impossible to just append the output file path,
> and to allow original and processed files to have different extensions.
> However, it also means that soupault doesn't actually know the output 
> path and cannot replicate what the user-given command would do.

admittedly in the general case it's not so obvious. at the moment I'm 
relying on a filter which takes input & output paths. the filter just 
checks if the input file has changed or output is missing to avoid 
unnecessary work.

e.g.

png = "./filters/process_png '{{source_file_path}}' 
'{{target_dir}}/{{source_file_name}}'

> I agree that it would be nice to cache asset processor outputs but it's 
> going to require syntax design changes.

Can't think of anything at the moment that could work in the general 
case without relying on explicit description, but...

e.g. in the above something like

png_cache = ['{{source_file_path}}', 
'{{target_dir}}/{{source_file_name}}', 
'{{target_dir}}/preview_{{source_file_name}}']

which could just check for matching checksums (or some other cache 
invalidation?) before running the asset_processor (which might generate 
assets on paths not specified in the command)

> If you have ideas how to best handle that, please share.

I'll think further about how it could work more generally.

>> is there a way to ensure asset_processors are run after the 
> preprocessors have completed?

> That said, the decision to make processing pages and assets separate 
> steps is strategic, but the order of those steps is trivial.
> If there's a compelling reason to switch them, I see no reason why not 
> to do it.

I would agree that keeping them separate is useful. The only motive I 
have is based on a use case where the preprocessor may produce assets 
(as side effect) as well as html (output). In particular, converting a 
pdf or odt for example might produce image files so would be useful to 
run the asset_processor after (rather than just run soupault twice or 
relying on another explicit build stage)
Details
Message ID
<19eab65b-71fe-e898-564e-f0cc724c647b@baturin.org>
In-Reply-To
<ef7c0998-e75f-545a-d3bf-976e79a67cc9@fo.am> (view parent)
DKIM signature
missing
Download raw message
 >png = "./filters/process_png '{{source_file_path}}' 
'{{target_dir}}/{{source_file_name}}'

Yes, the problem is that you know that the output path is 
'{{target_dir}}/{{source_file_name}}' but to soupault that command is 
opaque.
To make automatic caching possible, the output path needs to be made 
explicit.

One compatible syntax I can think of it like this:

[asset_processors]
   png = { target_path_template = 
"{{target_dir}}/{{source_file_name}}.css", command_template = "sass 
{{source_file_path}} {{target_file_path}}" }

where {{target_file_path}} is generated using target_path_template and 
injected in the command_template environment.

It's much more complicated than the current one, but I can see how I 
could add it — check if the value is a string or an inline table,
then use different ways of constructing the complete command.

I'm not sure if it's worthwhile, though. I'm by no means against caching 
asset processor outputs, just not sure if trying to embed an asset 
management
system inside of soupault is a good idea or not.

(The source of all problems is that page preprocessors work with stdin 
and stdout, while many type of asset processors like image convertors
may not even support writing to stdout, and reading potentially very 
large files into memory just to postprocess and cache them
can cause lots of problems for users.)

On 1/27/23 10:37, nik gaffney wrote:
>
> On 2023-01-27 05:07, Daniil Baturin wrote:
>>  >do you have any plans to add caching for asset_processors?
>>
>> That's complicated. With pages, it's simple since the output path is 
>> decided by soupault itself.
>> With asset processors, the user specifies a template for generating a 
>> complete command.
>> That is required to accommodate commands with peculiar syntax that 
>> makes it impossible to just append the output file path,
>> and to allow original and processed files to have different extensions.
>> However, it also means that soupault doesn't actually know the output 
>> path and cannot replicate what the user-given command would do.
>
> admittedly in the general case it's not so obvious. at the moment I'm 
> relying on a filter which takes input & output paths. the filter just 
> checks if the input file has changed or output is missing to avoid 
> unnecessary work.
>
> e.g.
>
> png = "./filters/process_png '{{source_file_path}}' 
> '{{target_dir}}/{{source_file_name}}'
>
>> I agree that it would be nice to cache asset processor outputs but 
>> it's going to require syntax design changes.
>
> Can't think of anything at the moment that could work in the general 
> case without relying on explicit description, but...
>
> e.g. in the above something like
>
> png_cache = ['{{source_file_path}}', 
> '{{target_dir}}/{{source_file_name}}', 
> '{{target_dir}}/preview_{{source_file_name}}']
>
> which could just check for matching checksums (or some other cache 
> invalidation?) before running the asset_processor (which might 
> generate assets on paths not specified in the command)
>
>> If you have ideas how to best handle that, please share.
>
> I'll think further about how it could work more generally.
>
>>> is there a way to ensure asset_processors are run after the 
>> preprocessors have completed?
>
>> That said, the decision to make processing pages and assets separate 
>> steps is strategic, but the order of those steps is trivial.
>> If there's a compelling reason to switch them, I see no reason why 
>> not to do it.
>
> I would agree that keeping them separate is useful. The only motive I 
> have is based on a use case where the preprocessor may produce assets 
> (as side effect) as well as html (output). In particular, converting a 
> pdf or odt for example might produce image files so would be useful to 
> run the asset_processor after (rather than just run soupault twice or 
> relying on another explicit build stage)
>
>
>
>
Details
Message ID
<fa1de904-983c-2506-00b6-6d1ce1ee1c5e@fo.am>
In-Reply-To
<19eab65b-71fe-e898-564e-f0cc724c647b@baturin.org> (view parent)
DKIM signature
missing
Download raw message
On 2023-01-27 13:28, Daniil Baturin wrote:
> I'm not sure if it's worthwhile, though. I'm by no means against caching 
> asset processor outputs, just not sure if trying to embed an asset 
> management
> system inside of soupault is a good idea or not.

It's probably more trouble than it's worth. That said, if asset 
processing adds a significant build overhead, might be a good idea to 
include some suggestions and/or examples in the docs.

I can look at cleaning up the image processing scripts i'm currently 
using as a starting point.

nik
Reply to thread Export thread (mbox)