~lioploum/offpunk-devel

3 2

Triming the cache

Details
Message ID
<171752439758.7.8227099294300087127.347071641@ploum.eu>
DKIM signature
pass
Download raw message
Being probably the heaviest user of Offpunk, my cache has now grown to 
40Go (probably because having experimented with deeper level of download 
during the 2.0 development).

Besides of that, I’m really happy with the cache simple design: the 
cache can be completely backuped with a simple "cp -pr" and, if you want 
to merge two caches, you could simply do "cp -pru". According to my 
tests, it works!

Nevertheless, I think it is time to add a "trim cache" feature to 
Offpunk.

## A. Straightforward trimming

My first idea was to do the following:

1. Browse every lists, including history and archives and make a list of 
every link and every link in them : $TOKEEP_LIST
2. Now, go into the cache and delete everything which is older than 
$CACHE_TTL and which is not in $TOKEEP_LIST

Another possibility is to trim older items as long as it is bigger than 
$MAX_SIZE, ignoring $CACHE_TTL

But, I have another idea:

## B. Separated caches

I was wondering about having two caches. One in .cache ($CACHE1), the 
other in .local/share ($CACHE2).

As soon as you add an item to a list, its cache is duplicated from 
.cache to .local/share. When browsing, offpunk always check both and if 
an item is present in both, take the newest (and update the other).

The advantage of this method is that you could easily backup the part of 
the cache which are important to you (even putting it on a shared disk 
or something similar). Removing ~/.cache would have very few impact on 
you.

Disadvantage: It will be hard to know when to remove items from $CACHE2


I’m wondering if you have any idea or any thought about the matter?

Regards,

Ploum

--
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
Details
Message ID
<20240608201748.GA88522@patsas>
In-Reply-To
<171752439758.7.8227099294300087127.347071641@ploum.eu> (view parent)
DKIM signature
pass
Download raw message
I'm currently synchronizing my offpunk cache between computers using the 
following setup:

* The ~/Sync/Netcache directory is synchronized between computers using 
   Syncthing.
* ~/.cache/offpunk/gemini is a symbolic link to ~/Sync/Netcache/gemini. 
   Same for the gopher and finger caches. The gemini/gopher/finger caches 
   are small enough that I don't mind synchronizing everything. I remove 
   some large binary files occasionally.
* The ~/.cache/offpunk/https directory contains symbolic links only for 
   the domains I want to synchronize. For example 
   ~/.cache/offpunk/https/www.rfc-editor.org is a symbolic link to 
   ~/Sync/Netcache/https/www.rfc-editor.org. Similarly for the http 
   cache. This way I only synchronize what I need from the larger 
   http/https caches.

So it's already possible to synchronize part of the cache but it's 
definitely more involved than adding a URL to a list.

The double cache can make it easier to keep a separate cache of desired 
domains but doesn't make it easier to prune files based on age, size or 
type. It probably makes it a little more difficult for external tools to 
work with the offpunk cache, for example to search through it.

At this point I favor the simpler solution, having a single cache, but I 
don't have a super-strong opinion on it.

All the best,
Sotiris
Details
Message ID
<171793573598.7.10250328058055671139.350851786@ploum.eu>
In-Reply-To
<20240608201748.GA88522@patsas> (view parent)
DKIM signature
pass
Download raw message
On 24 jun 08 10:17, Sotiris Papatheodorou wrote:
>I'm currently synchronizing my offpunk cache between computers using the
>following setup:
>
>* The ~/Sync/Netcache directory is synchronized between computers using
>   Syncthing.
>* ~/.cache/offpunk/gemini is a symbolic link to ~/Sync/Netcache/gemini.
>   Same for the gopher and finger caches. The gemini/gopher/finger caches
>   are small enough that I don't mind synchronizing everything. I remove
>   some large binary files occasionally.
>* The ~/.cache/offpunk/https directory contains symbolic links only for
>   the domains I want to synchronize. For example
>   ~/.cache/offpunk/https/www.rfc-editor.org is a symbolic link to
>   ~/Sync/Netcache/https/www.rfc-editor.org. Similarly for the http
>   cache. This way I only synchronize what I need from the larger
>   http/https caches.
>
>So it's already possible to synchronize part of the cache but it's
>definitely more involved than adding a URL to a list.
>
>The double cache can make it easier to keep a separate cache of desired
>domains but doesn't make it easier to prune files based on age, size or
>type. It probably makes it a little more difficult for external tools to
>work with the offpunk cache, for example to search through it.
>
>At this point I favor the simpler solution, having a single cache, but I
>don't have a super-strong opinion on it.

Thanks a lot for that usecase. Very interesting. I tend to agree with 
you: having two separate caches will probably be a recipe for problems.

Offering a way to trim the cache based on "last-seen but not in any 
list" seems the best and more intuitive way to go forward.

But it doesn’t help much your own usecase and, TBH, I’m not sure how it 
could be helped. I need to think a bit about the problem.

Any idea is welcome !

-- 
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
Details
Message ID
<20240609210438.GA154104@patsas>
In-Reply-To
<171793573598.7.10250328058055671139.350851786@ploum.eu> (view parent)
DKIM signature
pass
Download raw message
On 2024-06-09, Ploum wrote:
>Offering a way to trim the cache based on "last-seen but not in any
>list" seems the best and more intuitive way to go forward.
>
>But it doesn’t help much your own usecase and, TBH, I’m not sure how it
>could be helped. I need to think a bit about the problem.

I imagine the cache trimming would be something initiated manually by 
the user, in which case I could just ignore it and use a different 
trimming strategy.

There's quite a few parameters the cache trimming can be based on:

* File age.
* File size.
* File media type.
* Domain name.
* URL pattern.

and probably more, plus combinations of the above. A fully-featured trim 
command starts looking like a variant of the Unix find command. But "old 
but not in any list" is simple and might be enough for most use cases.

All the best,
Sotiris
Reply to thread Export thread (mbox)