I just had a very stupid idea so I need to share it with you:
What if the cache was handled by git?
Each single domain would be a separate git repository. Each new sync in
that domain would be a new commit to the repository.
Potential benefits:
- Keeping the history of pages modified over time
- information about when a page was seen for the first time (no FS
creation time available under Linux…)
- through a mechanisme yet to invent, allowing people to share the cache
of a given domain without connecting to the Internet.
- In the long term, distributing your content directly through git (the
local git cache is then replaced by a clone of an "official" git)
- Besides adding a .git folder in each domain, there’s virtually no
change in the cache. You can still explore it with grep/ls and have it
modified without caring about git.
Drawbacks:
- Lot of complexity.
- Increase in cache storage.
- Very hard to remove permanently part of the cache to save space.
So, yeah, it looks like a stupid idea.
But, for some reason, I keep thinking about it…
What’s your take?
--
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
On 2023-12-15, Ploum wrote:
>Each single domain would be a separate git repository. Each new sync in>that domain would be a new commit to the repository.
It's actually an interesting idea. Each computer becomes a small Wayback
Machine, I think it's worth a try.
I'll attempt to add the functionality in my offpunk-compatible (minus my
bugs) caching proxy. Its architecture allows different storage types for
the cache so this shouldn't be too difficult. It will likely take a few
months though, I'm quite busy at the time.
On 2023-12-15, Ploum wrote:
>- through a mechanisme yet to invent, allowing people to share the >cache of a given domain without connecting to the Internet.
The "without connecting to the Internet" part already works. Git allows
remote URLs to point to the filesystem so you could pull from e.g.
/media/flash-drive/example.com. The only problem here is that each user
would have a different git commit history. Merging those requires using
the --allow-unrelated-histories flag but will probably work fine most of
the time.
On 2023-12-15, Ploum wrote:
>- Lot of complexity.
If I'm not mistaken, updating the cache would be as simple as:
cd ~/.cache/offpunk/gemini/example.com
git add .
git commit -m 'offpunk sync'
Of course an extra (optional) dependency is extra complexity by itself.
The most complex functionality would be exposing the git history in the
client interface to allow easily viewing previous versions of a page.
On 2023-12-15, Ploum wrote:
>- Increase in cache storage.
This would be especially problematic for changing binary files.
Incremental updates to text files however should be quite efficient. I'd
like to run the git-based-cache for some time and get some actual
statistics.
On 2023-12-15, Ploum wrote:
>- Very hard to remove permanently part of the cache to save space.
It's possible to rewrite git history and squash all commits into a
single one, erasing old versions of files. Running git gc after that
should reduce disk usage.
Sotiris
On 23/12/29 03:45, Sotiris Papatheodorou wrote:
>On 2023-12-15, Ploum wrote:>>Each single domain would be a separate git repository. Each new sync in>>that domain would be a new commit to the repository.>>It's actually an interesting idea. Each computer becomes a small Wayback>Machine, I think it's worth a try.>>I'll attempt to add the functionality in my offpunk-compatible (minus my>bugs) caching proxy. Its architecture allows different storage types for>the cache so this shouldn't be too difficult. It will likely take a few>months though, I'm quite busy at the time.
I would be very interested by your experience. Of course, no pressure
here. This is just for fun and to experiment. That’s why I shared the
idea: to see if it was totally stupid or if it could give another idea
to someone else.
>>>On 2023-12-15, Ploum wrote:>>- through a mechanisme yet to invent, allowing people to share the>>cache of a given domain without connecting to the Internet.>>The "without connecting to the Internet" part already works. Git allows>remote URLs to point to the filesystem so you could pull from e.g.>/media/flash-drive/example.com. The only problem here is that each user>would have a different git commit history. Merging those requires using>the --allow-unrelated-histories flag but will probably work fine most of>the time.
I didn’t know about --allow-unrleated-histories. Seems really
interesting.
Also a good read by Solderpunk:
gemini://zaibatsu.circumlunar.space/~solderpunk/gemlog/low-budget-p2p-content-distribution-with-git.gmi
>>>On 2023-12-15, Ploum wrote:>>- Lot of complexity.>>If I'm not mistaken, updating the cache would be as simple as:>> cd ~/.cache/offpunk/gemini/example.com> git add .> git commit -m 'offpunk sync'>>Of course an extra (optional) dependency is extra complexity by itself.>The most complex functionality would be exposing the git history in the>client interface to allow easily viewing previous versions of a page.
Git is such a powerful tool. But also a very complex one. A strong
requirement would be to not force users to learn git.
I believe that "history" is an advanced feature but the most interesting
feature would be, for me, merging your cache with someone else’s cache.
This should be done with constraints:
- always keep versions of a page you encountered yourself
- always keep the good chronological order (and how could it possibly be
guessed?)
- keep the resulting cache below a certain size (at some point, size
will start to become a problem)
Those are all very very interesting problems.
>>>On 2023-12-15, Ploum wrote:>>- Increase in cache storage.>>This would be especially problematic for changing binary files.>Incremental updates to text files however should be quite efficient. I'd>like to run the git-based-cache for some time and get some actual>statistics.>>On 2023-12-15, Ploum wrote:>>- Very hard to remove permanently part of the cache to save space.>>It's possible to rewrite git history and squash all commits into a>single one, erasing old versions of files. Running git gc after that>should reduce disk usage.
Yes, this is definitely possible. The question is how to do it properly.
But that’s a nice-to-have problem.
I mean, if the most pressing problem is the size of the cache, it means
that we have solved all other problems *and* that we consider the git
experiment as useful and that offpunk should migrate to git based cache.
Which is not for the short term but quite exciting!
Thanks for replying with such nice ideas.