~lioploum/offpunk-devel

2 2

Using git as a cache ?

Details
Message ID
<170263580983.10.5122506765204732465.231439883@ploum.eu>
DKIM signature
missing
Download raw message
I just had a very stupid idea so I need to share it with you:

What if the cache was handled by git?

Each single domain would be a separate git repository. Each new sync in 
that domain would be a new commit to the repository.



Potential benefits:

- Keeping the history of pages modified over time
- information about when a page was seen for the first time (no FS 
   creation time available under Linux…)
- through a mechanisme yet to invent, allowing people to share the cache 
   of a given domain without connecting to the Internet.
- In the long term, distributing your content directly through git (the 
   local git cache is then replaced by a clone of an "official" git)
- Besides adding a .git folder in each domain, there’s virtually no 
   change in the cache. You can still explore it with grep/ls and have it 
   modified without caring about git.


Drawbacks:

- Lot of complexity.
- Increase in cache storage.
- Very hard to remove permanently part of the cache to save space.

So, yeah, it looks like a stupid idea.

But, for some reason, I keep thinking about it…

What’s your take?

--  

Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
Details
Message ID
<20231229134530.GA4846@patsas>
In-Reply-To
<170263580983.10.5122506765204732465.231439883@ploum.eu> (view parent)
DKIM signature
missing
Download raw message
On 2023-12-15, Ploum wrote:
>Each single domain would be a separate git repository. Each new sync in
>that domain would be a new commit to the repository.

It's actually an interesting idea. Each computer becomes a small Wayback 
Machine, I think it's worth a try.

I'll attempt to add the functionality in my offpunk-compatible (minus my 
bugs) caching proxy. Its architecture allows different storage types for 
the cache so this shouldn't be too difficult. It will likely take a few 
months though, I'm quite busy at the time.


On 2023-12-15, Ploum wrote:
>- through a mechanisme yet to invent, allowing people to share the 
>cache of a given domain without connecting to the Internet.

The "without connecting to the Internet" part already works. Git allows 
remote URLs to point to the filesystem so you could pull from e.g. 
/media/flash-drive/example.com. The only problem here is that each user 
would have a different git commit history. Merging those requires using 
the --allow-unrelated-histories flag but will probably work fine most of 
the time.


On 2023-12-15, Ploum wrote:
>- Lot of complexity.

If I'm not mistaken, updating the cache would be as simple as:

	cd ~/.cache/offpunk/gemini/example.com
	git add .
	git commit -m 'offpunk sync'

Of course an extra (optional) dependency is extra complexity by itself. 
The most complex functionality would be exposing the git history in the 
client interface to allow easily viewing previous versions of a page.


On 2023-12-15, Ploum wrote:
>- Increase in cache storage.

This would be especially problematic for changing binary files. 
Incremental updates to text files however should be quite efficient. I'd 
like to run the git-based-cache for some time and get some actual 
statistics.

On 2023-12-15, Ploum wrote:
>- Very hard to remove permanently part of the cache to save space.

It's possible to rewrite git history and squash all commits into a 
single one, erasing old versions of files. Running git gc after that 
should reduce disk usage.


Sotiris
Details
Message ID
<170386370813.8.4931456628944056547.239531895@ploum.eu>
In-Reply-To
<20231229134530.GA4846@patsas> (view parent)
DKIM signature
missing
Download raw message
On 23/12/29 03:45, Sotiris Papatheodorou wrote:
>On 2023-12-15, Ploum wrote:
>>Each single domain would be a separate git repository. Each new sync in
>>that domain would be a new commit to the repository.
>
>It's actually an interesting idea. Each computer becomes a small Wayback
>Machine, I think it's worth a try.
>
>I'll attempt to add the functionality in my offpunk-compatible (minus my
>bugs) caching proxy. Its architecture allows different storage types for
>the cache so this shouldn't be too difficult. It will likely take a few
>months though, I'm quite busy at the time.

I would be very interested by your experience. Of course, no pressure 
here. This is just for fun and to experiment. That’s why I shared the 
idea: to see if it was totally stupid or if it could give another idea 
to someone else.
>
>
>On 2023-12-15, Ploum wrote:
>>- through a mechanisme yet to invent, allowing people to share the
>>cache of a given domain without connecting to the Internet.
>
>The "without connecting to the Internet" part already works. Git allows
>remote URLs to point to the filesystem so you could pull from e.g.
>/media/flash-drive/example.com. The only problem here is that each user
>would have a different git commit history. Merging those requires using
>the --allow-unrelated-histories flag but will probably work fine most of
>the time.

I didn’t know about --allow-unrleated-histories. Seems really 
interesting.

Also a good read by Solderpunk:
gemini://zaibatsu.circumlunar.space/~solderpunk/gemlog/low-budget-p2p-content-distribution-with-git.gmi
>
>
>On 2023-12-15, Ploum wrote:
>>- Lot of complexity.
>
>If I'm not mistaken, updating the cache would be as simple as:
>
>	cd ~/.cache/offpunk/gemini/example.com
>	git add .
>	git commit -m 'offpunk sync'
>
>Of course an extra (optional) dependency is extra complexity by itself.
>The most complex functionality would be exposing the git history in the
>client interface to allow easily viewing previous versions of a page.

Git is such a powerful tool. But also a very complex one. A strong 
requirement would be to not force users to learn git.

I believe that "history" is an advanced feature but the most interesting 
feature would be, for me, merging your cache with someone else’s cache. 
This should be done with constraints:

- always keep versions of a page you encountered yourself
- always keep the good chronological order (and how could it possibly be 
   guessed?)
- keep the resulting cache below a certain size (at some point, size 
   will start to become a problem)


Those are all very very interesting problems.
>
>
>On 2023-12-15, Ploum wrote:
>>- Increase in cache storage.
>
>This would be especially problematic for changing binary files.
>Incremental updates to text files however should be quite efficient. I'd
>like to run the git-based-cache for some time and get some actual
>statistics.
>
>On 2023-12-15, Ploum wrote:
>>- Very hard to remove permanently part of the cache to save space.
>
>It's possible to rewrite git history and squash all commits into a
>single one, erasing old versions of files. Running git gc after that
>should reduce disk usage.

Yes, this is definitely possible. The question is how to do it properly. 
But that’s a nice-to-have problem.

I mean, if the most pressing problem is the size of the cache, it means 
that we have solved all other problems *and* that we consider the git 
experiment as useful and that offpunk should migrate to git based cache.

Which is not for the short term but quite exciting!

Thanks for replying with such nice ideas.
Reply to thread Export thread (mbox)