~lioploum/offpunk-devel

9 5

Cache format specification

Details
Message ID
<Y8qk1jr52yGcnJSO@junk>
DKIM signature
missing
Download raw message
I'm looking into implementing an offpunk-compatible page cache for my 
Gemini client for old Kindles and paid some more attention to the 
offpunk cache format. From my understanding the current cache paths look 
like this:

<scheme>/<hostname>/<path split on slashes>/<query split on slashes>

i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes 
gemini/kennedy.gemi.dev/search/foo/bar


# Non-default port numbers are lost
Non-default port numbers aren't saved in the path making the mapping 
from URL to path non-unique. It can be an issue if a host is serving 
different content over different ports on the same scheme, although this 
doesn't sound common.


# The query is ambiguous
The query component can't be differentiated from a path component, 
making the mapping from path to URL non-unique.


# The query is needlessly split
I think the query component should always be a single file. A query for 
foo/bar is completely distinct from a query for foo/baz, there's no need 
for them to share a directory. Of course the / must be quoted in this 
case.


# The userinfo component is lost
Although Gemini disallows the userinfo component and Gopher also seems 
to not support it, it's a thing in http and https.


A first proposal, assuming a UNIX-like filesystem where all characters 
apart from slash (/) and null (0x00) are allowed in filenames (this is a 
big assumption, ignores shell meta-characters, won't work on Windows 
etc.):

<scheme>/<username>@<hostname>:<port>/<path>/?<query>


<username>@ omitted if no userinfo component is present. Question marks 
are (?) percent-encoded to %3F. The password part of the userinfo 
component should never be part of the cache path.

:<port> omitted if the default port for the scheme is used.

<path> split on slashes as now and with question marks (?) 
percent-encoded to %3F. Question marks are used to detect if a query 
component is present. The directory to index file translation happens 
the same as now

?<query> omitted if no query component is present. Slashes (/) in the 
query are percent-encoded to %2F. The query component should be the only 
file that starts with a question mark (it's not allowed in the scheme or 
hostname).

The fragment component is ignored as it is now.

i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes 
gemini/kennedy.gemi.dev:1966/search/?foo%2Fbar

This should allow a unique mapping from URL to path and back.


Thoughts?

All the best,
Sotiris
Details
Message ID
<Y8rrp5b9QrAkX7R0@sysrq.in>
In-Reply-To
<Y8qk1jr52yGcnJSO@junk> (view parent)
DKIM signature
missing
Download raw message
I vouch for using an SQLite-powered database (it's in the standard
library).
Details
Message ID
<Y8r76ir3GuOQxsOr@junk>
In-Reply-To
<Y8rrp5b9QrAkX7R0@sysrq.in> (view parent)
DKIM signature
missing
Download raw message
On 2023-01-21, Anna (cybertailor) Vyalkova wrote:
>I vouch for using an SQLite-powered database (it's in the standard
>library).

This is probably the best solution for a robust, general-purpose 
archiver. However I don't think offpunk strives to be that and I must 
say I really like being able to browse and search through the archive 
without specialized tools and I could live with some rare edge cases.

All the best,
Sotiris
Details
Message ID
<167432466346.6.4883081606761446069.93546095@ploum.eu>
In-Reply-To
<Y8qk1jr52yGcnJSO@junk> (view parent)
DKIM signature
missing
Download raw message
On 23/01/20 04:27, Sotiris Papatheodorou - sotiris at papatheodorou.xyz wrote:
>I'm looking into implementing an offpunk-compatible page cache for my
>Gemini client for old Kindles and paid some more attention to the
>offpunk cache format. From my understanding the current cache paths
>look like this:
>
Hey,

I will not dig into the technical discussion now but want to point a few
choices I made and explain my vision.


Firstly, a very hard requirement is to use the filesystem without a
database. I really want to keep that requirements because it is a lot
more inline with the minimalist philosophy and allows the cache to be
human readable.

The cache should be usable with cd/cat/find/grep. So I=E2=80=99m really aga=
inst
a SQLite or anything like that (that was one key principle I had in mind
when I started offpunk).

Another longterm vision is allowing cache to be mergeable/synchronizable
with another cache  (think two users connecting there laptops to share
their cache).


Another point I=E2=80=99ve briefly stated is that I would like to create a
standalone tool called "netcache" that would handle the cache. This
would be a UNIX tool where you could request any URL and it would serve
you the content or mark the content as "too-be-fetched". A kind of wget
or curl with a cache.  That would completely abstract the notion of
cache and network from Offpunk which would only focus on displaying
content. I wanted to create a prototype in python (extracting code from
Offpunk) and, if the concept is good, potentially rewrite it in Rust.

Doing that requires, of course, to think about the format of the cache,
which is a really interesting discussion.

So my first question to you, Sotiris, is:

Would you prefer collaborate on that "netcache" tool that could be used
by your own gemini client or would you need to rewrite the whole code
anyway?


Now, before giving a detailed answer to your mail, just want to point
out that, yeah, I did some mistake as I didn=E2=80=99t think of all the edg=
e
cases. One key error was the handling of error themselves (which should
be in a different folder, not in the content folder). Now, the problem
is to handle potential back-compatibility if we change anything.

An important point to note: despite all its shortcomings, the current
implementation works suprizingly well! A lot of ad-hoc fixes were mede
during the first months and, since that, I=E2=80=99ve yet to encounter a us=
ecase
which is broken because of the definition of the cache (I mean :
encountering it in real life).  So I would be really careful before
changing anything. If it looks dumb but it works, it is probably not so
dumb after all ;-)
Details
Message ID
<167438664067.9.14636745893120331843.93665846@ploum.eu>
In-Reply-To
<Y8qk1jr52yGcnJSO@junk> (view parent)
DKIM signature
missing
Download raw message
On 23/01/20 04:27, Sotiris Papatheodorou - sotiris at papatheodorou.xyz wrote:
>I'm looking into implementing an offpunk-compatible page cache for my
>Gemini client for old Kindles and paid some more attention to the
>offpunk cache format. From my understanding the current cache paths
>look like this:
>
><scheme>/<hostname>/<path split on slashes>/<query split on slashes>
>
>i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
>gemini/kennedy.gemi.dev/search/foo/bar
>

Please see my previous answer for background. Here, I will only answer
individual elements.

>
># Non-default port numbers are lost
>Non-default port numbers aren't saved in the path making the mapping
>from URL to path non-unique. It can be an issue if a host is serving
>different content over different ports on the same scheme, although
>this doesn't sound common.
>

Indeed. Didn’t even thought of this usecase. I assumed that every
hostname has only one server per protocol. I don’t think that I’ve ever
witnessed a server service two different content on the same URL with
the same protocol.

While this is theoritically possible, I find it very implausible and
quite dumb. The only usecase would be to purposedly confuse an user by
serving the same URL.

Could we assume that this usecase is particularly malicious?

A simple workaround would be anyway to save the port in the path only in
the case that the port is not the standard one.

>
># The query is ambiguous
>The query component can't be differentiated from a path component,
>making the mapping from path to URL non-unique.
>

Yes. I knew since the start. But I wanted to make a simple filesystem
mapping so I started with that, guessing we would see how it works.

And it works surprizingly well so I didn’t see why to change it.

>
># The query is needlessly split
>I think the query component should always be a single file. A query
>for foo/bar is completely distinct from a query for foo/baz, there's
>no need for them to share a directory. Of course the / must be quoted
>in this case.
>

Your point is valid.

On the other hand: quoting is hard and makes the path a lot more complex
when browsing "by hand". Splitting the query is also quite intuitive
(even if it has no logical foundation).  It also keep folder names
readable and shorter (which might be a requirement for some
filesystems).

I don’t know if it worths the effort to change current behaviour as long
as we don’t identify real-life problems.

>
># The userinfo component is lost
>Although Gemini disallows the userinfo component and Gopher also seems
>to not support it, it's a thing in http and https.

Indeed. Something which has mostly disappeared because of cookies
identification.
>
>
>A first proposal, assuming a UNIX-like filesystem where all characters
>apart from slash (/) and null (0x00) are allowed in filenames (this is
>a big assumption, ignores shell meta-characters, won't work on Windows
>etc.):
>
><scheme>/<username>@<hostname>:<port>/<path>/?<query>
>
>
><username>@ omitted if no userinfo component is present. Question
>marks are (?) percent-encoded to %3F. The password part of the
>userinfo component should never be part of the cache path.

For browsability purpose, I would put the username as a subfolder of
hostname. This is only a suggestion and the usecase should be rare
enough that we should not care too much about it.

>
>:<port> omitted if the default port for the scheme is used.

Exactly.

>
><path> split on slashes as now and with question marks (?)
>percent-encoded to %3F. Question marks are used to detect if a query
>component is present. The directory to index file translation happens
>the same as now
>
>?<query> omitted if no query component is present. Slashes (/) in the
>query are percent-encoded to %2F. The query component should be the
>only file that starts with a question mark (it's not allowed in the
>scheme or hostname).
>
>The fragment component is ignored as it is now.
>
>i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
>gemini/kennedy.gemi.dev:1966/search/?foo%2Fbar
>
>This should allow a unique mapping from URL to path and back.
>
>
>Thoughts?

Important points for me:

1. Backward compatibility. Implementing a change should have a minimal
impact on current cache and should be transparent for the user. I think
your proposal will only imply to redownload a few URLs (those with
non-standard port and those with queries with several arguments). I
guess that this is an acceptable trade-off.


2. I’m more and more enclined to develop this as a standalone tool.
Solarpunk had the idea to make this kind of tool offering a cache-proxy.
While I highly prefer an UNIX tool and, pontentially, an importable
library. I guess those three solutions are complementary.


Thanks for bringing the discussion, it is a very important one!


PS: for professional reasons, I will not be able to start coding
netcache for the next two months. Hope to have finished my professional
project then and split offpunk source code.

>
>All the best,
>Sotiris



--
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
Details
Message ID
<Y9LGF4a2XeCBJ5xr@junk>
In-Reply-To
<167438664067.9.14636745893120331843.93665846@ploum.eu> (view parent)
DKIM signature
missing
Download raw message
offpunk %s:^(finger|gemini|gopher|spartan):\/\/.+
transmission-gtk %s:^magnet:.+
Details
Message ID
<Y9LvBFDsuNNakpwB@junk>
In-Reply-To
<167438664067.9.14636745893120331843.93665846@ploum.eu> (view parent)
DKIM signature
missing
Download raw message
Sorry for the previous garbage email.

On 2023-01-20, Ploum wrote:
>Now, before giving a detailed answer to your mail, just want to point 
>out that, yeah, I did some mistake as I didn't think of all the edge 
>cases.

I didn't mean my email to be a criticism of the current cache. Knowing 
you're planning on implementing a separate netcache tool, I though it 
would be best to figure out the cache format sooner rather than later 
since it would potentially be a big breaking change.


On 2023-01-20, Ploum wrote:
>Would you prefer to collaborate on that "netcache" tool that could be 
>used by your own gemini client or would you need to rewrite the whole 
>code anyway?

My Gemini client is written in Go because it makes it easy to 
cross-compile for the Kindle. I won't be able to easily use a library 
written in Python or Rust so I'm planning on  implementing a similar 
netcache library in Go. However I'll probably contribute to the Python 
version since I'm using offpunk on my other computers. I also think 
having two implementations of the netcache will help find edge cases and 
make the cache more portable.


On 2023-01-20, Ploum wrote:
>One key error was the handling of error themselves (which should be in 
>a different folder, not in the content folder).

Agreed, it would be best to separate error messages from the cache.


On 2023-01-20, Ploum wrote:
>Now, the problem is to handle potential back-compatibility if we change 
>anything.

I think the only change in the cache that could be automated with a 
script would be the URL/QP encoding of certain characters if 
implemented.

Quoting certain characters will be necessary to support certain 
filesystems (e.g. FAT32, NTFS) or operating systems (although I must say 
I don't care much about Windows or MacOS). For the Kindle client I'll 
have to save the cache on a FAT32 filesystem which seems to disallow 
several characters (" * / : < > ? \ | + , . ; = [ ]). I'll implement 
URL-encoding of filenames just for this client if this isn't desired for 
netcache.


On 2023-01-22, Ploum wrote:
>Could we assume that this usecase is particularly malicious?
>
>A simple workaround would be anyway to save the port in the path only 
>in the case that the port is not the standard one.

Agreed on both points.


On 2023-01-22, Ploum wrote:
>For browsability purpose, I would put the username as a subfolder of 
>hostname. This is only a suggestion and the usecase should be rare 
>enough that we should not care too much about it.

I don't have strong opinion on this but putting the username in a 
subdirectory of the hostname will make it indistinguishable from the 
first path component. Whether that matters depends on my next point.

One important point to decide on I think is whether the cache format 
should allow uniquely mapping a path to a URL. It seems there wasn't a 
need so far so probably not. But maybe there's another use-case that 
will benefit from it.


On 2023-01-22, Ploum wrote:
>PS: for professional reasons, I will not be able to start coding 
>netcache for the next two months. Hope to have finished my professional 
>project then and split offpunk source code.

No worries, I'll also be rather busy in the coming months.

All the best,
Sotiris
Details
Message ID
<ZHDZqJjXrNHqbCQv@sax>
In-Reply-To
<Y8qk1jr52yGcnJSO@junk> (view parent)
DKIM signature
missing
Download raw message
Sotiris Papatheodorou (2023-01-20 16:27:34 +0200) wrote:

> I'm looking into implementing an offpunk-compatible page cache for my Gemini
> client for old Kindles and paid some more attention to the offpunk cache
> format. From my understanding the current cache paths look like this:
> 
> <scheme>/<hostname>/<path split on slashes>/<query split on slashes>
> 
> i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
> gemini/kennedy.gemi.dev/search/foo/bar
> 
> […] A first proposal, assuming a UNIX-like filesystem where all characters apart
> from slash (/) and null (0x00) are allowed in filenames (this is a big
> assumption, ignores shell meta-characters, won't work on Windows etc.):
> 
> <scheme>/<username>@<hostname>:<port>/<path>/?<query>
> 
> […] i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
> gemini/kennedy.gemi.dev:1966/search/?foo%2Fbar […]

Hi!  Ploum pointed me to this thread a while ago, so I wanted to add my two
cents pointing out some issues that may have been overlooked (or maybe they
are already handled in offpunk's code).

- Is there any kind of URL canonicalization taking place before store/load?
  E.g. `gemini://Foo.bar/x/../baz?` -> `gemini://foo.bar/baz`.

- How are folder components handled?  E.g. `gemini://foo.bar/log` yields
  `text/gemini` content (like a list of log entries), and
  `gemini://foo.bar/log/entry.gmi` does too.

Besides that, I find Sotiri's proposals quite sensible, and some character
quoting may be unavoidable to support some (FAT) filesystems even on Unix
devices, as pointed out.

Cheers!

-- 
Ivan Vilata i Balaguer -- https://elvil.net/
Details
Message ID
<168534981678.7.5942434265369833939.135258097@ploum.eu>
In-Reply-To
<ZHDZqJjXrNHqbCQv@sax> (view parent)
DKIM signature
missing
Download raw message
On 23/05/26 06:09, Ivan Vilata i Balaguer - ivan at selidor.net wrote:
>Sotiris Papatheodorou (2023-01-20 16:27:34 +0200) wrote:
>
>> I'm looking into implementing an offpunk-compatible page cache for my Gemini
>> client for old Kindles and paid some more attention to the offpunk cache
>> format. From my understanding the current cache paths look like this:
>>
>> <scheme>/<hostname>/<path split on slashes>/<query split on slashes>
>>
>> i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
>> gemini/kennedy.gemi.dev/search/foo/bar
>>
>> […] A first proposal, assuming a UNIX-like filesystem where all characters apart
>> from slash (/) and null (0x00) are allowed in filenames (this is a big
>> assumption, ignores shell meta-characters, won't work on Windows etc.):
>>
>> <scheme>/<username>@<hostname>:<port>/<path>/?<query>
>>
>> […] i.e. gemini://kennedy.gemi.dev:1966/search?foo/bar becomes
>> gemini/kennedy.gemi.dev:1966/search/?foo%2Fbar […]
>
>Hi!  Ploum pointed me to this thread a while ago, so I wanted to add my two
>cents pointing out some issues that may have been overlooked (or maybe they
>are already handled in offpunk's code).
>
>- Is there any kind of URL canonicalization taking place before store/load?
>  E.g. `gemini://Foo.bar/x/../baz?` -> `gemini://foo.bar/baz`.

Other than making the protocol a folder, there’s no intelligence. So 
gemini://Foo.bar/x/../baz becomes gemini/Foo.bar/baz"  (because the path 
is considered as a path so ".." are automatically taken into account.
>
>- How are folder components handled?  E.g. `gemini://foo.bar/log` yields
>  `text/gemini` content (like a list of log entries), and
>  `gemini://foo.bar/log/entry.gmi` does too.

The whole thing is done by the "get_cache_path()" method. 

As for folders, the rule is a pure hack. Each protocol as an hardcoded 
"index" extension. http-> index.html. gemini -> index.gmi gopher -> 
gophermap , finger -> index.txt.

By default, it’s "index.gmi".

If an URL ends with a "/", the index for the protocol is added 
afterward.

gemini://Foo.bar/x/ -> gemini/Foo.bar/x/index.gmi

Now, there’s the case where you may have acceded the URL 
gemini://Foo.bar/x  (and thus "x" was considered as a file not a 
folder).

Well, in that case, the "x" file is simply deleted and an "x" folder is 
created.

I never took the time to save the "x" file to copy it as "x/index.gmi" 
because that would mess with the cache timestamp (it would look like the 
content was access recently even if it is not the case). And I received 
exactly 0 complains about that behaviour so I thought "if it’s broken 
but nobody notices it, better not fix it" ;-)

>
>Besides that, I find Sotiri's proposals quite sensible, and some character
>quoting may be unavoidable to support some (FAT) filesystems even on Unix
>devices, as pointed out.
>
>Cheers!
>
>-- 
>Ivan Vilata i Balaguer -- https://elvil.net/



-- 
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Livres: https://ploum.net/livres.html
Maeve Sproule <code@sprock.dev>
Details
Message ID
<59038d66-c92e-fbfd-14a1-48740af99e6e@sprock.dev>
In-Reply-To
<168534981678.7.5942434265369833939.135258097@ploum.eu> (view parent)
DKIM signature
missing
Download raw message
On 29/05/2023 02.43, Ploum wrote:
> On 23/05/26 06:09, Ivan Vilata i Balaguer - ivan at selidor.net wrote:
>> - Is there any kind of URL canonicalization taking place before store/load?
>>   E.g. `gemini://Foo.bar/x/../baz?` -> `gemini://foo.bar/baz`.
> 
> Other than making the protocol a folder, there’s no intelligence. So
> gemini://Foo.bar/x/../baz becomes gemini/Foo.bar/baz"  (because the path
> is considered as a path so ".." are automatically taken into account.

A quick test shows that the hostname does in fact seem to be 
canonicalized, either by urllib.parse or by the IDN handling code.

As to the behaviour of "..", it is probably a good idea to at least 
check that the path is within the expected cache directory (after OS 
path/symlink resolution), since the current code allows links with ".." 
to cause offpunk to write arbitrary files outside the cache. I don't 
know how much free time I have right now to work on this, but I can 
write up a patch to add this check to offpunk some time in the next few 
days, if you'd like.

-- 
sprock
Reply to thread Export thread (mbox)