~bitfehler/m2dir

4 2

Thoughts

Details
Message ID
<17c32401239f7b0b.f35d3f6599202235.cdc008ecf268f5a1@mba>
DKIM signature
pass
Download raw message
As discussed, here are some of my thoughts on the spec: After having 
read through the spec, I generally think this is an improvement on 
Maildir in almost every way. I feel like you've hit a sweetspot between 
keeping things simple while still allowing some flexibility in the spec. 

> Each message is a file, with an immutable name

Why immutable? I get that the hash should be immutable, but is there a
good reason for requiring this for the human-readable (or any other) 
part as well?

> Depending on the setup, an m2store root may itself be an m2dir.

Given the fact that this is not allowed when mirroring an IMAP (as you
explain further down,) is there a good reason for allowing this? I'm
thinking it might simplify implementations if an m2store is not allowed
to be an m2dir itself.

> An m2store root must contain an empty file .m2dir.root to enable 
> discovery by other m2dir-compatible applications.

How about `.m2store` instead, to stick with the terminology?

> Every file in the m2dir represents an email.

What happens if the file does not comply with the filename rules?

> Email metadata (such as flags) is stored in a subdirectory .meta [...]

This is very clever, but I fear a bit the number of filesystem
operations this would require. It's probably not a big deal, but are
there any more performant options? I know arbitrary metadata can be
attached to files, but that might only be supported on ext3, ext4, etc.
An alternative would be to also allow Maildir-type metadata as part of 
the path for e.g. known flags, but that might just make everything more
messy. Just brainstorming here.

> Each type of metadata is stored in its own file, following the naming 
> convention:
>
> `.meta/.<UNIQUE_ID>.<EXTENSION>`

Why the are we hiding files within the `.meta` folder? I can imagine 
people wanting to manually check some metadata in their m2dir, running 
`ls <m2dir>/.meta` and being confused that nothing gets listed.

> 1. Check if target directory is a valid m2dir (contains .mdir marker 
>    file)
>    - If yes, deliver message to this directory
>
> 2. Check if target directory is a valid m2store (contains .m2dir.root 
>    marker file)
>    - If no, abort delivery with error
>
> 3. Check if target directory contains an entry .delivery that is valid 
>    according to the rules described in the [Default 
>    Target][#default-target] section.
>    - If yes, deliver the mail to the target specified by the .delivery 
>      entry
>    - If no, abort delivery with error

This was a bit confusing to me. I'm guessing this list is meant to
short-circuit? If so, I think we should add "If no, proceed to next 
step" and "If yes, proceed to next step" to step 1 and 2, respectively.

> This temporary file's name must start with a period (.) in order to be 
> ignored by compliant applications.

Maybe the spec should require some uniqueness strategy as well, to avoid
multiple m2dir-supporting applications writing something to the same
path simultaneously? This could probably be something very simple like
checking if a file exists already and increasing a counter if it does.

> <DATE> is the date from the email's Date header in the following format:
>
> `YYYY-MM-DD_hh:mm`

I think RFC 3339 up to seconds should be used here. The `T` isn't that
difficult to ignore, and the seconds would help sort emails that appear
within the same minute. 

Touching on your point below that the date should be naïve and in the 
user's local timezone: I think since we're at the point where the spec
recommends a standard format for the human-centric part, it should
include timezone information. Some applications might begin depending on 
that part for extracting the date, and so a naïve date will definitely 
become confusing. RFC 3339 requires the timezone, so with it there 
should be no ambiguities.

---

All in all, I'm super pumped about this format! I can't wait to start
using it.

-- 
Magnus
Details
Message ID
<1cc7bc55-4032-47e2-a1c5-bab005acf689@bitfehler.net>
In-Reply-To
<17c32401239f7b0b.f35d3f6599202235.cdc008ecf268f5a1@mba> (view parent)
DKIM signature
pass
Download raw message
Hey Knut,

thanks for sharing your feedback, some excellent questions in there :)

As is customary, I'll try to address them inline:

On 4/4/24 7:31 PM, Knut Magnus Aasrud wrote:
> As discussed, here are some of my thoughts on the spec: After having 
> read through the spec, I generally think this is an improvement on 
> Maildir in almost every way. I feel like you've hit a sweetspot between 
> keeping things simple while still allowing some flexibility in the spec.
>> Each message is a file, with an immutable name
> 
> Why immutable? I get that the hash should be immutable, but is there a
> good reason for requiring this for the human-readable (or any other) 
> part as well?

Indeed, this could probably be phrased differently. The intention was to 
make clear that "standard" operations on messages and/or m2dirs never 
modify the filename (unlike maildir, where setting flags changes the 
filename). This is important for tools maintaining an index on top of 
the storage (think notmuch).

The spec is intended to give enough leeway that changing the 
human-readable part is possible, but I would say it should be 
discouraged. At the very least, it would always have to happen with 
explicit consent from the user, so that any index or such could be re-built.

What I envisioned was something like: a user comes up with an 
alternative convention for the human-readable part. They want to test it 
and are aware that renaming will break any index they may have. They 
rename all files, update or rebuild any indices, and continue.

Other than that, I think renaming _should_ not happen. Since you're 
asking about this, you probably have some use case in mind, though, 
right? I'd be curious to learn what that is :)

>> Depending on the setup, an m2store root may itself be an m2dir.
> 
> Given the fact that this is not allowed when mirroring an IMAP (as you
> explain further down,) is there a good reason for allowing this? I'm
> thinking it might simplify implementations if an m2store is not allowed
> to be an m2dir itself.

Good point, not sure. I initially disallowed it, however the early 
versions were much too IMAP-centric. I started allowing it when adding 
email delivery. I then added the "default delivery target" stuff, which 
takes care of this to some extend. I am leaning towards still allowing 
it for the following reasoning, which I admit... might be semi-sound:

* Maybe there will be a remote mail store (not IMAP) at some point that
   has one true root for its hierarchy, at which point everyone would
   like to make that root their "~/Mail", and not "~/Mail/Inbox"
* This is of course purely speculative, but in practice (read: IMAP) the
   spec makes it clear that you should not do it, but stops short of
   nailing this door shut in case of future developments.

Thin, I know. I'd be curious, though: do you think the benefit would be 
that great? I'd be very understanding of any tool that for example 
simply ignored the root even if it were an m2dir. From my experience, 
the major pain was more to determine what the actual INBOX is (and all 
the various cases that showed up in practice). In this case it would be 
clear that the root is not the INBOX, but the folder INBOX is. I think 
as long as the ambiguity is eliminated, handling this should be acceptable?

>> An m2store root must contain an empty file .m2dir.root to enable 
>> discovery by other m2dir-compatible applications.
> 
> How about `.m2store` instead, to stick with the terminology?

Agreed.

>> Every file in the m2dir represents an email.
> 
> What happens if the file does not comply with the filename rules?

Indeed, I missed specifying this. I'd say it's a violation of the spec 
and therefore should be an error. Unless you have compelling reasons for 
allowing this?

>> Email metadata (such as flags) is stored in a subdirectory .meta [...]
> 
> This is very clever, but I fear a bit the number of filesystem
> operations this would require. It's probably not a big deal, but are
> there any more performant options? I know arbitrary metadata can be
> attached to files, but that might only be supported on ext3, ext4, etc.
> An alternative would be to also allow Maildir-type metadata as part of 
> the path for e.g. known flags, but that might just make everything more
> messy. Just brainstorming here.

Ah, yes :)

I tried various things. The maildir "flags-in-filename" stuff is 
fundamentally incompatible with arbitrary flags/keywords, which I really 
want to support. Also, changing the filename on flag changes can be 
annoying for indices on top of all this (see first answer).

I actually did one implementation that used POSIX extended attributes (I 
assume that's the "arbitrary metadata" you are referring to). It is of 
course faster, for sure. However, I tested a folder with several 
thousand mails in it, and the "separate file in .meta/" variant was 
still totally fast enough.

And extended attributes come with a _lot_ of drawbacks: making a backup 
of your emails? Better be very careful that your tool honors them [1]! 
Backing up to S3? No dice.

[1] 
https://wiki.archlinux.org/title/Extended_attributes#Preserving_extended_attributes

And, last but not least, it would exclude a lot of OS/filesystems. So, 
yes, this was considered, but consciously decided against.

>> Each type of metadata is stored in its own file, following the naming 
>> convention:
>>
>> `.meta/.<UNIQUE_ID>.<EXTENSION>`
> 
> Why the are we hiding files within the `.meta` folder? I can imagine 
> people wanting to manually check some metadata in their m2dir, running 
> `ls <m2dir>/.meta` and being confused that nothing gets listed.

Agree, and I honestly don't really remember the history of that right 
now. So all for it.

>> 1. Check if target directory is a valid m2dir (contains .mdir marker 
>>    file)
>>    - If yes, deliver message to this directory
>>
>> 2. Check if target directory is a valid m2store (contains .m2dir.root 
>>    marker file)
>>    - If no, abort delivery with error
>>
>> 3. Check if target directory contains an entry .delivery that is valid 
>>    according to the rules described in the [Default    
>> Target][#default-target] section.
>>    - If yes, deliver the mail to the target specified by the .delivery 
>>      entry
>>    - If no, abort delivery with error
> 
> This was a bit confusing to me. I'm guessing this list is meant to
> short-circuit? If so, I think we should add "If no, proceed to next 
> step" and "If yes, proceed to next step" to step 1 and 2, respectively.

Yes, it's meant to short-circuit. I'll think about a better approach for 
phrasing the entire thing, or might just pick up your suggestion.

>> This temporary file's name must start with a period (.) in order to be 
>> ignored by compliant applications.
> 
> Maybe the spec should require some uniqueness strategy as well, to avoid
> multiple m2dir-supporting applications writing something to the same
> path simultaneously? This could probably be something very simple like
> checking if a file exists already and increasing a counter if it does.

I indeed didn't spell it out there, but my implicit understanding was 
that this should use typical means of secure temporary file creation, 
like mkstemp(3), just with the additional constraint of it being a 
dot-file. I suppose this could be elaborated on.

I'll just note that having multiple processes deliver mails to the same 
folder at the same time can still be... difficult. But I decided it is 
not the responsibility of the storage format to handle all the edge 
cases of this... :)

>> <DATE> is the date from the email's Date header in the following format:
>>
>> `YYYY-MM-DD_hh:mm`
> 
> I think RFC 3339 up to seconds should be used here. The `T` isn't that
> difficult to ignore, and the seconds would help sort emails that appear
> within the same minute.

_Now_ it becomes interesting!

The whole human-readable part is of course subjective to a certain 
extend, and thus hard to reason about. Like, I personally just hate the 
`T` :) I also don't see any value in the seconds, but can accept the 
ordering argument. It was very obvious that there would be a lot of 
opinions on this part, and hence I made clear in the spec that an 
application should allow for some flexibility. Which brings me to the 
next part:

> Touching on your point below that the date should be naïve and in the 
> user's local timezone: I think since we're at the point where the spec
> recommends a standard format for the human-centric part, it should
> include timezone information. Some applications might begin depending on 
> that part for extracting the date, and so a naïve date will definitely 
> become confusing. RFC 3339 requires the timezone, so with it there 
> should be no ambiguities.

That (extracting date from filename) would be a huge no-no in my book. 
It's exactly what I want to avoid, so folks have the freedom to play 
with the human-readable part if they so desire. It even already says 
that "an application must not attempt to parse the human-centric part or 
derive any properties from it." [2]

[2] https://man.sr.ht/~bitfehler/m2dir/#human-centric-part-of-filename

Do you have a specific use-case in mind when bringing this up? If so, 
maybe there could be a better way to accomodate it? What good would the 
date be without any other information?

For example, the `.meta/` folder is intentionally designed to be 
extendable. An application has to parse the date when delivering a mail. 
It could write the exact date someplace in there? But again, why the 
date and not e.g. the subject?

> All in all, I'm super pumped about this format! I can't wait to start
> using it.

Excellent! Thank you so much for this valuable feedback, and I am super 
excited to see what others will do with this!

Give me just a few more days for a rudimentary cleanup and then I'll 
publish the rust crate that I wrote. I evolved with the spec, so it's a 
little messy, but whatever :)

Cheers,
Conrad
Details
Message ID
<17c3a03696e2c0e1.f35d3f6599202235.cdc008ecf268f5a1@mba>
In-Reply-To
<1cc7bc55-4032-47e2-a1c5-bab005acf689@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
> The spec is intended to give enough leeway that changing the 
> human-readable part is possible, but I would say it should be 
> discouraged. At the very least, it would always have to happen with 
> explicit consent from the user, so that any index or such could be 
> re-built.

That is a very smart recommendation for implementors!

> Other than that, I think renaming _should_ not happen. Since you're 
> asking about this, you probably have some use case in mind, though, 
> right? I'd be curious to learn what that is :)

No, just curious. I think I, and most users, would not really touch the
m2dir files that much, but it is nice that it is allowed.

> >> Depending on the setup, an m2store root may itself be an m2dir.
> > 
> > Given the fact that this is not allowed when mirroring an IMAP (as you
> > explain further down,) is there a good reason for allowing this? I'm
> > thinking it might simplify implementations if an m2store is not allowed
> > to be an m2dir itself.
> 
> Good point, not sure. I initially disallowed it, however the early 
> versions were much too IMAP-centric. I started allowing it when adding 
> email delivery. I then added the "default delivery target" stuff, which 
> takes care of this to some extend. I am leaning towards still allowing 
> it for the following reasoning, which I admit... might be semi-sound:
> 
> * Maybe there will be a remote mail store (not IMAP) at some point that
>    has one true root for its hierarchy, at which point everyone would
>    like to make that root their "~/Mail", and not "~/Mail/Inbox"
> * This is of course purely speculative, but in practice (read: IMAP) the
>    spec makes it clear that you should not do it, but stops short of
>    nailing this door shut in case of future developments.
> 
> Thin, I know. I'd be curious, though: do you think the benefit would be 
> that great? I'd be very understanding of any tool that for example 
> simply ignored the root even if it were an m2dir. From my experience, 
> the major pain was more to determine what the actual INBOX is (and all 
> the various cases that showed up in practice). In this case it would be 
> clear that the root is not the INBOX, but the folder INBOX is. I think 
> as long as the ambiguity is eliminated, handling this should be acceptable?

Suggestion: The spec allows any directory outside of an m2store context
to be an m2dir, only requiring that there is an empty `.m2dir` file
there and that all files within it follow the file naming scheme.
Dotfiles and directories are ignored. As such, any m2store root that
also has an `.m2dir` file in it is also an m2dir. This should be 
allowed, but in accordance with IMAP, let's have the spec discourage it.

> >> Every file in the m2dir represents an email.
> > 
> > What happens if the file does not comply with the filename rules?
> 
> Indeed, I missed specifying this. I'd say it's a violation of the spec 
> and therefore should be an error. Unless you have compelling reasons for 
> allowing this?

Nope, I think that is very sensible. I would even go as far as stating
that if an m2dir contains a (non-dotfile) file that does not follow the
naming scheme, it is not an m2dir.

> [...]
>
> And extended attributes come with a _lot_ of drawbacks: making a backup 
> of your emails? Better be very careful that your tool honors them [1]! 
> Backing up to S3? No dice.
> 
> [1] 
> https://wiki.archlinux.org/title/Extended_attributes#Preserving_extended_attributes
> 
> And, last but not least, it would exclude a lot of OS/filesystems. So, 
> yes, this was considered, but consciously decided against.

I'm convinced! `.meta` is a very nice solution indeed.

> >> This temporary file's name must start with a period (.) in order to be 
> >> ignored by compliant applications.
> > 
> > Maybe the spec should require some uniqueness strategy as well, to avoid
> > multiple m2dir-supporting applications writing something to the same
> > path simultaneously? This could probably be something very simple like
> > checking if a file exists already and increasing a counter if it does.
> 
> I indeed didn't spell it out there, but my implicit understanding was 
> that this should use typical means of secure temporary file creation, 
> like mkstemp(3), just with the additional constraint of it being a 
> dot-file. I suppose this could be elaborated on.
> 
> I'll just note that having multiple processes deliver mails to the same 
> folder at the same time can still be... difficult. But I decided it is 
> not the responsibility of the storage format to handle all the edge 
> cases of this... :)

Yes, I guess that will have to be solved based on the use-cases of the
implementor. They could probably also figure out how to solve the
temporary file problem, so it might not be necessary to spell out in the
spec.

> >> <DATE> is the date from the email's Date header in the following format:
> >>
> >> `YYYY-MM-DD_hh:mm`
> > 
> > I think RFC 3339 up to seconds should be used here. The `T` isn't that
> > difficult to ignore, and the seconds would help sort emails that appear
> > within the same minute.
> 
> _Now_ it becomes interesting!
> 
> The whole human-readable part is of course subjective to a certain 
> extend, and thus hard to reason about. Like, I personally just hate the 
> `T` :) I also don't see any value in the seconds, but can accept the 
> ordering argument. It was very obvious that there would be a lot of 
> opinions on this part, and hence I made clear in the spec that an 
> application should allow for some flexibility. 

True, the idea of adding a customizable human-centric part is a stroke 
of genius! However, I think either the spec has to recommend something 
that would work very generally and be based on other standards, or it 
should not recommend something at all. But of course, any spec is bound
to be colored by the author, and your recommendations are very sensible.
(I will probably include seconds in my personal store, though 😉.)

> Which brings me to the next part:
> 
> > Touching on your point below that the date should be naïve and in the 
> > user's local timezone: I think since we're at the point where the spec
> > recommends a standard format for the human-centric part, it should
> > include timezone information. Some applications might begin depending on 
> > that part for extracting the date, and so a naïve date will definitely 
> > become confusing. RFC 3339 requires the timezone, so with it there 
> > should be no ambiguities.
> 
> That (extracting date from filename) would be a huge no-no in my book. 
> It's exactly what I want to avoid, so folks have the freedom to play 
> with the human-readable part if they so desire. It even already says 
> that "an application must not attempt to parse the human-centric part or 
> derive any properties from it." [2]
> 
> [2] https://man.sr.ht/~bitfehler/m2dir/#human-centric-part-of-filename

Yeah, you're right, I'm just pointing out the fact that since the parsed
date is already there, it might be tempting for people to use it instead
of parsing the email themselves (faster). However, I guess then they
don't comply with the spec since it is discouraged, so my point doesn't
really hold. I'm convinced!

> > All in all, I'm super pumped about this format! I can't wait to start
> > using it.
> 
> Excellent! Thank you so much for this valuable feedback, and I am super 
> excited to see what others will do with this!
> 
> Give me just a few more days for a rudimentary cleanup and then I'll 
> publish the rust crate that I wrote. I evolved with the spec, so it's a 
> little messy, but whatever :)

Looking forward to it! I was super pumped to write something myself, but
then I'll have a look at your crate instead! If would love to help out
with a separate pair of eyes on the API if I can. Maybe give me view 
access to the repo and I can look through and supply some patches?

-- 
Cheers,
Magnus
Details
Message ID
<89d4cbf8-ffd2-419d-af00-1a454e05614d@bitfehler.net>
In-Reply-To
<17c3a03696e2c0e1.f35d3f6599202235.cdc008ecf268f5a1@mba> (view parent)
DKIM signature
pass
Download raw message
Hey,

thanks again for all the input. A few final notes (context shortended 
for "brevity"):

On 4/6/24 9:27 AM, Knut Magnus Aasrud wrote:
...
> Suggestion: The spec allows any directory outside of an m2store context
> to be an m2dir, only requiring that there is an empty `.m2dir` file
> there and that all files within it follow the file naming scheme.
> Dotfiles and directories are ignored. As such, any m2store root that
> also has an `.m2dir` file in it is also an m2dir. This should be
> allowed, but in accordance with IMAP, let's have the spec discourage it.

Yes, this is pretty much what I was aiming for. I'll give that section a 
once-over to make that more clear.

...> True, the idea of adding a customizable human-centric part is a stroke
> of genius! However, I think either the spec has to recommend something
> that would work very generally and be based on other standards, or it
> should not recommend something at all. But of course, any spec is bound
> to be colored by the author, and your recommendations are very sensible.

It is actually great talking this through with someone, because frankly, 
you are totally right. As you may have guessed, the spec actually 
started out simply requiring the (now) "recommended" format. It quickly 
became clear that finding one format for everyone is a fool's errand, so 
the variable format was born. And reading your feedback makes me realize 
that I was simply having a hard time letting go of "my" format. But 
that's silly :)

I will update the section accordingly, maybe just keeping bits and 
pieces as an example of how the human-readable part can be used to help 
human operators.

> (I will probably include seconds in my personal store, though 😉.)

Yes, you totally should! I really want to see what people come up with! 
As you say, this is the much more important part of the spec than the 
very subjective format I initially came up with.

...
> Looking forward to it! I was super pumped to write something myself, but
> then I'll have a look at your crate instead! If would love to help out
> with a separate pair of eyes on the API if I can. Maybe give me view
> access to the repo and I can look through and supply some patches?

Here is what I'll do: my Mondays are always really busy, but I should 
manage to publish an update to the spec on Tuesday that incorporates all 
the things we hashed out in this thread. I'll also push the code 
someplace for you to see by then. However, I realized that my code was 
written with two major constraints: first, it evolved through various 
stages along with the spec itseld; and second, I have always tried to 
keep the interface somewhat close to what I was using for maildir in my 
existing code, so I could more easily play with the code in my existing 
tools.

I don't think these are major deficits, but it might of course be fun to 
see with what kind of API you'd come up with with a green-field 
approach. Just a thought, though, I'll leave that up to you.

Cheers,
Conrad
Details
Message ID
<acbf17fb-c0c9-4b48-8521-f55a9d67c69e@aasrud.com>
In-Reply-To
<89d4cbf8-ffd2-419d-af00-1a454e05614d@bitfehler.net> (view parent)
DKIM signature
pass
Download raw message
> Here is what I'll do: my Mondays are always really busy, but I should 
> manage to publish an update to the spec on Tuesday that incorporates 
> all the things we hashed out in this thread. I'll also push the code 
> someplace for you to see by then.

Great stuff, but don't feel any obligation to rush. We all have busy days 
at work and such, and this is all fun!

> [...] it might of course be fun to see with what kind of API you'd come 
> up with with a green-field approach. Just a thought, though, I'll leave 
> that up to you.

I'll try to sketch out an API that would make sense for me, and we'll see 
if it holds up. Maybe a merge of the two would work well, let's see!

-- 
Magnus
Reply to thread Export thread (mbox)