~adnano/gemini

43 17

robots.txt for Gemini formalised

Details
Message ID
<C79XPGLWFE5E.3L6WKME61MN8U@stilgar>
DKIM signature
missing
Download raw message
Hi folks,

There is now (finally!) an official reference on the use of robots.txt
files in Geminispace.  Please see:

gemini://gemini.circumlunar.space/docs/companion/robots.gmi

I attempted to take into account previous discussions on the mailing
list and the currently declared practices of various well-known Gemini
bots (broadly construed).

I don't consider this "companion spec" to necessarily be finalised at
this point, but I am primarily interested in hearing suggestions for
change from either authors of software which tries to respect robots.txt
who are having problems caused by the current specification, or from
server admins who are having bot problems who feel that the current
specification is not working for them.

The biggest gap that I can currently see is that there is no advice on
how often bots should re-query robots.txt to check for policy changes.
I could find no clear advice on this for the web, either.  I would be
happy to hear from people who've written software that uses robots.txt
with details on what their current practices are in this respect.

Cheers,
Solderpunk
Sean Conner <sean@conman.org>
Details
Message ID
<20201122225942.GB1721@brevard.conman.org>
In-Reply-To
<C79XPGLWFE5E.3L6WKME61MN8U@stilgar> (view parent)
DKIM signature
missing
Download raw message
It was thus said that the Great Solderpunk once stated:
> Hi folks,
> 
> There is now (finally!) an official reference on the use of robots.txt
> files in Geminispace.  Please see:
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

  Nice.

> I attempted to take into account previous discussions on the mailing
> list and the currently declared practices of various well-known Gemini
> bots (broadly construed).
> 
> I don't consider this "companion spec" to necessarily be finalised at
> this point, but I am primarily interested in hearing suggestions for
> change from either authors of software which tries to respect robots.txt
> who are having problems caused by the current specification, or from
> server admins who are having bot problems who feel that the current
> specification is not working for them.

  Right now, there are two things I would change.

	1. Add "allow".  While the initial spec [1] did not have an allow
	   rule, a subsequent draft proposal [2] did, which Google is
	   pushing (as of 2019) to become an RFC [3].

	2. I would specify virtual agents as:

		Virtual-agent: archiver
		Virtual-agent: indexer

	   This makes it easier to add new virtual agents, separates the
	   namespace of agents from the namespace of virtual agents, and is
	   allowed by all current and proposed standards [4].

	   The rule I would follow is:

		Definitions:  
			specific user agent is one that is not '*'
			specific virtual agent is one that is not '*'
			generic user agent is one that is specified as '*'
			generic virtual agent is one that is '*'

		A crawler should use a block of rules:

			if it finds a specific user agent (most targetted)
			or it finds a specific virtual agent
			or it finds a generic virtual agent
			or it finds a generic user agent (least targetted)

	   I'm wavering on the generic virtual agent bit, so if you think
	   that makes this too complicated, fine, I think it can go.

> The biggest gap that I can currently see is that there is no advice on
> how often bots should re-query robots.txt to check for policy changes.
> I could find no clear advice on this for the web, either.  I would be
> happy to hear from people who've written software that uses robots.txt
> with details on what their current practices are in this respect.

  The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which
informs a crawler how often they should make requests.  It might be easy to
add a field saying how often to fetch a resource.  A sample file:

# The GUS agent, plus any agent that identifies as an "indexer" is allowed
# one path in an otherwise disallowed place, and only fetch items in 10
# second increments.

User-agent: GUS
Virtual-agent: indexer
Allow: /private/butpublic
Disallow: /private
Crawl-delay: 10

# Agents that fetch feeds, should only grab every 6 hours.  "Check" is
# allowed as agents should ignore fields it doesn't understand.

Virtual-agent: feed
Disallow: /private
Check: 21600

# And a fallback.  Here we don't allow any old crawler into the private
# space, and we force them to use 20 seonds between fetches.

User-agent: *
Disallow: /private
Crawl-delay: 20

  -spc

[1]	gemini://gemini.circumlunar.space/docs/companion/robots.gmi

[2]	http://www.robotstxt.org/norobots-rfc.txt

[3]	https://developers.google.com/search/reference/robots_txt

[4]	Any field not understood by a crawler should be ignored.

[5]	https://en.wikipedia.org/wiki/Robots_exclusion_standard
Details
Message ID
<C7A60Q2JCSQP.3UAHW6DREZ4RD@taiga>
In-Reply-To
<C79XPGLWFE5E.3L6WKME61MN8U@stilgar> (view parent)
DKIM signature
missing
Download raw message
Feedback:

A web portal is a regular user agent, not a robot.

Maybe we could normalize robots fetching robots.txt with the query
string set to some useful identifiying information? This would allow
gemini administrators to make bot-specific rules, understand the
behavior of their logs, and get in touch with the operator if
necessary.
John Cowan <cowan@ccil.org>
Details
Message ID
<CAD2gp_ToHLfz8P4g3u6=RDaNcY+75BNP1OgK-XAj8gZqp+mFBg@mail.gmail.com>
In-Reply-To
<C7A60Q2JCSQP.3UAHW6DREZ4RD@taiga> (view parent)
DKIM signature
missing
Download raw message
On Sun, Nov 22, 2020 at 6:03 PM Drew DeVault <sir at cmpwn.com> wrote:


> A web portal is a regular user agent, not a robot.
>

Agreed.  However, The spec says "publicly serve the result", and a *public*
proxy can pound a Gemini server if a lot of Web clients are accessing it
concurrently.  It should be able to find out whether the server is robust
to such operations or not.

By the same token, a public Gopher proxy (if there are any) should respect
"Disallow: gopherproxy".

Other points:
+1 for Allow:
+1 for Virtual-Agent
+1 for ignoring unknown lines
Unsure what the difference is between Crawl-Delay: and Check:, but having a
retry delay is a Good Thing

Additionally:  "Agent:" should specify a SHA-256 hash of the client cert
used by particular crawlers rather than a random easy-to-forge name.  Thus
GUS should crawl using a cert and publicly post the hash of this cert.
Then callers with that cert are necessarily GUS, since the cert itself is
not published.  (Of course it's still possible for a server to steal GUS's
client cert.)


> Maybe we could normalize robots fetching robots.txt with the query
> string set to some useful identifiying information? This would allow
> gemini administrators to make bot-specific rules, understand the
> behavior of their logs, and get in touch with the operator if
> necessary.
>

The trouble is that completely different pages can be returned with
different query strings that are entirely unrelated to actual searching, so
it's inappropriate to usurp the query string for this purpose.  That's not
to say that agent control can't rely on the query string.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>
Details
Message ID
<C7A8253W74PS.3J3SBB2IKJCAE@nitro>
In-Reply-To
<CAD2gp_ToHLfz8P4g3u6=RDaNcY+75BNP1OgK-XAj8gZqp+mFBg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote:
> Additionally: "Agent:" should specify a SHA-256 hash of the client cert
> used by particular crawlers rather than a random easy-to-forge name.
> Thus
> GUS should crawl using a cert and publicly post the hash of this cert.
> Then callers with that cert are necessarily GUS, since the cert itself
> is
> not published. (Of course it's still possible for a server to steal
> GUS's
> client cert.)

This doesn't seem very useful, as bad robots can simply ignore the rules
in robots.txt.
John Cowan <cowan@ccil.org>
Details
Message ID
<CAD2gp_RHZ760cmJsr5TFbYwDRfpLUjiDsTSTv0ZMU-oYvb4D+w@mail.gmail.com>
In-Reply-To
<C7A8253W74PS.3J3SBB2IKJCAE@nitro> (view parent)
DKIM signature
missing
Download raw message
Of course they can: that's always true, as the pre-spec already says.   The
idea is to give crawlers (etc.) that want to keep to the rules some way to
clearly and uniquely identify themselves to servers.

On Sun, Nov 22, 2020 at 7:39 PM Adnan Maolood <me at adnano.co> wrote:

> On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote:
> > Additionally: "Agent:" should specify a SHA-256 hash of the client cert
> > used by particular crawlers rather than a random easy-to-forge name.
> > Thus
> > GUS should crawl using a cert and publicly post the hash of this cert.
> > Then callers with that cert are necessarily GUS, since the cert itself
> > is
> > not published. (Of course it's still possible for a server to steal
> > GUS's
> > client cert.)
>
> This doesn't seem very useful, as bad robots can simply ignore the rules
> in robots.txt.
>
Details
Message ID
<X7sS8dSqOUjSqvYX@goldfish.localdomain>
In-Reply-To
<C79XPGLWFE5E.3L6WKME61MN8U@stilgar> (view parent)
DKIM signature
missing
Download raw message
This looks great! I'm excited to see this companion spec become more
formalized, and really like the categorical virtual agent design. One
thing that stuck out to me after a first read was the `webproxy` user
agent. What would you think of something like the following instead:

`proxy`
`proxy-web`
`proxy-gopher`

The prefixed design doesn't have any drawbacks as far as I can tell,
and would allow for more intuitively designed blocking/allowing
hierarchies. E.g., if there were 4 different types of proxies in use,
and you only wanted to allow one, you could be more restrictive with
`proxy` and less restrictive with the more precise, suffixed user
agent of the type you are okay with (e.g., `proxy-gopher`).

Emphasis on "more intuitively designed" - I realize you could
technically accomplish this in the current design by simply adding
`proxy` to the mix, but I think the prefix-based organization makes it
clearer and a bit more intuitive.

Warm regards,
Natalie
Robert khuxkm Miles <khuxkm@tilde.team>
Details
Message ID
<fc318c9a12adbe40159c501c002c5681@tilde.team>
In-Reply-To
<C7A60Q2JCSQP.3UAHW6DREZ4RD@taiga> (view parent)
DKIM signature
missing
Download raw message
November 22, 2020 6:02 PM, "Drew DeVault" <sir at cmpwn.com> wrote:

> Feedback:
> 
> A web portal is a regular user agent, not a robot.

Just throwing in here for consideration that I agree with Drew, a proxy is not a robot by default. Are we implying that a browser must also follow robots.txt to be well-behaved? If so, I might just block AV-98 from reading my capsule. :)

What I would recommend in lieu of robots.txt proxy rules is normalizing using robots.txt on the web side of a proxy to prevent web spiders from inadvertantly crawling gemspace. For instance, proxy.vulpes.one blocks every robot user agent from indexing any part of the site.

Is there any good usecase for a proxy User-Agent in robots.txt, other than blocking web spiders from being able to crawl gemspace? If not, I would be in favor of dropping that part of the definition.

Just my two cents,
Robert "khuxkm" Miles
Sean Conner <sean@conman.org>
Details
Message ID
<20201123020541.GE1721@brevard.conman.org>
In-Reply-To
<fc318c9a12adbe40159c501c002c5681@tilde.team> (view parent)
DKIM signature
missing
Download raw message
It was thus said that the Great Robert khuxkm Miles once stated:
> 
> Is there any good usecase for a proxy User-Agent in robots.txt, other than
> blocking web spiders from being able to crawl gemspace? If not, I would be
> in favor of dropping that part of the definition.

  I'm in favor of dropping that part of the definition as it doesn't make
sense at all.  Given a web based proxy at <https://example.com/gemini>, web
crawlers will check for <https://example.com/robots.txt> for guidance, not
<https://example.com/gemini?gemini.conman.org/robots.txt>.  Web crawlers
will not be able to crawl gemini space for two main reasons:

        1. Most server certificates are self-signed and opt out of the CA
           business.  And even if a crawler where to accept self-signed
          (or non-standard CA signed) certificates, then---

        2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
           fail anyway.

  -spc
Robert khuxkm Miles <khuxkm@tilde.team>
Details
Message ID
<3d8fa90114f88eb4daabf8e5298bbe99@tilde.team>
In-Reply-To
<20201123020541.GE1721@brevard.conman.org> (view parent)
DKIM signature
missing
Download raw message
November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Robert khuxkm Miles once stated:
> 
>> Is there any good usecase for a proxy User-Agent in robots.txt, other than
>> blocking web spiders from being able to crawl gemspace? If not, I would be
>> in favor of dropping that part of the definition.
> 
> I'm in favor of dropping that part of the definition as it doesn't make
> sense at all. Given a web based proxy at <https://example.com/gemini>, web
> crawlers will check for <https://example.com/robots.txt> for guidance, not
> <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers
> will not be able to crawl gemini space for two main reasons:
> 
> 1. Most server certificates are self-signed and opt out of the CA
> business. And even if a crawler where to accept self-signed
> (or non-standard CA signed) certificates, then---
> 
> 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
> fail anyway.
> 
> -spc

Well, the argument is that the crawler would access <https://example.com/gemini?gemini://gemini.conman.org/>, and from there it could access <https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and so on. However, I'd argue that the onus falls on example.com to set a robots.txt rule in <https://example.com/robots.txt> to prevent web crawlers from indexing anything with their proxy.

Just my two cents,
Robert "khuxkm" Miles
Details
Message ID
<C7AB750HR1S5.12F8WY1S6S2PQ@taiga>
In-Reply-To
<3d8fa90114f88eb4daabf8e5298bbe99@tilde.team> (view parent)
DKIM signature
missing
Download raw message
A web portal is a one-to-one mapping of a user request to a gemini
request. It's not an automated process. It's a genuine user agent, an
agent of a user. The level of traffic you'd receive from a web portal is
similar to the amount of traffic you'd receive from any other user
agent, and rate controls or access blocking don't make sense.

As the maintainer of such a web portal, I officially NACK any suggestion
that it should obey robots.txt, and will not introduce such a feature.
Sean Conner <sean@conman.org>
Details
Message ID
<20201123033109.GF1721@brevard.conman.org>
In-Reply-To
<C7AB750HR1S5.12F8WY1S6S2PQ@taiga> (view parent)
DKIM signature
missing
Download raw message
It was thus said that the Great Drew DeVault once stated:
> A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user. The level of traffic you'd receive from a web portal is
> similar to the amount of traffic you'd receive from any other user
> agent, and rate controls or access blocking don't make sense.
> 
> As the maintainer of such a web portal, I officially NACK any suggestion
> that it should obey robots.txt, and will not introduce such a feature.

  What's the IP address of your web portal, so I can block it and prevent
the various webbots that will go through your web portal and index the
Gemini content without my consent?

  -spc
Robert khuxkm Miles <khuxkm@tilde.team>
Details
Message ID
<4785addcaf3ad9e9aeef53924f1ddcbd@tilde.team>
In-Reply-To
<20201123033109.GF1721@brevard.conman.org> (view parent)
DKIM signature
missing
Download raw message
November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Drew DeVault once stated:
> 
>> A web portal is a one-to-one mapping of a user request to a gemini
>> request. It's not an automated process. It's a genuine user agent, an
>> agent of a user. The level of traffic you'd receive from a web portal is
>> similar to the amount of traffic you'd receive from any other user
>> agent, and rate controls or access blocking don't make sense.
>> 
>> As the maintainer of such a web portal, I officially NACK any suggestion
>> that it should obey robots.txt, and will not introduce such a feature.
> 
> What's the IP address of your web portal, so I can block it and prevent
> the various webbots that will go through your web portal and index the
> Gemini content without my consent?
> 
> -spc

I assume Drew's smart enough to block web bots from crawling his gemini portal. Just saying.

Just my two cents,
Robert "khuxkm" Miles
Details
Message ID
<C7AC110NKPTT.240NYV5BJ3KTL@taiga>
In-Reply-To
<20201123033109.GF1721@brevard.conman.org> (view parent)
DKIM signature
missing
Download raw message
On Sun Nov 22, 2020 at 10:31 PM EST, Sean Conner wrote:
> What's the IP address of your web portal, so I can block it and prevent
> the various webbots that will go through your web portal and index the
> Gemini content without my consent?

It's not an indexer. It's a user agent. And its IP address is
173.195.146.137.

Dick.
John Cowan <cowan@ccil.org>
Details
Message ID
<CAD2gp_TjRggAZLvi8WFL+XR6nqG4iTnwj_jyni-zc3JwK_FzSA@mail.gmail.com>
In-Reply-To
<C7AB750HR1S5.12F8WY1S6S2PQ@taiga> (view parent)
DKIM signature
missing
Download raw message
On Sun, Nov 22, 2020 at 10:07 PM Drew DeVault <sir at cmpwn.com> wrote:

A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user.
>

It is the agent of an arbitrarily large number of users.  That's the
difference between, say, an email user agent and an email gateway to a
non-Internet email system.  There is no reason to impose even soft
regulation on the former.  There is every reason to allow regulation of the
latter.


John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
The experiences of the past show that there has always been a discrepancy
between plans and performance.        --Emperor Hirohito, August 1945
Sean Conner <sean@conman.org>
Details
Message ID
<20201123045619.GG1721@brevard.conman.org>
In-Reply-To
<4785addcaf3ad9e9aeef53924f1ddcbd@tilde.team> (view parent)
DKIM signature
missing
Download raw message
It was thus said that the Great Robert khuxkm Miles once stated:
> November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote:
> 
> > It was thus said that the Great Drew DeVault once stated:
> > 
> >> A web portal is a one-to-one mapping of a user request to a gemini
> >> request. It's not an automated process. It's a genuine user agent, an
> >> agent of a user. The level of traffic you'd receive from a web portal is
> >> similar to the amount of traffic you'd receive from any other user
> >> agent, and rate controls or access blocking don't make sense.
> >> 
> >> As the maintainer of such a web portal, I officially NACK any suggestion
> >> that it should obey robots.txt, and will not introduce such a feature.
> > 
> > What's the IP address of your web portal, so I can block it and prevent
> > the various webbots that will go through your web portal and index the
> > Gemini content without my consent?
> > 
> > -spc
> 
> I assume Drew's smart enough to block web bots from crawling his gemini
> portal. Just saying.
> 
> Just my two cents,

  Drew's proxy is a webserver in its own right:

	https://git.sr.ht/~sircmpwn/kineto/tree/master/main.go

  It checks for a GET request for "/favicon.ico" but not to "/robots.txt".
Every other GET request is immediately proxied to a gemini server.  I think
it was meant to run locally, but he made an instance available on the public
Internet.

  -spc
Details
Message ID
<C7ADNRBHP020.3J0RKW0UMS3JC@taiga>
In-Reply-To
<CAD2gp_TjRggAZLvi8WFL+XR6nqG4iTnwj_jyni-zc3JwK_FzSA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On Sun Nov 22, 2020 at 11:51 PM EST, John Cowan wrote:
> It is the agent of an arbitrarily large number of users.

So is every other user agent. It will never make more requests than
there are users who are asking for content. It is not special.
Details
Message ID
<08fde749-b8f8-9ddc-2e8c-9b7e6af459f6@emilis.net>
In-Reply-To
<CAD2gp_ToHLfz8P4g3u6=RDaNcY+75BNP1OgK-XAj8gZqp+mFBg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On 11/23/20 2:30 AM, John Cowan wrote:
>
> By the same token, a public Gopher proxy (if there are any) should 
> respect "Disallow: gopherproxy".
>
> Other points:
> +1 for Allow:
> +1 for Virtual-Agent
> +1 for ignoring unknown lines
> Unsure what the difference is between Crawl-Delay: and Check:, but 
> having a retry delay is a Good Thing

A small nit-pick: if we use "Virtual-Agent" and "Crawl-Delay", we should 
at least use "gopher-proxy" instead of "gopherproxy".


--
Emilis Dambauskas
gemini://tilde.team/~emilis/
Details
Message ID
<C7B0TQPHY8QB.2U0TVW8ZG9KPZ@taiga>
In-Reply-To
<08fde749-b8f8-9ddc-2e8c-9b7e6af459f6@emilis.net> (view parent)
DKIM signature
missing
Download raw message
-1 to Virtual-Agent

I think that this is best formalized as an addendum to the existing
robots.txt conventions, which simply details a gemini-specific
interpretation as such.
Details
Message ID
<20201124102902.GA17441@localhost.localdomain>
In-Reply-To
<20201123045619.GG1721@brevard.conman.org> (view parent)
DKIM signature
missing
Download raw message
Hi

I suppose I am chipping it a bit too late here, but I think
the robots.txt thing was always a rather ugly mechanism - a
bit of an afterthought.

Consider the gemini://example.com/~somebody/personal.gmi -
if somebody wishes to exclude personal.gmi from being
crawled they need write access to example.com/robots.txt,
and how do we go about making sure that ~somebodyelse,
also on example.com doesn't overwrite robots.txt with
their own rules ?

Then there is the problem of transitivity - if we
have a portal, proxy or archive - how does it relay
the information to its downstream users ? See also
the exchange between Sean and Drew...

So the way I remember it, robots.txt was a quick hack
to prevent spiders getting trapped in a maze of
cgi generated data, and so hammering the server.
It wasn't designed to solve matters of privacy
and redistribution.

I have pitched this idea before: I think a footer containing
the license/rules under which a page can be distributed/cached
is more sensible than robots.txt. This approach is:

* local to the page (no global /robots.txt)
* persistent (survives being copied, mirrored & re-exported)
* sound (one knows the conditions under which this can be redistributed)

I speak under correction, but I believe a decent amount of the
public web was mined for faces to train the neural networks
that now make totalitarian surveillance possible. Had these
been labelled "CC ND (no derivative work)" then there
would be legal impediment - not to the regimes now, but to
the universities and research labs which pioneered this.

We now have people more aware of this problem, and some
of us wish to put up material limited to gemini-space only,
and not export it to the web. A footer line "-- GMI: A. User"
could prohibit export to the web, while one "-- CC-SA: J. Soap"
would permit it...

regards

marc
Details
Message ID
<df7439d5-fafd-a070-4762-16c30033e15c@qwertqwefsday.eu>
In-Reply-To
<20201124102902.GA17441@localhost.localdomain> (view parent)
DKIM signature
missing
Download raw message
On 24.11.2020, marc wrote:
> I suppose I am chipping it a bit too late here, but I think
> the robots.txt thing was always a rather ugly mechanism - a
> bit of an afterthought.

+1 that the robots.txt solution feels a lot like a hack.
  
> So the way I remember it, robots.txt was a quick hack
> to prevent spiders getting trapped in a maze of
> cgi generated data, and so hammering the server.
> It wasn't designed to solve matters of privacy
> and redistribution.

There is a more modern alternative to robots.txt which is the X-Robots-Tag
HTTP header and sounds like what you are trying to do here.

That said, there are probably people who will not want special headers to be
added [1], altough I personally think that something like you suggest would not
be that "exploitable". Especially because it is just part of the documents text.

[1] See the first sentence of ?2.4 of the Gemini FAQ
     gemini://gemini.circumlunar.space/docs/faq.gmi
     https://gemini.circumlunar.space/docs/faq.html
Details
Message ID
<20201124123109.b08d27b9dbe0285829cd51b2@gmail.com>
In-Reply-To
<20201124102902.GA17441@localhost.localdomain> (view parent)
DKIM signature
missing
Download raw message
On Tue, 24 Nov 2020 11:29:02 +0100
marc <marcx2 at welz.org.za> wrote:

> Consider the gemini://example.com/~somebody/personal.gmi -
> if somebody wishes to exclude personal.gmi from being
> crawled they need write access to example.com/robots.txt,
> and how do we go about making sure that ~somebodyelse,
> also on example.com doesn't overwrite robots.txt with
> their own rules ?

How the server produces responses to robots.txt requests is an
implementation detail. robots.txt can easily be implemented such that
the server responds with access information provided by files in
subdirectories. For example: a system directory corresponding to
/~somebody/ contains a file named ".disallow" containing
"personal.gmi". When the server builds a response to /robots.txt, it
considers the content of all ".disallow" files and includes Disallow
lines corresponding to their content. This way, individual users on a
multi-user system can decide for themselves the access policy for their
content without shared access to a canonical robots.txt.

> I have pitched this idea before: I think a footer containing
> the license/rules under which a page can be distributed/cached
> is more sensible than robots.txt. This approach is:
> 
> * local to the page (no global /robots.txt)
> * persistent (survives being copied, mirrored & re-exported)
> * sound (one knows the conditions under which this can be redistributed)

What if my document is a binary file of some sort that I can not add a
footer to? The only ways to address this consistently for all document
types are to

a) Include the information in the response, *distinct* from its body
b) Provide the information in a sidecar file or sideband communication
   channel

-- 
Philip
Nick Thomas <gemini@ur.gs>
Details
Message ID
<9204cd8ffec11727b17764729e60e69be7b8517d.camel@ur.gs>
In-Reply-To
<C79XPGLWFE5E.3L6WKME61MN8U@stilgar> (view parent)
DKIM signature
missing
Download raw message
Hi,

On Sun, 2020-11-22 at 17:31 +0100, Solderpunk wrote:
> Hi folks,
> 
> There is now (finally!) an official reference on the use of
> robots.txt
> files in Geminispace.  Please see:
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

Thanks for this. One change that I'd be interested in is adding a
statement that if there is no `robots.txt` for the site, we assume an
implicit disallow-all for all the virtual-agents except proxies.

Presumed consent, with opt-outs for the tiny minority of people who
have the time and mental space to work out how to get those opt-outs to
apply, is standard behaviour on the web, but it's not behaviour I like.
GitHub recently dumped code of mine into an arctic vault, for instance;
the archive.org snapshots of geminispace have similar dynamics. We can
do better by asking people to opt *in* to these kinds of things if they
want it, rather than to opt *out* if they don't.

I exclude Virtual-Agent: webproxy here because the likely use of such a
proxy is transient, rather than persistent. It seems odd to me that it
sits alongside indexing, archival, and research, all of which lead to
durable artifacts on success. It does complicate things a little to
treat it differently, thought.

Thoughts? I appreciate this would impact on the ability of archivists
or researchers to capture geminispace, but I see that as a feature,
rather than an unfortunate side-effect :). 

/Nick
A. E. Spencer-Reed <easrng@gmail.com>
Details
Message ID
<CAEzvDCUbHEVujhRzMD9uGhNLCKa454K_GN=6ELOdjwNbrnn6SQ@mail.gmail.com>
In-Reply-To
<9204cd8ffec11727b17764729e60e69be7b8517d.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message
On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote:

> Thoughts? I appreciate this would impact on the ability of archivists
> or researchers to capture geminispace, but I see that as a feature,
> rather than an unfortunate side-effect :).

I don't agree with archiving being disallowed by default. archive.org
and others have saved me so many times, I can't imagine why one would
not want an archive. If there is a reason I would much prefer an
opt-out system for it.
Why do you dislike archival?
James Tomasino <tomasino@lavabit.com>
Details
Message ID
<b0cde998-0ce5-9442-795a-ece30e89f18d@lavabit.com>
In-Reply-To
<CAEzvDCUbHEVujhRzMD9uGhNLCKa454K_GN=6ELOdjwNbrnn6SQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Just an FYI on the recent discussion around implied license for search engines and archival: These aren't rules baked into a spec, they're implications of the DMCA in the US and relevant case law, such as BLAKE A. FIELD vs GOOGLE (2016). The existence of a mechanism to disallow indexing was vital to that decision establishing implied license. Search engines, whether they be our lovely friend GUS or some future behemoth, can gather, index, and cache as they see fit because there is a mechanism for you to say no. That mechanism is the robots.txt and they have a strong case saying that the rules which govern it are already well established.

As much as I'd love to wave a magic wand and say, "it's all opt-in here" we don't really have any legal footing to do so.
Jason McBrayer <jmcbray@carcosa.net>
Details
Message ID
<87wnyasvb1.fsf@dorothy.carcosa.net>
In-Reply-To
<C7AB750HR1S5.12F8WY1S6S2PQ@taiga> (view parent)
DKIM signature
missing
Download raw message
"Drew DeVault" <sir at cmpwn.com> writes:

> A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user.

I believe the concern is not that a web portal will archive pages, or
run on its own as an automated process, but that it will be used by a
third-party web bot (i.e., one not run by the owner of the portal) to
crawl Gemini sites and index them on the web.

> As the maintainer of such a web portal, I officially NACK any
> suggestion that it should obey robots.txt, and will not introduce such
> a feature.

It seems to me that the correct thing is for people that run web portals
to have a very strong robots.txt on /their/ web site, and additionally,
to be proactive about blocking web bots that don't observe robots.txt. I
think people want to block web portals in their Gemini robots.txt
because they don't trust web portal authors to do those two things. I
understand the feeling, but they're still trusting web portal authors to
obey robots.txt, which is honestly more work.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |
Details
Message ID
<C7BJWJDU98AZ.AJ6OKBU5HUEO@taiga>
In-Reply-To
<87wnyasvb1.fsf@dorothy.carcosa.net> (view parent)
DKIM signature
missing
Download raw message
On Tue Nov 24, 2020 at 9:06 AM EST, Jason McBrayer wrote:
> I believe the concern is not that a web portal will archive pages, or
> run on its own as an automated process, but that it will be used by a
> third-party web bot (i.e., one not run by the owner of the portal) to
> crawl Gemini sites and index them on the web.

Aha, this is a much better point. One which should probably be addressed
in the robots.txt specification.

> It seems to me that the correct thing is for people that run web portals
> to have a very strong robots.txt on /their/ web site, and additionally,
> to be proactive about blocking web bots that don't observe robots.txt. I
> think people want to block web portals in their Gemini robots.txt
> because they don't trust web portal authors to do those two things. I
> understand the feeling, but they're still trusting web portal authors to
> obey robots.txt, which is honestly more work.

Web portals are users, plain and simple. Anyone who blocks a web portal
is blocking legitimate users who are engaging in legitimate activity.
This is a dick move and I won't stand up for anyone who does it.

However, the issue of web crawlers hitting geminispace through a web
portal is NOT that, and I'm glad you brought it up. I'm going to forbid
web crawlers from crawling my gemini portal.
James Tomasino <tomasino@lavabit.com>
Details
Message ID
<dff8f69a-2d26-015b-3a36-50076151249b@lavabit.com>
In-Reply-To
<CAEzvDCUbHEVujhRzMD9uGhNLCKa454K_GN=6ELOdjwNbrnn6SQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On 11/24/20 1:15 PM, A. E. Spencer-Reed wrote:
> On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote:
> 
>> Thoughts? I appreciate this would impact on the ability of archivists
>> or researchers to capture geminispace, but I see that as a feature,
>> rather than an unfortunate side-effect :).
> I don't agree with archiving being disallowed by default. archive.org
> and others have saved me so many times, I can't imagine why one would
> not want an archive. If there is a reason I would much prefer an
> opt-out system for it.
> Why do you dislike archival?

Denying archival is already possible with robots.txt in its present form. We don't need to edit the spec for that either. If you want to avoid the internet archive you can use:

User-agent: ia_archiver
Disallow: /
Details
Message ID
<20201124151649.GA20449@localhost.localdomain>
In-Reply-To
<20201124123109.b08d27b9dbe0285829cd51b2@gmail.com> (view parent)
DKIM signature
missing
Download raw message
Hi

> How the server produces responses to robots.txt requests is an
> implementation detail. robots.txt can easily be implemented such that
> the server responds with access information provided by files in
> subdirectories. For example: a system directory corresponding to
> /~somebody/ contains a file named ".disallow" containing
> "personal.gmi". When the server builds a response to /robots.txt, it
> considers the content of all ".disallow" files and includes Disallow
> lines corresponding to their content. This way, individual users on a
> multi-user system can decide for themselves the access policy for their
> content without shared access to a canonical robots.txt.

Note that the apache people worry about just doing a
stat() for .htaccess along a path. This proposal requires an
opendir() for *every* directory in the exported hierarchy.

I concede that this isn't impossible - it is potentially expensive,
messy or nonstandard (and yes, there are inotify tricks or
serving the entire site out of a database, but that isn't a
common thing).

> > I have pitched this idea before: I think a footer containing
> > the license/rules under which a page can be distributed/cached
> > is more sensible than robots.txt. This approach is:
> > 
> > * local to the page (no global /robots.txt)
> > * persistent (survives being copied, mirrored & re-exported)
> > * sound (one knows the conditions under which this can be redistributed)
> 
> What if my document is a binary file of some sort that I can not add a
> footer to? The only ways to address this consistently for all document
> types are to
> 
> a) Include the information in the response, *distinct* from its body
> b) Provide the information in a sidecar file or sideband communication
>    channel

So I think this is the interesting bit of the discussion -
the tradeoff of keeping this information inside the file or
in a sidechannel. You are of course correct that not every
file format permits embedding such information, and that
is the one side of the tradeoff.... the other side is
the argument for persistence - having the data in another
file (or in a protocol header) means that is likely to be
lost.

And my view is that caching/archiving/aggregating/protocol
translation all involve making copies, where a careless or
inconsiderate intermediate is likely to discard information
not embedded in the file. For instance, if a web frontend
serves gemini://example.org/private.gmi as
https://example.com/gemini/example.org/private.gmi
how good are the odds that this frontend fetches
gemini://example.org/robots.txt, rewrites the urls in there
from /private.gmi to /gemini/example.org/private.gmi and
merges it into its own /robots.txt ? And does it before
any crawler request is made... 

A pragmatist's argument: The web and geminispace are a graph
of links, and all the interior nodes have to be markup, so those
are covered, and they control the reachability - without
a link you can't get to the terminal/leaf node. And even if
this is bypassed (robots.txt isn't really a defence against hotlinking
either) most other terminal nodes are images or video, which typically have
ways of adding meta information (exif, etc).

regards

marc
Nick Thomas <gemini@ur.gs>
Details
Message ID
<c81a0b58f64239ddc35ef4c8f45fd0d01ca342aa.camel@ur.gs>
In-Reply-To
<CAEzvDCUbHEVujhRzMD9uGhNLCKa454K_GN=6ELOdjwNbrnn6SQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
(I could be a lot better at using mailing lists. I think this message
was sent privately in error).

On Tue, 2020-11-24 at 08:15 -0500, A. E. Spencer-Reed wrote:
> Why do you dislike archival?

Thanks for weighing in!

In short, because the purposes to which the archive can be put, and the
motives of the archiver, are not clear at time of robots.txt-mediated
archival.

For myself, I'm happy with some types of archival, and not happy with
some other types. Some people would be happy to be included in every
archive going; others, in none of them. Given this variability, we must
take a stance on what to assume if robots.txt isn't present. I also I
don't think this variability is amenable to capture with more fine-
grained virtual agents. 

The current internet-draft for robots.txt says, in 2.2.1:

>  If no group satisfies either condition, or no groups are present at
> all, no rules apply.

( https://tools.ietf.org/html/draft-koster-rep-00 )

This is pretty standard on the Web and, entirely coincidentally, a huge
boon to Google et al. Importing robots.txt the way we do in the
companion specification also imports this line.

However, unlike the Web, Gemini "takes user privacy very seriously".
Archives *can* be injurious to user privacy - if you need convincing on
this point, there are a range of cases and examples around GDPR "right
to be forgotten" stuff. To my perspective, Gemini is important a line
from the internet-draft that is directly contrary to its mission.

Combining Gemini's mission with that realisation means that if no
statement has been made about whether the given user (server operator
in this specific case) is OK with their content being archived, the
presumption should be that they are not OK with it. We should value
user privacy above archiver convenience.

In affect, we add a second exception to the protocal that amends 2.2.1
to end "if no rules are specified, this robots.txt file MUST be
assumed".

On a practical level, being excluded from search engines by-default
drives the discoverability of robots.txt, and server software could
easily include flags like --permit-indexing or --permit-archival to
streamline that discoverability. I don't think that opt-in rates would
be similar to current opt-out rates on the Web.

/Nick
Details
Message ID
<a1773196-446b-8984-da84-3908e4a37fa7@qwertqwefsday.eu>
In-Reply-To
<c81a0b58f64239ddc35ef4c8f45fd0d01ca342aa.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message
Nick Thomas wrote:> I don't think that opt-in rates would be similar to current opt-out rates
> on the Web.
This can probably be summed up with one question:
Why do we want a robots.txt in the first place? After all, if there were no
reasons against archival et al., we would not need a robots.txt at all. And
IMHO this also is the reason why it should rather be an opt-in system.

-- 
You can verify the digital signature on this email with the public key available
through web key discovery.
Nick Thomas <gemini@ur.gs>
Details
Message ID
<88e2d72e36fe044250c6bb8b28fb901d0de57af6.camel@ur.gs>
In-Reply-To
<b0cde998-0ce5-9442-795a-ece30e89f18d@lavabit.com> (view parent)
DKIM signature
missing
Download raw message
On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote:
> 
> As much as I'd love to wave a magic wand and say, "it's all opt-in
> here" we don't really have any legal footing to do so.
> 

James and I talked a bit more about this one on IRC. Key to this
argument, AIUI, is how robots.txt (or the lack of it) is treated for
FTP, which lacks any mention of it in the spec but has apparently been
given weight in DMCA-related rulings involving it.

I'm not sure I agree with the reasoning, which goes something like "the
robots.txt Internet-Draft is already de-jure part of Gemini, and we
can't change that", but IANAL ^^. In particular, I've been thinking
about this almost entirely in GDPR terms so far, and have a bunch of
DMCA-related reading to do now.

In the event that it *is* accurate, we talked about an alternative way
to implement the functionality.  Rather than having the gemini
robots.txt spec say "if the client doesn't receive a robots.txt, it
must assume this one", the *server* could be made to return a defined
robots.txt response body if it would otherwise issue a 51 response to
`/robots.txt`

(51 may be too specific, it could be 5x, but I don't *think* it would
be appropriate in response to 4x responses, which crawlers would be
expected to retry).

Of course, any server could do that already today, so the ask is to put
a recommendation about it into "server best practice", perhaps
incorporating the `--permit-indexing` and `--permit-archiving` flags I
talked about in another post.

Another advantage of this approach is that it becomes opaque to crawler
authors whether the user has explicitly selected a preference or not.
I'm also inclined to trust server implementors over crawler
implementors.

/Nick

p.s. there was also some question as to whether someone hosting gemini
content was a "gemini user", in the way we use that term on the project
homepage. To me, it seems like a reasonable extrapolation, but perhaps
it's a topic that deserves more debate or clarification.
James Tomasino <tomasino@lavabit.com>
Details
Message ID
<27fb8ce5-6a52-a105-7740-be04cc4ba3c1@lavabit.com>
In-Reply-To
<88e2d72e36fe044250c6bb8b28fb901d0de57af6.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message
On 11/24/20 5:12 PM, Nick Thomas wrote:
> On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote:
>> As much as I'd love to wave a magic wand and say, "it's all opt-in
>> here" we don't really have any legal footing to do so.
>>
> James and I talked a bit more about this one on IRC. Key to this
> argument, AIUI, is how robots.txt (or the lack of it) is treated for
> FTP, which lacks any mention of it in the spec but has apparently been
> given weight in DMCA-related rulings involving it.
> 
> I'm not sure I agree with the reasoning, which goes something like "the
> robots.txt Internet-Draft is already de-jure part of Gemini, and we
> can't change that", but IANAL ^^. In particular, I've been thinking
> about this almost entirely in GDPR terms so far, and have a bunch of
> DMCA-related reading to do now.

In addition to FTP, gopher adopted the robots.txt standard almost immediately:

https://groups.google.com/g/comp.internet.net-happenings/c/Iv8ylGxvoh8?pli=1

You can read the IETF spec for the Robots Exclusion Protocol here:
https://tools.ietf.org/html/draft-rep-wg-topic-00

As you'll note in "2.3.  Access method", their documentation isn't scheme specific and they even list FTP as a valid option.

This is the document that will be used in court by anyone defending an indexer and any exclusion you want to obtain for Gemini would need to happen there. Having a contradictory statement in the Gemini spec will not stand up against the history and precedence of this one.

If you want to implement stronger protections in Gemini then I'd suggest adding a note in the best-practices document for server creators to (as Nick suggested) serve a robots.txt if no such file exists with the contents:

User-agent: *
Disallow: /

That achieves your aim of block-by-default and the opt-in would be the creation of a robots.txt file of your own.
Details
Message ID
<C7BQ6YBE90F9.AS7TK1X4E9BF@stilgar>
In-Reply-To
<C7BJWJDU98AZ.AJ6OKBU5HUEO@taiga> (view parent)
DKIM signature
missing
Download raw message
On Tue Nov 24, 2020 at 3:07 PM CET, Drew DeVault wrote:

> Web portals are users, plain and simple. Anyone who blocks a web portal
> is blocking legitimate users who are engaging in legitimate activity.
> This is a dick move and I won't stand up for anyone who does it.

This has actually long been a bit of a contentious point in the
Gopherverse, and we have inherited a bit of the controversy, if I
remember much earlier discussions accurately.  There are some people
(a vocal minority?  I'm not sure), who feel that public web proxies
exposing their Gopherhole/capsule to the entire browser-using world are
negating the agency they exercised in very deliberately putting some
content up only on Gopher/Gemini and not the web.  Web proxies force
them to be visible in (and linkable from) a space that they are actively
trying not to participate in.

While I am aware of the ultimate futility of trying to control where
publically served online content ends up, I have some sympathy for this
perspective (perhaps even more so now that we have very nice tools like
your own Kineto by which people who *do* want their content to be
accessible from a browser can achieve this easily).  When the first web
portals for Gemini turned up, some people expressed interest in being
able to opt out, to keep their Gemini-only content truly Gemini-only,
and at least one of those early web portals (portal.mozz.us) agreed to
respect those wishes.  The webproxy user agent I put into the first
robots.txt draft is actually just codifying what portal.mozz.us has
already been doing for many months.  I did not expect its inclusion to
be so controversial.  I *did* try to word it carefully so that personal
webproxies which, e.g. run on a user's local machine and are not
publically accessible need not abide by robots.txt, as those are really
just roundabout Gemini clients.

Cheers,
Solderpunk

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Robert khuxkm Miles <khuxkm@tilde.team>
Details
Message ID
<a75cf3477ad8e01859ea55cb1c87d937@tilde.team>
In-Reply-To
<88e2d72e36fe044250c6bb8b28fb901d0de57af6.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message
I am personally against this idea of forcing (or even normalizing) browsers giving special treatment to a request for a URL based on what the server would normally respond (I'm not even going to entertain the idea of pretending the internet draft doesn't apply to us). This is what I assume it would look like in spec (or best practices, or wherever you want to put it):

> When a client makes a request for a URI with a path component of "/robots.txt", and the server would normally respond to such a request with a 51 Not Found status code, it should instead respond with a 20 status code, a MIME type of text/plain, and content of "User-Agent: *\r\nDisallow: /\r\n". This prevents capsules from being indexed without consent.

Doesn't that just *feel* like a hack to you?

I did some research with GUS's known-hosts list. Of the 362 hosts known to GUS, only 36 have a robots.txt file, so any choice made as to what the default robots.txt should be will affect around 90% of Geminispace (not to mention any new hosts to come). Notably, of the 36 hosts to impose a robots.txt, 7 of them completely block archiving (although that number is skewed, as I know that at least 3 of those hosts are ran by the same person, and 2 of those hosts are ran by another person). This means that anywhere between 2% (all of the hosts who don't have a robots.txt are fine with being archived) to 20% (the sample of people who have robots.txt is representative of the whole population), or even 91% (everybody without a robots.txt doesn't want to be archived). I don't feel comfortable making a declaration either way, but this is food for thought.

Just my two cents,
Robert "khuxkm" Miles

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Nick Thomas <gemini@ur.gs>
Details
Message ID
<9cea8f8a58cf006ec109b235e5a891aebb68681b.camel@ur.gs>
In-Reply-To
<a75cf3477ad8e01859ea55cb1c87d937@tilde.team> (view parent)
DKIM signature
missing
Download raw message
On Tue, 2020-11-24 at 19:08 +0000, Robert "khuxkm" Miles wrote:

> Doesn't that just *feel* like a hack to you?

It definitely feels hackish when worded like this :).

The precise technical form is secondary to the outcome (as I see it) of
protecting users from a privacy-hostile default in the robots.txt
specification. I appreciate that you're currently an opt-out, rather
than opt-in, advocate, but I'd still appreciate any ideas you have to
make it nicer *if* gemini ends up going for opt-in.

An alternative form that just came to mind is a server implementation
recommendation like this:

```
Geminispace crawlers use the /robots.txt request path to determine
whether a capsule can be accessed for archival, indexing, research, and
other purposes. This can have privacy implications for the user, so
servers should not start unless they have an explicit signal on how to
handle requests to the /robots.txt path.

For example, this signal may be the availability of any content for the
/robots.txt path, a user-added database entry indicating that the path
should receive a 5x response, or a non-default configuration parameter
specifying that it's OK to skip the check.

If no such signal is present, the server should emit an error message
and either exit immediately, or allow the user to specify how the path
should be handled.
```

As a new server operator with no idea about `robots.txt`, I'd run, say:

```
$ agate [::]:1995 mysite cert.pem key.rsa ur.gs

No robots.txt file present! Please create mysite/robots.txt, or re-run
Agate with --permit-robots to allow your content to be archived,
indexed, or otherwise used by automated crawlers of Geminispace
```

*I'd* think "hang on, I don't want my content to be archived" and go
off to learn about this robots.txt thing; others might shrug and just
add the --permit-robots flag. 

>  Of the 362 hosts known to GUS, only 36 have a robots.txt file, so
> any choice made as to what the default robots.txt should be will
> affect around 90% of Geminispace 

Thanks for running the numbers on this. I agree with everything you
said based on them. That any change affects such a large proportion of
existing geminispace is especially worth emphasising.

/Nick

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

John Cowan <cowan@ccil.org>
Details
Message ID
<CAD2gp_RFkZTho34Vad3KnNROxCwqbaKj9NBWt-ceBRjd-ubQxQ@mail.gmail.com>
In-Reply-To
<9cea8f8a58cf006ec109b235e5a891aebb68681b.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message
On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote:


> >  Of the 362 hosts known to GUS, only 36 have a robots.txt file, so
> > any choice made as to what the default robots.txt should be will
> > affect around 90% of Geminispace
>
> Thanks for running the numbers on this. I agree with everything you
> said based on them. That any change affects such a large proportion of
> existing geminispace is especially worth emphasising.
>

Why is that a Good Thing?  It's another piece of bureaucracy: 90% of hosts
were happy to be archived before, so now they have to write a robots.txt
file.  Although small for any one server operator, it is large when
multiplied by the number of servers there *will be*.  "Small Internet" does
not mean "Internet with only a few servers", AFAIK.

Two things about the Internet Archive:

1) It is a U.S. public library, which gives it special rights when it comes
to making copies.

2) Though it does not respect robots.txt, it is happy to make your content
invisible to archive users by informal request (or, of course, by a DCMA
takedown notice).



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

James Tomasino <tomasino@lavabit.com>
Details
Message ID
<e6706f44-88c1-91c6-f705-6528d4d7a1ee@lavabit.com>
In-Reply-To
<CAD2gp_RFkZTho34Vad3KnNROxCwqbaKj9NBWt-ceBRjd-ubQxQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On 11/24/20 11:44 PM, John Cowan wrote:
> 2) Though it does not respect robots.txt, it is happy to make your
> content invisible to archive users by informal request (or, of course,
> by a DCMA takedown notice).

The Internet Archive does respect robots.txt, though they're not happy
about it and have written on the subject a few times. I included a
snippet in an earlier email with their user-agent.

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Nick Thomas <gemini@ur.gs>
Details
Message ID
<cfd687d61a8c3a685c4e74682403918404135710.camel@ur.gs>
In-Reply-To
<CAD2gp_RFkZTho34Vad3KnNROxCwqbaKj9NBWt-ceBRjd-ubQxQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
On Tue, 2020-11-24 at 18:44 -0500, John Cowan wrote:
> On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote:
> 
> > Thanks for running the numbers on this. I agree with everything you
> > said based on them. That any change affects such a large proportion
> > of
> > existing geminispace is especially worth emphasising.
> > 
> 
> Why is that a Good Thing? 

I very intentionally *didn't* say it was a good thing :). There are
many ways to interpret the data, but I'm still glad we have it.

> It's another piece of bureaucracy: 90% of hosts
> were happy to be archived before

You're presuming consent here. We don't actually *know* that said 90%
of hosts are happy to be archived; we only know that 90% of hosts
haven't included a robots.txt file, which could be for any one of a
multitude of reasons.

*If* a not-insignificant proportion of those hosts without robots.txt
files would actually prefer not to be included in archives when asked,
the current situation is not serving their privacy well, and gemini is
suppose to be protective of user privacy. *If* an overwhelming majority
of them simply don't care, then sure, the argument for it starts to
look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
threshold for the first condition and 1% for the second.

A personal example: *I* didn't have a robots.txt on my capsule file
until today, but I don't want to be included in archives for various
reasons. Presuming consent from the lack of a robots.txt file would
have incorrectly guessed my preference, and harmed my privacy. Who else
in that 90% is like me? We don't know.

> so now they have to write a robots.txt
> file.  Although small for any one server operator, it is large when
> multiplied by the number of servers there *will be*.  "Small
> Internet" does
> not mean "Internet with only a few servers", AFAIK.

Yes, there is a convenience/privacy trade-off here. I interpret
gemini's mission to favour privacy over convenience when the two come
into conflict.

> Two things about the Internet Archive:
> 
> 1) It is a U.S. public library, which gives it special rights when it
> comes
> to making copies.

Certainly true, and there will be cases where, even when you do have
wonderfully hand-crafted robots.txt file like the one I made today, an
archiver determines that they can legally scrape you anyway. Others
will scrape illegally, whether through malice or ignorance.

Meanwhile, Google, the Internet Archive, and a bunch of other people
respect robots.txt even when they might not be legally *required* to
via GDPR-like provisions. A control doesn't have to be perfect to be
desirable. This argument comes up in the context of "right to be
forgotten" quite a lot ^^.

> 2) Though it does not respect robots.txt, it is happy to make your
> content
> invisible to archive users by informal request (or, of course, by a
> DCMA
> takedown notice).

As I understand it, archive.org does respect robots.txt in general, but
has exceptions for certain sites it's identified it has a public
interest justification for. That includes the US military, but probably
doesn't include any currently-existing gemini site.

/Nick
Details
Message ID
<20201125191537.ed5f74098e7469e28b39efa6@gmail.com>
In-Reply-To
<20201124151649.GA20449@localhost.localdomain> (view parent)
DKIM signature
missing
Download raw message
On Tue, 24 Nov 2020 16:16:49 +0100
marc <marcx2 at welz.org.za> wrote:

> Note that the apache people worry about just doing a
> stat() for .htaccess along a path. This proposal requires an
> opendir() for *every* directory in the exported hierarchy.

Apache is designed to be able to serve large enterprises with high
request loads. The cause for their concern seems unlikely to apply to
multi-user Gemini hosts.

> I concede that this isn't impossible - it is potentially expensive,
> messy or nonstandard (and yes, there are inotify tricks or
> serving the entire site out of a database, but that isn't a
> common thing).

It's very much a matter of implementation. For example, if high
performance is a concern you can regenerate the information once per
minute rather than on a per-request basis, or on request from the users,
via a Gemini endpoint.

That's however a good argument for an Allow directive corresponding to
Disallow, to be able to disallow by default and only allowing resources
lower down in the hierarchy explicitly, which allows for a "better safe
than sorry" approach to "prevent" a crawler from picking up resources
before the new robot rules have been picked up.

> So I think this is the interesting bit of the discussion -
> the tradeoff of keeping this information inside the file or
> in a sidechannel. You are of course correct that not every
> file format permits embedding such information, and that
> is the one side of the tradeoff.... the other side is
> the argument for persistence - having the data in another
> file (or in a protocol header) means that is likely to be
> lost.

What you're proposing is doubly effective in that data that isn't there
*can't* be lost! :)

I appreciate your point, but "not every file format" is an
understatement. It's really only one file format that is controlled by
the Gemini spec right now: text/gemini. That's where we could add such
information and define it to be meaningful.

> And my view is that caching/archiving/aggregating/protocol
> translation all involve making copies, where a careless or
> inconsiderate intermediate is likely to discard information
> not embedded in the file.

A careless or inconsiderate intermediate is likely to discard
information, full stop. It's only respectful and considerate robots
that will recognize either approach.

> For instance, if a web frontend
> serves gemini://example.org/private.gmi as
> https://example.com/gemini/example.org/private.gmi
> how good are the odds that this frontend fetches
> gemini://example.org/robots.txt, rewrites the urls in there
> from /private.gmi to /gemini/example.org/private.gmi and
> merges it into its own /robots.txt ? And does it before
> any crawler request is made...

On the other hand, how likely is it that a web crawler will interpret
robot instructions from text/gemini-turned-html documents?

> A pragmatist's argument: The web and geminispace are a graph
> of links, and all the interior nodes have to be markup, so those
> are covered, and they control the reachability - without
> a link you can't get to the terminal/leaf node. And even if
> this is bypassed (robots.txt isn't really a defence against hotlinking
> either) most other terminal nodes are images or video, which typically have
> ways of adding meta information (exif, etc).

Do you propose to standardize extensions to Exif/ID3/Vorbis
comments/PDF metadata etc. as well as text/gemini? Neither these
currently have a standard way to specify a robots policy; it seems
understood that it's not a concern of the file itself whether a crawler
should be able to download it if the file is ever served over a
crawlable graph.

Hotlinking is a different concern altogether. The purpose of robots.txt
is not to disallow hotlinking.

-- 
Philip

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Details
Message ID
<5FBF8055.4050204@marmaladefoo.com>
In-Reply-To
<cfd687d61a8c3a685c4e74682403918404135710.camel@ur.gs> (view parent)
DKIM signature
missing
Download raw message

On 25-Nov-2020 00:18, Nick Thomas wrote:
>
> You're presuming consent here. We don't actually *know* that said 90%
> of hosts are happy to be archived; we only know that 90% of hosts
> haven't included a robots.txt file, which could be for any one of a
> multitude of reasons.
>
> *If* a not-insignificant proportion of those hosts without robots.txt
> files would actually prefer not to be included in archives when asked,
> the current situation is not serving their privacy well, and gemini is
> suppose to be protective of user privacy. *If* an overwhelming majority
> of them simply don't care, then sure, the argument for it starts to
> look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
> threshold for the first condition and 1% for the second.
>
> A personal example: *I* didn't have a robots.txt on my capsule file
> until today, but I don't want to be included in archives for various
> reasons. Presuming consent from the lack of a robots.txt file would
> have incorrectly guessed my preference, and harmed my privacy. Who else
> in that 90% is like me? We don't know.
>
Hello all

Personally, I'm not really that interested in the legal arguments back 
and forth about archiving and access. Yes there are some legal case 
precedents in this area in some jurisdictions, but I would say that by 
and large that ship has sailed. Sorry about that folks. The web is the 
de-facto baseline reference in this respect, whether we like it or not.

If you *publish* information on the internet, there *will* be actors who 
will re-purpose it. Gemini is no different to the web in this.

If any of us have information that is to be preserved as private, I 
cannot see how you can expect that to be achieved if you publish on the 
public internet (i.e. servers that do not require authentication). If 
you want to hide something, use authentication or a private channel.

Yes there is robots.txt which is an opt-out mechanism, from general 
robot access to a server's content. It is established practice and good 
actors will respect it. But it cannot be a mechanism to preserve privacy.

My take on the whole "Gemini preserves privacy better" is really about 
clients. We don't have extended headers, cookies or agent names in 
requests. So to that extent, client privacy is maintained better than 
the web, where the expectation is of long term, cross-session tracking. 
We dont thankfully have that.

I don't see it as Gemini's role to attempt to set a cultural/legal 
privacy framework for servers who are choosing to publish on Gemini. We 
cannot imagine we can break new ground in this respect. We can however 
do our efforts to have this as a side effect of technical design in the 
protocol itself, and within the Gemini community we can look out for 
risks in exposing such personal information via the protocol.

If Gemini ever becomes interesting enough to the outside world that some 
case goes to court (what a publicity success that would be!), surely the 
existing infrastructure of public server hypertext systems, namely the 
web, will be the established precedent.

So I support use of robots.txt, but if none exists, the presumption - 
like the web -  is that access and usage is allowed. If some actor 
doesn't follow a server's robots.txt, I'm sad about it, but we should 
ultimately expect it.

  - Luke

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Details
Message ID
<_MlwWaJv1RqLJ_aLuOlJo04ZyexXidF00axz4dxtIks9dd3iBRfZkJ4OGs8r3JKk1-IBBVMaSF0JqA3FFLWAKAyIogdOsgy3LF0D9aQMvhA=@protonmail.com>
In-Reply-To
<5FBF8055.4050204@marmaladefoo.com> (view parent)
DKIM signature
missing
Download raw message
My arguments weren't just about privacy. They were also about copyright.
Sharing on the internet is fine, but copyright still applies.

Secondly, You can share something for free online for a short period of time, and
then remove it after that time limit. This was done with a lot of books during a
portion of the Covid pandemic we are in. To say that archives should be able to permanently
cache this without explicit permission makes no logical sense.

Anyways, back to my original argument, caching should be opt-in. It makes the most sense.
*Granting permission* to use, modify, distribute something should be opt-in. Not opt-out.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 4:15 AM, Luke Emmet <luke at marmaladefoo.com> wrote:

> On 25-Nov-2020 00:18, Nick Thomas wrote:
>
> > You're presuming consent here. We don't actually know that said 90%
> >
> > of hosts are happy to be archived; we only know that 90% of hosts
> >
> > haven't included a robots.txt file, which could be for any one of a
> >
> > multitude of reasons.
> >
> > If a not-insignificant proportion of those hosts without robots.txt
> >
> > files would actually prefer not to be included in archives when asked,
> >
> > the current situation is not serving their privacy well, and gemini is
> >
> > suppose to be protective of user privacy. If an overwhelming majority
> >
> > of them simply don't care, then sure, the argument for it starts to
> >
> > look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
> >
> > threshold for the first condition and 1% for the second.
> >
> > A personal example: I didn't have a robots.txt on my capsule file
> >
> > until today, but I don't want to be included in archives for various
> >
> > reasons. Presuming consent from the lack of a robots.txt file would
> >
> > have incorrectly guessed my preference, and harmed my privacy. Who else
> >
> > in that 90% is like me? We don't know.
>
> Hello all
>
> Personally, I'm not really that interested in the legal arguments back
>
> and forth about archiving and access. Yes there are some legal case
>
> precedents in this area in some jurisdictions, but I would say that by
>
> and large that ship has sailed. Sorry about that folks. The web is the
>
> de-facto baseline reference in this respect, whether we like it or not.
>
> If you publish information on the internet, there will be actors who
>
> will re-purpose it. Gemini is no different to the web in this.
>
> If any of us have information that is to be preserved as private, I
>
> cannot see how you can expect that to be achieved if you publish on the
>
> public internet (i.e. servers that do not require authentication). If
>
> you want to hide something, use authentication or a private channel.
>
> Yes there is robots.txt which is an opt-out mechanism, from general
>
> robot access to a server's content. It is established practice and good
>
> actors will respect it. But it cannot be a mechanism to preserve privacy.
>
> My take on the whole "Gemini preserves privacy better" is really about
>
> clients. We don't have extended headers, cookies or agent names in
>
> requests. So to that extent, client privacy is maintained better than
>
> the web, where the expectation is of long term, cross-session tracking.
>
> We dont thankfully have that.
>
> I don't see it as Gemini's role to attempt to set a cultural/legal
>
> privacy framework for servers who are choosing to publish on Gemini. We
>
> cannot imagine we can break new ground in this respect. We can however
>
> do our efforts to have this as a side effect of technical design in the
>
> protocol itself, and within the Gemini community we can look out for
>
> risks in exposing such personal information via the protocol.
>
> If Gemini ever becomes interesting enough to the outside world that some
>
> case goes to court (what a publicity success that would be!), surely the
>
> existing infrastructure of public server hypertext systems, namely the
>
> web, will be the established precedent.
>
> So I support use of robots.txt, but if none exists, the presumption -
>
> like the web - is that access and usage is allowed. If some actor
>
> doesn't follow a server's robots.txt, I'm sad about it, but we should
>
> ultimately expect it.
>
> -   Luke

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Details
Message ID
<20201126162440.GA27406@localhost.localdomain>
In-Reply-To
<5FBF8055.4050204@marmaladefoo.com> (view parent)
DKIM signature
missing
Download raw message
Hi

> I don't see it as Gemini's role to attempt to set a cultural/legal privacy
> framework for servers who are choosing to publish on Gemini. We cannot
> imagine we can break new ground in this respect.

That seems ... rather defeatist. 

Alasdair Gray provides an inspirational quote for a situation like this:

  "Work as if you live in the early days of a better nation"

(apparently later he wanted to say world, but nation had stuck...)

Gemini is still a young project, where a different culture
and nicer norms could be established...

Long ago, before the web, when the internet was young
somebody grabbed the jokes from rec.humor.funny (I think, might
have been another newsgroup) and published them in book form. Some
posters were outraged at the copyright violation, others
flattered. 

Had the individual posters just had a way of telling us
how their material could have been re-used, there would
have been no controversy, and maybe this would have
laid the groundwork for a different way of aggregating
online material, with internet editors neatly assembling
"best-ofs" or "my conversations-with-..." and people
optimising their comments for quotability or adding
footnotes and expansions to posts they were keen to
improve... instead of just feeble likes.

TLDR: I can imagine it. 

regards

marc

Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Details
Message ID
<5FBFE383.10608@marmaladefoo.com>
In-Reply-To
<20201126162440.GA27406@localhost.localdomain> (view parent)
DKIM signature
missing
Download raw message

On 26-Nov-2020 16:24, marc wrote:
>> I don't see it as Gemini's role to attempt to set a cultural/legal privacy
>> framework for servers who are choosing to publish on Gemini. We cannot
>> imagine we can break new ground in this respect.
> That seems ... rather defeatist.
>
> Alasdair Gray provides an inspirational quote for a situation like this:
>
>    "Work as if you live in the early days of a better nation"
Well, I wasn't expecting to have my Utopian credentials questioned ;-)

After all, I am a proponent of Gemini like everyone else here, pushing 
against the flow.

But its true I'm probably towards the pragmatic end of the scale, and I 
like to see people discussing subjects I find to be productive. Trying 
to establish alternative IPR legal precedents, contrary to the flow of 
what happens on the web seems like a lot of work to me and we can spin a 
lot of cycles doing so. But if it rings your bell, by all means continue.

I'm all for building a nice culture among the gemini-folk, but wider 
cultural changes happen slowly in my experience.

  - Luke
Reply to thread Export thread (mbox)