~natpen/gus

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
4 2

[PATCH] correctly handle robots.txt

Details
Message ID
<20210212070534.14511-1-rwagner@rw-net.de>
DKIM signature
pass
Download raw message
Patch: +26 -8
Honor the robots.txt entrys of "indexer" and "gus" as well
as the default * section.

The robot_file_map.p must be deleted on a live instance
after this change has been applied to refetch all robots
files, as previously only empty files have been stored.
---
 docs/handling-robots.md | 14 ++++++++++++++
 gus/crawl.py            | 20 ++++++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)
 create mode 100644 docs/handling-robots.md

diff --git a/docs/handling-robots.md b/docs/handling-robots.md
new file mode 100644
index 0000000..4a80383
--- /dev/null
+++ b/docs/handling-robots.md
@@ -0,0 +1,14 @@
# robots.txt handling

robots.txt is fetched for each (sub)domain before actually crawling the content.

GUS honors the following User-agents:
* indexer
* gus
* *

## robots.txt caching

Every fetched robots.txt is cached in `index/robot_file_map.p`, even if they were empty/missing.

To force a refetch of _all_ robots.txt for _all_ capsulses, simply delete the file named above and run a crawl.
diff --git a/gus/crawl.py b/gus/crawl.py
index 7095add..bdc0c75 100644
--- a/gus/crawl.py
+++ b/gus/crawl.py
@@ -392,6 +392,7 @@ def fetch_robots_file(robot_host):
    )
    rp = GeminiRobotFileParser(robot_url)
    rp.read()
    return rp


def get_robots_file(robot_host):
@@ -443,12 +444,15 @@ def crawl_page(
    robots_file = get_robots_file(gr.normalized_host)
    crawl_delay = None
    if robots_file is not None:
        # keep overwriting the value of can_fetch with more specific user-agent values
        # last one should win, and if not present, RobotFileParser will just return
        # the higher level's value again
        can_fetch = robots_file.can_fetch("*", gr.normalized_url)
        can_fetch = robots_file.can_fetch("indexer", gr.normalized_url)
        can_fetch = robots_file.can_fetch("gus", gr.normalized_url)
        logging.debug("Found robots.txt for %s", gr.normalized_url)
        # only fetch if both user-agents are allowed to fetch
        # RobotFileParser will return the higher level value (*) if no specific
        # value is found, but has no understanding the "gus" is a more specific
        # form of an indexer
        logging.debug("can_fetch indexer: %s",robots_file.can_fetch("indexer", gr.normalized_url))
        logging.debug("can_fetch gus: %s",robots_file.can_fetch("gus", gr.normalized_url))
        can_fetch = (robots_file.can_fetch("indexer", gr.normalized_url) and
            robots_file.can_fetch("gus", gr.normalized_url))

        # same approach as above - last value wins
        crawl_delay = robots_file.crawl_delay("*")
@@ -456,8 +460,8 @@ def crawl_page(
        crawl_delay = robots_file.crawl_delay("gus")

        if not can_fetch:
            logging.debug(
                "Blocked by robots files, skipping: %s",
            logging.info(
                "Blocked by robots.txt, skipping: %s",
                gus.lib.logging.strip_control_chars(url),
            )
            return
-- 
2.30.1
Details
Message ID
<YCaEyU3RchH1HzQH@goldfish.localdomain>
In-Reply-To
<20210212070534.14511-1-rwagner@rw-net.de> (view parent)
DKIM signature
pass
Download raw message
On Fri, Feb 12, 2021 at 08:05:34AM +0100, René Wagner wrote:
> diff --git a/gus/crawl.py b/gus/crawl.py
> index 7095add..bdc0c75 100644
> --- a/gus/crawl.py
> +++ b/gus/crawl.py
> @@ -392,6 +392,7 @@ def fetch_robots_file(robot_host):
>      )
>      rp = GeminiRobotFileParser(robot_url)
>      rp.read()
> +    return rp

Oh my goodness, was the issue really just that I forgot to return rp?
:facepalm: good find!! I spent way too long scratching my head about
this one.

>  def get_robots_file(robot_host):
> @@ -443,12 +444,15 @@ def crawl_page(
>      robots_file = get_robots_file(gr.normalized_host)
>      crawl_delay = None
>      if robots_file is not None:
> -        # keep overwriting the value of can_fetch with more specific user-agent values
> -        # last one should win, and if not present, RobotFileParser will just return
> -        # the higher level's value again
> -        can_fetch = robots_file.can_fetch("*", gr.normalized_url)
> -        can_fetch = robots_file.can_fetch("indexer", gr.normalized_url)
> -        can_fetch = robots_file.can_fetch("gus", gr.normalized_url)
> +        logging.debug("Found robots.txt for %s", gr.normalized_url)
> +        # only fetch if both user-agents are allowed to fetch
> +        # RobotFileParser will return the higher level value (*) if no specific
> +        # value is found, but has no understanding the "gus" is a more specific
> +        # form of an indexer
> +        logging.debug("can_fetch indexer: %s",robots_file.can_fetch("indexer", gr.normalized_url))
> +        logging.debug("can_fetch gus: %s",robots_file.can_fetch("gus", gr.normalized_url))
> +        can_fetch = (robots_file.can_fetch("indexer", gr.normalized_url) and
> +            robots_file.can_fetch("gus", gr.normalized_url))

I'm not sure I agree with this change to the logic. I think last-wins,
with increasing specificity (as implemented before), will allow valid
robots.txt configurations that the new implementation will incorrectly
disallow. For example, imagine the case where someone wants to
disallow all indexers except GUS, because they like/trust GUS. That
was possible with the old implementation, because the most-specific
setting (i.e., GUS) would override the previous, less-specific setting
(i.e., indexer).
Details
Message ID
<20210212141429.BDC9D63C08BB@dd22736.kasserver.com>
In-Reply-To
<YCaEyU3RchH1HzQH@goldfish.localdomain> (view parent)
DKIM signature
pass
Download raw message
Hi Natalie,

thanks for the feedback!

>>  def get_robots_file(robot_host):
>> @@ -443,12 +444,15 @@ def crawl_page(
>>      robots_file = get_robots_file(gr.normalized_host)
>>      crawl_delay = None
>>      if robots_file is not None:
>> -        # keep overwriting the value of can_fetch with more specific
>> user-agent values
>> -        # last one should win, and if not present, RobotFileParser will just
>> return
>> -        # the higher level's value again
>> -        can_fetch = robots_file.can_fetch("*", gr.normalized_url)
>> -        can_fetch = robots_file.can_fetch("indexer", gr.normalized_url)
>> -        can_fetch = robots_file.can_fetch("gus", gr.normalized_url)
>> +        logging.debug("Found robots.txt for %s", gr.normalized_url)
>> +        # only fetch if both user-agents are allowed to fetch
>> +        # RobotFileParser will return the higher level value (*) if no
>> specific
>> +        # value is found, but has no understanding the "gus" is a more
>> specific
>> +        # form of an indexer
>> +        logging.debug("can_fetch indexer:
>> %s",robots_file.can_fetch("indexer", gr.normalized_url))
>> +        logging.debug("can_fetch gus: %s",robots_file.can_fetch("gus",
>> gr.normalized_url))
>> +        can_fetch = (robots_file.can_fetch("indexer", gr.normalized_url) and
>> +            robots_file.can_fetch("gus", gr.normalized_url))
> 
> I'm not sure I agree with this change to the logic. I think last-wins,
> with increasing specificity (as implemented before), will allow valid
> robots.txt configurations that the new implementation will incorrectly
> disallow. For example, imagine the case where someone wants to
> disallow all indexers except GUS, because they like/trust GUS. That
> was possible with the old implementation, because the most-specific
> setting (i.e., GUS) would override the previous, less-specific setting
> (i.e., indexer).

You are right, my implementation doesn't work in the use-case you described.
The previous implementation is flawed in another way:
if someone disallows "indexer" and does not set a value for "gus", 
RobotFileParser will just return the value for "*" when asked for "gus".
This means that the "indexer" is not honored at all and the content is fetched
instead of honoring the "disallow" for "indexer" and move along.

The hierarchy RobotFileParser uses is
*
|_ indexer
|_ gus

but we need
*
|_ indexer
   |_ gus

If RobotFileParser would use the hierarchy we need we could simply
fetch the value for "gus" and we are done.

I'm not quite sure how to tackle the situation, if we just use "gus" we will
often end up using the value for "*" and ignore "indexer" completely.
I have no clue if "gus" is at all used by any capsule.
Need to think more about it...

Regards
René
Details
Message ID
<YCaSnh528QXgrOb9@goldfish.localdomain>
In-Reply-To
<20210212141429.BDC9D63C08BB@dd22736.kasserver.com> (view parent)
DKIM signature
pass
Download raw message
On Fri, Feb 12, 2021 at 03:14:29PM +0100, René Wagner wrote:
> Hi Natalie,
>
> thanks for the feedback!

My pleasure! Thank YOU for finally bringing the necessary brainpower
to solve this bug :P

> You are right, my implementation doesn't work in the use-case you described.
> The previous implementation is flawed in another way:
> if someone disallows "indexer" and does not set a value for "gus",
> RobotFileParser will just return the value for "*" when asked for "gus".
> This means that the "indexer" is not honored at all and the content is fetched
> instead of honoring the "disallow" for "indexer" and move along.

Ahhhh, I didn't realize that flaw existed in the prior implementation.
I wonder if we could modify the original implementation by inserting a
small conditional in front of the overrides to check if the
more-specific value is even present in the robots.txt file (if it's
not, then we should skip the override, since it would just errantly
override with the value of `*`).

Hmm. That's my initial thought for the "best" way to handle it, but I
think I first need to check if there's even an easy/supported way to
check if a user-agent is explicitly specific with Python's robots.txt
functionality.

What do you think?
Details
Message ID
<20210212185133.6386C63C08BB@dd22736.kasserver.com>
In-Reply-To
<YCaSnh528QXgrOb9@goldfish.localdomain> (view parent)
DKIM signature
pass
Download raw message
Good evening ;)

> Ahhhh, I didn't realize that flaw existed in the prior implementation.
> I wonder if we could modify the original implementation by inserting a
> small conditional in front of the overrides to check if the
> more-specific value is even present in the robots.txt file (if it's
> not, then we should skip the override, since it would just errantly
> override with the value of `*`).
> 
> Hmm. That's my initial thought for the "best" way to handle it, but I
> think I first need to check if there's even an easy/supported way to
> check if a user-agent is explicitly specific with Python's robots.txt
> functionality.
> 
> What do you think?

I've had some additional thoughts about it and haven't found any other
solution than what you proposed - atleast if we want to stick with
both useragents:
we need to know wether the robots.txt contains a gus section or not
and combine that information with the ones previously fetched.

According to the robotparser docs it doesn't support anything useful
for our usecase.
I think I'll need a nights sleep on this... :-D

regards
René
Reply to thread Export thread (mbox)