~natpen/gus

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch

[PATCH gus] robots.txt sections "*" and "indexer" are honored

Details
Message ID
<20210222180602.2290-1-rwagner@rw-net.de>
DKIM signature
pass
Download raw message
Patch: +4 -13
We no longer use the "gus" section for ease of implementation.
It's probably barely used anyway.
---
 docs/handling-robots.md |  1 -
 gus/crawl.py            | 16 ++++------------
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/docs/handling-robots.md b/docs/handling-robots.md
index 4a80383..e1c602b 100644
--- a/docs/handling-robots.md
+++ b/docs/handling-robots.md
@@ -4,7 +4,6 @@ robots.txt is fetched for each (sub)domain before actually crawling the content.

GUS honors the following User-agents:
* indexer
* gus
* *

## robots.txt caching
diff --git a/gus/crawl.py b/gus/crawl.py
index bdc0c75..c953888 100644
--- a/gus/crawl.py
+++ b/gus/crawl.py
@@ -445,19 +445,11 @@ def crawl_page(
    crawl_delay = None
    if robots_file is not None:
        logging.debug("Found robots.txt for %s", gr.normalized_url)
        # only fetch if both user-agents are allowed to fetch
        # RobotFileParser will return the higher level value (*) if no specific
        # value is found, but has no understanding the "gus" is a more specific
        # form of an indexer
        logging.debug("can_fetch indexer: %s",robots_file.can_fetch("indexer", gr.normalized_url))
        logging.debug("can_fetch gus: %s",robots_file.can_fetch("gus", gr.normalized_url))
        can_fetch = (robots_file.can_fetch("indexer", gr.normalized_url) and
            robots_file.can_fetch("gus", gr.normalized_url))

        # same approach as above - last value wins
        crawl_delay = robots_file.crawl_delay("*")
        # only fetch if allowed for user-agents * and indexer 
        # RobotFileParser will return the higher level value (*) if 
        # no indexer section is found
        can_fetch = robots_file.can_fetch("indexer", gr.normalized_url)
        crawl_delay = robots_file.crawl_delay("indexer")
        crawl_delay = robots_file.crawl_delay("gus")

        if not can_fetch:
            logging.info(
-- 
2.30.1
Reply to thread Export thread (mbox)