~natpen/gus

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
6 2

[PATCH] Add global feed parsing, gemlog feed parsing

Details
Message ID
<20201213223602.4046-1-alex@alexwennerberg.com>
DKIM signature
pass
Download raw message
Patch: +103 -12
* Look for gemlog feeds by searching for datetime logs
* Parse feed items from gemlogs, rss, atom, add to separate table
* Reformat existing feed finder to use new logic
* Add new endpoint to list items from aggregated list of feeds, sorted
by date

This isn't quite production ready, but it's ready for feedback! Here are
some open questions/issues: 1. Why is "rss.xml" excluded from crawling?

2. I currently pull all urls from rss/atom/gemlog feeds and feed them to
the user. Maybe I shoud parse these links as gemini resources and only
show those to the user? I'm still getting used to the structure of GUS

3. Should I refactor/rename some of the table name, resources, etc?  

4. The feed title is in the feeditems database -- maybe it should be
moved to the page database & accessed with a join? This could be
related to 2.

5. Are you using any code formatting? I noticed black was a dependency,
but running it caused a number of changes to your code -- are you using
a different configuration? 

6. I only tested this on a limited, contrived data set -- I'm not sure
how comprehensive my error handling is

Hope you find this helpful! Let me know if you have any questions.
All the best,
Alex                                                                                                                                                                         
            
---
 gus/crawl.py                  | 11 +++++---
 gus/lib/db_model.py           | 12 ++++++++-
 gus/lib/feed.py               | 50 +++++++++++++++++++++++++++++++++++
 serve/models.py               | 20 ++++++++------
 serve/templates/all_posts.gmi | 11 ++++++++
 serve/templates/index.gmi     |  1 +
 serve/views.py                | 10 +++++++
 7 files changed, 103 insertions(+), 12 deletions(-)
 create mode 100644 gus/lib/feed.py
 create mode 100644 serve/templates/all_posts.gmi

diff --git a/gus/crawl.py b/gus/crawl.py
index 0b2c757..a513474 100644
--- a/gus/crawl.py
+++ b/gus/crawl.py
@@ -15,6 +15,7 @@ import peewee
from . import constants
from gus.lib.db_model import init_db, Page, Link, Crawl
from gus.lib.gemini import GeminiResource, GeminiRobotFileParser
from gus.lib.feed import get_feed_items
import gus.lib.logging

# hack: the built-in methods in urllib need to know the
@@ -294,7 +295,6 @@ def index_prompt(resource, response):
    page.save()
    return page


def index_content(resource, response):
    logging.debug(
        "Indexing content for: %s",
@@ -334,10 +334,9 @@ def index_content(resource, response):
                    existing_change_frequency, "content"
                )
    page = Page(**doc)
    page.save()
    page.save() 
    return page, is_different


def should_skip(resource):
    should_skip = False
    for excluded_prefix in EXCLUDED_URL_PREFIXES:
@@ -626,6 +625,12 @@ def crawl_page(
                timestamp=datetime.utcnow(),
            )
            page_crawl.save()
        for feed_item in page.feed_items:
            feed_item.delete_instance()
        feed_items = get_feed_items(gr, response)
        for feed_item in feed_items:
            feed_item.page = page
            feed_item.save()
    else:
        logging.warn(
            "Got unhandled status: %s: %s",
diff --git a/gus/lib/db_model.py b/gus/lib/db_model.py
index bf1bdd2..0a3afc1 100644
--- a/gus/lib/db_model.py
+++ b/gus/lib/db_model.py
@@ -17,7 +17,7 @@ def init_db(filename=":memory:"):
    """
    Bind an SQLite database to the Peewee ORM models.
    """
    models = [Crawl, Link, Page, Search, Thread, ThreadPage]
    models = [Crawl, Link, Page, Search, Thread, ThreadPage, FeedItem]
    db = SqliteDatabase(filename)
    db.bind(models)
    db.create_tables(models)
@@ -54,6 +54,16 @@ class Page(Model):
    indexed_at = DateTimeField(null=True)


class FeedItem(Model):
    """
    Items in feeds
    """
    page = ForeignKeyField(Page, backref="feed_items", on_delete="CASCADE")
    title = TextField(null=True)
    feed_title = TextField(null=True) # TODO normalize
    date = DateTimeField(null=True) 


class Link(Model):
    """
    Hyperlinks between pages in Geminispace
diff --git a/gus/lib/feed.py b/gus/lib/feed.py
new file mode 100644
index 0000000..867719e
--- /dev/null
+++ b/gus/lib/feed.py
@@ -0,0 +1,50 @@
"""
Resources for parsing feeds
"""

import re
from gus.lib.db_model import FeedItem
import feedparser
import datetime

GEMLOG_LINK_PATTERN = re.compile(
    "^=> ?(.*) ([0-9]{4}-[0-9]{2}-[0-9]{2}) ?-? ?(.*)", re.MULTILINE
)


def get_feed_items(resource, response):
    """
    Gets all feed items from a Gemini Resource by parsing its content
    """
    feed_items = []
    feed_title = resource.get_friendly_title(response.content)
    if response.content_type == "text/gemini":
        matches = re.finditer(GEMLOG_LINK_PATTERN, response.content)
        for match in matches:
            feed_item = FeedItem()
            feed_item.url = match[1]
            try:
                feed_item.date = datetime.datetime.strptime(match[2], "%Y-%m-%d")
            except ValueError:
                continue
            feed_item.title = match[3]
            feed_item.feed_title = feed_title
            feed_items.append(feed_item)
    elif response.content_type in [
        "application/xml",
        "application/atom+xml",
        "application/rss+xml",
    ]:
        feed_obj = feedparser.parse(response.content)
        feed = feed_obj.feed
        for entry in feed_obj.entries:
            feed_items.append(
                FeedItem(
                    url=entry.link,
                    # Only grab YMD
                    date=datetime.datetime(*entry.updated_parsed[:3]).isoformat(),
                    feed_title=feed.title,
                    title=entry.title,
                )
            )
    return feed_items
diff --git a/serve/models.py b/serve/models.py
index 54a016d..1275be5 100644
--- a/serve/models.py
+++ b/serve/models.py
@@ -176,19 +176,23 @@ ORDER BY t.thread_length DESC, t.updated_at DESC, t.id ASC, tp.address ASC"""

    def get_feeds(self):
        feeds_query = Page.raw(
            """SELECT DISTINCT p.*
            """SELECT DISTINCT fetchable_url, url 
FROM page AS p
JOIN indexable_crawl AS c
ON c.page_id == p.id
WHERE p.url LIKE '%atom.xml'
OR p.url LIKE '%feed.xml'
OR p.url LIKE '%.rss'
OR p.url LIKE '%.atom'
OR p.content_type IN ('application/atom+xml', 'application/rss+xml')
JOIN feeditem AS fi
ON fi.page_id == p.id
"""
        )
        return feeds_query.execute()

    def get_recent_posts(self):
        feeds_query = Page.raw(
            """SELECT * from feeditem 
ORDER BY date desc
"""
        )
        return feeds_query.execute()


    def get_newest_hosts(self):
        newest_hosts_query = Page.raw(
            """
diff --git a/serve/templates/all_posts.gmi b/serve/templates/all_posts.gmi
new file mode 100644
index 0000000..f7397ea
--- /dev/null
+++ b/serve/templates/all_posts.gmi
@@ -0,0 +1,11 @@
{% include 'fragments/header.gmi' %}

## All Gemini Posts

Below are the aggregated posts from all public feeds of which GUS is aware.

{% for post in all_posts %}
{{ "=> {} {} ({}) {}".format(post.url, post.date[:10], post.feed_title, post.title) }}
{% endfor %}

{% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/index.gmi b/serve/templates/index.gmi
index 43992d0..337f558 100644
--- a/serve/templates/index.gmi
+++ b/serve/templates/index.gmi
@@ -6,6 +6,7 @@
=> /statistics Geminispace Statistics
=> /known-hosts Known Gemini Hosts
=> /known-feeds Known Gemini Feeds
=> /all-posts Recent Gemini Feed Posts
=> /newest-hosts Newest Gemini Hosts
=> /newest-pages Newest Gemini Pages

diff --git a/serve/views.py b/serve/views.py
index d659b89..f649eec 100644
--- a/serve/views.py
+++ b/serve/views.py
@@ -133,6 +133,16 @@ def known_feeds(request):
    return Response(Status.SUCCESS, "text/gemini", body)


@app.route("/all-posts", strict_trailing_slash=False)
def known_feeds(request):
    body = render_template(
        "all_posts.gmi",
        all_posts=gus.get_recent_posts(),
        index_modification_time=gus.statistics["index_modification_time"],
        quote=random.choice(constants.QUOTE_BANK),
    )
    return Response(Status.SUCCESS, "text/gemini", body)

@app.route("", strict_trailing_slash=False)
def index(request):
    body = render_template(
-- 
2.20.1
Details
Message ID
<C7S0P4HPYM8O.2E8UB3RZZEJDX@debian-thinkpad>
In-Reply-To
<20201213223602.4046-1-alex@alexwennerberg.com> (view parent)
DKIM signature
fail
Download raw message
DKIM signature: fail
I used gemini://testuser.flounder.online

For testing if you're interested
Details
Message ID
<C7S0VK06AVLF.ZPWTDJYX4Z7U@debian-thinkpad>
In-Reply-To
<C7S0P4HPYM8O.2E8UB3RZZEJDX@debian-thinkpad> (view parent)
DKIM signature
fail
Download raw message
DKIM signature: fail
Already found an issue here -- it looks like I forgot to add url to the
DB model. Still working on the email flow here, I think I would create a
new v2 version of this patch?

I can fix that after you are able to take a look at this -- maybe I can
address other issues as well

Thanks!

Alex
Remco
Details
Message ID
<87mtyfz7i1.fsf@rwv.io>
In-Reply-To
<20201213223602.4046-1-alex@alexwennerberg.com> (view parent)
DKIM signature
pass
Download raw message
Hi Alex,

I really like this addition and I am very curious about what it will
look and feel like in production.  Here's a couple of things to think
about before we can push this to production:

- "all" posts could be "a lot"

  Please consider implementing pagination or at least put a limit on it.
  If you put a limit on it, include it in the text to the visitor "the
  100 most recent posts" or whatever.

- what happens when a page disappears?

  Currently the "indexable_crawl" SQL view is used to determine if a
  page is still likely to exist.  It has a confusing name and it is
  pretty slow but that's what we have at the moment.

- consider adding some unit tests

  There aren't many yet but GUS could benefit from some more tests.  The
  feed.py file is a nice candidate for some tests.

As for your questions, I hope Nat can answer them.  I'll include some
questions and comments on you code below.

2020/12/13 23:38, alex wennerberg:

> diff --git a/gus/lib/db_model.py b/gus/lib/db_model.py
> index bf1bdd2..0a3afc1 100644
> --- a/gus/lib/db_model.py
> +++ b/gus/lib/db_model.py
> @@ -54,6 +54,16 @@ class Page(Model):
>      indexed_at = DateTimeField(null=True)
>
>
> +class FeedItem(Model):
> +    """
> +    Items in feeds
> +    """
> +    page = ForeignKeyField(Page, backref="feed_items", on_delete="CASCADE")
> +    title = TextField(null=True)
> +    feed_title = TextField(null=True) # TODO normalize

Please don't introduce TODOs, either implement them or put them in your
personal notes.  Also, this is probably not the right place to do
normalization.  Either normalize in "get_feed_items" or don't normalize
and add an accessor function here like "normalized_title".  I'd go for
the latter option because it allows for reimplementing it without having
to touch the db.

> +    date = DateTimeField(null=True)
> +
> +
>  class Link(Model):
>      """
>      Hyperlinks between pages in Geminispace
> diff --git a/gus/lib/feed.py b/gus/lib/feed.py
> new file mode 100644
> index 0000000..867719e
> --- /dev/null
> +++ b/gus/lib/feed.py
> @@ -0,0 +1,50 @@
> +"""
> +Resources for parsing feeds
> +"""
> +
> +import re
> +from gus.lib.db_model import FeedItem
> +import feedparser
> +import datetime
> +
> +GEMLOG_LINK_PATTERN = re.compile(
> +    "^=> ?(.*) ([0-9]{4}-[0-9]{2}-[0-9]{2}) ?-? ?(.*)", re.MULTILINE

According to the spec at:

  gemini://gemini.circumlunar.space/docs/specification.gmi

a link arrow (=>) can be follow by any number of whitespace characters,
using something like "^=>\s*" is more accurate.  Also, the URL part
should not contain whitespace and at least one character, regex engines
being greedy this regex might capture more than you've bargained for,
use "(\S+)" instead.

As for the "title" part, according to the spec it's optional:

> that users may e.g. use labels with a separator

so maybe the whole last part should be optional.  Also use "\s?" instead
of " ?" because some people really love tabs.  And the "-" separator
could be any separator.

> diff --git a/serve/models.py b/serve/models.py
> index 54a016d..1275be5 100644
> --- a/serve/models.py
> +++ b/serve/models.py
> @@ -176,19 +176,23 @@ ORDER BY t.thread_length DESC, t.updated_at DESC, t.id ASC, tp.address ASC"""
>
>      def get_feeds(self):
>          feeds_query = Page.raw(
> -            """SELECT DISTINCT p.*
> +            """SELECT DISTINCT fetchable_url, url
>  FROM page AS p
> -JOIN indexable_crawl AS c

See beginning of this email about "what happens when a page disappears".

> -ON c.page_id == p.id
> -WHERE p.url LIKE '%atom.xml'
> -OR p.url LIKE '%feed.xml'
> -OR p.url LIKE '%.rss'
> -OR p.url LIKE '%.atom'
> -OR p.content_type IN ('application/atom+xml', 'application/rss+xml')
> +JOIN feeditem AS fi
> +ON fi.page_id == p.id
>  """
>          )
>          return feeds_query.execute()
>
> +    def get_recent_posts(self):
> +        feeds_query = Page.raw(
> +            """SELECT * from feeditem
> +ORDER BY date desc
> +"""

Put a limit on this, see earlier comment.

> +        )
> +        return feeds_query.execute()

Please consider using "iterator" instead of "execute".  The latter will
instantiate an object for all pages in the result set immediately but
"iterator" will only do so for the current item when iterating.

> diff --git a/serve/templates/index.gmi b/serve/templates/index.gmi
> index 43992d0..337f558 100644
> --- a/serve/templates/index.gmi
> +++ b/serve/templates/index.gmi
> @@ -6,6 +6,7 @@
>  => /statistics Geminispace Statistics
>  => /known-hosts Known Gemini Hosts
>  => /known-feeds Known Gemini Feeds
> +=> /all-posts Recent Gemini Feed Posts

Why not call the link "/recent-posts" too?

>  => /newest-hosts Newest Gemini Hosts
>  => /newest-pages Newest Gemini Pages
>
> diff --git a/serve/views.py b/serve/views.py
> index d659b89..f649eec 100644
> --- a/serve/views.py
> +++ b/serve/views.py
> @@ -133,6 +133,16 @@ def known_feeds(request):
>      return Response(Status.SUCCESS, "text/gemini", body)
>
>
> +@app.route("/all-posts", strict_trailing_slash=False)
> +def known_feeds(request):

Ah, copy-paste error.  You probably meant "def all_posts"?  ;-)

That's it.

Cheers,
R.
Details
Message ID
<C7YPG3JL9N0T.2LKJJPT9OENN@localhost>
In-Reply-To
<87mtyfz7i1.fsf@rwv.io> (view parent)
DKIM signature
pass
Download raw message
Working on a v2 version of this patch! Knocked out some of the simpler fixes. 
Once Nat is able to answer some of the design Q's I'll send over a v2 patch

On Tue Dec 15, 2020 at 3:30 AM PST, Remco wrote:
> Hi Alex,
>
> I really like this addition and I am very curious about what it will
> look and feel like in production. Here's a couple of things to think
> about before we can push this to production:
>
> - "all" posts could be "a lot"

Replaced "all" with "recent"

> Please consider implementing pagination or at least put a limit on it.
> If you put a limit on it, include it in the text to the visitor "the
> 100 most recent posts" or whatever.

Added a limit of 50 

> - what happens when a page disappears?
>
> Currently the "indexable_crawl" SQL view is used to determine if a
> page is still likely to exist. It has a confusing name and it is
> pretty slow but that's what we have at the moment.

I'll have to consider this!

> - consider adding some unit tests
>
> There aren't many yet but GUS could benefit from some more tests. The
> feed.py file is a nice candidate for some tests.

Good idea.

> Please don't introduce TODOs, either implement them or put them in your
> personal notes. Also, this is probably not the right place to do
> normalization. Either normalize in "get_feed_items" or don't normalize
> and add an accessor function here like "normalized_title". I'd go for
> the latter option because it allows for reimplementing it without having
> to touch the db.
Removed the TODO. Will think about architecture in v2 of this patch.
> > +import re
> > +from gus.lib.db_model import FeedItem
> > +import feedparser
> > +import datetime
> > +
> > +GEMLOG_LINK_PATTERN = re.compile(
> > +    "^=> ?(.*) ([0-9]{4}-[0-9]{2}-[0-9]{2}) ?-? ?(.*)", re.MULTILINE
>
> According to the spec at:
>
> gemini://gemini.circumlunar.space/docs/specification.gmi
>
> a link arrow (=>) can be follow by any number of whitespace characters,
> using something like "^=>\s*" is more accurate. Also, the URL part
> should not contain whitespace and at least one character, regex engines
> being greedy this regex might capture more than you've bargained for,
> use "(\S+)" instead.

Makes sense! I'll add some unit tests for this too

> As for the "title" part, according to the spec it's optional:
>
> > that users may e.g. use labels with a separator
>
> so maybe the whole last part should be optional. Also use "\s?" instead
> of " ?" because some people really love tabs. And the "-" separator
> could be any separator.

I believe my title should be fine, since .* can be blank.

Here's the new regex:

=>\s*(\S+)\s([0-9]{4}-[0-9]{2}-[0-9]{2})\s?-?\s?(.*)

> > +        )
> > +        return feeds_query.execute()
>
> Please consider using "iterator" instead of "execute". The latter will
> instantiate an object for all pages in the result set immediately but
> "iterator" will only do so for the current item when iterating.

This makes sense to me, but I was just copying the structure of the other queries on this file -- 
shouldn't they also be modified accordingly?

> > +=> /all-posts Recent Gemini Feed Posts
>
> Why not call the link "/recent-posts" too?

Done

> >  => /newest-hosts Newest Gemini Hosts
> >  => /newest-pages Newest Gemini Pages
> >
> > diff --git a/serve/views.py b/serve/views.py
> > index d659b89..f649eec 100644
> > --- a/serve/views.py
> > +++ b/serve/views.py
> > @@ -133,6 +133,16 @@ def known_feeds(request):
> >      return Response(Status.SUCCESS, "text/gemini", body)
> >
> >
> > +@app.route("/all-posts", strict_trailing_slash=False)
> > +def known_feeds(request):
>
> Ah, copy-paste error. You probably meant "def all_posts"? ;-)

Correct. I'm surprised the routing still worked here? Regardless, fixed

Thanks for the comments! Like I said, I'll tackle some of the more challenging questions once we
hear back from Natalie

All the best,

Alex
Details
Message ID
<C7YQH7LDBWXU.Y07RBCE6QCST@localhost>
In-Reply-To
<C7YPG3JL9N0T.2LKJJPT9OENN@localhost> (view parent)
DKIM signature
pass
Download raw message
It looks like a lot of my replies got mangled in the patch UI, but the raw email should be viewable. 

Still getting used to this workflow

Alex
Remco
Details
Message ID
<87tuse1ap8.fsf@rwv.io>
In-Reply-To
<C7YPG3JL9N0T.2LKJJPT9OENN@localhost> (view parent)
DKIM signature
pass
Download raw message
2020/12/21 21:04, alex wennerberg:

> Here's the new regex:
>
> =>\s*(\S+)\s([0-9]{4}-[0-9]{2}-[0-9]{2})\s?-?\s?(.*)

Almost, the spec states:

> =>[<whitespace>]<URL>[<whitespace><USER-FRIENDLY LINK NAME>]
>
> where:
>
> - <whitespace> is any non-zero number of consecutive spaces or tabs
> - Square brackets indicate that the enclosed content is optional.
> - <URL> is a URL, which may be absolute or relative.

So adding a plus after the second \s should cover all bases:

=>\s*(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2})\s?-?\s?(.*)

>> > +        )
>> > +        return feeds_query.execute()
>>
>> Please consider using "iterator" instead of "execute". The latter will
>> instantiate an object for all pages in the result set immediately but
>> "iterator" will only do so for the current item when iterating.
>
> This makes sense to me, but I was just copying the structure of the
> other queries on this file -- shouldn't they also be modified
> accordingly?

My limited contributions to gus are my only experience with peewee and I
don't know what the down sides of iterators are in python (not my goto
language).  But, to me, it seems "iterator" should always be preferred
over "execute" unless you'd be pulling in a strictly limited result set.

So, the answer is probably: in some cases, yes.

Cheers,
R.
Reply to thread Export thread (mbox)