~natpen/gus

1

Hello worldi & Gemfeeds

Details
Message ID
<C7OERXV215RA.1Q84F3JQFK04M@debian-thinkpad>
DKIM signature
fail
Download raw message
DKIM signature: fail
Wanted to kick things off in this mailing list

Now that Gemfeeds have been standardized (gemini://gemini.circumlunar.space/docs/companion/subscription.gmi) -- I am thinking that 
we could extended the feed crawler to look for gemfeeds

Some implementation ideas:
- In the crawler, check if a gemini page can be parsed as a feed by
  lookig for links that start with YYYY-MM-DD
- Maybe create a new table called feeds with a list of all the pages and
  when they were updated based on YYYY-MM-DD dates or by
  atom.xml/rss dates
- Maybe list feeds reverse chronologically by last updated date.
- Maybe aggregate _all_ feeds together into one list

Alex
Details
Message ID
<X9LkMOeOO9qlvFfk@goldfish.localdomain>
In-Reply-To
<C7OERXV215RA.1Q84F3JQFK04M@debian-thinkpad> (view parent)
DKIM signature
pass
Download raw message
On Wed, Dec 09, 2020 at 04:51:21PM +0000, alex wennerberg wrote:
>
> Wanted to kick things off in this mailing list

Wonderful! Thank you for the push to finally start using this.

> Now that Gemfeeds have been standardized (gemini://gemini.circumlunar.space/docs/companion/subscription.gmi) -- I am thinking that
> we could extended the feed crawler to look for gemfeeds

I'm open to this. I think we could do it with low-complexity
implementation, and not much runtime/crawltime/indextime cost, so it
seems like an okay thing to add.

> Some implementation ideas:
> - In the crawler, check if a gemini page can be parsed as a feed by
>   lookig for links that start with YYYY-MM-DD
> - Maybe create a new table called feeds with a list of all the pages and
>   when they were updated based on YYYY-MM-DD dates or by
>   atom.xml/rss dates

I'm less sold on this storage design. A page could hypothetically also
*stop* being a feed, so there's a temporal element we should try to
capture as well. I think my counter-proposal would be to simply add a boolean
`is_feed` column to the existing Page table. That table is already
"stateful" in that it gets overwritten with the latest page "state"
every time it gets crawled.

This would then presumably drive down its change_frequency value, so
it would get crawled more often. I don't think we would need to do
anything special about last updated date apart from this (there's
already good tracking of crawl activity timestamps in the Crawl
table).

That gets us to the point of being able to crawl feed pages
effectively. Now, separately, I have some thoughts about how to work
with the database schema to facilitate display of a feed *posts* page
in the actual user-facing frontend.

I think for such a user-facing page, performance-wise, it would
probably be easiest if there were an additional column in the Page
table called `post_date`, which would be nullable. After, or during a
crawl, all the posts could be parsed out of each feed page, and for
each post's row in the Page table, it could have a non-null post_date
value set. That way, in the frontend page for viewing feeds and posts,
all we would have to do is a database query for pages with a non-null
post_date ordered by post_date descending.

Hopefully that made sense. I feel like I just wrote a lot of text
liable to be easily misunderstood. I'll close by saying thank you for
suggesting this, and please feel free to suggest alternative
implementation suggestions (in text or code!).

Natalie
Reply to thread Export thread (mbox)