We discussed on IRC the possibility to have RFC documents. I am not sure
yet how it will be structured.
In the meantime, I made a list of possible topics that we may discuss:
- Web Archive Feed: year-month-day incremental snapshots, that would be
implemented by medium to large website owners that will allow search
engines to index their website without crawling all the pages (related
standard WARC, RSS, ATOM)
- Sensimark: A category hierarchy similar to Wikipedia Vital Articles
[0] and "fingerprinting" method that will allow both humans and robots
to figure what topics are covered by a search engine or website (related
tool: word2vec, BERT, classification) That could be at least two or
three documents, 1) One for the first or first two level categories, 2)
Another for the fingerprinting method 3) And yet another one, for really
nice topics e.g. "black and white manga between 1950 and 1980" which
will be difficult to document for all niches and topics.
- Search Query Forwarding: explain how and in what circonstencies a
search engine can request another search engine to answer a query. More
generally, how resource sharing works.
- Web Archive Exchange: Search Engines may make public their own web
crawl results so that other search engines may use them to seed their
index or for use in full scans (deep search).
- Peer Archive eXchange & Discovery, similar to bittorrent PEX but for
search engines.
[0] https://en.wikipedia.org/wiki/Wikipedia:Vital_articles, see [1]
[1] https://cloud.google.com/natural-language/docs/categories