Hello,
I stumbled upon your post [0], since I am working on a search engine
myself I figured it was a good idea to shim in. The search engine is
called: babelia.
a) babelia instances will advertise their topical content via
classifications such as the wikipedia vital articles
hierarchy. Ok, this is not good enough for niche topics but it
gives a heads up compared to randomly asking instances if they
know about "those keywords" which is by the way not very great
privacy wise. So instead, babelia try to guess the topic of the
query, match them with one or more category, and given this
categories, babelia recommend some instance to the user. This is
crucial for privacy, but also to avoid to clutter the federation
with random queries, hence performance. I prototyped the first
part of that algorithm in sensimark [1]. sensimark will learn how
to categorize documents in categories (if you read correctly
around the web, this is a newbie task in machine-learning).
TODO: The part that is missing is matching queries to categories.
That is a different problem because you can not build an
(interesting?) word2vec representation from a couple of words.
sensimark works on paragraphs.
b) I dare to think if figured the indexing / querying strategy. I
tried (!) to summarize it at [2]. The gist of the idea is to
store the whole document in the forward index, and every
downcased word in the reverse index (also known as
posting). Stemming and Lemmatization is done at query time with
query expansion. So does synonyms handling (but that is already
documented elsewhere). The other idea is to rely on something
like map-reduce on a single machine. I am not interested in
multiple machine cluster, also scaling vertically seems possible
given AMD Ryzen and such. After a candidates selections, the
map-reduce will score and filter results and keep to top N
results. Pagination is not planned. See below. While I am
at it, look into Aho-Corasick, and boolean keyword to finite
state machine.
c) To work around the fact that pagination is not supported, and
also to take advantage of common crawl (CC) it wil be possible
to ask for deep searches. Those kind of search will simply scan
the whole index or CC. I prototyped that, it takes around 15
days with 40 cores to scan through the whole CC. That is
reasonable. Mind the fact that there is much more that few tera
in CC (but I do not have the exact number at hand).
Regarding the N Tier strategy. You will definitly need some moderation
to enter the index. A good way to extend upon wikipedia and
Stackoverflow, it to follow links from those websites. Even if you
correctly implement PageRank you can not (I guess) blindly trust it.
You would still need to build a graph / network of the websites, and a
way to navigate that graph. I agree with Drew, a crawler for that kind
of project is a requirement! You can live without it for sometime
re-using content from existing dataset, like kiwix.org. But certainly
not in the long run. It might sound like feature creep, but in the
context of community instances, it make sense to have a feed reader
with link sharing.
To give you an idea how feasible this is. Wikipedia + StackOverflow
are around 100G but wikidata is 3T.
Last but not least, do not re-invent the wheel regarding storage use
an Ordered Key-Value store.
[0] https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html
[1] https://github.com/amirouche/sensimark/
[2] https://hyper.dev/blog/do-it-yourself-a-search-engine.html
[3] I did not mention it, but I have zero clue of to split CJK
languages into words
-- Amirouche ~ https://hyper.dev
On Tuesday, November 9th, 2021 at 6:06 PM, Amirouche <amirouche@hyper.dev> wrote:
> Hello,
>
> I stumbled upon your post [0], since I am working on a search engine
> myself I figured it was a good idea to shim in. The search engine is
> called: babelia.
>
> a) babelia instances will advertise their topical content via
> classifications such as the wikipedia vital articles
> hierarchy. Ok, this is not good enough for niche topics but it
> gives a heads up compared to randomly asking instances if they
> know about "those keywords" which is by the way not very great
> privacy wise. So instead, babelia try to guess the topic of the
> query, match them with one or more category, and given this
> categories, babelia recommend some instance to the user. This is
> crucial for privacy, but also to avoid to clutter the federation
> with random queries, hence performance. I prototyped the first
> part of that algorithm in sensimark [1]. sensimark will learn how
> to categorize documents in categories (if you read correctly
> around the web, this is a newbie task in machine-learning).
>
> TODO: The part that is missing is matching queries to categories.
>
> That is a different problem because you can not build an
> (interesting?) word2vec representation from a couple of words.
> sensimark works on paragraphs.
>
> b) I dare to think if figured the indexing / querying strategy. I
> tried (!) to summarize it at [2]. The gist of the idea is to
> store the whole document in the forward index, and every
> downcased word in the reverse index (also known as
> posting). Stemming and Lemmatization is done at query time with
> query expansion. So does synonyms handling (but that is already
> documented elsewhere). The other idea is to rely on something
> like map-reduce on a single machine. I am not interested in
> multiple machine cluster, also scaling vertically seems possible
> given AMD Ryzen and such. After a candidates selections, the
> map-reduce will score and filter results and keep to top N
> results. Pagination is not planned. See below. While I am
> at it, look into Aho-Corasick, and boolean keyword to finite
> state machine.
>
> c) To work around the fact that pagination is not supported, and
> also to take advantage of common crawl (CC) it wil be possible
> to ask for deep searches. Those kind of search will simply scan
> the whole index or CC. I prototyped that, it takes around 15
> days with 40 cores to scan through the whole CC. That is
> reasonable. Mind the fact that there is much more that few tera
> in CC (but I do not have the exact number at hand).
>
> Regarding the N Tier strategy. You will definitly need some moderation
> to enter the index. A good way to extend upon wikipedia and
> Stackoverflow, it to follow links from those websites. Even if you
> correctly implement PageRank you can not (I guess) blindly trust it.
> You would still need to build a graph / network of the websites, and a
> way to navigate that graph. I agree with Drew, a crawler for that kind
> of project is a requirement! You can live without it for sometime
> re-using content from existing dataset, like kiwix.org. But certainly
> not in the long run. It might sound like feature creep, but in the
> context of community instances, it make sense to have a feed reader
> with link sharing.
>
> To give you an idea how feasible this is. Wikipedia + StackOverflow
> are around 100G but wikidata is 3T.
>
> Last but not least, do not re-invent the wheel regarding storage use
> an Ordered Key-Value store (see https://okvs.dev :))
>
> [0] https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html
> [1] https://github.com/amirouche/sensimark/
> [2] https://hyper.dev/blog/do-it-yourself-a-search-engine.html
> [3] I did not mention it, but I have zero clue of to split CJK
> languages into words
>
We created a mailing about this topic at https://peacesear.ch/
direct link: https://sr.ht/~peacesearch/
Amirouche Amazigh BOUBEKKI ~ https://hyper.dev