~sircmpwn/public-inbox

1

We can do better than DuckDuckGo [et al.]

Details
Message ID
<50ccff96d2c8f76177e2c81256116e87@hyper.dev>
DKIM signature
missing
Download raw message
Hello,


I stumbled upon your post [0], since I am working on a search engine
myself I figured it was a good idea to shim in.  The search engine is
called: babelia.

   a) babelia instances will advertise their topical content via
      classifications such as the wikipedia vital articles
      hierarchy. Ok, this is not good enough for niche topics but it
      gives a heads up compared to randomly asking instances if they
      know about "those keywords" which is by the way not very great
      privacy wise.  So instead, babelia try to guess the topic of the
      query, match them with one or more category, and given this
      categories, babelia recommend some instance to the user.  This is
      crucial for privacy, but also to avoid to clutter the federation
      with random queries, hence performance.  I prototyped the first
      part of that algorithm in sensimark [1]. sensimark will learn how
      to categorize documents in categories (if you read correctly
      around the web, this is a newbie task in machine-learning).
      TODO: The part that is missing is matching queries to categories.
      That is a different problem because you can not build an
      (interesting?) word2vec representation from a couple of words.
      sensimark works on paragraphs.

   b) I dare to think if figured the indexing / querying strategy. I
      tried (!) to summarize it at [2]. The gist of the idea is to
      store the whole document in the forward index, and every
      downcased word in the reverse index (also known as
      posting). Stemming and Lemmatization is done at query time with
      query expansion. So does synonyms handling (but that is already
      documented elsewhere). The other idea is to rely on something
      like map-reduce on a single machine. I am not interested in
      multiple machine cluster, also scaling vertically seems possible
      given AMD Ryzen and such. After a candidates selections, the
      map-reduce will score and filter results and keep to top N
      results.  Pagination is not planned. See below.  While I am
      at it, look into Aho-Corasick, and boolean keyword to finite
      state machine.

    c) To work around the fact that pagination is not supported, and
       also to take advantage of common crawl (CC) it wil be possible
       to ask for deep searches. Those kind of search will simply scan
       the whole index or CC.  I prototyped that, it takes around 15
       days with 40 cores to scan through the whole CC.  That is
       reasonable. Mind the fact that there is much more that few tera
       in CC (but I do not have the exact number at hand).

Regarding the N Tier strategy. You will definitly need some moderation
to enter the index. A good way to extend upon wikipedia and
Stackoverflow, it to follow links from those websites.  Even if you
correctly implement PageRank you can not (I guess) blindly trust it.
You would still need to build a graph / network of the websites, and a
way to navigate that graph. I agree with Drew, a crawler for that kind
of project is a requirement!  You can live without it for sometime
re-using content from existing dataset, like kiwix.org.  But certainly
not in the long run. It might sound like feature creep, but in the
context of community instances, it make sense to have a feed reader
with link sharing.

To give you an idea how feasible this is. Wikipedia + StackOverflow
are around 100G but wikidata is 3T.

Last but not least, do not re-invent the wheel regarding storage use
an Ordered Key-Value store.


[0] https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html

[1] https://github.com/amirouche/sensimark/

[2] https://hyper.dev/blog/do-it-yourself-a-search-engine.html

[3] I did not mention it, but I have zero clue of to split CJK
     languages into words

-- Amirouche ~ https://hyper.dev
Details
Message ID
<K-1YKOds2ByHlrMGGt2M0rmYzAUHBjtVHHx1fVldg9IayPFGnnhKjeGSeBh16aHoTQTgqH91eyRCKquOijm96LvhimjWVld8zFbwp5OS1_c=@hyper.dev>
In-Reply-To
<50ccff96d2c8f76177e2c81256116e87@hyper.dev> (view parent)
DKIM signature
pass
Download raw message
On Tuesday, November 9th, 2021 at 6:06 PM, Amirouche <amirouche@hyper.dev> wrote:

> Hello,
>
> I stumbled upon your post [0], since I am working on a search engine
> myself I figured it was a good idea to shim in. The search engine is
> called: babelia.
>
> a) babelia instances will advertise their topical content via
> classifications such as the wikipedia vital articles
> hierarchy. Ok, this is not good enough for niche topics but it
> gives a heads up compared to randomly asking instances if they
> know about "those keywords" which is by the way not very great
> privacy wise. So instead, babelia try to guess the topic of the
> query, match them with one or more category, and given this
> categories, babelia recommend some instance to the user. This is
> crucial for privacy, but also to avoid to clutter the federation
> with random queries, hence performance. I prototyped the first
> part of that algorithm in sensimark [1]. sensimark will learn how
> to categorize documents in categories (if you read correctly
> around the web, this is a newbie task in machine-learning).
>
> TODO: The part that is missing is matching queries to categories.
>
> That is a different problem because you can not build an
> (interesting?) word2vec representation from a couple of words.
> sensimark works on paragraphs.
>
> b) I dare to think if figured the indexing / querying strategy. I
> tried (!) to summarize it at [2]. The gist of the idea is to
> store the whole document in the forward index, and every
> downcased word in the reverse index (also known as
> posting). Stemming and Lemmatization is done at query time with
> query expansion. So does synonyms handling (but that is already
> documented elsewhere). The other idea is to rely on something
> like map-reduce on a single machine. I am not interested in
> multiple machine cluster, also scaling vertically seems possible
> given AMD Ryzen and such. After a candidates selections, the
> map-reduce will score and filter results and keep to top N
> results. Pagination is not planned. See below. While I am
> at it, look into Aho-Corasick, and boolean keyword to finite
> state machine.
>
> c) To work around the fact that pagination is not supported, and
> also to take advantage of common crawl (CC) it wil be possible
> to ask for deep searches. Those kind of search will simply scan
> the whole index or CC. I prototyped that, it takes around 15
> days with 40 cores to scan through the whole CC. That is
> reasonable. Mind the fact that there is much more that few tera
> in CC (but I do not have the exact number at hand).
>
> Regarding the N Tier strategy. You will definitly need some moderation
> to enter the index. A good way to extend upon wikipedia and
> Stackoverflow, it to follow links from those websites. Even if you
> correctly implement PageRank you can not (I guess) blindly trust it.
> You would still need to build a graph / network of the websites, and a
> way to navigate that graph. I agree with Drew, a crawler for that kind
> of project is a requirement! You can live without it for sometime
> re-using content from existing dataset, like kiwix.org. But certainly
> not in the long run. It might sound like feature creep, but in the
> context of community instances, it make sense to have a feed reader
> with link sharing.
>
> To give you an idea how feasible this is. Wikipedia + StackOverflow
> are around 100G but wikidata is 3T.
>
> Last but not least, do not re-invent the wheel regarding storage use
> an Ordered Key-Value store (see https://okvs.dev :))
>
> [0] https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html
> [1] https://github.com/amirouche/sensimark/
> [2] https://hyper.dev/blog/do-it-yourself-a-search-engine.html
> [3] I did not mention it, but I have zero clue of to split CJK
> languages into words
>

We created a mailing about this topic at https://peacesear.ch/

direct link: https://sr.ht/~peacesearch/

Amirouche Amazigh BOUBEKKI ~ https://hyper.dev
Reply to thread Export thread (mbox)