On a call yesterday Fintan, Alexis, Ele, and I discussed what the future of the
Link network looks like. This summary of the discussion should serve as a
starting point for more detailed design work.
== The gossip network
The link network was originally designed as a peer to peer network using a
gossip protocol to discover data. From the perspective of a user the idea was
that you would announce data to the gossip network and interested peers would
replicate this data. If you wanted to fetch data you would request data from the
gossip network and these requests would be gossiped around until a peer which
has the data responds.
In practice this model has turned out to be complicated and unreliable. This is
for a few reasons:
* In the modern internet most nodes are behind a NAT device and we do not yet
have NAT traversal implemented. Consequently peers who announce that they have
some data may not be reachable by peers which are interested in that data.
* Maintaining the gossip cluster requires lots of resources. This has meant that
operators are reluctant to run the gossip components of link
* The semantics of the announce model mean that it is impossible to know whether
your data has in fact been replicated by anyone which in turn leads to an
impression of the network as being unreliable
* When requesting data it is hard to understand why data is not available. The
only information we have is that no one responded to WANT message.
== Seed Nodes
The problems with gossip replication have led to the development of an
alternative model based on "seeds". A seed node is a node which is highly
available™ at a routable IP address. In this model, rather than peers announcing
changes to the gossip network and hoping that interested seeds see the
announcement they instead directly connect to a seed and request that the seed
replicate their data. Likewise when peers wish to request data they directly
contact their seeds and request the data from the seed.
Compared to the gossip network this model allows the application to guarantee
that data is replicated and makes diagnosing errors much simpler. Furthermore,
it is much easier to see how seed nodes can be incentivised when users have a
direct relationship with them. One can imagine running a business operating
seeds for people.
== Network Requirements
The seed architecture has emerged rather than being overtly designed to meet the
goals of the Radicle project. It is pragmatic and has allowed us to start
dogfooding, but we should articulate what the properties we care about for a
network architecture are, then evaluate whether this seed architecture meets
those properties.
The primary overall goal of the radicle network is to allow users to fetch and
update identities (``Project``s, and ``Person``s) and data associated with those
identities without allowing network operators to hold that data hostage. To that
end I think the following properties are important:
* Identities are self authenticating, it doesn't matter where you got the
identity data from you can always validate it locally. Therefore identities
can be mirrored by any server
* The only information you need to fetch an identity from the network is it's
URN. This means that if a particular node is refusing to host a URN for some
reason then you can easily switch hosting to another node without users
depending on the URN needing to do anything
We can see that the seed architecture currently falls short of the second
requirement because in order to obtain an identity you need to know it's URN
_and_ the seed(s) where it is published.
== URN Lookup
It seems that a simple extension to the seed architecture which would allow us
to meet our network requirements is an architectural component which provides a
lookup from URN to a set of seeds. As well as solving the core requirement this
has a set of other nice properties:
* Applications could use this lookup to find seeds which should be notified if
new COBs are published by a peer not in the tracking graph of the seed. This
provides a way for non-tracked peers to e.g. submit issues for a project
* Indexer nodes could use this lookup to watch for updates to projects in order
to provide global views of the network
There are various designs which can achieve a URN lookup, this thread should be
for discussing those designs. I don't have a good sense for what the
requirements for this particular component should be, I think we should discuss
that as well as outline in detail what some solutions would be.
Some ideas we discussed that need more elaboration:
* A DHT
* Some sort of DNS based system
* ENS
On Wed Jun 29, 2022 at 5:16 PM IST, Kim Altintop wrote:
> This is the depth of your understanding? Wow.
I'd like to remind anyone and everyone that we're still *people*
working on this project. People who are trying to do their best and
deserve some respect.
In the same vein I respect your knowledge and realise that you have
found the original post disappointing in your own view. We would
gladly hear out any and all critique from you. I, however, don't think
that a two sentence reply that only achieves to insult OP, and
everyone mentioned in the post, is warranted. It's hurtful at worst
and unconstructive at best.
Even having to formulate this reply has put me on edge...
Sorry you feel edgy. But "wow" is the only response I have recognising the
pattern: something is not going quite like we wanted, some complaints are made,
some symptoms are observed, some explanations are brought forward. But then,
instead of scrutinising that -- and, most importantly, taking into account the
path that led here -- a shortcut is taken.
The term for that is "jumping to conclusions", I believe, and it has hurt the
Radicle project on more than one occasion.
I can start from the top, though:
> == The gossip network
Gossip was initially chosen for much the same reasons IPFS relies on "pub/sub"
for many meaningful applications: git repositories employ a dynamic compaction
scheme, and so are not well suited for "chunked" storage and distribution as
employed by most file sharing applications. There is also no inherent "data
economics" which would incentivise nodes to replicate data they are not
otherwise interested in.
Hence, the sensible choice is to make it so nodes cluster around "topics" they
are interested in, in order to reduce fanout. That, however, turned out to not
be so easy (IPFS' "gossipsub" presents but one set of tradeoffs), and there was
never enough time to work on it.
> == Seed Nodes
It is not true that they "emerged", this concept was there since the beginning.
One reason is that there must be some kind of entry nodes, another that there is
otherwise no way to reason about availability of the data. If connectivity of a
seed is based on content (as opposed to random), discovery of replicas is quite
possible without resorting to a DHT overlay.
It is true that it was not sufficiently understood that equating peer-to-peer
with laptop-to-laptop is just not a good "requirement" and will inevitably lead
to disappointment. Usable peer-to-peer networks are server-to-server networks
(mostly), and anyone who tells you otherwise is trying to sell you something.
It was also not sufficiently understood that there is no working around the fact
that git is not "decentralized" in the sense "web3" enthusiasts like to use that
term: git patches (commits) do not commute, and so any kind of shared
maintainership requires "consensus". That will not become automated while the
cheapest and easiest to understand consensus algorithm is a "centralized" git
server.
On the up side, this also makes everything much easier because data just needs
to be mirrored along a tree with that server as the root. The remaining question
is how to propagate (proposals for) updates in the other direction (ie.
_towards_ the root). But that's for another time.
> == Network Requirements> == URN Lookup
It's quite a stretch to conflate name resolution (technically: name independent
routing) with indexing.
I am not very interested in indexing, because that's essentially a solved
problem. It is also inherently "centralized", at least if you want it to be
within competitive cost and latency bounds.
As for name resolution, well that's the culprit of distributed systems, isn't
it? The problem here is that we have two problems:
1. Mapping a name for a datum to machines which have it
2. Mapping a name for a machine to its network address(es)
I have repeatedly said and written that this is fairly straighforward to
implement using just the routing datastructure of a DHT (not the k/v storage).
But, it depends on a threshold of stable nodes in a network, and on proactive
content advertisement (what IPFS calls "providers"). The latter posing some
questions regarding the bounds of the set of names a machine could possibly
hold.
If we relax the "requirement" to (morally) content-address git repositories, we
can get away with just 2., which is a lot simpler but implies explicit peering.
Which I don't think is so bad, given your set of complaints.
More interesting is, however, to take a step back and revisit why we couldn't
just dump the object database into IPFS in the first place. Is there not some
way to tweak compaction such that we could? It turns out there is.
But that's for another time.
So, the part I don't understand with using something like Kademlia, is
that it seems like it's designed for a case where nodes don't get to
decide what content they store, and this is what allows the optimization
of finding nodes in log time. Am I missing something?
Using the routing structure would basically mean that we have the ability to
look up IPs, given a PeerId, right? We'd still need a way for peers to
announce which projects they have, and it doesn't look like Kademlia is
designed for that (as it's deterministic with Kademlia, not arbitrary).
------- Original Message -------
On Thursday, June 30th, 2022 at 14:53, Kim Altintop <kim@eagain.io> wrote:
> > > Sorry you feel edgy. But "wow" is the only response I have recognising the> pattern: something is not going quite like we wanted, some complaints are made,> some symptoms are observed, some explanations are brought forward. But then,> instead of scrutinising that -- and, most importantly, taking into account the> path that led here -- a shortcut is taken.> > The term for that is "jumping to conclusions", I believe, and it has hurt the> Radicle project on more than one occasion.> > > I can start from the top, though:> > > == The gossip network> > > Gossip was initially chosen for much the same reasons IPFS relies on "pub/sub"> for many meaningful applications: git repositories employ a dynamic compaction> scheme, and so are not well suited for "chunked" storage and distribution as> employed by most file sharing applications. There is also no inherent "data> economics" which would incentivise nodes to replicate data they are not> otherwise interested in.> > Hence, the sensible choice is to make it so nodes cluster around "topics" they> are interested in, in order to reduce fanout. That, however, turned out to not> be so easy (IPFS' "gossipsub" presents but one set of tradeoffs), and there was> never enough time to work on it.> > > == Seed Nodes> > > It is not true that they "emerged", this concept was there since the beginning.> One reason is that there must be some kind of entry nodes, another that there is> otherwise no way to reason about availability of the data. If connectivity of a> seed is based on content (as opposed to random), discovery of replicas is quite> possible without resorting to a DHT overlay.> > It is true that it was not sufficiently understood that equating peer-to-peer> with laptop-to-laptop is just not a good "requirement" and will inevitably lead> to disappointment. Usable peer-to-peer networks are server-to-server networks> (mostly), and anyone who tells you otherwise is trying to sell you something.> > It was also not sufficiently understood that there is no working around the fact> that git is not "decentralized" in the sense "web3" enthusiasts like to use that> term: git patches (commits) do not commute, and so any kind of shared> maintainership requires "consensus". That will not become automated while the> cheapest and easiest to understand consensus algorithm is a "centralized" git> server.> > On the up side, this also makes everything much easier because data just needs> to be mirrored along a tree with that server as the root. The remaining question> is how to propagate (proposals for) updates in the other direction (ie.> towards the root). But that's for another time.> > > == Network Requirements> > == URN Lookup> > > It's quite a stretch to conflate name resolution (technically: name independent> routing) with indexing.> > I am not very interested in indexing, because that's essentially a solved> problem. It is also inherently "centralized", at least if you want it to be> within competitive cost and latency bounds.> > As for name resolution, well that's the culprit of distributed systems, isn't> it? The problem here is that we have two problems:> > 1. Mapping a name for a datum to machines which have it> 2. Mapping a name for a machine to its network address(es)> > I have repeatedly said and written that this is fairly straighforward to> implement using just the routing datastructure of a DHT (not the k/v storage).> But, it depends on a threshold of stable nodes in a network, and on proactive> content advertisement (what IPFS calls "providers"). The latter posing some> questions regarding the bounds of the set of names a machine could possibly> hold.> > If we relax the "requirement" to (morally) content-address git repositories, we> can get away with just 2., which is a lot simpler but implies explicit peering.> Which I don't think is so bad, given your set of complaints.> > > More interesting is, however, to take a step back and revisit why we couldn't> just dump the object database into IPFS in the first place. Is there not some> way to tweak compaction such that we could? It turns out there is.> > But that's for another time.
It just occured to me that we *could* use the DHT part of Kademlia if the DHT had the type:
HashMap<ProjectId, Set<PeerId>>
So first a `lookupValue(id) -> Set<PeerId>` would be issued, and with the result, a `lookupPeer(id)`
would be issued (eg. with a random peer in the result set), and finally a connection would be established and the project
would be fetched.
Thoughts?
------- Original Message -------
On Wednesday, August 10th, 2022 at 18:28, Alexis Sellier <alexis@radicle.foundation> wrote:
> > > So, the part I don't understand with using something like Kademlia, is> that it seems like it's designed for a case where nodes don't get to> decide what content they store, and this is what allows the optimization> of finding nodes in log time. Am I missing something?> > Using the routing structure would basically mean that we have the ability to> look up IPs, given a PeerId, right? We'd still need a way for peers to> announce which projects they have, and it doesn't look like Kademlia is> designed for that (as it's deterministic with Kademlia, not arbitrary).> > ------- Original Message -------> On Thursday, June 30th, 2022 at 14:53, Kim Altintop kim@eagain.io wrote:> > > > > Sorry you feel edgy. But "wow" is the only response I have recognising the> > pattern: something is not going quite like we wanted, some complaints are made,> > some symptoms are observed, some explanations are brought forward. But then,> > instead of scrutinising that -- and, most importantly, taking into account the> > path that led here -- a shortcut is taken.> > > > The term for that is "jumping to conclusions", I believe, and it has hurt the> > Radicle project on more than one occasion.> > > > I can start from the top, though:> > > > > == The gossip network> > > > Gossip was initially chosen for much the same reasons IPFS relies on "pub/sub"> > for many meaningful applications: git repositories employ a dynamic compaction> > scheme, and so are not well suited for "chunked" storage and distribution as> > employed by most file sharing applications. There is also no inherent "data> > economics" which would incentivise nodes to replicate data they are not> > otherwise interested in.> > > > Hence, the sensible choice is to make it so nodes cluster around "topics" they> > are interested in, in order to reduce fanout. That, however, turned out to not> > be so easy (IPFS' "gossipsub" presents but one set of tradeoffs), and there was> > never enough time to work on it.> > > > > == Seed Nodes> > > > It is not true that they "emerged", this concept was there since the beginning.> > One reason is that there must be some kind of entry nodes, another that there is> > otherwise no way to reason about availability of the data. If connectivity of a> > seed is based on content (as opposed to random), discovery of replicas is quite> > possible without resorting to a DHT overlay.> > > > It is true that it was not sufficiently understood that equating peer-to-peer> > with laptop-to-laptop is just not a good "requirement" and will inevitably lead> > to disappointment. Usable peer-to-peer networks are server-to-server networks> > (mostly), and anyone who tells you otherwise is trying to sell you something.> > > > It was also not sufficiently understood that there is no working around the fact> > that git is not "decentralized" in the sense "web3" enthusiasts like to use that> > term: git patches (commits) do not commute, and so any kind of shared> > maintainership requires "consensus". That will not become automated while the> > cheapest and easiest to understand consensus algorithm is a "centralized" git> > server.> > > > On the up side, this also makes everything much easier because data just needs> > to be mirrored along a tree with that server as the root. The remaining question> > is how to propagate (proposals for) updates in the other direction (ie.> > towards the root). But that's for another time.> > > > > == Network Requirements> > > == URN Lookup> > > > It's quite a stretch to conflate name resolution (technically: name independent> > routing) with indexing.> > > > I am not very interested in indexing, because that's essentially a solved> > problem. It is also inherently "centralized", at least if you want it to be> > within competitive cost and latency bounds.> > > > As for name resolution, well that's the culprit of distributed systems, isn't> > it? The problem here is that we have two problems:> > > > 1. Mapping a name for a datum to machines which have it> > 2. Mapping a name for a machine to its network address(es)> > > > I have repeatedly said and written that this is fairly straighforward to> > implement using just the routing datastructure of a DHT (not the k/v storage).> > But, it depends on a threshold of stable nodes in a network, and on proactive> > content advertisement (what IPFS calls "providers"). The latter posing some> > questions regarding the bounds of the set of names a machine could possibly> > hold.> > > > If we relax the "requirement" to (morally) content-address git repositories, we> > can get away with just 2., which is a lot simpler but implies explicit peering.> > Which I don't think is so bad, given your set of complaints.> > > > More interesting is, however, to take a step back and revisit why we couldn't> > just dump the object database into IPFS in the first place. Is there not some> > way to tweak compaction such that we could? It turns out there is.> > > > But that's for another time.
> It just occured to me that we *could* use the DHT part of Kademlia if the DHT had the type:> > HashMap<ProjectId, Set<PeerId>>
Yes this has been dubbed “provider cache” on various occasions, borrowing ipfs jargon.
rust-libp2p essentially do S/Kademlia
https://git.gnunet.org/bibliography.git/plain/docs/SKademlia2007.pdf
Last time I checked they even added circuit information in there
I would think the hopepunchr lib can be used to do all that
It doesn't yet support WebRTC transport from browsers but looking forward to that
------- Original Message -------
On Friday, August 12th, 2022 at 8:14 PM, Kim Altintop <kim@eagain.io> wrote:
> > It just occured to me that we could use the DHT part of Kademlia if the DHT had the type:> > > > HashMap<ProjectId, Set<PeerId>>> > > Yes this has been dubbed “provider cache” on various occasions, borrowing ipfs jargon.