~sircmpwn/sr.ht-discuss

7 5

Keeping the Source Flowing

Details
Message ID
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org>
DKIM signature
pass
Download raw message
This is a bit off-topic, but as it may affect sr.ht itself and lots of 
its community, I think it's worth mentioning. If there are other public 
venues where this is being discussed, I'd be glad to know.

https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/

In the last couple of years, many free software projects moved away from 
Github to Gitlab, and then to Codeberg. Sourcehut remains a bit of a 
niche compared to the last two, but this upcoming move of Gitlab to 
remove dormant repositories could precipitate a new exodus, as people 
will look for solutions for long-term hosting of historical or stable code.

I know the Software Heritage project (Zack Cced) exists and will 
probably benefit in a way from this situation if it has the capacity to 
host that many repositories. The Register article says that Gitlab could 
save a million dollar a year, so this service to keep less used source 
code available to the public is not something easy to replace.

I wonder whether there's interest in the community to tackle this issue 
of keeping memory of the code and strike a balance between long-term 
availability and perusing of the code. Also, as is also mentioned in the 
article, the criteria of 'inactivity' may not be reliable entirely since 
some code that reached stability barely needs to be changed at all.

And since we're here, maybe Drew can enlighten us about the cost of 
hosting code and how (self-)hosting code with SourceHut compares with 
the Gitlab approach, since, in the end, we should be having many 
instances of sr.ht services in the wild.

And Zack maybe you could give a hint about the kind of costs keeping 
free software memory entails ?

Regards,

==
hk
Details
Message ID
<CLYWZTQKSET2.19WOMBEWUEFE2@taiga>
In-Reply-To
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org> (view parent)
DKIM signature
pass
Download raw message
We aren't storing anything like what GitLab does, but I find the million
dollar figure highly quesitonable. We have no plans to implement
anything similar to what GitLab proposed in the foreseeable future, and
we would certainly strive to make the data remain available should the
need ever arise -- either through software heritage, archive.org, or
cold storage.
Details
Message ID
<ecf31adb-eeed-4f0a-b94b-7f29b0d2dd3e@www.fastmail.com>
In-Reply-To
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org> (view parent)
DKIM signature
pass
Download raw message
Hey there,

On Sat, Aug 6, 2022, at 13:00, hellekin wrote:
> This is a bit off-topic, but as it may affect sr.ht itself and lots of 
> its community, I think it's worth mentioning. If there are other public 
> venues where this is being discussed, I'd be glad to know.
>
> https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/
>
> In the last couple of years, many free software projects moved away from 
> Github to Gitlab, and then to Codeberg. Sourcehut remains a bit of a 
> niche compared to the last two, but this upcoming move of Gitlab to 
> remove dormant repositories could precipitate a new exodus, as people 
> will look for solutions for long-term hosting of historical or stable code.
>
> I know the Software Heritage project (Zack Cced) exists and will 
> probably benefit in a way from this situation if it has the capacity to 
> host that many repositories. The Register article says that Gitlab could 
> save a million dollar a year, so this service to keep less used source 
> code available to the public is not something >easy to replace.

To the best of my knowledge the SWF has a different mission. It does not host active code repositories or inactive ones per se. It references and hosts software code for the sake of knowledge and preservation, but not because it helps dealing with inactive code on Gitlab.

Where I'm not sure to understand your point is this one: are you suggesting Sourcehut hosts thousands of inactive repos? 

I don't think it's the role of Sourcehut but I may be wrong.

All the best,

Charles.

>
>
> ==
> hk
Details
Message ID
<20220806151524.z7sfxjopv52shcx7@microconnector.local>
In-Reply-To
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org> (view parent)
DKIM signature
pass
Download raw message
On Sat, Aug 06, 2022 at 01:00:57PM +0200, hellekin wrote:
> This is a bit off-topic, but as it may affect sr.ht itself and lots of its
> community, I think it's worth mentioning. If there are other public venues
> where this is being discussed, I'd be glad to know.
> 
> https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/
> 

But do not neglect the follow up article[1].

[1]: https://www.theregister.com/2022/08/05/gitlab_reverses_deletion_policy/
Details
Message ID
<87iln4y31k.fsf@gnu.org>
In-Reply-To
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org> (view parent)
DKIM signature
pass
Download raw message
hellekin <hellekin@cepheide.org> writes:

> I know the Software Heritage project (Zack Cced) exists and will
> probably benefit in a way from this situation if it has the capacity
> to host that many repositories. 

The mission of Software Heritage is to archive repositories that are
hosted _elsewhere_, not to be the primary host for those.

It's good to have both a good forge (SourceHut) and a steady archive
(SWH): that way we depend less on possibly bad policies of mainstream
forges and we have a guarantee that all source code ever published is
never lost.

Both projects fill complementary missions and need to be supported,
but for distinct reasons IMO.

-- 
 Bastien
Details
Message ID
<20220808070755.xpdsgvfd7s2yyhfu@upsilon.cc>
In-Reply-To
<a2df9a03-dd38-3ebe-2de0-08708e1a6d97@cepheide.org> (view parent)
DKIM signature
pass
Download raw message
Thanks Hellekin for starting this thread and Cc:-ing me.

On Sat, Aug 06, 2022 at 01:00:57PM +0200, hellekin wrote:
> I know the Software Heritage project (Zack Cced) exists and will probably
> benefit in a way from this situation if it has the capacity to host that
> many repositories. The Register article says that Gitlab could save a
> million dollar a year, so this service to keep less used source code
> available to the public is not something easy to replace.

As Bastien pointed out, we do not aim to be a primary hosting place for
source code, but you're totally right that, as part of our mission, we
do want to archive in the long-term as much FOSS source code as we can
manage.

The list of forges/code hosting platforms we regularly crawl is visible
at https://archive.softwareheritage.org/ and I'm sure you'll notice that
Sourcehut is currently not there. We do want to archive Sourcehut
properly though (provided that the platform operators are OK with that
of course) and we have started tracking this at
https://forge.softwareheritage.org/T4346 a few months ago.

The way we usually archive large hosting platforms is by implementing a
dedicated "lister" component (that's our jargon for this) that can
provide either incrementally or fully a list of all the publicly
available VCS repos hosted there. Ideally, the listing should include
metadata such as "last modified" timestamps fr each repo so that
crawling can be more targeted. (This is not strictly needed for
platforms which are not as big as github or the gitlab.com instance, but
it is a nice to have anyway as it reduces the burden also on both ours
and the hoster side.)

Last time we looked into this we didn't find an API that does this for
Sourcehut, but it is possible we didn't look in the right place. Tips
welcome! Lacking this we can resort to web crawling, but that won't be
ideal.  I don't think at this point we at Software Heritage can commit
the resources to develop ourselves the needed Sourcehut API, but we'll
be happy to discuss spec and test it out if there is an interest in
doing so on your side.

Meanwhile individual Git repos hosted on Sourcehut can still be archived
on demand using our Save code now feature
https://save.softwareheritage.org/ , which has both a Web UI and an
associated API. (But no, a big loop around it for all Sourcehut repos
will not be ideal :-))

Happy to discuss more any of this with interested folks at Sourcehut.

PS I'm in-between holidays and will be back for good only after mid-August.
-- 
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack          _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CTO Software Heritage            o o o     o       /\|^|/\
Former Debian Project Leader & OSI Board Director               '" V "'
Details
Message ID
<CM0IRZXJIT40.1TIAGJJ6PWKLT@taiga>
In-Reply-To
<20220808070755.xpdsgvfd7s2yyhfu@upsilon.cc> (view parent)
DKIM signature
pass
Download raw message
On Mon Aug 8, 2022 at 9:07 AM CEST, Stefano Zacchiroli wrote:
> The way we usually archive large hosting platforms is by implementing a
> dedicated "lister" component (that's our jargon for this) that can
> provide either incrementally or fully a list of all the publicly
> available VCS repos hosted there. Ideally, the listing should include
> metadata such as "last modified" timestamps fr each repo so that
> crawling can be more targeted. (This is not strictly needed for
> platforms which are not as big as github or the gitlab.com instance, but
> it is a nice to have anyway as it reduces the burden also on both ours
> and the hoster side.)

Such a feature is not entirely trivial for SourceHut, as we don't
provide a list of all "public" repositories for privacy reasons -- if
you don't know someone's username, you cannot find their repositories.

However, we do have the project hub, which provides an index of all
public "projects" (distinct from repos), which themselves may contain
git or hg repositories.

There is no easy feature for enumerating these and the repositories
within them, but the upcoming project hub GraphQL API could presumably
be used for this.

I don't have much time to improve support specifically for Software
Heritage, but if an interested user wants to champion it, by all means
do so :)
Details
Message ID
<20220808090858.qiwivlwlrxktruto@upsilon.cc>
In-Reply-To
<CM0IRZXJIT40.1TIAGJJ6PWKLT@taiga> (view parent)
DKIM signature
pass
Download raw message
On Mon, Aug 08, 2022 at 10:55:05AM +0200, Drew DeVault wrote:
> Such a feature is not entirely trivial for SourceHut, as we don't
> provide a list of all "public" repositories for privacy reasons -- if
> you don't know someone's username, you cannot find their repositories.

Understandable.

As a FOSS developer myself I guess I'd like to be able to configure this
(all repos / individual repos publicly listed or not), but as a default
yours totally make sense.

> However, we do have the project hub, which provides an index of all
> public "projects" (distinct from repos), which themselves may contain
> git or hg repositories.
> 
> There is no easy feature for enumerating these and the repositories
> within them, but the upcoming project hub GraphQL API could presumably
> be used for this.

Sounds good. We're ourselves migrating our API to GraphQL for v2, and
also already using it to extract useful metadata from GitHub repos, so
once this is available we can easily build on it.

> I don't have much time to improve support specifically for Software
> Heritage, but if an interested user wants to champion it, by all means
> do so :)

That's totally fair. And, to be clear, it was not my intention to ask
Sourcehut (or you specifically) to do anything ad-hoc to support our own
non-profit mission :-) We just live in a limited-resource world and we
cannot commit either to work on the Sourcehut-side of this ATM.

Maybe some contributor (of either project!) will pick this up; or we can
just reconsider it in the future when suitable resources come up. In the
meantime do not hesitate to ping me (or anyone else among our devs on
#swh-devel Libera/Matrix) if/when you want to discuss tech details.

I'll point to this discussion from our issue tracker. Feel free to point
to it from the relevant issue on your side if appropriate.

Cheers
-- 
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack          _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CTO Software Heritage            o o o     o       /\|^|/\
Former Debian Project Leader & OSI Board Director               '" V "'
Reply to thread Export thread (mbox)