This is a bit off-topic, but as it may affect sr.ht itself and lots of
its community, I think it's worth mentioning. If there are other public
venues where this is being discussed, I'd be glad to know.
https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/
In the last couple of years, many free software projects moved away from
Github to Gitlab, and then to Codeberg. Sourcehut remains a bit of a
niche compared to the last two, but this upcoming move of Gitlab to
remove dormant repositories could precipitate a new exodus, as people
will look for solutions for long-term hosting of historical or stable code.
I know the Software Heritage project (Zack Cced) exists and will
probably benefit in a way from this situation if it has the capacity to
host that many repositories. The Register article says that Gitlab could
save a million dollar a year, so this service to keep less used source
code available to the public is not something easy to replace.
I wonder whether there's interest in the community to tackle this issue
of keeping memory of the code and strike a balance between long-term
availability and perusing of the code. Also, as is also mentioned in the
article, the criteria of 'inactivity' may not be reliable entirely since
some code that reached stability barely needs to be changed at all.
And since we're here, maybe Drew can enlighten us about the cost of
hosting code and how (self-)hosting code with SourceHut compares with
the Gitlab approach, since, in the end, we should be having many
instances of sr.ht services in the wild.
And Zack maybe you could give a hint about the kind of costs keeping
free software memory entails ?
Regards,
==
hk
We aren't storing anything like what GitLab does, but I find the million
dollar figure highly quesitonable. We have no plans to implement
anything similar to what GitLab proposed in the foreseeable future, and
we would certainly strive to make the data remain available should the
need ever arise -- either through software heritage, archive.org, or
cold storage.
Hey there,
On Sat, Aug 6, 2022, at 13:00, hellekin wrote:
> This is a bit off-topic, but as it may affect sr.ht itself and lots of > its community, I think it's worth mentioning. If there are other public > venues where this is being discussed, I'd be glad to know.>> https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/>> In the last couple of years, many free software projects moved away from > Github to Gitlab, and then to Codeberg. Sourcehut remains a bit of a > niche compared to the last two, but this upcoming move of Gitlab to > remove dormant repositories could precipitate a new exodus, as people > will look for solutions for long-term hosting of historical or stable code.>> I know the Software Heritage project (Zack Cced) exists and will > probably benefit in a way from this situation if it has the capacity to > host that many repositories. The Register article says that Gitlab could > save a million dollar a year, so this service to keep less used source > code available to the public is not something >easy to replace.
To the best of my knowledge the SWF has a different mission. It does not host active code repositories or inactive ones per se. It references and hosts software code for the sake of knowledge and preservation, but not because it helps dealing with inactive code on Gitlab.
Where I'm not sure to understand your point is this one: are you suggesting Sourcehut hosts thousands of inactive repos?
I don't think it's the role of Sourcehut but I may be wrong.
All the best,
Charles.
>>> ==> hk
hellekin <hellekin@cepheide.org> writes:
> I know the Software Heritage project (Zack Cced) exists and will> probably benefit in a way from this situation if it has the capacity> to host that many repositories.
The mission of Software Heritage is to archive repositories that are
hosted _elsewhere_, not to be the primary host for those.
It's good to have both a good forge (SourceHut) and a steady archive
(SWH): that way we depend less on possibly bad policies of mainstream
forges and we have a guarantee that all source code ever published is
never lost.
Both projects fill complementary missions and need to be supported,
but for distinct reasons IMO.
--
Bastien
Thanks Hellekin for starting this thread and Cc:-ing me.
On Sat, Aug 06, 2022 at 01:00:57PM +0200, hellekin wrote:
> I know the Software Heritage project (Zack Cced) exists and will probably> benefit in a way from this situation if it has the capacity to host that> many repositories. The Register article says that Gitlab could save a> million dollar a year, so this service to keep less used source code> available to the public is not something easy to replace.
As Bastien pointed out, we do not aim to be a primary hosting place for
source code, but you're totally right that, as part of our mission, we
do want to archive in the long-term as much FOSS source code as we can
manage.
The list of forges/code hosting platforms we regularly crawl is visible
at https://archive.softwareheritage.org/ and I'm sure you'll notice that
Sourcehut is currently not there. We do want to archive Sourcehut
properly though (provided that the platform operators are OK with that
of course) and we have started tracking this at
https://forge.softwareheritage.org/T4346 a few months ago.
The way we usually archive large hosting platforms is by implementing a
dedicated "lister" component (that's our jargon for this) that can
provide either incrementally or fully a list of all the publicly
available VCS repos hosted there. Ideally, the listing should include
metadata such as "last modified" timestamps fr each repo so that
crawling can be more targeted. (This is not strictly needed for
platforms which are not as big as github or the gitlab.com instance, but
it is a nice to have anyway as it reduces the burden also on both ours
and the hoster side.)
Last time we looked into this we didn't find an API that does this for
Sourcehut, but it is possible we didn't look in the right place. Tips
welcome! Lacking this we can resort to web crawling, but that won't be
ideal. I don't think at this point we at Software Heritage can commit
the resources to develop ourselves the needed Sourcehut API, but we'll
be happy to discuss spec and test it out if there is an interest in
doing so on your side.
Meanwhile individual Git repos hosted on Sourcehut can still be archived
on demand using our Save code now feature
https://save.softwareheritage.org/ , which has both a Web UI and an
associated API. (But no, a big loop around it for all Sourcehut repos
will not be ideal :-))
Happy to discuss more any of this with interested folks at Sourcehut.
PS I'm in-between holidays and will be back for good only after mid-August.
--
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/
Télécom Paris, Polytechnic Institute of Paris o o o </> <\>
Co-founder & CTO Software Heritage o o o o /\|^|/\
Former Debian Project Leader & OSI Board Director '" V "'
On Mon Aug 8, 2022 at 9:07 AM CEST, Stefano Zacchiroli wrote:
> The way we usually archive large hosting platforms is by implementing a> dedicated "lister" component (that's our jargon for this) that can> provide either incrementally or fully a list of all the publicly> available VCS repos hosted there. Ideally, the listing should include> metadata such as "last modified" timestamps fr each repo so that> crawling can be more targeted. (This is not strictly needed for> platforms which are not as big as github or the gitlab.com instance, but> it is a nice to have anyway as it reduces the burden also on both ours> and the hoster side.)
Such a feature is not entirely trivial for SourceHut, as we don't
provide a list of all "public" repositories for privacy reasons -- if
you don't know someone's username, you cannot find their repositories.
However, we do have the project hub, which provides an index of all
public "projects" (distinct from repos), which themselves may contain
git or hg repositories.
There is no easy feature for enumerating these and the repositories
within them, but the upcoming project hub GraphQL API could presumably
be used for this.
I don't have much time to improve support specifically for Software
Heritage, but if an interested user wants to champion it, by all means
do so :)
On Mon, Aug 08, 2022 at 10:55:05AM +0200, Drew DeVault wrote:
> Such a feature is not entirely trivial for SourceHut, as we don't> provide a list of all "public" repositories for privacy reasons -- if> you don't know someone's username, you cannot find their repositories.
Understandable.
As a FOSS developer myself I guess I'd like to be able to configure this
(all repos / individual repos publicly listed or not), but as a default
yours totally make sense.
> However, we do have the project hub, which provides an index of all> public "projects" (distinct from repos), which themselves may contain> git or hg repositories.> > There is no easy feature for enumerating these and the repositories> within them, but the upcoming project hub GraphQL API could presumably> be used for this.
Sounds good. We're ourselves migrating our API to GraphQL for v2, and
also already using it to extract useful metadata from GitHub repos, so
once this is available we can easily build on it.
> I don't have much time to improve support specifically for Software> Heritage, but if an interested user wants to champion it, by all means> do so :)
That's totally fair. And, to be clear, it was not my intention to ask
Sourcehut (or you specifically) to do anything ad-hoc to support our own
non-profit mission :-) We just live in a limited-resource world and we
cannot commit either to work on the Sourcehut-side of this ATM.
Maybe some contributor (of either project!) will pick this up; or we can
just reconsider it in the future when suitable resources come up. In the
meantime do not hesitate to ping me (or anyone else among our devs on
#swh-devel Libera/Matrix) if/when you want to discuss tech details.
I'll point to this discussion from our issue tracker. Feel free to point
to it from the relevant issue on your side if appropriate.
Cheers
--
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/
Télécom Paris, Polytechnic Institute of Paris o o o </> <\>
Co-founder & CTO Software Heritage o o o o /\|^|/\
Former Debian Project Leader & OSI Board Director '" V "'
On 8/8/22 11:08, Stefano Zacchiroli wrote:
> >> I don't have much time to improve support specifically for Software>> Heritage, but if an interested user wants to champion it, by all means>> do so :)> > That's totally fair. And, to be clear, it was not my intention to ask> Sourcehut (or you specifically) to do anything ad-hoc to support our own> non-profit mission :-) We just live in a limited-resource world and we> cannot commit either to work on the Sourcehut-side of this ATM.> > Maybe some contributor (of either project!) will pick this up; or we can> just reconsider it in the future when suitable resources come up. In the> meantime do not hesitate to ping me (or anyone else among our devs on> #swh-devel Libera/Matrix) if/when you want to discuss tech details.> > I'll point to this discussion from our issue tracker. Feel free to point> to it from the relevant issue on your side if appropriate.> > Cheers
Thank you all for picking up this thread. I find it very valuable
conversation for the sake of free software.
Note that this kind of project -- figuring a way to add SH to SH, ahem,
SourceHut to Software Heritage -- is probably something that can get
traction and funding from the upcoming NGI Zero Foundation fund. As
previous NGI Zero funds, it will be something like 25-50K per iteration,
with a maximum of 200,000 € per project (either SH). So if someone wants
to pick this up, you're welcome to follow up here since it's already
documented with the Software Heritage.
Cheers,
==
hk
On Thu, Aug 11, 2022 at 12:57:15PM +0200, hellekin wrote:
> Note that this kind of project -- figuring a way to add SH to SH, ahem,> SourceHut to Software Heritage -- is probably something that can get> traction and funding from the upcoming NGI Zero Foundation fund. As previous> NGI Zero funds, it will be something like 25-50K per iteration, with a> maximum of 200,000 € per project (either SH). So if someone wants to pick> this up, you're welcome to follow up here since it's already documented with> the Software Heritage.
We've used in the past a number of (cascading) grants, including some
from NGI Zero, to outsource development of Software Heritage adapters to
technologies we didn't have in-house expertise about. See:
https://www.softwareheritage.org/grants/ . I confirm they could work
very well, but note that the funding is only half of the story. One also
needs enough mentoring/code reviewing power within the project that will
be receiving the code. We can assure that for code that will land in our
code base, but we couldn't do the same for code that will land in
SourceHut (the API backend in this case). If someone is interested on
SourceHut side and can commit the relevant effort, let me know and we'll
be happy to support (and/or participate in, whatever will be most
appropriate) a grant application about this.
Cheers
--
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/
Télécom Paris, Polytechnic Institute of Paris o o o </> <\>
Co-founder & CTO Software Heritage o o o o /\|^|/\
Former Debian Project Leader & OSI Board Director '" V "'