~sircmpwn/sr.ht-discuss

4 2

Can anyone help with https://git.sr.ht/~etalab/codegouvfr-fetch-data ?

Details
Message ID
<87pmr0yrwn.fsf@bzg.fr>
DKIM signature
pass
Download raw message
Hi all,

as you've read in Drew's anniversary blog, one of the service of the
French administration (Etalab/DINUM) is hosting some of its projects
on SourceHut: https://sr.ht/~etalab/

Our main project is the one related to https://code.gouv.fr where we
expose published repositories from the French public sector.

The very first step when building this website is to collect data from
several forges: github.com, gitlab.com and GitLab instances.

See this file for the list of forges we collect data from:
https://git.sr.ht/~etalab/codegouvfr-fetch-data/blob/master/platforms.csv

We don't collect data from SourceHut yet, partly because I don't speak
Python, and partly because we did not have any account to collect from
on sr.ht.

If someone is familiar with SourceHut GraphQL API and willing to help
us harvest metadata on SourceHut accounts/repositories, it'd be much
appreciated!

Some basic guidance here:
https://git.sr.ht/~etalab/codegouvfr-fetch-data#todo

The data model for organizations/groups and repositories is here:
https://git.sr.ht/~etalab/codegouvfr-fetch-data/tree/master/schemas/

See for example the code for fetching data from GitLab:
https://git.sr.ht/~etalab/codegouvfr-fetch-data/tree/master/item/gitlab.py

For sr.ht, we plan to relax our policy of harvesting organizational
accounts only, since the distinction does not exist on sr.ht yet (we
are not sure we need it, manually checking who owns the account is
probably good enough for us.)

Don't hesitate to directly send patches to the project's list:
https://lists.sr.ht/~etalab/codegouvfr-devel

Or to ping me privately.  Thanks!

-- 
 Bastien
Details
Message ID
<CFRCS4647REQ.AU42DJJOOQXM@taiga>
In-Reply-To
<87pmr0yrwn.fsf@bzg.fr> (view parent)
DKIM signature
fail
Download raw message
DKIM signature: fail
Can you enumerate exactly what data you hope to collect, so that we
needn't find out by reverse engineering the GitHub integration?
Details
Message ID
<87a6i4ypo7.fsf@gnu.org>
In-Reply-To
<CFRCS4647REQ.AU42DJJOOQXM@taiga> (view parent)
DKIM signature
pass
Download raw message
"Drew DeVault" <sir@cmpwn.com> writes:

> Can you enumerate exactly what data you hope to collect, so that we
> needn't find out by reverse engineering the GitHub integration?

Yes, here it is:

For accounts (aka "organizations"):

  - URL (as primary key)
  - avatar_url (no-op?)
  - creation_date
  - description (srht "bio"?)
  - email
  - is_verified (no-op?)
  - location
  - login (e.g. "etalab" or "~etalab"?)
  - name (no-op: not distinct from login on srht?)
  - platform (= SourceHut)
  - repositories_count
  - website

For repositories:

  - repository_url (primary key)
  - creation_date
  - description
  - forks_count
  - homepage
  - is_archived (no-op)
  - is_fork
  - language
  - last_modification (= last commit date)
  - last_update (= last settings update)
  - license
  - name
  - open_issues_count (no-op since SourceHut trackers relate to projects?)
  - organization_name (or srht account's name?)
  - platform (= SourceHut)
  - software_heritage_exists (if on https://www.softwareheritage.org)
  - software_heritage_url (if on https://www.softwareheritage.org)
  - stars_count (no-op)
  - topics (= srht tags?)

You can browse the current data we have.

For organizations:
https://code.gouv.fr/data/organizations/json/all.json

For repositories:
https://code.gouv.fr/data/repositories/json/all.json

HTH,

-- 
 Bastien
Details
Message ID
<CFRXCG30AU37.1Y94JKY0K3M8G@taiga>
In-Reply-To
<87a6i4ypo7.fsf@gnu.org> (view parent)
DKIM signature
pass
Download raw message
Sweet. You can use the GraphQL API for this. Docs are available here:

https://man.sr.ht/graphql.md

For accounts, consult the following GraphQL resolvers:

https://git.sr.ht/~sircmpwn/meta.sr.ht/tree/master/api/graph/schema.graphqls#L67

I would suggest seeking to store the sr.ht data as it is, rather than
trying to shoehorn it into a similar shape as the GitHub data.

We don't provide a URL, but you can construct one yourself fairly easily
(https://sr.ht/~$username). Created, description, email, location,
description (bio), and website (url) should be straightforward. There
are no fields for avatar_url or is_verified; all accounts can be assumed
to be verified. You may want to store canonicalName, which includes the
~ prefix identifying it as a user (rather than an organization).

You can use the git.sr.ht GraphQL API to get info on repos:

https://git.sr.ht/~sircmpwn/git.sr.ht/tree/master/api/graph/schema.graphqls#L87

We don't provide the license directly, but I would be interested in
seeing this feature added if you're open to writing the patch.

Feel free to follow up on IRC or via email if you have any questions.
Details
Message ID
<87pmqyygxf.fsf@gnu.org>
In-Reply-To
<CFRXCG30AU37.1Y94JKY0K3M8G@taiga> (view parent)
DKIM signature
pass
Download raw message
Thank you very much for the directions!  Really appreciated.

I will see how I can move forward on this.

-- 
 Bastien
Reply to thread Export thread (mbox)