~sircmpwn/godocs.io

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
3 2

[PATCH gddo] gddo-server: Fetch modules in the background

Details
Message ID
<20210620232026.3330-1-me@adnano.co>
DKIM signature
pass
Download raw message
Patch: +50 -1
In a separate goroutine, continuously fetch the oldest module in the
database from the module proxy and update it if necessary. Note that the
--crawl-interval flag must be specified to enable background fetching.

---
A few remaining questions:

- Should we crawl the imports of packages (to discover new packages not
  in the database)? If we do, perhaps this should be opt-in as this
  could cause the database to grow exponentially, which wouldn't be
  desirable for small installations.
- Is PostgreSQL's ORDER BY fast enough for large databases? Or is speed
  not really a concern?
- Should we log FETCH messages, or should we keep silent except when
  errors occur?

 gddo-server/fetch.go          | 18 ++++++++++++++++++
 gddo-server/main.go           |  8 +++++++-
 internal/database/database.go | 25 +++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/gddo-server/fetch.go b/gddo-server/fetch.go
index a425d24..ba795b7 100644
--- a/gddo-server/fetch.go
+++ b/gddo-server/fetch.go
@@ -174,3 +174,21 @@ func (s *Server) updateMeta(ctx context.Context, modulePath string) error {
	}
	return nil
}

// fetchOldest updates the oldest module in the database if necessary.
func (s *Server) fetchOldest(ctx context.Context) {
	modulePath, err := s.db.Oldest(ctx)
	if err != nil {
		log.Printf("Error retrieving oldest module: %v", err)
		return
	}
	if modulePath == "" {
		// No modules in the database yet
		return
	}
	log.Println("FETCH", modulePath)
	if err := s.fetch(ctx, modulePath, proxy.LatestVersion); err != nil {
		log.Printf("Error fetching %s: %v", modulePath, err)
		return
	}
}
diff --git a/gddo-server/main.go b/gddo-server/main.go
index b5f8300..4b6bf93 100644
--- a/gddo-server/main.go
+++ b/gddo-server/main.go
@@ -32,7 +32,13 @@ func main() {
	if err != nil {
		log.Fatal("error creating server:", err)
	}
	// TODO: Crawl old modules in the background.

	// Update modules in the background
	go func() {
		for range time.Tick(s.cfg.CrawlInterval) {
			s.fetchOldest(ctx)
		}
	}()

	var wg sync.WaitGroup
	defer wg.Wait()
diff --git a/internal/database/database.go b/internal/database/database.go
index 4eac497..452befe 100644
--- a/internal/database/database.go
+++ b/internal/database/database.go
@@ -675,3 +675,28 @@ func (db *Database) PutMeta(ctx context.Context, meta source.Meta) error {
		return nil
	})
}

// Oldest returns the module path of the oldest module in the database
// (i.e., the module with the smallest updated timestamp).
func (db *Database) Oldest(ctx context.Context) (string, error) {
	var modulePath string
	err := db.withTx(ctx, nil, func(tx *sql.Tx) error {
		rows, err := tx.QueryContext(ctx,
			`SELECT module_path FROM modules ORDER BY updated LIMIT 1;`)
		if err != nil {
			return err
		}
		defer rows.Close()

		if rows.Next() {
			if err := rows.Scan(&modulePath); err != nil {
				return err
			}
		}
		return rows.Err()
	})
	if err != nil {
		return "", err
	}
	return modulePath, nil
}
-- 
2.32.0
Details
Message ID
<CC8UKA7X3GBQ.12OL57T060LCV@taiga>
In-Reply-To
<20210620232026.3330-1-me@adnano.co> (view parent)
DKIM signature
fail
Download raw message
DKIM signature: fail
On Sun Jun 20, 2021 at 7:20 PM EDT, Adnan Maolood wrote:
> In a separate goroutine, continuously fetch the oldest module in the
> database from the module proxy and update it if necessary. Note that the
> --crawl-interval flag must be specified to enable background fetching.
>
> ---
> A few remaining questions:
>
> - Should we crawl the imports of packages (to discover new packages not
> in the database)? If we do, perhaps this should be opt-in as this
> could cause the database to grow exponentially, which wouldn't be
> desirable for small installations.

Can we estimate this based on the number of packages in the production
db today? Come up with some kind of model for predicting its behavior?

> - Is PostgreSQL's ORDER BY fast enough for large databases? Or is speed
> not really a concern?

That should not be an issue. We can always add a B-tree index if
necessary.

> - Should we log FETCH messages, or should we keep silent except when
> errors occur?

A one-liner ("Crawler successfully updated $pkgname") would probably not
be an issue.
Details
Message ID
<CC8UPX1DLFXA.3M7UBQQNXXCOH@nitro>
In-Reply-To
<CC8UKA7X3GBQ.12OL57T060LCV@taiga> (view parent)
DKIM signature
pass
Download raw message
On Sun Jun 20, 2021 at 7:50 PM EDT, Drew DeVault wrote:
> Can we estimate this based on the number of packages in the production
> db today? Come up with some kind of model for predicting its behavior?

As of right now, there are 15022 unique packages that are imported by
packages in the database. Of these 15022 packages, 1693 (about 11.3%)
are not already in the database. Now, we can't tell how many packages
these 1693 import without adding them to the database or fetching them
from the module proxy.

---
To retrieve the number of unique imported packages in the database, this
command was used:

	psql -A -t -c "SELECT imported_path FROM imports;" | sort -u | wc -l

To retrieve the number of unique imported packages that are NOT already
stored in the database, this command was used:

	psql -A -t -c "SELECT imported_path FROM imports WHERE NOT EXISTS (SELECT FROM packages WHERE import_path = imported_path);" | sort -u | wc -l

Also, note that these numbers may be off by one package as the "C"
pseudo-package is not present in the database but is still included in
package imports.
Details
Message ID
<CC9AFB3CLPIN.1FVEBGYG1DDH8@taiga>
In-Reply-To
<20210620232026.3330-1-me@adnano.co> (view parent)
DKIM signature
fail
Download raw message
DKIM signature: fail
Thanks!

To git@git.sr.ht:~sircmpwn/gddo
   22cd4e8..b136038  master -> master
Reply to thread Export thread (mbox)