~rjarry/dlrepo

dlrepo: fs: fix post-process of deduplicated files v2 APPLIED

Julien Floret: 1
 fs: fix post-process of deduplicated files

 5 files changed, 53 insertions(+), 17 deletions(-)
#775719 .build.yml success
Export patchset (mbox)
How do I use this?

Copy & paste the following snippet into your terminal to import this patchset into git:

curl -s https://lists.sr.ht/~rjarry/dlrepo/patches/32801/mbox | git am -3
Learn more about email & git

[PATCH dlrepo v2] fs: fix post-process of deduplicated files Export this patch

When a new artifact is uploaded via an HTTP PUT request, the server
calculates its digest and compares it with the digest present in the
request. The .digests file of the format is updated.
Then the artifact goes through the deduplication process: the file in
the job folder is hardlinked to a file named after its digest in the
global blobs folder.

After all artifacts of a given format have been uploaded, an HTTP PATCH
request can be issued on the format to remove the "dirty" flag. This
will run the post-process script if the DLREPO_POST_PROCESS_CMD
environment variable is defined.

The problem here is that the post-process script is able to modify
artifacts files in-place (for example, when signing an RPM file).
Because the file is hardlinked to a blob, its modification affects all
files hardlinked to it.

Real-life example:

Suppose we upload job1/fmt1 and job2/fmt2 simultaneously, and both
contain an identical foo.rpm file (of sha256 aaaaa).
We also have a post-process file that signs RPM files.

fmt1/foo.rpm and fmt2/foo.rpm are uploaded. No post-process has been
done yet.
We have the following data tree:

data/.blobs/sha256/
`-- aa
    `-- aaaaa (sha256: aaaaa)

branches/branch/tag/
|-- job1
|   |-- fmt1
|   |   |-- foo.rpm (hardlinked with aaaaa)
|   |   `-- .digests <- {"foo.rpm": "sha256:aaaaa"}
|   `-- .stamp
|-- job2
|   |-- fmt2
|   |   |-- foo.rpm (hardlinked with aaaaa)
|   |   `-- .digests <- {"foo.rpm": "sha256:aaaaa"}

Now, the post-process of fmt1 is run and foo.rpm is signed. Its new
sha256 is bbbbb; ALL FILES HARDLINKED TO IT have been modified.
So we have now:

data/.blobs/sha256/
|-- aa
|   `-- aaaaa (sha256: bbbbb) <-- WRONG
`-- bb
    `-- bbbbb (sha256: bbbbb) (hardlinked with aaaaa)

branches/branch/tag/
|-- job1
|   |-- fmt1
|   |   |-- foo.rpm (hardlinked with aaaaa)
|   |   `-- .digests <- {"foo.rpm": "sha256:bbbbb"}
|   `-- .stamp
|-- job2
|   |-- fmt2
|   |   |-- foo.rpm (hardlinked with aaaaa)
|   |   `-- .digests <- {"foo.rpm": "sha256:aaaaa"} <-- WRONG

This is a mess. Note that all 4 files are hardlinked together now.

Finally, the post-process of fmt2 is run. Since it is already signed,
the file is untouched and its digest is not recalculated, consequently
the fs stays in the same state.

To fix this problem, we must never modify a hardlinked file. After this
patch, when an HTTP POST is issued on a newly uploaded format, before
running the post-process we replace all hardlinks in the format by
temporary local copies, so that the post-process can safely modify them.
Then after the post-process, we deduplicate the files by removing the
copies and creating the hardlinks between the format folder and the
.blobs folder.

To avoid making copies of all formats, even those not concerned by the
post-process script, we introduce a new optional configuration variable:
DLREPO_POST_PROCESS_FILTER. If set, it must contain a regular expression
for matching files inside the format folder.
If any file of the format matches the filter, the post-process script is
run. Note that all files of the format are copied, not only those who
match the filter. Otherwise, if by mishap the regex were incomplete and
the post-process were to touch other files, the database would be
broken.
If DLREPO_POST_PROCESS_FILTER is not set, the post-process script is run
on all formats.

Fixes: fd4d5c28600c ("format: add post-process command support")
Signed-off-by: Julien Floret <julien.floret@6wind.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Thomas Faivre <thomas.faivre@6wind.com>
---
v2: changed the example value for DLREPO_POST_PROCESS_FILTER as
suggested by Robin.
dlrepo/fs/__init__.py      | 26 ++++++++++++++++++++------
 dlrepo/fs/fmt.py           | 28 ++++++++++++++++++++--------
 docs/dlrepo-api.7.scdoc    |  7 ++++---
 docs/dlrepo-config.5.scdoc |  8 ++++++++
 etc/default                |  1 +
 5 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/dlrepo/fs/__init__.py b/dlrepo/fs/__init__.py
index 9e6212da8519..bc91e3773ce6 100644
--- a/dlrepo/fs/__init__.py
+++ b/dlrepo/fs/__init__.py
@@ -78,6 +78,9 @@ class AbstractRepository:
    def link_blob_ignore_quota(self, digest: str, target: Path):
        raise NotImplementedError()

    def copy_blob(self, digest: str, target: Path):
        raise NotImplementedError()


# --------------------------------------------------------------------------------------
class ArtifactRepository(AbstractRepository):
@@ -154,10 +157,13 @@ class ArtifactRepository(AbstractRepository):
        )

    def link_blob(self, digest: str, target: Path):
        _hardlink(self.blob_path(digest), target)
        _copy_or_link(self.blob_path(digest), target)

    def link_blob_ignore_quota(self, digest: str, target: Path):
        _hardlink(self.blob_path(digest), target)
        _copy_or_link(self.blob_path(digest), target)

    def copy_blob(self, digest: str, target: Path):
        _copy_or_link(self.blob_path(digest), target, copy=True)


# --------------------------------------------------------------------------------------
@@ -271,12 +277,15 @@ class UserRepository(AbstractRepository):

    def link_blob(self, digest: str, target: Path):
        try:
            _hardlink(self.base.blob_path(digest), target, self)
            _copy_or_link(self.base.blob_path(digest), target, self)
        finally:
            self.disk_usage_save()

    def link_blob_ignore_quota(self, digest: str, target: Path):
        _hardlink(self.base.blob_path(digest), target)
        _copy_or_link(self.base.blob_path(digest), target)

    def copy_blob(self, digest: str, target: Path):
        _copy_or_link(self.base.blob_path(digest), target, copy=True)


# --------------------------------------------------------------------------------------
@@ -309,7 +318,9 @@ async def _stream_to_file(


# --------------------------------------------------------------------------------------
def _hardlink(src: Path, dst: Path, user_repo: Optional[UserRepository] = None):
def _copy_or_link(
    src: Path, dst: Path, user_repo: Optional[UserRepository] = None, copy: bool = False
):
    if not src.is_file():
        raise FileNotFoundError()
    dst.parent.mkdir(mode=0o755, parents=True, exist_ok=True)
@@ -319,7 +330,10 @@ def _hardlink(src: Path, dst: Path, user_repo: Optional[UserRepository] = None):
        dst.unlink()
    if user_repo is not None:
        user_repo.disk_usage_add(src.stat().st_size)
    os.link(src, dst)
    if copy:
        shutil.copyfile(src, dst)
    else:
        os.link(src, dst)


# --------------------------------------------------------------------------------------
diff --git a/dlrepo/fs/fmt.py b/dlrepo/fs/fmt.py
index 6d67ffbeb331..59dac2d20098 100644
--- a/dlrepo/fs/fmt.py
+++ b/dlrepo/fs/fmt.py
@@ -7,6 +7,7 @@ import json
import logging
import os
from pathlib import Path
import re
from typing import Awaitable, Callable, Dict, Iterator, List, Tuple

from cachetools import LRUCache, cachedmethod
@@ -143,20 +144,30 @@ class ArtifactFormat(SubDir):
                pass

    POST_PROCESS_CMD = os.getenv("DLREPO_POST_PROCESS_CMD")
    POST_PROCESS_FILTER = os.getenv("DLREPO_POST_PROCESS_FILTER")

    async def post_process(self):
        if not self.POST_PROCESS_CMD:
            return

        digests = self.get_digests()
        if self.POST_PROCESS_FILTER and not any(
            re.match(self.POST_PROCESS_FILTER, f) for f in digests.keys()
        ):
            return

        LOG.debug(
            "running post process command: %s (cwd %s)",
            self.POST_PROCESS_CMD,
            self._path,
        )
        digests = self.get_digests()
        # record content modification time to only recalculate digest if needed

        ctimes = {}
        for f in digests.keys():
            # make a temporary copy of all hardlinked files so that post-process can
            # safely modify them
            self.root().copy_blob(digests[f], self._path / f)
            # record content modification time to only recalculate digest if needed
            ctimes[f] = (self._path / f).stat().st_ctime

        try:
@@ -180,13 +191,14 @@ class ArtifactFormat(SubDir):
            fpath = self._path / f
            if f in ctimes and f in digests and ctimes[f] == fpath.stat().st_ctime:
                # file was not modified by post processing, avoid recalculating digest
                new_digests[f] = digests[f]
                continue
                digest = new_digests[f] = digests[f]
            else:
                # new or modified file
                digest = await loop.run_in_executor(None, file_digest, "sha256", fpath)
                digest = f"sha256:{digest}"
                new_digests[f] = digest

            # new or modified file
            digest = await loop.run_in_executor(None, file_digest, "sha256", fpath)
            digest = f"sha256:{digest}"
            new_digests[f] = digest
            # deduplicate files, remove temporary copies
            blob = self.root().blob_path(digest)
            if blob.is_file():
                # replace file with hardlink to existing blob
diff --git a/docs/dlrepo-api.7.scdoc b/docs/dlrepo-api.7.scdoc
index 4e75f883e61b..5bde72944a04 100644
--- a/docs/dlrepo-api.7.scdoc
+++ b/docs/dlrepo-api.7.scdoc
@@ -682,9 +682,10 @@ Check if the specified _{format}_ exists and is not _dirty_.

Clear the _dirty_ flag of the specified _{format}_. This flag is automatically
set to _true_ when new files are uploaded. If a *DLREPO_POST_PROCESS_CMD* is
configured (see *dlrepo-config*(5)), the command will be executed synchronously
with the artifact format folder as working directory before returning the
response to the client.
configured, and if the format matches the optional *DLREPO_POST_PROCESS_FILTER*
(see *dlrepo-config*(5)), the command will be executed synchronously with the
artifact format folder as working directory before returning the response to the
client.

After the command is finished, all file digests in the _{format}_ will be
refreshed since they may have been modified by the post process command. The
diff --git a/docs/dlrepo-config.5.scdoc b/docs/dlrepo-config.5.scdoc
index c014963a33ca..13a36185b278 100644
--- a/docs/dlrepo-config.5.scdoc
+++ b/docs/dlrepo-config.5.scdoc
@@ -80,6 +80,14 @@ running the daemon process.

	Example: _/etc/dlrepo/post-process.sh_

*DLREPO_POST_PROCESS_FILTER* (optional)
	Regular expression for selecting on which artifact formats
	_DLREPO_POST_PROCESS_CMD_ must be run. If set, the command is run if at
	least one file in the format folder matches the regular expression.
	If not set, _DLREPO_POST_PROCESS_CMD_ is run on all formats.

	Example: _'^(.+\.rpm|Release)$'_

*DLREPO_PUBLIC_URL* (optional)
	If set, it will replace the default URL in the script contents that are
	returned from the _/cli_ URL.
diff --git a/etc/default b/etc/default
index d9568baf05d8..a8de461cf244 100644
--- a/etc/default
+++ b/etc/default
@@ -9,6 +9,7 @@
#DLREPO_LOG_LEVEL=WARNING
#DLREPO_USER_QUOTA=10737418240
#DLREPO_POST_PROCESS_CMD=
#DLREPO_POST_PROCESS_FILTER=
#DLREPO_PUBLIC_URL=
#DLREPO_TEMPLATES_DIR=
#DLREPO_STATIC_DIR=
-- 
2.30.2
dlrepo/patches/.build.yml: SUCCESS in 1m19s

[fs: fix post-process of deduplicated files][0] v2 from [Julien Floret][1]

[0]: https://lists.sr.ht/~rjarry/dlrepo/patches/32801
[1]: mailto:julien.floret@6wind.com

✓ #775719 SUCCESS dlrepo/patches/.build.yml https://builds.sr.ht/~rjarry/job/775719
Julien Floret, Jun 07, 2022 at 10:48: