Julien Floret: 1 fs: fix post-process of deduplicated files 5 files changed, 53 insertions(+), 17 deletions(-)
Copy & paste the following snippet into your terminal to import this patchset into git:
curl -s https://lists.sr.ht/~rjarry/dlrepo/patches/32738/mbox | git am -3Learn more about email & git
When a new artifact is uploaded via an HTTP PUT request, the server calculates its digest and compares it with the digest present in the request. The .digests file of the format is updated. Then the artifact goes through the deduplication process: the file in the job folder is hardlinked to a file named after its digest in the global blobs folder. After all artifacts of a given format have been uploaded, an HTTP PATCH request can be issued on the format to remove the "dirty" flag. This will run the post-process script if the DLREPO_POST_PROCESS_CMD environment variable is defined. The problem here is that the post-process script is able to modify artifacts files in-place (for example, when signing an RPM file). Because the file is hardlinked to a blob, its modification affects all files hardlinked to it. Real-life example: Suppose we upload job1/fmt1 and job2/fmt2 simultaneously, and both contain an identical foo.rpm file (of sha256 aaaaa). We also have a post-process file that signs RPM files. fmt1/foo.rpm and fmt2/foo.rpm are uploaded. No post-process has been done yet. We have the following data tree: data/.blobs/sha256/ `-- aa `-- aaaaa (sha256: aaaaa) branches/branch/tag/ |-- job1 | |-- fmt1 | | |-- foo.rpm (hardlinked with aaaaa) | | `-- .digests <- {"foo.rpm": "sha256:aaaaa"} | `-- .stamp |-- job2 | |-- fmt2 | | |-- foo.rpm (hardlinked with aaaaa) | | `-- .digests <- {"foo.rpm": "sha256:aaaaa"} Now, the post-process of fmt1 is run and foo.rpm is signed. Its new sha256 is bbbbb; ALL FILES HARDLINKED TO IT have been modified. So we have now: data/.blobs/sha256/ |-- aa | `-- aaaaa (sha256: bbbbb) <-- WRONG `-- bb `-- bbbbb (sha256: bbbbb) (hardlinked with aaaaa) branches/branch/tag/ |-- job1 | |-- fmt1 | | |-- foo.rpm (hardlinked with aaaaa) | | `-- .digests <- {"foo.rpm": "sha256:bbbbb"} | `-- .stamp |-- job2 | |-- fmt2 | | |-- foo.rpm (hardlinked with aaaaa) | | `-- .digests <- {"foo.rpm": "sha256:aaaaa"} <-- WRONG This is a mess. Note that all 4 files are hardlinked together now. Finally, the post-process of fmt2 is run. Since it is already signed, the file is untouched and its digest is not recalculated, consequently the fs stays in the same state.
Nice catch. Thanks for this awesome description of the problem.
To fix this problem, we must never modify a hardlinked file. After this patch, when an HTTP POST is issued on a newly uploaded format, before running the post-process we replace all hardlinks in the format by temporary local copies, so that the post-process can safely modify them. Then after the post-process, we deduplicate the files by removing the copies and creating the hardlinks between the format folder and the .blobs folder. To avoid making copies of all formats, even those not concerned by the post-process script, we introduce a new optional configuration variable: DLREPO_POST_PROCESS_FILTER. If set, it must contain a regular expression for matching files inside the format folder.
I was wondering about that option. A regexp may be overkill in that case. However it is very flexible and a simple shell glob may not be enough to express what files are to be post-processed.
If any file of the format matches the filter, the post-process script is run. Note that all files of the format are copied, not only those who match the filter. Otherwise, if by mishap the regex were incomplete and the post-process were to touch other files, the database would be broken.
Fair enough. Depending on the number of files in the format, I don't know if you would save much time by only copying the files that are matched by the regexp.
If DLREPO_POST_PROCESS_FILTER is not set, the post-process script is run on all formats. Fixes: fd4d5c28600c ("format: add post-process command support") Signed-off-by: Julien Floret <julien.floret@6wind.com> Acked-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Thomas Faivre <thomas.faivre@6wind.com> --- dlrepo/fs/__init__.py | 26 ++++++++++++++++++++------ dlrepo/fs/fmt.py | 28 ++++++++++++++++++++-------- docs/dlrepo-api.7.scdoc | 7 ++++--- docs/dlrepo-config.5.scdoc | 8 ++++++++ etc/default | 1 + 5 files changed, 53 insertions(+), 17 deletions(-) diff --git a/dlrepo/fs/__init__.py b/dlrepo/fs/__init__.py index 9e6212da8519..bc91e3773ce6 100644 --- a/dlrepo/fs/__init__.py +++ b/dlrepo/fs/__init__.py @@ -78,6 +78,9 @@ class AbstractRepository: def link_blob_ignore_quota(self, digest: str, target: Path): raise NotImplementedError() + def copy_blob(self, digest: str, target: Path): + raise NotImplementedError() + # -------------------------------------------------------------------------------------- class ArtifactRepository(AbstractRepository): @@ -154,10 +157,13 @@ class ArtifactRepository(AbstractRepository): ) def link_blob(self, digest: str, target: Path): - _hardlink(self.blob_path(digest), target) + _copy_or_link(self.blob_path(digest), target) def link_blob_ignore_quota(self, digest: str, target: Path): - _hardlink(self.blob_path(digest), target) + _copy_or_link(self.blob_path(digest), target) + + def copy_blob(self, digest: str, target: Path): + _copy_or_link(self.blob_path(digest), target, copy=True) # -------------------------------------------------------------------------------------- @@ -271,12 +277,15 @@ class UserRepository(AbstractRepository): def link_blob(self, digest: str, target: Path): try: - _hardlink(self.base.blob_path(digest), target, self) + _copy_or_link(self.base.blob_path(digest), target, self) finally: self.disk_usage_save() def link_blob_ignore_quota(self, digest: str, target: Path): - _hardlink(self.base.blob_path(digest), target) + _copy_or_link(self.base.blob_path(digest), target) + + def copy_blob(self, digest: str, target: Path): + _copy_or_link(self.base.blob_path(digest), target, copy=True) # -------------------------------------------------------------------------------------- @@ -309,7 +318,9 @@ async def _stream_to_file( # -------------------------------------------------------------------------------------- -def _hardlink(src: Path, dst: Path, user_repo: Optional[UserRepository] = None): +def _copy_or_link( + src: Path, dst: Path, user_repo: Optional[UserRepository] = None, copy: bool = False +): if not src.is_file(): raise FileNotFoundError() dst.parent.mkdir(mode=0o755, parents=True, exist_ok=True) @@ -319,7 +330,10 @@ def _hardlink(src: Path, dst: Path, user_repo: Optional[UserRepository] = None): dst.unlink() if user_repo is not None: user_repo.disk_usage_add(src.stat().st_size) - os.link(src, dst) + if copy: + shutil.copyfile(src, dst) + else: + os.link(src, dst) # -------------------------------------------------------------------------------------- diff --git a/dlrepo/fs/fmt.py b/dlrepo/fs/fmt.py index 6d67ffbeb331..59dac2d20098 100644 --- a/dlrepo/fs/fmt.py +++ b/dlrepo/fs/fmt.py @@ -7,6 +7,7 @@ import json import logging import os from pathlib import Path +import re from typing import Awaitable, Callable, Dict, Iterator, List, Tuple from cachetools import LRUCache, cachedmethod @@ -143,20 +144,30 @@ class ArtifactFormat(SubDir): pass POST_PROCESS_CMD = os.getenv("DLREPO_POST_PROCESS_CMD") + POST_PROCESS_FILTER = os.getenv("DLREPO_POST_PROCESS_FILTER") async def post_process(self): if not self.POST_PROCESS_CMD: return + digests = self.get_digests() + if self.POST_PROCESS_FILTER and not any( + re.match(self.POST_PROCESS_FILTER, f) for f in digests.keys() + ): + return + LOG.debug( "running post process command: %s (cwd %s)", self.POST_PROCESS_CMD, self._path, ) - digests = self.get_digests() - # record content modification time to only recalculate digest if needed + ctimes = {} for f in digests.keys(): + # make a temporary copy of all hardlinked files so that post-process can + # safely modify them + self.root().copy_blob(digests[f], self._path / f) + # record content modification time to only recalculate digest if needed ctimes[f] = (self._path / f).stat().st_ctime try: @@ -180,13 +191,14 @@ class ArtifactFormat(SubDir): fpath = self._path / f if f in ctimes and f in digests and ctimes[f] == fpath.stat().st_ctime: # file was not modified by post processing, avoid recalculating digest - new_digests[f] = digests[f] - continue + digest = new_digests[f] = digests[f] + else: + # new or modified file + digest = await loop.run_in_executor(None, file_digest, "sha256", fpath) + digest = f"sha256:{digest}" + new_digests[f] = digest - # new or modified file - digest = await loop.run_in_executor(None, file_digest, "sha256", fpath) - digest = f"sha256:{digest}" - new_digests[f] = digest + # deduplicate files, remove temporary copies blob = self.root().blob_path(digest) if blob.is_file(): # replace file with hardlink to existing blob diff --git a/docs/dlrepo-api.7.scdoc b/docs/dlrepo-api.7.scdoc index 4e75f883e61b..5bde72944a04 100644 --- a/docs/dlrepo-api.7.scdoc +++ b/docs/dlrepo-api.7.scdoc @@ -682,9 +682,10 @@ Check if the specified _{format}_ exists and is not _dirty_. Clear the _dirty_ flag of the specified _{format}_. This flag is automatically set to _true_ when new files are uploaded. If a *DLREPO_POST_PROCESS_CMD* is -configured (see *dlrepo-config*(5)), the command will be executed synchronously -with the artifact format folder as working directory before returning the -response to the client. +configured, and if the format matches the optional *DLREPO_POST_PROCESS_FILTER* +(see *dlrepo-config*(5)), the command will be executed synchronously with the +artifact format folder as working directory before returning the response to the +client. After the command is finished, all file digests in the _{format}_ will be refreshed since they may have been modified by the post process command. The diff --git a/docs/dlrepo-config.5.scdoc b/docs/dlrepo-config.5.scdoc index c014963a33ca..017ae7b96de2 100644 --- a/docs/dlrepo-config.5.scdoc +++ b/docs/dlrepo-config.5.scdoc @@ -80,6 +80,14 @@ running the daemon process. Example: _/etc/dlrepo/post-process.sh_ +*DLREPO_POST_PROCESS_FILTER* (optional) + Regular expression for selecting on which artifact formats + _DLREPO_POST_PROCESS_CMD_ must be run. If set, the command is run if at + least one file in the format folder matches the regular expression. + If not set, _DLREPO_POST_PROCESS_CMD_ is run on all formats. + + Example: ^(.*\.rpm|Release)$
Since this is a shell variable, I don't know how quoting is interpreted by systemd. I figure in the same way than any POSIX shell. Can you at least change the example value to this? _'^(.+\.rpm|Release)$'_ That way, users will have a good starting point.
+ *DLREPO_PUBLIC_URL* (optional) If set, it will replace the default URL in the script contents that are returned from the _/cli_ URL. diff --git a/etc/default b/etc/default index d9568baf05d8..a8de461cf244 100644 --- a/etc/default +++ b/etc/default @@ -9,6 +9,7 @@ #DLREPO_LOG_LEVEL=WARNING #DLREPO_USER_QUOTA=10737418240 #DLREPO_POST_PROCESS_CMD= +#DLREPO_POST_PROCESS_FILTER= #DLREPO_PUBLIC_URL= #DLREPO_TEMPLATES_DIR= #DLREPO_STATIC_DIR=
Please preserve your commit log for v2! Thanks a lot.
-- 2.30.2
builds.sr.ht <builds@sr.ht>dlrepo/patches/.build.yml: SUCCESS in 1m18s [fs: fix post-process of deduplicated files][0] from [Julien Floret][1] [0]: https://lists.sr.ht/~rjarry/dlrepo/patches/32738 [1]: mailto:julien.floret@6wind.com ✓ #773377 SUCCESS dlrepo/patches/.build.yml https://builds.sr.ht/~rjarry/job/773377
Julien Floret, Jun 03, 2022 at 11:18: