This small series is for fixing potential orphan hardlinks when the
same file is uploaded several times into different places.
Patch 1 is a preliminary code cleanup without functional changes.
Patch 2 actually fixes the bug by adding a check before moving the
uploaded file into the .blobs folder.
Julien Floret (2):
fs: rm return value for finalize_upload
fs: ensure upload never replaces an existing blob
dlrepo/fs/__init__.py | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
--
2.39.2
[PATCH dlrepo 1/2] fs: rm return value for finalize_upload
When a file is uploaded, it is first uploaded to .uploads/<uuid>, then
moved to .blobs/<digest>.
The .blobs/<digest> file is then hardlinked to the actual file path
into the job.
If however the same file is uploaded more than once, without checking
that its digest already exists on the server, then any existing
.blobs/<digest> file is replaced with the same newly uploaded
file. As a consequence, the existing hardlinks become orphaned.
What's more, if the most recent job is deleted afterwards, the
associated .blobs/<digest> file will have no other hardlink to it, so
it will be deleted at the next run of the cleanup task.
This is catastrophic for containers, because it breaks container
pull. Indeed "docker pull", for example, downloads its layers
from .blobs/<digest>. If the blob has been cleaned up, the pull
command will fail with the "unknown blob" error message, even if the
missing blob can be seen under the job's container format.
The problem can be easily reproduced by running multiple parallel
"dlrepo-cli upload" of the same big file into different jobs.
To avoid this, let's check if the blob already exists before moving
the uploaded file into it. If yes, just keep the existing blob and
remove the newly uploaded one.
Fixes: bd1c23893882 ("server: add filesystem api")
Signed-off-by: Julien Floret <julien.floret@6wind.com>
Acked-by: Thomas Faivre <thomas.faivre@6wind.com>
---
dlrepo/fs/__init__.py | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/dlrepo/fs/__init__.py b/dlrepo/fs/__init__.py
index 00c68f00f92f..789eff455055 100644
--- a/dlrepo/fs/__init__.py+++ b/dlrepo/fs/__init__.py
@@ -439,6 +439,10 @@ async def _check_and_move(src: Path, dst: Path, digest: str):
except OSError:
pass
raise ValueError(f"Received data does not match digest: {digest}")
- dst.parent.mkdir(mode=0o755, parents=True, exist_ok=True)- src.rename(dst)- dst.chmod(0o644)+ if dst.exists():+ # never replace an existing blob, we would lose any existing hard links+ src.unlink()+ else:+ dst.parent.mkdir(mode=0o755, parents=True, exist_ok=True)+ src.rename(dst)+ dst.chmod(0o644)
--
2.39.2
Julien Floret <julien.floret@6wind.com> wrote:
> This small series is for fixing potential orphan hardlinks when the> same file is uploaded several times into different places.> Patch 1 is a preliminary code cleanup without functional changes.> Patch 2 actually fixes the bug by adding a check before moving the> uploaded file into the .blobs folder.>> Julien Floret (2):> fs: rm return value for finalize_upload> fs: ensure upload never replaces an existing blob
Acked-by: Robin Jarry <robin@jarry.cc>
Applied, thanks.
To git@git.sr.ht:~rjarry/dlrepo
7d14913bec77..998ec62a1d22 main -> main