~lkcamp/discussion

5 4

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Vinicius Peixoto <vpeixoto@lkcamp.dev>
Details
Message ID
<d673f289-2385-4949-ac80-f3a502d4deb2@lkcamp.dev>
DKIM signature
pass
Download raw message
Hi all,

I noticed this report from syzbot when going through the preliminary 
tasks for the Linux Kernel Mentorship Program, and thought I'd take a 
stab at solving it. I apologize in advance for any mistakes as I'm still 
very new to kernel development. Either way, here's my analysis:

 From what I can tell by looking at the reproducer from syzbot, it is 
trying to mount a file filled with bogus data as an ocfs2 disk, and this 
is triggering an assertion in jbd2_cleanup_journal_tail, which in turn 
causes a panic.

The problematic call stack goes roughly like this:

mount_bdev
   -> ofcs2_mount_volume
     -> ofcs2_check_volume
       -> ofcs2_journal_load
         -> jbd2_journal_load
           -> journal_reset (fails)

Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2: 
Journal too short (blocks 2-1024)"); this leaves journal->j_head == 
NULL. However, jbd2_journal_load clears the JBD2_ABORT flag right before 
calling journal_reset. This leads to a problem later when 
ofcs2_mount_volume tries to flush the journal as part of the cleanup 
when aborting the mount operation:

   -> ofcs2_mount_volume (error; goto out_system_inodes)
     -> ofcs2_journal_shutdown
       -> jbd2_journal_flush
         -> jbd2_cleanup_journal_tail (J_ASSERT fails)

This failure happens because of the following code:

         if (is_journal_aborted(journal))
                 return -EIO;

         if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
                 return 1;
         J_ASSERT(blocknr != 0);

Since JBD2_ABORT was cleared in jbd2_journal_load earlier, we enter 
jbd2_journal_get_log_tail, which will set *blocknr = journal->j_head 
(which is NULL) and then trigger the assertion, causing a panic.

I confirmed that setting the JBD2_ABORT flag in journal_reset before 
returning -EINVAL fixes the problem:

         static int journal_reset(journal_t *journal)
                         journal_fail_superblock(journal);
         +               journal->j_flags |= JBD2_ABORT;
                         return -EINVAL;

You can find a proper patch file + the syzbot re-test result in [1]. 
However, I'm not entirely sure whether this is the correct decision, and 
I wanted to confirm that this is an appropriate solution before sending 
a proper patch to the mailing list.

Thanks in advance,
Vinicius

[1] https://syzkaller.appspot.com/bug?extid=8512f3dbd96253ffbe27

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Theodore Ts'o <tytso@mit.edu>
Details
Message ID
<20240826133208.GB424729@mit.edu>
In-Reply-To
<d673f289-2385-4949-ac80-f3a502d4deb2@lkcamp.dev> (view parent)
DKIM signature
pass
Download raw message
On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> to flush the journal as part of the cleanup when aborting the mount
> operation:
> 
>   -> ofcs2_mount_volume (error; goto out_system_inodes)
>     -> ofcs2_journal_shutdown
>       -> jbd2_journal_flush
>         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> ...
>
> I confirmed that setting the JBD2_ABORT flag in journal_reset before
> returning -EINVAL fixes the problem:
> 
>         static int journal_reset(journal_t *journal)
>                         journal_fail_superblock(journal);
>         +               journal->j_flags |= JBD2_ABORT;
>                         return -EINVAL;
> 
> You can find a proper patch file + the syzbot re-test result in [1].
> However, I'm not entirely sure whether this is the correct decision, and I
> wanted to confirm that this is an appropriate solution before sending a
> proper patch to the mailing list.

The reason why this isn't an issue with ext4 is because the normal
"right" way to do this is if jbd2_journal_load() returns an error, is
to call jbd2_journal_destroy() to release the data structures with the
journal --- and then don't try to use the journal afterwards.

The weird thing is that there are two code paths in ocfs2 that calls
jbd2_journal_load().  One is in ocfs2_replay_journal() which does what
ext4 does.  The other is ocfs2_load_journal() which does *not* do
this, and this is the one which you saw in the syzkaller reproducer.
It looks like one codepath is used to replay the ocfs2 for some other
node, and the is to load (and presumably later, replay) the journal
for the mounting node.

There are also some other things which look very confusing, such as the fact that ocfs2_journal_shutdown calls igrab:

	/* need to inc inode use count - jbd2_journal_destroy will iput. */
	if (!igrab(inode))
		BUG();

... which is *weird*.  Normaly the journal inode refcount is expected
to be bumped before calling jbd2_journal_init_inode() --- which
normally is done by an iget() function, and is needed to make sure the
journal inode isn't released out from under the jbd2 layer.  It looks
like ocfs2 uses the journal inode for other stuff so get the journal
inode without calling something like iget().  Which is OK, I guess,
but it means that there are a whole bunch of places where it has to
call !BUG_ON(igrab(journal->j_inode) before calling
jbd2_journal_destroy().  It would seem like the *right* thing to do is
to bump the refcount in ocfs2_journal_init(), and if for some reason
the igrab fails, it can just return an error early, instead of having
to resort to BUG_ON() later, and if you don't realize that you have to
do this weird igrab() before calling jbd2_journal_destroy(), you'll
end up leaking the journal inode.

Anyway, I think the right thing to do is to restructure how ocfs2
calls the jbd2 journal, but that's going to require a more careful
examination of the code paths.  Your patch is a bit of a blunt force
hack that should be harmless, but it's probably not the best way to
fix the problem, although doing it "right" would be more
complicated....

					- Ted

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Joel Becker <jlbec@evilplan.org>
Details
Message ID
<ZszY3SHWTp7XfS3z@google.com>
In-Reply-To
<20240826133208.GB424729@mit.edu> (view parent)
DKIM signature
missing
Download raw message
On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> > Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> > However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> > journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> > to flush the journal as part of the cleanup when aborting the mount
> > operation:
> > 
> >   -> ofcs2_mount_volume (error; goto out_system_inodes)
> >     -> ofcs2_journal_shutdown
> >       -> jbd2_journal_flush
> >         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > ...
> The reason why this isn't an issue with ext4 is because the normal
> "right" way to do this is if jbd2_journal_load() returns an error, is
> to call jbd2_journal_destroy() to release the data structures with the
> journal --- and then don't try to use the journal afterwards.
> 
> The weird thing is that there are two code paths in ocfs2 that calls
> jbd2_journal_load().  One is in ocfs2_replay_journal() which does what
> ext4 does.  The other is ocfs2_load_journal() which does *not* do
> this, and this is the one which you saw in the syzkaller reproducer.
> It looks like one codepath is used to replay the ocfs2 for some other
> node, and the is to load (and presumably later, replay) the journal
> for the mounting node.

You are correct, Ted, that one path is for the local journal and the
other is to recover remote journals for other nodes that may have
crashed.

I think the big ordering issue is that we set
osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(),
before we've attempted any replay.  Later in ocfs2_journal_shutdown(),
we check this state and decide to perform cleanup.

Instead, we should not set OCFS2_JOURNAL_LOADED until
ocfs2_journal_load() has called jbd2_journal_load().  Only then do we
actually know we have loaded a valid journal.

Something like:

```
	status = jbd2_journal_load(journal->j_journal);
	if (status < 0) {
		mlog(ML_ERROR, "Failed to load journal!\n");
		BUG_ON(!igrab(journal->j_inode));
		jbd2_journal_destroy(journal->j_journal);
		iput(journal->j_inode)
		goto done;
	}
	journal->j_state = OCFS2_JOURNAL_LOADED;
```

With code like this, when jbd2_journal_load() fails, the future
ocfs2_journal_shutdown() will exit early, because !OCFS2_JOURNAL_LOADED.

I think this is the right spot; a quick audit of the paths (it has been
a while) doesn't find any other outstanding state; the rest of journal
startup, such as the commit thread etc, only happen after this.

> jbd2_journal_destroy().  It would seem like the *right* thing to do is
> to bump the refcount in ocfs2_journal_init(), and if for some reason
> the igrab fails, it can just return an error early, instead of having
> to resort to BUG_ON() later, and if you don't realize that you have to
> do this weird igrab() before calling jbd2_journal_destroy(), you'll
> end up leaking the journal inode.

There are interactions of journal inodes for nodes we don't own, and
also connections to cluster locks for our own journal (don't replay
ourselves while another node has locked it and is recovering us).  So we
do have some state to keep track of.  But it's been so long that I don't
recall if there was a specific reason we do this late via igrab(), or if
it's just that we should have done as you describe and missed it.
You'll note that I copied the igrab/iput game in my snippet above.

Should someone try to audit the igrab/iput thing later?  Yes.  But it's
not a necessary part of this fix.

Thanks,
Joel

-- 

"War doesn't determine who's right; war determines who's left."

			http://www.jlbec.org/
			jlbec@evilplan.org

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Jan Kara <jack@suse.cz>
Details
Message ID
<20240828105452.z6tqwq47bcnmrevd@quack3>
In-Reply-To
<20240826133208.GB424729@mit.edu> (view parent)
DKIM signature
pass
Download raw message
On Mon 26-08-24 09:32:08, Theodore Ts'o wrote:
> On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> > Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> > However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> > journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> > to flush the journal as part of the cleanup when aborting the mount
> > operation:
> > 
> >   -> ofcs2_mount_volume (error; goto out_system_inodes)
> >     -> ofcs2_journal_shutdown
> >       -> jbd2_journal_flush
> >         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > ...
> >
> > I confirmed that setting the JBD2_ABORT flag in journal_reset before
> > returning -EINVAL fixes the problem:
> > 
> >         static int journal_reset(journal_t *journal)
> >                         journal_fail_superblock(journal);
> >         +               journal->j_flags |= JBD2_ABORT;
> >                         return -EINVAL;
> > 
> > You can find a proper patch file + the syzbot re-test result in [1].
> > However, I'm not entirely sure whether this is the correct decision, and I
> > wanted to confirm that this is an appropriate solution before sending a
> > proper patch to the mailing list.
> 
> The reason why this isn't an issue with ext4 is because the normal
> "right" way to do this is if jbd2_journal_load() returns an error, is
> to call jbd2_journal_destroy() to release the data structures with the
> journal --- and then don't try to use the journal afterwards.

Yep. OCFS2 guys are actually looking into fixing this in OCFS2
(https://lore.kernel.org/ocfs2-devel/20240819131120.746077-1-sunjunchao2870@gmail.com/)
Not sure what's the status though. Julian, did you send v2 of your fix?

								Honza

> The weird thing is that there are two code paths in ocfs2 that calls
> jbd2_journal_load().  One is in ocfs2_replay_journal() which does what
> ext4 does.  The other is ocfs2_load_journal() which does *not* do
> this, and this is the one which you saw in the syzkaller reproducer.
> It looks like one codepath is used to replay the ocfs2 for some other
> node, and the is to load (and presumably later, replay) the journal
> for the mounting node.
> 
> There are also some other things which look very confusing, such as the
> fact that ocfs2_journal_shutdown calls igrab:
> 
> 	/* need to inc inode use count - jbd2_journal_destroy will iput. */
> 	if (!igrab(inode))
> 		BUG();
> 
> ... which is *weird*.  Normaly the journal inode refcount is expected
> to be bumped before calling jbd2_journal_init_inode() --- which
> normally is done by an iget() function, and is needed to make sure the
> journal inode isn't released out from under the jbd2 layer.  It looks
> like ocfs2 uses the journal inode for other stuff so get the journal
> inode without calling something like iget().  Which is OK, I guess,
> but it means that there are a whole bunch of places where it has to
> call !BUG_ON(igrab(journal->j_inode) before calling
> jbd2_journal_destroy().  It would seem like the *right* thing to do is
> to bump the refcount in ocfs2_journal_init(), and if for some reason
> the igrab fails, it can just return an error early, instead of having
> to resort to BUG_ON() later, and if you don't realize that you have to
> do this weird igrab() before calling jbd2_journal_destroy(), you'll
> end up leaking the journal inode.
> 
> Anyway, I think the right thing to do is to restructure how ocfs2
> calls the jbd2 journal, but that's going to require a more careful
> examination of the code paths.  Your patch is a bit of a blunt force
> hack that should be harmless, but it's probably not the best way to
> fix the problem, although doing it "right" would be more
> complicated....
> 
> 					- Ted
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Details
Message ID
<c2724c5c2c17127866e6ed5c69b8f1f24d4c2db3.camel@gmail.com>
In-Reply-To
<20240828105452.z6tqwq47bcnmrevd@quack3> (view parent)
DKIM signature
pass
Download raw message
On Wed, 2024-08-28 at 12:54 +0200, Jan Kara wrote:
> On Mon 26-08-24 09:32:08, Theodore Ts'o wrote:
> > On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > > Since the disk data is bogus, journal_reset fails with -EINVAL
> > > ("JBD2:
> > > Journal too short (blocks 2-1024)"); this leaves journal->j_head
> > > == NULL.
> > > However, jbd2_journal_load clears the JBD2_ABORT flag right
> > > before calling
> > > journal_reset. This leads to a problem later when
> > > ofcs2_mount_volume tries
> > > to flush the journal as part of the cleanup when aborting the
> > > mount
> > > operation:
> > > 
> > >   -> ofcs2_mount_volume (error; goto out_system_inodes)
> > >     -> ofcs2_journal_shutdown
> > >       -> jbd2_journal_flush
> > >         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > > ...
> > > 
> > > I confirmed that setting the JBD2_ABORT flag in journal_reset
> > > before
> > > returning -EINVAL fixes the problem:
> > > 
> > >         static int journal_reset(journal_t *journal)
> > >                         journal_fail_superblock(journal);
> > >         +               journal->j_flags |= JBD2_ABORT;
> > >                         return -EINVAL;
> > > 
> > > You can find a proper patch file + the syzbot re-test result in
> > > [1].
> > > However, I'm not entirely sure whether this is the correct
> > > decision, and I
> > > wanted to confirm that this is an appropriate solution before
> > > sending a
> > > proper patch to the mailing list.
> > 
> > The reason why this isn't an issue with ext4 is because the normal
> > "right" way to do this is if jbd2_journal_load() returns an error,
> > is
> > to call jbd2_journal_destroy() to release the data structures with
> > the
> > journal --- and then don't try to use the journal afterwards.
> 
> Yep. OCFS2 guys are actually looking into fixing this in OCFS2
> (
> https://lore.kernel.org/ocfs2-devel/20240819131120.746077-1-sunjunchao2870@gmail.com
> /)
> Not sure what's the status though. Julian, did you send v2 of your
> fix?

Yeah, I have already submitted the v2[1] of the patch and it is
awaiting review.

[1]:
https://lore.kernel.org/ocfs2-devel/20240823083150.17590-1-sunjunchao2870@gmail.com/
> 
>                                                                 Honza
> 
> > The weird thing is that there are two code paths in ocfs2 that
> > calls
> > jbd2_journal_load().  One is in ocfs2_replay_journal() which does
> > what
> > ext4 does.  The other is ocfs2_load_journal() which does *not* do
> > this, and this is the one which you saw in the syzkaller
> > reproducer.
> > It looks like one codepath is used to replay the ocfs2 for some
> > other
> > node, and the is to load (and presumably later, replay) the journal
> > for the mounting node.
> > 
> > There are also some other things which look very confusing, such as
> > the
> > fact that ocfs2_journal_shutdown calls igrab:
> > 
> >         /* need to inc inode use count - jbd2_journal_destroy will
> > iput. */
> >         if (!igrab(inode))
> >                 BUG();
> > 
> > ... which is *weird*.  Normaly the journal inode refcount is
> > expected
> > to be bumped before calling jbd2_journal_init_inode() --- which
> > normally is done by an iget() function, and is needed to make sure
> > the
> > journal inode isn't released out from under the jbd2 layer.  It
> > looks
> > like ocfs2 uses the journal inode for other stuff so get the
> > journal
> > inode without calling something like iget().  Which is OK, I guess,
> > but it means that there are a whole bunch of places where it has to
> > call !BUG_ON(igrab(journal->j_inode) before calling
> > jbd2_journal_destroy().  It would seem like the *right* thing to do
> > is
> > to bump the refcount in ocfs2_journal_init(), and if for some
> > reason
> > the igrab fails, it can just return an error early, instead of
> > having
> > to resort to BUG_ON() later, and if you don't realize that you have
> > to
> > do this weird igrab() before calling jbd2_journal_destroy(), you'll
> > end up leaking the journal inode.
> > 
> > Anyway, I think the right thing to do is to restructure how ocfs2
> > calls the jbd2 journal, but that's going to require a more careful
> > examination of the code paths.  Your patch is a bit of a blunt
> > force
> > hack that should be harmless, but it's probably not the best way to
> > fix the problem, although doing it "right" would be more
> > complicated....
> > 
> >                                         - Ted

Thanks,
-- 
Julian Sun <sunjunchao2870@gmail.com>

Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail

Details
Message ID
<80ad8de73bc8bea5d77f0802e65e970d5993c65a.camel@gmail.com>
In-Reply-To
<ZszY3SHWTp7XfS3z@google.com> (view parent)
DKIM signature
pass
Download raw message
On Mon, 2024-08-26 at 12:34 -0700, Joel Becker wrote:
> On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote:
> > On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > > Since the disk data is bogus, journal_reset fails with -EINVAL
> > > ("JBD2:
> > > Journal too short (blocks 2-1024)"); this leaves journal->j_head
> > > == NULL.
> > > However, jbd2_journal_load clears the JBD2_ABORT flag right
> > > before calling
> > > journal_reset. This leads to a problem later when
> > > ofcs2_mount_volume tries
> > > to flush the journal as part of the cleanup when aborting the
> > > mount
> > > operation:
> > > 
> > >   -> ofcs2_mount_volume (error; goto out_system_inodes)
> > >     -> ofcs2_journal_shutdown
> > >       -> jbd2_journal_flush
> > >         -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > > ...
> > The reason why this isn't an issue with ext4 is because the normal
> > "right" way to do this is if jbd2_journal_load() returns an error,
> > is
> > to call jbd2_journal_destroy() to release the data structures with
> > the
> > journal --- and then don't try to use the journal afterwards.
> > 
> > The weird thing is that there are two code paths in ocfs2 that
> > calls
> > jbd2_journal_load().  One is in ocfs2_replay_journal() which does
> > what
> > ext4 does.  The other is ocfs2_load_journal() which does *not* do
> > this, and this is the one which you saw in the syzkaller
> > reproducer.
> > It looks like one codepath is used to replay the ocfs2 for some
> > other
> > node, and the is to load (and presumably later, replay) the journal
> > for the mounting node.
> 
> You are correct, Ted, that one path is for the local journal and the
> other is to recover remote journals for other nodes that may have
> crashed.
> 
> I think the big ordering issue is that we set
> osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(),
> before we've attempted any replay.  Later in
> ocfs2_journal_shutdown(),
> we check this state and decide to perform cleanup.
> 
> Instead, we should not set OCFS2_JOURNAL_LOADED until
> ocfs2_journal_load() has called jbd2_journal_load().  Only then do we
> actually know we have loaded a valid journal.

Yeah, it's right. We should not set OCFS2_JOURNAL_LOADED in
ocfs2_journal_load(). Instead, we should set other flag like
OCFS2_JOURNAL_INIT, to indicate that resources have been allocated.
This way, we can clean them up properly in ocfs2_journal_shutdown(). We
should distinguish between these two states to ensure the correct exit
procedure when an error occurs, just like this patch[1] does.

[1]:
https://lore.kernel.org/ocfs2-devel/20240823083150.17590-1-sunjunchao2870@gmail.com/
> 
> Something like:
> 
> ```
>         status = jbd2_journal_load(journal->j_journal);
>         if (status < 0) {
>                 mlog(ML_ERROR, "Failed to load journal!\n");
>                 BUG_ON(!igrab(journal->j_inode));
>                 jbd2_journal_destroy(journal->j_journal);
>                 iput(journal->j_inode)
>                 goto done;
>         }
>         journal->j_state = OCFS2_JOURNAL_LOADED;
> ```
> 
> With code like this, when jbd2_journal_load() fails, the future
> ocfs2_journal_shutdown() will exit early, because
> !OCFS2_JOURNAL_LOADED.
> 
> I think this is the right spot; a quick audit of the paths (it has
> been
> a while) doesn't find any other outstanding state; the rest of
> journal
> startup, such as the commit thread etc, only happen after this.
> 
> > jbd2_journal_destroy().  It would seem like the *right* thing to do
> > is
> > to bump the refcount in ocfs2_journal_init(), and if for some
> > reason
> > the igrab fails, it can just return an error early, instead of
> > having
> > to resort to BUG_ON() later, and if you don't realize that you have
> > to
> > do this weird igrab() before calling jbd2_journal_destroy(), you'll
> > end up leaking the journal inode.
> 
> There are interactions of journal inodes for nodes we don't own, and
> also connections to cluster locks for our own journal (don't replay
> ourselves while another node has locked it and is recovering us).  So
> we
> do have some state to keep track of.  But it's been so long that I
> don't
> recall if there was a specific reason we do this late via igrab(), or
> if
> it's just that we should have done as you describe and missed it.
> You'll note that I copied the igrab/iput game in my snippet above.
> 
> Should someone try to audit the igrab/iput thing later?  Yes.  But
> it's
> not a necessary part of this fix.
> 
> Thanks,
> Joel
> 

Thanks,
-- 
Julian Sun <sunjunchao2870@gmail.com>
Reply to thread Export thread (mbox)