~mpu/qbe

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
3 2

Re: [PATCH] amd64: optimize loading +0 into floating point registers

Details
Message ID
<3IZQYCSM1UCM4.3KEAJ4MQE5UY9@mforney.org>
DKIM signature
pass
Download raw message
Patch: +19 -5
Revisiting this, motivated by this recent blog post by Brian Callahan:
https://briancallahan.net/blog/20220330.html

Quentin Carbonneaux <quentin@c9x.me> wrote:
> I am not opposed to this optimization but I find the implementation
> a bit too hacky. An alternative approach I can suggest is to add a
> 'zero' instruction that just zeroes its return temporary. This
> instruction could easily work with both integer and sse registers.
> It could be 'xzero' as well if it is only used for amd64.

I'm not sure what you find hacky about this approach. The job of
isel() is to replace instructions that cannot be encoded in the
target ISA with an equivalent sequence of instructions that can.
In this case, we *do* have an amd64 instruction that copies zero
into a float register, and that is `pxor dst, dst`. We just need
to let that case pass through isel() and handle it in emit().

Adding a new instruction '%r =d zero' that is equivalent to one we
already have ('%r =d copy 0') seems worse to me. What if we also
wanted to optimize loading other floating point constants? On arm64,
we can move a limited set of constants into a register with fmov.
It would seem silly to me to add new instructions like `%r =d one`
for this.

Below is the patch (the last one!) I've been using in my QBE branch.
It handles integer register zeroing as well as floating point. You
can apply it with `git am --scissor`.

-- 8< --
From: =?UTF-8?q?=C3=89rico=20Nogueira?= <erico.erc@gmail.com>
Date: Sun, 11 Jul 2021 19:19:12 -0300
Subject: [PATCH] amd64: optimize loading 0 into registers

Loading +0 into a floating point register can be done using pxor or
xorps instructions. Per [1], we went with pxor because it can run on all
vector ALU ports, even if it's one byte longer.

Similarly, an integer register can be zeroed with xor, which has a
smaller encoding than mov with 0 immediate.

To implement this, let fixarg pass through Ocopy when the value is
+0 for floating point, and change emitins to emit pxor/xor when it
encounters a copy from 0.

Co-authored-by: Michael Forney <mforney@mforney.org>

[1] https://stackoverflow.com/questions/39811577/does-using-mix-of-pxor-and-xorps-affect-performance/39828976
---
 amd64/emit.c | 12 ++++++++++++
 amd64/isel.c | 12 +++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/amd64/emit.c b/amd64/emit.c
index b8e9e8e..388b8b3 100644
--- a/amd64/emit.c
+++ b/amd64/emit.c
@@ -450,6 +450,18 @@ emitins(Ins i, Fn *fn, FILE *f)
		if (req(i.to, i.arg[0]))
			break;
		t0 = rtype(i.arg[0]);
		if (t0 == RCon
		&& fn->con[i.arg[0].val].type == CBits
		&& fn->con[i.arg[0].val].bits.i == 0) {
			if (isreg(i.to)) {
				if (KBASE(i.cls) == 0)
					emitf("xor%k %=, %=", &i, fn, f);
				else
					emitf("pxor %D=, %D=", &i, fn, f);
				break;
			}
			i.cls = KWIDE(i.cls) ? Kl : Kw;
		}
		if (i.cls == Kl
		&& t0 == RCon
		&& fn->con[i.arg[0].val].type == CBits) {
diff --git a/amd64/isel.c b/amd64/isel.c
index 4181e26..d4f0b69 100644
--- a/amd64/isel.c
+++ b/amd64/isel.c
@@ -69,7 +69,7 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn)
	r1 = r0 = *r;
	s = rslot(r0, fn);
	op = i ? i->op : Ocopy;
	if (KBASE(k) == 1 && rtype(r0) == RCon) {
	if (KBASE(k) == 1 && rtype(r0) == RCon && fn->con[r0.val].bits.i != 0) {
		/* load floating points from memory
		 * slots, they can't be used as
		 * immediates
@@ -84,13 +84,15 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn)
		a.offset.label = intern(buf);
		fn->mem[fn->nmem-1] = a;
	}
	else if (op != Ocopy && k == Kl && noimm(r0, fn)) {
	else if (op != Ocopy && ((k == Kl && noimm(r0, fn)) || (KBASE(k) == 1 && rtype(r0) == RCon))) {
		/* load constants that do not fit in
		 * a 32bit signed integer into a
		 * long temporary
		 * long temporary OR
		 * load positive zero into a floating
		 * point register
		 */
		r1 = newtmp("isel", Kl, fn);
		emit(Ocopy, Kl, r1, r0, R);
		r1 = newtmp("isel", k, fn);
		emit(Ocopy, k, r1, r0, R);
	}
	else if (s != -1) {
		/* load fast locals' addresses into
-- 
2.34.1

Re: [PATCH] amd64: optimize loading +0 into floating point registers

Details
Message ID
<968cb24b-6687-4231-bcda-fa7622fd31b3@www.fastmail.com>
In-Reply-To
<3IZQYCSM1UCM4.3KEAJ4MQE5UY9@mforney.org> (view parent)
DKIM signature
pass
Download raw message
This is indeed well motivated by Brian Callahan's post.

On Mon, Apr 4, 2022, at 04:59, Michael Forney wrote:
> +		if (t0 == RCon
> +		&& fn->con[i.arg[0].val].type == CBits
> +		&& fn->con[i.arg[0].val].bits.i == 0) {
> +			if (isreg(i.to)) {
> +				if (KBASE(i.cls) == 0)
> +					emitf("xor%k %=, %=", &i, fn, f);
> +				else
> +					emitf("pxor %D=, %D=", &i, fn, f);
> +				break;
> +			}
> +			i.cls = KWIDE(i.cls) ? Kl : Kw;
> +		}

However, I find it is a bit dangerous to turn the mov into
a xor in emit.c. The former does not set the flags while
the latter does. I could not expose a bug without changes
to qbe, but if you disable copy elimination, the following
program is compiled incorrectly (with the patch applied):

data $zero = {w 0}

export
function w $main() {
@start
        %z =w loadw $zero
        jnz %z, @out, @loop
@loop
        %i0 =w phi @start 10, @loop %i1
        %i1 =w sub %i0, 1
        %x0 =w copy 0
        jnz %i1, @loop, @out
@out
        %i2 =w phi @start 0, @loop %i1
        %x1 =w phi @start 1, @loop %x0
        %r =w add %i2, %x1
        ret %r
}

One solution I'm keen to explore is to introduce a new
generic isel pass before the arch-specific isels. There,
we could emit a 'zero' instruction that is known to
clobber flags. This would also be a fine place to turn
multiplications by powers of two into shifts and make
other miscellaneous improvements. Maybe even match &
replace bitwise rotations to get better perf on
crypto code.

Re: [PATCH] amd64: optimize loading +0 into floating point registers

Brian Callahan <bcallah@posteo.net>
Details
Message ID
<4c2c5cee-02a9-230b-26e8-fcf6338b4fd9@posteo.net>
In-Reply-To
<968cb24b-6687-4231-bcda-fa7622fd31b3@www.fastmail.com> (view parent)
DKIM signature
pass
Download raw message
Hi all --

(Apologies, I am not subscribed to this list and simply clicked the
"Reply to this thread" button on https://lists.sr.ht/~mpu/qbe/patches/30785)

You're right to point out the xor vs mov flags difference. This is why I
tried to make clear in the two blog posts that my little teaching
peephole optimizer was for the combination of cproc+qbe and not qbe
alone, but perhaps I did not do such a good job at making that clear. I
am nearly positive that cproc cannot emit code that ever depends on this
flags difference (though I am sure Michael will correct me if I am wrong
here).

In any event, I use QBE quite a bit so if my post motivates some
additional optimizations, that is a good result. I even have an OpenBSD
package ready to go that I'll get around to importing someday :)

~Brian

Re: [PATCH] amd64: optimize loading +0 into floating point registers

Details
Message ID
<d79c7774-b715-4284-851d-15464c6b4703@www.fastmail.com>
In-Reply-To
<4c2c5cee-02a9-230b-26e8-fcf6338b4fd9@posteo.net> (view parent)
DKIM signature
pass
Download raw message
Hi Brian,

On Wed, Apr 13, 2022, at 04:06, Brian Callahan wrote:
> I am nearly positive that cproc cannot emit code that ever depends on this
> flags difference (though I am sure Michael will correct me if I am wrong
> here).

It is pretty hard to tell for sure if your optimization is correct
in all cases. Or it may be today, but there is no guarantee it'll
remain correct. Like you can read in my previous message, it is
merely a matter of disabling copy elimination in qbe to expose a
bug. (And you can see disabling copy elimination as using 'qbe -O0'
instead of 'qbe -O1'; both should give the same results.)

Your peephole optimizer could be modified to detect if flags are
'live' when hitting a 'mov $0, %r__'. That'd make a nice followup
blog post :).

There is a safe way to do this xor optimization in qbe, and it is
to do it right before instruction selection for jumps happens.
I unfortunately do not have the throughput to look at this before
the 1.0 release.

> In any event, I use QBE quite a bit so if my post motivates some
> additional optimizations, that is a good result.

Yes! I really appreciate that you went through the bother of
measuring the wins, that's the best way to justify putting
effort into it.

> I even have an OpenBSD
> package ready to go that I'll get around to importing someday :)

Maybe wait for the 1.0 release; it'll be available soonish.
Reply to thread Export thread (mbox)