amd64: optimize loading +0 into floating point registers v1 NEEDS REVISION

Michael Forney: 1
 amd64: optimize loading +0 into floating point registers

 2 files changed, 19 insertions(+), 5 deletions(-)
Hi all --

(Apologies, I am not subscribed to this list and simply clicked the
"Reply to this thread" button on https://lists.sr.ht/~mpu/qbe/patches/30785)

You're right to point out the xor vs mov flags difference. This is why I
tried to make clear in the two blog posts that my little teaching
peephole optimizer was for the combination of cproc+qbe and not qbe
alone, but perhaps I did not do such a good job at making that clear. I
am nearly positive that cproc cannot emit code that ever depends on this
flags difference (though I am sure Michael will correct me if I am wrong
In any event, I use QBE quite a bit so if my post motivates some
additional optimizations, that is a good result. I even have an OpenBSD
package ready to go that I'll get around to importing someday :)
Hi Brian,
Export patchset (mbox)
How do I use this?

Copy & paste the following snippet into your terminal to import this patchset into git:

curl -s https://lists.sr.ht/~mpu/qbe/patches/30785/mbox | git am -3
Learn more about email & git

Re: [PATCH] amd64: optimize loading +0 into floating point registers Export this patch

Revisiting this, motivated by this recent blog post by Brian Callahan:

Quentin Carbonneaux <quentin@c9x.me> wrote:
> I am not opposed to this optimization but I find the implementation
> a bit too hacky. An alternative approach I can suggest is to add a
> 'zero' instruction that just zeroes its return temporary. This
> instruction could easily work with both integer and sse registers.
> It could be 'xzero' as well if it is only used for amd64.

I'm not sure what you find hacky about this approach. The job of
isel() is to replace instructions that cannot be encoded in the
target ISA with an equivalent sequence of instructions that can.
In this case, we *do* have an amd64 instruction that copies zero
into a float register, and that is `pxor dst, dst`. We just need
to let that case pass through isel() and handle it in emit().

Adding a new instruction '%r =d zero' that is equivalent to one we
already have ('%r =d copy 0') seems worse to me. What if we also
wanted to optimize loading other floating point constants? On arm64,
we can move a limited set of constants into a register with fmov.
It would seem silly to me to add new instructions like `%r =d one`
for this.

Below is the patch (the last one!) I've been using in my QBE branch.
It handles integer register zeroing as well as floating point. You
can apply it with `git am --scissor`.

-- 8< --
From: =?UTF-8?q?=C3=89rico=20Nogueira?= <erico.erc@gmail.com>
Date: Sun, 11 Jul 2021 19:19:12 -0300
Subject: [PATCH] amd64: optimize loading 0 into registers

Loading +0 into a floating point register can be done using pxor or
xorps instructions. Per [1], we went with pxor because it can run on all
vector ALU ports, even if it's one byte longer.

Similarly, an integer register can be zeroed with xor, which has a
smaller encoding than mov with 0 immediate.

To implement this, let fixarg pass through Ocopy when the value is
+0 for floating point, and change emitins to emit pxor/xor when it
encounters a copy from 0.

Co-authored-by: Michael Forney <mforney@mforney.org>

[1] https://stackoverflow.com/questions/39811577/does-using-mix-of-pxor-and-xorps-affect-performance/39828976
 amd64/emit.c | 12 ++++++++++++
 amd64/isel.c | 12 +++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/amd64/emit.c b/amd64/emit.c
index b8e9e8e..388b8b3 100644
--- a/amd64/emit.c
+++ b/amd64/emit.c
@@ -450,6 +450,18 @@ emitins(Ins i, Fn *fn, FILE *f)
		if (req(i.to, i.arg[0]))
		t0 = rtype(i.arg[0]);
		if (t0 == RCon
		&& fn->con[i.arg[0].val].type == CBits
		&& fn->con[i.arg[0].val].bits.i == 0) {
			if (isreg(i.to)) {
				if (KBASE(i.cls) == 0)
					emitf("xor%k %=, %=", &i, fn, f);
					emitf("pxor %D=, %D=", &i, fn, f);
			i.cls = KWIDE(i.cls) ? Kl : Kw;
		if (i.cls == Kl
		&& t0 == RCon
		&& fn->con[i.arg[0].val].type == CBits) {
diff --git a/amd64/isel.c b/amd64/isel.c
index 4181e26..d4f0b69 100644
--- a/amd64/isel.c
+++ b/amd64/isel.c
@@ -69,7 +69,7 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn)
	r1 = r0 = *r;
	s = rslot(r0, fn);
	op = i ? i->op : Ocopy;
	if (KBASE(k) == 1 && rtype(r0) == RCon) {
	if (KBASE(k) == 1 && rtype(r0) == RCon && fn->con[r0.val].bits.i != 0) {
		/* load floating points from memory
		 * slots, they can't be used as
		 * immediates
@@ -84,13 +84,15 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn)
		a.offset.label = intern(buf);
		fn->mem[fn->nmem-1] = a;
	else if (op != Ocopy && k == Kl && noimm(r0, fn)) {
	else if (op != Ocopy && ((k == Kl && noimm(r0, fn)) || (KBASE(k) == 1 && rtype(r0) == RCon))) {
		/* load constants that do not fit in
		 * a 32bit signed integer into a
		 * long temporary
		 * long temporary OR
		 * load positive zero into a floating
		 * point register
		r1 = newtmp("isel", Kl, fn);
		emit(Ocopy, Kl, r1, r0, R);
		r1 = newtmp("isel", k, fn);
		emit(Ocopy, k, r1, r0, R);
	else if (s != -1) {
		/* load fast locals' addresses into
This is indeed well motivated by Brian Callahan's post.