Authentication-Results: mail-b.sr.ht; dkim=pass header.d=mforney.org header.i=@mforney.org Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by mail-b.sr.ht (Postfix) with ESMTPS id 5DD1511EEB5 for <~mpu/qbe@lists.sr.ht>; Mon, 4 Apr 2022 02:59:54 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id o10so1140810ple.7 for <~mpu/qbe@lists.sr.ht>; Sun, 03 Apr 2022 19:59:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mforney.org; s=google; h=date:to:cc:subject:from:references:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=IB84HFyFbpeDTtU4CB2VZP9/wnbZg1f4XIH0lsEeZ3E=; b=IHSOSgpDzdN4HFVrkQXNEK9eKdmxNo41iI6GIcfkKLJHKHz1OAskx1Q/U5tAWizChm cPbwrdzifIRo5DOc4PMzXDTgnksgLq/yn9Uy19dKeFAk/pWx6LT+UjgytEpJEpPYvX1x 72INI3NZRRSNC7k3d52u0ODK5fV5oUSEZhL8JF/QsmoEmK0TzNTMrTwWa96n8ag7wwl1 VMtF2cNqoRPqrU4oB4TYb0gmRFjkvKBqFukbC4FnUegR5GldSlu/uG7QF8N0LLTVEKp+ P2R1bjqc8j+qoeJwuWTVIM43LFm9rU3/EPZJUaRMyom/KPMuzYCc9yb63NKYaDzgZ0Km nDAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:to:cc:subject:from:references:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=IB84HFyFbpeDTtU4CB2VZP9/wnbZg1f4XIH0lsEeZ3E=; b=4IZfr2JpedA+yMWjhHPSYAvCGwWCBtI5P7SmJX2T+0RK4j8ACemuHUYf7n3Wx0pYaM cFxDGFNR8wMTh1/gqeDo+ov29DVTdkb+gZil4wWdefDnVBhqHugbTB2/c+pBc7w8CSIo E/dMIKkrtxFUkJ6UEE5jCtrDf2YOZYkF1CFSarhgsNsHVDYDwSG8kAvuS+HwISnGsu1h 3EZNRt5xPILtgTyBn68mydgpAbqGVKjSA7ATzc5V49dRXvMUl4UN80Uc5nCrKrx3b4Xw C1+hMy6XKL4r1Ndz8PLSp/uqourxf3r6DSOO8qALvdad7lYMsXVkC8MvJMNrbqN5LOAY aGrw== X-Gm-Message-State: AOAM532ldNzdQarMD7p2qwKr2nVD62ZcsRbIKpgLoNdnKY6A5SO6sFD9 HmOUJHIAYhvTz0wR8+PpCHlNOcl4z8/bzGG9 X-Google-Smtp-Source: ABdhPJwqIxhF/5NDh6O0nsLIQ6OWfGT1l3SXpOtqLxC1sIAdjEneliduK9nPWr1mU+XdSiaod8Dwmw== X-Received: by 2002:a17:902:d714:b0:153:2e9:3bcc with SMTP id w20-20020a170902d71400b0015302e93bccmr21386561ply.83.1649041188552; Sun, 03 Apr 2022 19:59:48 -0700 (PDT) Received: from localhost ([98.45.152.168]) by smtp.gmail.com with ESMTPSA id ay9-20020a056a00300900b004fae1346aa1sm9482010pfb.122.2022.04.03.19.59.47 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Sun, 03 Apr 2022 19:59:47 -0700 (PDT) Date: Sun, 03 Apr 2022 19:59:46 -0700 To: ~mpu/qbe@lists.sr.ht Cc: =?UTF-8?Q?=C3=89rico?= Nogueira Subject: Re: [PATCH] amd64: optimize loading +0 into floating point registers From: Michael Forney References: <20210804083330.26855-1-mforney@mforney.org> In-Reply-To: Message-Id: <3IZQYCSM1UCM4.3KEAJ4MQE5UY9@mforney.org> User-Agent: mblaze/1.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Revisiting this, motivated by this recent blog post by Brian Callahan: https://briancallahan.net/blog/20220330.html Quentin Carbonneaux wrote: > I am not opposed to this optimization but I find the implementation > a bit too hacky. An alternative approach I can suggest is to add a > 'zero' instruction that just zeroes its return temporary. This > instruction could easily work with both integer and sse registers. > It could be 'xzero' as well if it is only used for amd64. I'm not sure what you find hacky about this approach. The job of isel() is to replace instructions that cannot be encoded in the target ISA with an equivalent sequence of instructions that can. In this case, we *do* have an amd64 instruction that copies zero into a float register, and that is `pxor dst, dst`. We just need to let that case pass through isel() and handle it in emit(). Adding a new instruction '%r =3Dd zero' that is equivalent to one we already have ('%r =3Dd copy 0') seems worse to me. What if we also wanted to optimize loading other floating point constants? On arm64, we can move a limited set of constants into a register with fmov. It would seem silly to me to add new instructions like `%r =3Dd one` for this. Below is the patch (the last one!) I've been using in my QBE branch. It handles integer register zeroing as well as floating point. You can apply it with `git am --scissor`. -- 8< -- From: =3D?UTF-8?q?=3DC3=3D89rico=3D20Nogueira?=3D Date: Sun, 11 Jul 2021 19:19:12 -0300 Subject: [PATCH] amd64: optimize loading 0 into registers Loading +0 into a floating point register can be done using pxor or xorps instructions. Per [1], we went with pxor because it can run on all vector ALU ports, even if it's one byte longer. Similarly, an integer register can be zeroed with xor, which has a smaller encoding than mov with 0 immediate. To implement this, let fixarg pass through Ocopy when the value is +0 for floating point, and change emitins to emit pxor/xor when it encounters a copy from 0. Co-authored-by: Michael Forney [1] https://stackoverflow.com/questions/39811577/does-using-mix-of-pxor-and= -xorps-affect-performance/39828976 --- amd64/emit.c | 12 ++++++++++++ amd64/isel.c | 12 +++++++----- 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/amd64/emit.c b/amd64/emit.c index b8e9e8e..388b8b3 100644 --- a/amd64/emit.c +++ b/amd64/emit.c @@ -450,6 +450,18 @@ emitins(Ins i, Fn *fn, FILE *f) if (req(i.to, i.arg[0])) break; t0 =3D rtype(i.arg[0]); + if (t0 =3D=3D RCon + && fn->con[i.arg[0].val].type =3D=3D CBits + && fn->con[i.arg[0].val].bits.i =3D=3D 0) { + if (isreg(i.to)) { + if (KBASE(i.cls) =3D=3D 0) + emitf("xor%k %=3D, %=3D", &i, fn, f); + else + emitf("pxor %D=3D, %D=3D", &i, fn, f); + break; + } + i.cls =3D KWIDE(i.cls) ? Kl : Kw; + } if (i.cls =3D=3D Kl && t0 =3D=3D RCon && fn->con[i.arg[0].val].type =3D=3D CBits) { diff --git a/amd64/isel.c b/amd64/isel.c index 4181e26..d4f0b69 100644 --- a/amd64/isel.c +++ b/amd64/isel.c @@ -69,7 +69,7 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn) r1 =3D r0 =3D *r; s =3D rslot(r0, fn); op =3D i ? i->op : Ocopy; - if (KBASE(k) =3D=3D 1 && rtype(r0) =3D=3D RCon) { + if (KBASE(k) =3D=3D 1 && rtype(r0) =3D=3D RCon && fn->con[r0.val].bits.i = !=3D 0) { /* load floating points from memory * slots, they can't be used as * immediates @@ -84,13 +84,15 @@ fixarg(Ref *r, int k, Ins *i, Fn *fn) a.offset.label =3D intern(buf); fn->mem[fn->nmem-1] =3D a; } - else if (op !=3D Ocopy && k =3D=3D Kl && noimm(r0, fn)) { + else if (op !=3D Ocopy && ((k =3D=3D Kl && noimm(r0, fn)) || (KBASE(k) =3D= =3D 1 && rtype(r0) =3D=3D RCon))) { /* load constants that do not fit in * a 32bit signed integer into a - * long temporary + * long temporary OR + * load positive zero into a floating + * point register */ - r1 =3D newtmp("isel", Kl, fn); - emit(Ocopy, Kl, r1, r0, R); + r1 =3D newtmp("isel", k, fn); + emit(Ocopy, k, r1, r0, R); } else if (s !=3D -1) { /* load fast locals' addresses into --=20 2.34.1