I pushed a lightly edited version of this patch on master.
The only subtlety I noted is that it may call getcon() with
an unsigned long long that overflows int64_t (namely 2**63).
I think on all the platforms supported the implementation-
defined behavior will be the right one though.
Hi,
Following up on that I think we should not systematically
go to 'shl' for the multiplications. Shifts on amd64 are
annoying due to their register constraints, so we will
likely get better performance if we compile them to leas
and adds when possible.
If you include a 0 first arg to lea (needs a register and instruction)
you get *4 *8 too.
And if you include a -x first arg to lea (ditto another register and
instruction) you get *7.
The other question is what is the right split here between simpl() and
isel()? And how does it interact with isel's gen of compound
load/store instructions, which can do [base + index*N] - I noticed QBE
was generating these quite nicely :)
Passes the "standard" test suite.
(cproc bootstrap, hare[c] make test, roland units, linpack/coremark run)
However linpack benchmark is now notably slower. Coremark is ~2% faster.
As noticed before, linmark timing is dubious, and maybe my cheap (AMD) laptop
prefers mul to shl.
Anecdotally it seems like linpack overcorrects for "overhead". I need
to look at the source code, but FWIW most of the GCM advantage in
linpack was at larger "overhead", and the reduction in linpack
overhead with this patch reduces the measured "overhead". With some
debug enabled it seems almost all of the mul/udiv/urem promotions in
linpack are in initialisation code, not the actual workload.
TL;DR I think linpack is being too clever and over-correcting for
self-measured "overhead". Bad news (for me) is that there is still no
compelling advantage for GVN/GCM.
---
simpl.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 3 deletions(-)
diff --git a/simpl.c b/simpl.c
index 7001301..7e10f2b 100644
--- a/simpl.c+++ b/simpl.c
@@ -30,15 +30,52 @@ blit(Ref sd[2], int sz, Fn *fn)
}
}
+static int+isconbits(Fn *fn, Ref r)+{+ return rtype(r) == RCon && fn->con[r.val].type == CBits;+}++static int+ispow2(uint64_t v)+{+ return (v & (v - 1)) == 0;+}++static int LOG2_TAB64[64] = {+ 63, 0, 58, 1, 59, 47, 53, 2,+ 60, 39, 48, 27, 54, 33, 42, 3,+ 61, 51, 37, 40, 49, 18, 28, 20,+ 55, 30, 34, 11, 43, 14, 22, 4,+ 62, 57, 46, 52, 38, 26, 32, 41,+ 50, 36, 17, 19, 29, 10, 13, 21,+ 56, 45, 25, 31, 35, 16, 9, 12,+ 44, 24, 15, 8, 23, 7, 6, 5};++/* caveat: only works on 64-bit systems */+static int log2_64 (uint64_t value)+{+ assert(sizeof(unsigned long) == 8);+ value |= value >> 1;+ value |= value >> 2;+ value |= value >> 4;+ value |= value >> 8;+ value |= value >> 16;+ value |= value >> 32;+ return LOG2_TAB64[((value - (value >> 1))*0x07EDD5E59A4E28C2UL) >> 58];+}
Since you only apply this function to powers of two we
can simplify it quite a bit I think. Something like
static int
log2_64(uint64_t value)
{
return LOG2_TAB64[(value*0x07EDD5E59A4E28C2UL) >> 58];
}
Indeed, nice! Although for full disclosure I did have some remorse
about this all being too cute. Maybe the more obvious
shift/compare/add chain would just be better(tm).
should be equivalent for powers of 2. I also think that
this code is portable (under the reasonable assumption that
the multiplication won't be promoted to an int type with
more than 64bits!).
+static void
ins(Ins **pi, int *new, Blk *b, Fn *fn)
{
ulong ni;
+ Con *c; Ins *i;
+ int shift; i = *pi;
/* simplify more instructions here;
- * copy 0 into xor, mul 2^n into shift,+ * copy 0 into xor, mul 2^n into shift (DONE), * bit rotations, ... */
switch (i->op) {
case Oblit1:
@@ -53,12 +90,30 @@ ins(Ins **pi, int *new, Blk *b, Fn *fn)
}
blit((i-1)->arg, rsval(i->arg[0]), fn);
*pi = i-1;
+ return; /* finished */+ case Omul:+ case Oudiv:+ case Ourem:+ if (!KBASE(i->cls))