Hi everyone,
I wanted to ask you all something, that is probably going to sound
nuts. The way the VM is currently implemented(in most cases), is for
it to check for stack overflow and underflow during each opcode.
But what if, we didn't.
As uxntal arity checking tools are getting better, and since we have
the system/wst and system/rst ports to check for stack depth when and
where we need to. Do we really need to check on opcode? Maybe not. I
kind of like the idea of a uxn core that cannot error, error handling
is kind of mess and I have yet to see someone other than me use it.
But not only that, it's slow as hell, and since this is destined to be
interpreted, maybe this is not the best use of cycles.
Here's a conversation someone sent me about this:
https://groups.google.com/g/comp.lang.forth/c/qEndNz42bw0
Here's a thread on the fedi about this:
https://merveilles.town/@bd/110996467642216936
So, basically, the stacks would be circular, underflow would just mean
that the pointer is at 255, and overflow would loop it back around.
The programmer would need to write their own checks, I've been doing
some experiments but roughly, it looks like:
@on-reset ( -> )
#01 DUP
#02 .System/wst DEI NEQ ?{ ( error handler ) }
BRK
I'm still just trying to understand how this would impact it all, but
I'd like to know what you all think, it sounds a bit crazy at first
but think about it and let me know what you think.
Have a good one :)
Dll
> I wanted to ask you all something, that is probably going to sound> nuts. The way the VM is currently implemented(in most cases), is for> it to check for stack overflow and underflow during each opcode.> > But what if, we didn't.
I like this, even if just for the heck of it... :-)
It removes needless error checking. Also drop div-by-zero (just return
0), and you can use the catch vector for more interesting things.
Also we finally have a true 1-byte NOP (POPk). There are infinite ways of
screwing up on a low-level, untyped target machine, so error checks look
sort of arbitrary and artificial anyway.
felix
hell yea, that's the spirit.
Do you think there's any use for circular stacks, like some sort of
utility would could make from it? Looking it up there seems to be a
couple of things for caching. I can imagine pushing to the
near-infinite stack when streaming bytes from a file, and having a
kind of cache to grab things from the past. Any other idea?
On 9/2/23, felix.winkelmann@bevuta.com <felix.winkelmann@bevuta.com> wrote:
>> I wanted to ask you all something, that is probably going to sound>> nuts. The way the VM is currently implemented(in most cases), is for>> it to check for stack overflow and underflow during each opcode.>>>> But what if, we didn't.>> I like this, even if just for the heck of it... :-)>> It removes needless error checking. Also drop div-by-zero (just return> 0), and you can use the catch vector for more interesting things.> Also we finally have a true 1-byte NOP (POPk). There are infinite ways of> screwing up on a low-level, untyped target machine, so error checks look> sort of arbitrary and artificial anyway.>>> felix>>
On Sat, 2023-09-02 at 12:49 -0700, Devine Lu Linvega wrote:
> Do you think there's any use for circular stacks, like some sort of> utility would could make from it?
GA144 (144 forth-y cores) uses circular stacks but they are much
shorter than in UXN.
What for?
E.g.: Using a circular stack to infinitely replay a pattern to send to
a neighbour.
Opinion part:
I do not like dropping stack and div/0 checks in UXN.
Fact:
UXN is your toy.
Have fun!
I think circular stacks would be good. I am not sure if they are
specifically "useful" (it's easy enough to use a ring buffer of
whatever size you want in memory if that's your thing) but I also
can't imagine a correct program getting anywhere close to using 256
bytes of wst or rst so it's a moot point either way. Getting rid of
unnecessary checks to speed up the stack is all to the good.
I don't like the idea of getting rid of div-by-zero errors as much
because these do signal program errors that might otherwise be hard to
catch. Also, I don't think checks on division are very expensive
relative to the cost of division itself (unlike overflow/underflow
checks).
If we have to give up division by zero I'd prefer to return a weirder
value (such as #ff / #ffff) which would be more obviously wrong
(since they are impossible to get unless they are the numerator and
the denominator is 1).
Finally, I agree with Felix that it would be very cool to finally have
a true NOP in POPk!
-- d_m
On Sat, Sep 02, 2023 at 12:49:19PM -0700, Devine Lu Linvega wrote:
> hell yea, that's the spirit.> > Do you think there's any use for circular stacks, like some sort of> utility would could make from it? Looking it up there seems to be a> couple of things for caching. I can imagine pushing to the> near-infinite stack when streaming bytes from a file, and having a> kind of cache to grab things from the past. Any other idea?> > On 9/2/23, felix.winkelmann@bevuta.com <felix.winkelmann@bevuta.com> wrote:> >> I wanted to ask you all something, that is probably going to sound> >> nuts. The way the VM is currently implemented(in most cases), is for> >> it to check for stack overflow and underflow during each opcode.> >>> >> But what if, we didn't.> >> > I like this, even if just for the heck of it... :-)> >> > It removes needless error checking. Also drop div-by-zero (just return> > 0), and you can use the catch vector for more interesting things.> > Also we finally have a true 1-byte NOP (POPk). There are infinite ways of> > screwing up on a low-level, untyped target machine, so error checks look> > sort of arbitrary and artificial anyway.> >> >> > felix> >> >
Hi!
BD here, I feel I should clarify my position a bit more, and email gives
me a lot more characters to work with than Mastodon does.
First some context: I've been working on improving performance of the
UXN core, particularly on lower powered systems. To that end I wrote
a purely ARMv7tdmi version in ARM assembly [1] and another one in
C [2], trying to port my initial ARM core ideas. Both of these
implementations are very new and may be riddled with bugs, but seems
that so far they are working fine with a few test roms and more complex
programs (like noodle & oquonie). The C core was written in a couple of
afternoons so more testing is needed. Also, disregard the ugly code,
I used a whole lot of macros and generous amount of copy-pasting and
search and replace to have the core done quickly.
To get some performance numbers I ran the fib.rom through several core
implementations on the GBA, as it allows me to pretty accurately measure
the number of cycles and I'm quite familiar with the system (plus, this
is specially the kind of device we may want that extra performance).
First, the reference implementation [3], which was a bit of a regression
from the previous uxn-fast core we used to have, but it's pretty much
the same included in the SDL and x11 emulators (at least conceptually).
This resulted in an execution time of 290271842 cycles and I will
consider it the baseline for the rest of the core comparisons (1.0x).
The ARM assembly implementation run the same evaluation in 101461703
cycles (2.86x faster). Conversely, the equivalent fast C core results in
158904434 cyles (1.82x faster than the reference implementation but
0.64x slower than the assembly code). Both of these implementation
disregard stack bound checking, but what it the effect of a couple of
underflow/overflow checks such as:
```
if (w < wst) { return; }
if (w >= wst + 256) { return; }
```
On the tight eval loop?
If we add that to the fast C core keeping the rest of the things the
same, we measure 173294820 cycles (0.91x from the non-boundcheck version),
in other words, the performance degrades by 10% on an already decently
fast implementation.
Now, I'm not a world-class performance expert and I'm definitely not
claiming my code or implementations are as optimal as they can be, but
this seems like significant performance loss and it can make the
difference in more demanding roms.
Also, note that this performance may not directly translate into the FPS
of a program, after all, graphics and drawing routines add a significant
amount of strain to limited processors, so even the eval loop may be
quite fast, if drawing is slow it will drag everything down.
This doesn't mean that we have to get rid of bound-checking all
together, however (and this is also a consideration on screen devices
that try to write outside of the screen boundary) we could also just
disable these bound checks on "release" mode. A proper error message
like ("crash because overflow on wst") can be very helpful when
developing, but in my eyes the user doesn't care if an application
crashes with a segfault or with a "error: we made a whoopsie" message.
If we are not able to recover from a wrong program state, this kind of
stack checking serve no purpose. More static validation is certainly
preferable to crashing however!
With regards to circular stacks, I like the idea, but bear in mind this
can cause all sorts of issues as well, we need to know what we are
dealing with here. For example, if we can consider a program that wraps
around valid, it may be more difficult to find a fault in the logic of
our program. If this is by design and can be another tool to play around
with, I say go for it and have fun, maybe there are some interesting
things one can do with these! However, if the goal is to prevent
crashing, but we are still leaving the program in an invalid state (for
example by overriding the last stack value with an underflowed stack),
then I think crashing is a better option.
For me, division by zero should always be a crash unless you can return
some form of NaN or infinite value or something. It is invalid behaviour
and can be impossible to detect via static analysis. In this case,
however there may be programs that are able to recover from a that state
so a proper check here may be fine, and it definitely is insignificant
compared to the cost of a division (at least on the GBA, since it lacks
hardware div).
Best,
Bad Diode
[1]: https://git.badd10de.dev/uxngba/tree/src/uxn-core.s
[2]: https://git.sr.ht/~rabbits/uxn-playdate/tree/main/item/src/uxn-core.c
[3]: https://git.badd10de.dev/uxngba/tree/src/uxn.c?id=4305d585d87c6bbc197b0fe2d9f144d6d0b772e2
Hi bd!
Thanks for clarifying and counting cycles, that paints a much clearer picture :)
I've spent most of yesterday playing with this and being to handle
testing code at the uxntal level is very nice, having to flag the
emulator with debug wouldn't be as practical. Oquonie already has a
debug/release file setup, and so it was easy to add stack depth
checking to my vectors.
At runtime, I found that at the root level of each vector, DEI2 both
wst/rst and comparing against #0101 was really nice, what I do is
print the stack(#010e DEO) and trigger a crash(#010f DEO) on
imbalance. This replicates the earlier behavior somewhat, and finding
where the imbalance occurred wasn't to hard to pinpoint that way.
I'm thinking that I might write a guide on how to use uxnbal that
shows how to flag imbalance at compiletime. It'll be needed.
Personally, div by zero returning 0 would be beneficial for me, since
in nearly every case that I could find in my code, it occurs when I
iterate over values and shift them, sometimes they become zero, and I
have to add this extra code to jump over the div and return 0. But if
everyone feels that div by zero should be more obvious, it could do
some other thing. But I'm thinking maybe it should be handled by the
program.
The idea of circular stack is already pushing for a more careful
handling of exceptions by the programmer, this whole idea has been
helping elucidate some incoherence in uxn's design, where being
interpreted, should try to be as lean as it could possibly be made,
and I found that removing that stuff really aligns with the original
design.
Thanks for all your feedback on this :)
-- Bathtub Catapult Airlines
On 9/3/23, Bad Diode <bd@badd10de.dev> wrote:
> Hi!>> BD here, I feel I should clarify my position a bit more, and email gives> me a lot more characters to work with than Mastodon does.>> First some context: I've been working on improving performance of the> UXN core, particularly on lower powered systems. To that end I wrote> a purely ARMv7tdmi version in ARM assembly [1] and another one in> C [2], trying to port my initial ARM core ideas. Both of these> implementations are very new and may be riddled with bugs, but seems> that so far they are working fine with a few test roms and more complex> programs (like noodle & oquonie). The C core was written in a couple of> afternoons so more testing is needed. Also, disregard the ugly code,> I used a whole lot of macros and generous amount of copy-pasting and> search and replace to have the core done quickly.>> To get some performance numbers I ran the fib.rom through several core> implementations on the GBA, as it allows me to pretty accurately measure> the number of cycles and I'm quite familiar with the system (plus, this> is specially the kind of device we may want that extra performance).> First, the reference implementation [3], which was a bit of a regression> from the previous uxn-fast core we used to have, but it's pretty much> the same included in the SDL and x11 emulators (at least conceptually).> This resulted in an execution time of 290271842 cycles and I will> consider it the baseline for the rest of the core comparisons (1.0x).>> The ARM assembly implementation run the same evaluation in 101461703> cycles (2.86x faster). Conversely, the equivalent fast C core results in> 158904434 cyles (1.82x faster than the reference implementation but> 0.64x slower than the assembly code). Both of these implementation> disregard stack bound checking, but what it the effect of a couple of> underflow/overflow checks such as:>> ```> if (w < wst) { return; }> if (w >= wst + 256) { return; }> ```>> On the tight eval loop?>> If we add that to the fast C core keeping the rest of the things the> same, we measure 173294820 cycles (0.91x from the non-boundcheck version),> in other words, the performance degrades by 10% on an already decently> fast implementation.>> Now, I'm not a world-class performance expert and I'm definitely not> claiming my code or implementations are as optimal as they can be, but> this seems like significant performance loss and it can make the> difference in more demanding roms.>> Also, note that this performance may not directly translate into the FPS> of a program, after all, graphics and drawing routines add a significant> amount of strain to limited processors, so even the eval loop may be> quite fast, if drawing is slow it will drag everything down.>> This doesn't mean that we have to get rid of bound-checking all> together, however (and this is also a consideration on screen devices> that try to write outside of the screen boundary) we could also just> disable these bound checks on "release" mode. A proper error message> like ("crash because overflow on wst") can be very helpful when> developing, but in my eyes the user doesn't care if an application> crashes with a segfault or with a "error: we made a whoopsie" message.> If we are not able to recover from a wrong program state, this kind of> stack checking serve no purpose. More static validation is certainly> preferable to crashing however!>> With regards to circular stacks, I like the idea, but bear in mind this> can cause all sorts of issues as well, we need to know what we are> dealing with here. For example, if we can consider a program that wraps> around valid, it may be more difficult to find a fault in the logic of> our program. If this is by design and can be another tool to play around> with, I say go for it and have fun, maybe there are some interesting> things one can do with these! However, if the goal is to prevent> crashing, but we are still leaving the program in an invalid state (for> example by overriding the last stack value with an underflowed stack),> then I think crashing is a better option.>> For me, division by zero should always be a crash unless you can return> some form of NaN or infinite value or something. It is invalid behaviour> and can be impossible to detect via static analysis. In this case,> however there may be programs that are able to recover from a that state> so a proper check here may be fine, and it definitely is insignificant> compared to the cost of a division (at least on the GBA, since it lacks> hardware div).>> Best,>> Bad Diode>> [1]: https://git.badd10de.dev/uxngba/tree/src/uxn-core.s> [2]: https://git.sr.ht/~rabbits/uxn-playdate/tree/main/item/src/uxn-core.c> [3]:> https://git.badd10de.dev/uxngba/tree/src/uxn.c?id=4305d585d87c6bbc197b0fe2d9f144d6d0b772e2>
Can arity checkers detect infinite recursion? My intuition is that this is the halting problem, and therefore impossible. So removing bounds checks on stack ops leaves the door open for stack corruption and therefore very odd and unpredictable behaviour in such cases.
In my limited experience writing uxntal I found the stack overflow and underflow the most reliable signal that something went wrong.
I’m also not convinced that bounds checking on stack operations is expensive enough to matter on modern machines. I imagine (based on very limited knowledge) that most CPUs would pipeline such check such that they essentially become free. But measurements would be required to say for sure. All that is to say that the efficiency argument doesn’t really stir me.
Sorry to be negative about this fun idea! I really love and embrace the philosophy of eliminating the possibility of errors wherever possible. I’m just not sure about this.
nf
On Sun, Sep 3, 2023, at 02:28, Devine Lu Linvega wrote:
> Hi everyone,>> I wanted to ask you all something, that is probably going to sound> nuts. The way the VM is currently implemented(in most cases), is for> it to check for stack overflow and underflow during each opcode.>> But what if, we didn't.>> As uxntal arity checking tools are getting better, and since we have> the system/wst and system/rst ports to check for stack depth when and> where we need to. Do we really need to check on opcode? Maybe not. I> kind of like the idea of a uxn core that cannot error, error handling> is kind of mess and I have yet to see someone other than me use it.> But not only that, it's slow as hell, and since this is destined to be> interpreted, maybe this is not the best use of cycles.>> Here's a conversation someone sent me about this:> https://groups.google.com/g/comp.lang.forth/c/qEndNz42bw0>> Here's a thread on the fedi about this:> https://merveilles.town/@bd/110996467642216936>> So, basically, the stacks would be circular, underflow would just mean> that the pointer is at 255, and overflow would loop it back around.> The programmer would need to write their own checks, I've been doing> some experiments but roughly, it looks like:>> @on-reset ( -> )> #01 DUP> #02 .System/wst DEI NEQ ?{ ( error handler ) }> BRK>> I'm still just trying to understand how this would impact it all, but> I'd like to know what you all think, it sounds a bit crazy at first> but think about it and let me know what you think.>> Have a good one :)> Dll
I know that I'm quite late to the discussion, but I also wondered about the performance gains. A circular stack would still need a MOD (or an & bitmask), which wouldn't be that much better than the (theoretical) 4 instructions needed to check for an overflow or underflow.
Especially, I don't think that there are instructions that can push or pop more than 3 things at a time? you could probably calculate the amount of instructions you can run for without the possibility of an overflow.
not to mention, you can probably just get away with padding the stacks on both the top and bottom, and then only checking for an overflow (e.g.) every 8 clock cycles or so.
sorry if this doesn't make a lot of sense, I am deprived of coffee while writing :3
phyto
On 3 September 2023 20:30:18 UTC, nf <nf@wh3rd.net> wrote:
>Can arity checkers detect infinite recursion? My intuition is that this is the halting problem, and therefore impossible. So removing bounds checks on stack ops leaves the door open for stack corruption and therefore very odd and unpredictable behaviour in such cases. >>In my limited experience writing uxntal I found the stack overflow and underflow the most reliable signal that something went wrong. >>I’m also not convinced that bounds checking on stack operations is expensive enough to matter on modern machines. I imagine (based on very limited knowledge) that most CPUs would pipeline such check such that they essentially become free. But measurements would be required to say for sure. All that is to say that the efficiency argument doesn’t really stir me. >>Sorry to be negative about this fun idea! I really love and embrace the philosophy of eliminating the possibility of errors wherever possible. I’m just not sure about this. >>nf >>On Sun, Sep 3, 2023, at 02:28, Devine Lu Linvega wrote:>> Hi everyone,>>>> I wanted to ask you all something, that is probably going to sound>> nuts. The way the VM is currently implemented(in most cases), is for>> it to check for stack overflow and underflow during each opcode.>>>> But what if, we didn't.>>>> As uxntal arity checking tools are getting better, and since we have>> the system/wst and system/rst ports to check for stack depth when and>> where we need to. Do we really need to check on opcode? Maybe not. I>> kind of like the idea of a uxn core that cannot error, error handling>> is kind of mess and I have yet to see someone other than me use it.>> But not only that, it's slow as hell, and since this is destined to be>> interpreted, maybe this is not the best use of cycles.>>>> Here's a conversation someone sent me about this:>> https://groups.google.com/g/comp.lang.forth/c/qEndNz42bw0>>>> Here's a thread on the fedi about this:>> https://merveilles.town/@bd/110996467642216936>>>> So, basically, the stacks would be circular, underflow would just mean>> that the pointer is at 255, and overflow would loop it back around.>> The programmer would need to write their own checks, I've been doing>> some experiments but roughly, it looks like:>>>> @on-reset ( -> )>> #01 DUP>> #02 .System/wst DEI NEQ ?{ ( error handler ) }>> BRK>>>> I'm still just trying to understand how this would impact it all, but>> I'd like to know what you all think, it sounds a bit crazy at first>> but think about it and let me know what you think.>>>> Have a good one :)>> Dll
On 2023-09-14 20:51, phyto wrote:
>I know that I'm quite late to the discussion, but I also wondered about>the performance gains. A circular stack would still need a MOD (or an>& bitmask), which wouldn't be that much better than the (theoretical)>4 instructions needed to check for an overflow or underflow.
It doesn't have to add any instructions if you just take into account
the u8 wraparound and index with that, since the stacks are 256 bytes
long:
```
#include<stdio.h>
unsigned char wst[256];
int
main() {
for (int i = 0; i < 256; i++) {
wst[i] = i;
}
unsigned char idx = 0;
printf("%02x\n", wst[idx]);
idx--;
printf("%02x\n", wst[idx]);
printf("%02x\n", wst[255]);
return 0;
}
```
This should print out:
```
00
ff
ff
```
--
Bad Diode