Hi!
This cast does very strange things that I do not understand:
let evil = (("a": (int | str)): int);
The variable "evil" now behaves very strangely.
// All equality tests seem to pass.
assert(evil == 2);
assert(evil == 99);
// This reads "evil" as a different number sometimes, even though it
// shouldn't have changed. I've had it output both 0 and 32766
// in the same program invocation for example.
fmt::println(evil)!;
fmt::println(evil)!;
I wonder what's causing this. I guess something strange is happening
with the assmebly code generated.
$ harec -v
harec 0.24.2-rc2-108-gbbdf6bd
$ hare version
hare 0.24.2-rc1-152-g2459709e
QBE commit:
327736b3a
I also didn't know that a cast could be done from a tagged union type to
a member type, as opposed to using a type assertion. I assume that's
essentially the "unsafe" version that doesn't check before casting?
On Fri Jan 3, 2025 at 5:15 PM EST, Luna wrote:
> Hi!>> This cast does very strange things that I do not understand:>> let evil = (("a": (int | str)): int);>> The variable "evil" now behaves very strangely.>> // All equality tests seem to pass.> assert(evil == 2);> assert(evil == 99);>> // This reads "evil" as a different number sometimes, even though it> // shouldn't have changed. I've had it output both 0 and 32766> // in the same program invocation for example.> fmt::println(evil)!;> fmt::println(evil)!;
I don't know exactly what's causing this strange behavior, but it isn't
a bug because this is UB, since you're reading padding bytes.
Assuming align(int) == 4 && align(str) == 8, the representation of
(int | str) is:
union {
struct {
tag: u32,
val: int,
} int_,
struct {
tag: u32,
// 4 bytes of padding
val: str,
} str_,
}
So when casting `"a": (int | str)` to int, it grabs the padding bytes in
between the tag and the str value. The representation of the padding
bytes is indeterminate, and not guaranteed to be consistent.
It's still interesting that the behavior is so strange though; I
wouldn't have expected that.
> I also didn't know that a cast could be done from a tagged union type to> a member type, as opposed to using a type assertion. I assume that's> essentially the "unsafe" version that doesn't check before casting?
Yup, exactly
> union {> struct {> tag: u32,> val: int,> } int_,> struct {> tag: u32,> // 4 bytes of padding> val: str,> } str_,> }
Uh, pretend the syntax of that is correct and not a weird mix of Hare
and C syntax lol
On Fri Jan 3, 2025 at 5:33 PM EST, Sebastian wrote:
> I don't know exactly what's causing this strange behavior, but it isn't> a bug because this is UB, since you're reading padding bytes.
don't worry, I wasn't planning on doing that cast in real code (well,
not intenionally).
Here's how I came across this behavior. it felt a bit unintuitive.
I've a few times come across situations where I can yield data of
multiple types out of a block, and hare doesn't do a good job at
inferring the type of the block. For example:
const a = {
yield if (condition) (2, true) else ("a", false);
};
Hare decides that "a" should be of the type
((int, bool) | (string, bool)), but I want it to be
((int, string) | bool). I can do that by saying what type "a" should be:
const a: ((int, string) | bool) = {
yield if (condition) (2, true) else ("a", false);
};
but in some cases that can make the first line a bit long. I found I
could also do this:
const a = {
yield if (condition) (2, true) else ("a", false);
}: ((int, string) | bool);
which as I understand it allows the block to coerce things yielded from
it into the casted type, instead of casting them after yielding. So it
effectively does the same thing.
Here's another one that gets weird:
{
yield if (true) ("a", true) else (2, false);
}: ((int | str), bool)
I think this still works as intended. The yielded tuple can be coereced
into the right type and this code can be valid. However, what about
this:
{
const tuple = ("a", true);
yield if (true) tuple else (2, false);
}: ((int | str), bool)
I think that this one is UB from my testing, even though it looks so
similar to the previous one. I understand that hare can't figure out how
to do the same type coercion because the tuple is being yielded as a
variable here instead of a literal. I can't even figure out what it's
doing to the types to get to point it's at from looking at the code
though.
I'm starting to think it's bad practice to do casting of blocks like
this instead of having the type coercion happen implicitly, because the
compiler doesn't tell you what's wrong if you accidentally do a weird
cast of a tagged union. But also the way casting and type inferrence
works seems to get really weird and confusing in more complex cases.
> > Hi!> >> > This cast does very strange things that I do not understand:> >> > let evil = (("a": (int | str)): int);> >> > The variable "evil" now behaves very strangely.> >> > // All equality tests seem to pass.> > assert(evil == 2);> > assert(evil == 99);> >> > // This reads "evil" as a different number sometimes, even though it> > // shouldn't have changed. I've had it output both 0 and 32766> > // in the same program invocation for example.> > fmt::println(evil)!;> > fmt::println(evil)!;>> I don't know exactly what's causing this strange behavior, but it isn't> a bug because this is UB, since you're reading padding bytes.
For the curious: this happens because QBE correctly notices the variable
is set from uninitialized memory in the memopt pass, then copy
elimination propagates the undefined value to assert's comparison and
constant folding just converts all operations on undefined values
(including jumps) to nops. All of this can be seen by passing -dMCF to
QBE and observing changes between intermediate versions of the IR.
Hi,
You found some interesing cases! This really helps a lot with improving
things in the compiler.
> I've a few times come across situations where I can yield data of> multiple types out of a block, and hare doesn't do a good job at> inferring the type of the block. For example:>> const a = {> yield if (condition) (2, true) else ("a", false);> };>> Hare decides that "a" should be of the type> ((int, bool) | (string, bool)), but I want it to be> ((int, string) | bool). I can do that by saying what type "a" should be:>> const a: ((int, string) | bool) = {> yield if (condition) (2, true) else ("a", false);> };>> but in some cases that can make the first line a bit long. I found I> could also do this: >> const a = {> yield if (condition) (2, true) else ("a", false);> }: ((int, string) | bool);>> which as I understand it allows the block to coerce things yielded from> it into the casted type, instead of casting them after yielding. So it> effectively does the same thing.
Not always exactly the same thing. By specifying the type in the binding, you
declare the variable to be of that type, and providing an initializer
expression that is not *assignable* to that type is a compile error. By
providing the type after the block, the type of the variable is
determined via inference, and the block is casted to the specified
type, producing a compile error only if the block isn't *castable* to that
type. Rules for castability are less strict, and casts may be invalid
and produce UB (as your initial post shows). So specifying the type of
the binding is more restrictive, but also more safe.
The apparent syntactic similarity between the two is a bit unfortunate,
and we're likely to change that at some point. Ideally we'd also have a
way to do the less restrictive kind of thing but limited only to casts
that can be done safely.
>> Here's another one that gets weird:>> {> yield if (true) ("a", true) else (2, false);> }: ((int | str), bool)>> I think this still works as intended. The yielded tuple can be coereced> into the right type and this code can be valid. However, what about> this:>> {> const tuple = ("a", true);> yield if (true) tuple else (2, false);> }: ((int | str), bool)>> I think that this one is UB from my testing, even though it looks so> similar to the previous one. I understand that hare can't figure out how> to do the same type coercion because the tuple is being yielded as a> variable here instead of a literal. I can't even figure out what it's> doing to the types to get to point it's at from looking at the code> though.
Changing the cast to a binding with a set type (let a: ((int | str),
bool) = ...) explains the situation a bit. The error I get by doing so
is:
"Initializer of type ((str, bool) | ((int | str), bool)) is not
assignable to binding type ((int | str), bool)"
So, what really happens is that the compiler knows that the block is
expected to be of type ((int|str), bool), and passes that expectation to
yield, which passes that on to if, which passes that to both branches.
But the type of the `tuple` variable is set in stone once the variable
is created, and it is (str, bool). The type of (2, false) allows for
more flexibility and is succesfully set to ((int | str), bool). Then the
if expression merges the types of its branches, and this unfortunately
results in ((str, bool) | ((int | str), bool)). This is also the result
type of the block. And it is then casted to ((int |str), bool). That
cast is invalid, because the actual value returned from the block is the
one from the first branch, of type (str, bool). And we get UB, like
you said.
I had a hard time figuring this all out, even though I've written a
significant portion of the compiler code that is involved here, which is
saying something about the complexity involved in resolving this case.
The worst offender and the most "magic" thing in the above
procedure is the automatic promotion of (2, false) into ((int | str),
bool) imo. I've long felt this is a place where the compiler tries to be
too "smart". But until now, noone actually hit that in real code, at
least to my knowledge. So yeah I think this case really shows that
should not happen on its own.
Another part of improving all of this will also be making casts behave
better, so that even if the type inference screws up, you don't silently
end up with UB.
> I'm starting to think it's bad practice to do casting of blocks like> this instead of having the type coercion happen implicitly, because the> compiler doesn't tell you what's wrong if you accidentally do a weird> cast of a tagged union. But also the way casting and type inferrence> works seems to get really weird and confusing in more complex cases.
Yeah for now I suggest avoiding such casts and preferring bindings with
explicit type instead. And when in doubt about the type of some
expression, try splitting it into a couple of parts and binding each of
them to a variable with an explicit type, to see where the disagreement
between your understanding and compiler's inference lies.
Thanks for the insight on the way this all works, it's interesting!
On Fri Jan 3, 2025 at 10:58 PM EST, Bor Grošelj Simić wrote:
> The apparent syntactic similarity between the two is a bit unfortunate,> and we're likely to change that at some point. Ideally we'd also have a> way to do the less restrictive kind of thing but limited only to casts> that can be done safely.
I feel like the problem is that cast operation does a lot more in Hare
than it does in other languages, where often the most scary thing you
can do with a cast is truncating an integer or doing an obviously unsafe
pointer cast. Hare seems to allow a lot more things to be done with the
cast operator, including some things that rarely make any sense.
Also, why can't more unusual casts like ones involving tagged unions be
exclusive to the "as" operator? That's the operator I choose to use for
things like a tagged union cast in almost every case anyways.
> The worst offender and the most "magic" thing in the above> procedure is the automatic promotion of (2, false) into ((int | str),> bool) imo. I've long felt this is a place where the compiler tries to be> too "smart". But until now, noone actually hit that in real code, at> least to my knowledge. So yeah I think this case really shows that> should not happen on its own.
I agree, anything magical enough that you can rely on it without easily
being able to fully understand the rules it's based on is probably
flawed by design. I think the compiler should have some ability to do
the sorts of type inferrence with tagged unions and everything that it's
doing now, because otherwise hare's expression-oriented design would
becoem painful to work with, but the rules it uses should be simple
enough that a sane human can read any code and follow the process the
compiler would use to apply those rules.
> > The apparent syntactic similarity between the two is a bit unfortunate,> > and we're likely to change that at some point. Ideally we'd also have a> > way to do the less restrictive kind of thing but limited only to casts> > that can be done safely.>> I feel like the problem is that cast operation does a lot more in Hare> than it does in other languages, where often the most scary thing you> can do with a cast is truncating an integer or doing an obviously unsafe> pointer cast. Hare seems to allow a lot more things to be done with the> cast operator, including some things that rarely make any sense.
My thinking is probably severely affected by the amount of Hare code I've
written and read over the years, but I tend to think of casts as bridges
over the gaps the type system is unable to smooth out on its own, with
the programmer manually telling the compiler how to interpret pieces of
memory. The fact that this is limited to pointer casts in a lot of
languages is mostly an implementation detail. It can throw off
newcomers with experience from those languages though.
>> Also, why can't more unusual casts like ones involving tagged unions be> exclusive to the "as" operator? That's the operator I choose to use for> things like a tagged union cast in almost every case anyways.
The reason is that very often, the programmer knows the correct type to
use, and the compiler doesn't, for example in this
only-slightly-contrived case:
fn (x: (void | int)) int = {
if (var is void) {
return 0;
} else {
return var: int;
};
};
Using `as` instead of `:` in the cast here would result in redundant,
unreachable code (the `as` keyword ensures execution terminates if the
expectation is not met), because clearly `var` must be holding an int if
we reached the cast, but the compiler can't know that. In this
particular case it may be possible for the compiler to figure that out,
but like, in full generality, that's impossible. But yeah, use of `:` is
unsafe in most cases (even with integers!). As mentioned earlier, we are
considering changing some things around casts to make them harder to
misuse and to make their semantic meaning clearer in some cases, but the
option to use plain `:` (or some other semantically equivalent thing)
will remain for most kinds of casts.
>> > The worst offender and the most "magic" thing in the above> > procedure is the automatic promotion of (2, false) into ((int | str),> > bool) imo. I've long felt this is a place where the compiler tries to be> > too "smart". But until now, noone actually hit that in real code, at> > least to my knowledge. So yeah I think this case really shows that> > should not happen on its own.
This part of my analisys was actually a bit incorrect. That automatic
promotion still feels odd to me, but the thing that went the most wrong
is that the compiler allows one branch of the if to take the ((int |
str), bool) type hint and the other one to ignore it.
To fix this, I proposed this patch:
https://lists.sr.ht/~sircmpwn/hare-dev/patches/56767
In short, your example would result in a compile error with the new
behavior, because the true branch of the if-else expression is not
assignable to the type hint given by the block cast.
> I agree, anything magical enough that you can rely on it without easily> being able to fully understand the rules it's based on is probably> flawed by design. I think the compiler should have some ability to do> the sorts of type inferrence with tagged unions and everything that it's> doing now, because otherwise hare's expression-oriented design would> becoem painful to work with, but the rules it uses should be simple> enough that a sane human can read any code and follow the process the> compiler would use to apply those rules.
> Using `as` instead of `:` in the cast here would result in redundant,> unreachable code (the `as` keyword ensures execution terminates if the> expectation is not met), because clearly `var` must be holding an int if> we reached the cast, but the compiler can't know that. In this> particular case it may be possible for the compiler to figure that out,> but like, in full generality, that's impossible. But yeah, use of `:` is> unsafe in most cases (even with integers!). As mentioned earlier, we are> considering changing some things around casts to make them harder to> misuse and to make their semantic meaning clearer in some cases, but the> option to use plain `:` (or some other semantically equivalent thing)> will remain for most kinds of casts.
Why not accept that as a trade-off for safer code? I think that's what's
done with array indexing, where runtime bounds checking is done for
safety reasons even though it's slower. You could even let the compiler
turn off the runtime type check in production code if performance is an
issue.
> fn (x: (void | int)) int = {> if (var is void) {> return 0;> } else {> return var: int;> };> };
Taking this example, when I have written similar code in hare I have
written `return var as int`. Part of that is that I didn't realize you
could do that cast with a ":", but more than that I don't trust myself
not to make a mistake or an invalid assumption about what the cast will
do in all situations. If the compiler checks that for me, even just at
runtime in a development build, I can catch a lot of nasty bugs before
they have a chance to happen.
> > Using `as` instead of `:` in the cast here would result in redundant,> > unreachable code (the `as` keyword ensures execution terminates if the> > expectation is not met), because clearly `var` must be holding an int if> > we reached the cast, but the compiler can't know that. In this> > particular case it may be possible for the compiler to figure that out,> > but like, in full generality, that's impossible. But yeah, use of `:` is> > unsafe in most cases (even with integers!). As mentioned earlier, we are> > considering changing some things around casts to make them harder to> > misuse and to make their semantic meaning clearer in some cases, but the> > option to use plain `:` (or some other semantically equivalent thing)> > will remain for most kinds of casts.>> Why not accept that as a trade-off for safer code? I think that's what's> done with array indexing, where runtime bounds checking is done for> safety reasons even though it's slower. You could even let the compiler> turn off the runtime type check in production code if performance is an> issue.
We do something very similar with indexing. It is possible, but generaly
ill-advised, to cast arrays/slices to unbounded arrays, which do not
perform bounds checks on indexing/slicing. The difference is that with
indexing, disabling the checks is syntactically much heavier and you
already get discouraged by that, while for casts we haven't yet found a
good way to make the difference in safety nicely visible in the syntax.
> > fn (x: (void | int)) int = {> > if (var is void) {> > return 0;> > } else {> > return var: int;> > };> > };>> Taking this example, when I have written similar code in hare I have> written `return var as int`. Part of that is that I didn't realize you> could do that cast with a ":", but more than that I don't trust myself> not to make a mistake or an invalid assumption about what the cast will> do in all situations. If the compiler checks that for me, even just at> runtime in a development build, I can catch a lot of nasty bugs before> they have a chance to happen.
Just like with checked indexing, writing it as `var as int` is 100% the
correct way to do this, unless you're absolutely sure disabling the
check has enough performance gain that you are willing to sacrifice
safety for that gain. Which definitely does happen, but is very rare.
In the stdlib (which has >70kloc) we use unsafe indexing in like 10
places at most, and most if not all of those are not even because of
performance, but because we interface with system APIs that were
designed for C. For unsafe tagged casts, I'm not sure what the number
is, but I doubt it's much more than that.
I considered a build mode that checks even these unsafe casts (and some
other things in a similar spirit) before. It'd be sort of like UBsan.
I'd like to have that at some point. Making the compiler itself emit the
checks is nearly trivial, but it's not obvious how to expose this
functionality (probably a new flag?). The runtime library would need
to be adjusted for this in some way. And then the build driver should
know about it and expose it to the user somehow (some kind of new
special build tag? I'm not a fan of new special build tags).