> Hold off on merging this until I send a spec update, I just realized right now I> forgot to do that. Specifically, I imagine the spec will need to be more> explicit about what kind of stuff is allowed in rune/string constants, and the> behavior of \x in rune/string constants will need to be changed. Actually I> should probably talk about that:>> So, some context: \u is kinda weird. It behaves differently depending on where> it's used. In rune literals, it denotes the value the rune should hold, so> '\u1234' is represented as 0x1234. But in string literals, it's converted to> UTF-8. This is weird, but I feel like it's still correct.
I don't particularly like this. The name \u implies it has something to do with
unicode in all contexts. Perhaps a less weird alternative would be to have \u
behave identically in str and rune and instead allow arbitrary \x sequences up
to 4 bytes in lenght in runes, consistently with how \x allows one to disregard
the rules in string literals.
> \x is also weird. Prior to this commit, it behaved pretty much identically to> \u: in rune literals, it was used literally as the value; in string literals, it> was converted to UTF-8. Which is different from how \x behaves in other similar> languages: C, Go, Rust, and Zig all have \x denote a literal byte, whereas \u> denotes a Unicode codepoint. So I've changed \x to denote literal bytes in> strings
This is a huge can of worms. Right now just about everything and everyone
assumes string literals are valid UTF-8. I have stated before that I do not
agree with the current situation and I'm happy you're opening this discussion,
but moving in this direction also amplifies some problems that I see around
Hare's current str type (and I used to like Hare's way of doing string types, I
even argued with Zig people about the awesomeness of our design at some point).
Right now `str` serves a double purpose, as a type for string literals and as a
type for constant byte sequences of valid UTF-8. We already sometimes employ
various hacks to get around the second intended purpose, but before this
proposed change at least the two purposes had a non-empty intersection because
the literals were supposed to be valid UTF-8.
Now we're explicitly adding a way to make the string literals disobey UTF-8
rules, making the two purposes independent of each other and that makes me
question this whole idea of having a single type doing both of those things.
So there are plenty of options what to do, we could dedicate the str type just
to string literals and use []u8 for all byte data at runtime no matter whether
it's valid UTF-8 or not, we could do it the other way around, we could also
have a separate builtin type for each of those things I suppose. All of those
are quite disruptive unfortunately. We could also do nothing. I personally
prefer the first one.
(Sorry, this got a bit longer than expected and also not neccesarily immediately
relevant to merging or not merging this patch.)
Re: [PATCH harec 3/3] lex: disallow sign in hex literal
'\u aa' was also considered valid before this patch (strto* functions allow
whitespace before the number), so the commit message, comments and tests should
also reflect that.
Re: [PATCH harec 2/3] Improve handling of invalid UTF-8 in rune/str literals
On Tue Sep 19, 2023 at 5:32 PM EDT, Bor Grošelj Simić wrote:
> I don't particularly like this. The name \u implies it has something to do with> unicode in all contexts. Perhaps a less weird alternative would be to have \u> behave identically in str and rune and instead allow arbitrary \x sequences up> to 4 bytes in lenght in runes, consistently with how \x allows one to disregard> the rules in string literals.
\u does have to do with Unicode in all contexts though. It's just that
codepoints are stored differently in rune than in str.
> Right now `str` serves a double purpose, as a type for string literals and as a> type for constant byte sequences of valid UTF-8. We already sometimes employ> various hacks to get around the second intended purpose, but before this> proposed change at least the two purposes had a non-empty intersection because> the literals were supposed to be valid UTF-8.
I don't really understand the issue here? I don't think str really has a
double purpose: its purpose is to store UTF-8 encoded text.
> Now we're explicitly adding a way to make the string literals disobey UTF-8> rules, making the two purposes independent of each other and that makes me> question this whole idea of having a single type doing both of those things.
String literals still can't disobey UTF-8 rules. e.g. "\xaa" is still
invalid, since it doesn't form a valid UTF-8 codepoint. The only
difference is that \x now denotes bytes rather than codepoints, but the
compiler still enforces that those bytes form valid UTF-8.
Re: [PATCH harec 3/3] lex: disallow sign in hex literal
On Tue Sep 19, 2023 at 9:32 PM UTC, Bor Grošelj Simić wrote:
> '\u aa' was also considered valid before this patch (strto* functions allow> whitespace before the number), so the commit message, comments and tests should> also reflect that.
does the spec allow this? if so, it should be disallowed there as well
Re: [PATCH harec 2/3] Improve handling of invalid UTF-8 in rune/str literals
On Tue Sep 19, 2023 at 9:29 PM UTC, Sebastian wrote:
> String literals still can't disobey UTF-8 rules. e.g. "\xaa" is still> invalid, since it doesn't form a valid UTF-8 codepoint. The only> difference is that \x now denotes bytes rather than codepoints, but the> compiler still enforces that those bytes form valid UTF-8.
+1 to doing this. if someone needs invalid utf8, they should use a u8
array literal instead
Re: [PATCH harec 3/3] lex: disallow sign in hex literal
On Tue Sep 19, 2023 at 5:36 PM EDT, Ember Sawady wrote:
> On Tue Sep 19, 2023 at 9:32 PM UTC, Bor Grošelj Simić wrote:> > '\u aa' was also considered valid before this patch (strto* functions allow> > whitespace before the number), so the commit message, comments and tests should> > also reflect that.>> does the spec allow this? if so, it should be disallowed there as well
No, it was a bug caused by strto* allowing the string to begin with
arbitrary whitespace. Similar to how strto* allows a sign at the
beginning.
Re: [PATCH harec 2/3] Improve handling of invalid UTF-8 in rune/str literals
On Tue Sep 19, 2023 at 11:29 PM CEST, Sebastian wrote:
> On Tue Sep 19, 2023 at 5:32 PM EDT, Bor Grošelj Simić wrote:> > I don't particularly like this. The name \u implies it has something to do with> > unicode in all contexts. Perhaps a less weird alternative would be to have \u> > behave identically in str and rune and instead allow arbitrary \x sequences up> > to 4 bytes in lenght in runes, consistently with how \x allows one to disregard> > the rules in string literals.>> \u does have to do with Unicode in all contexts though. It's just that> codepoints are stored differently in rune than in str.
Right. +1 to this part then.
>> > Right now `str` serves a double purpose, as a type for string literals and as a> > type for constant byte sequences of valid UTF-8. We already sometimes employ> > various hacks to get around the second intended purpose, but before this> > proposed change at least the two purposes had a non-empty intersection because> > the literals were supposed to be valid UTF-8.>> I don't really understand the issue here? I don't think str really has a> double purpose: its purpose is to store UTF-8 encoded text.>> > Now we're explicitly adding a way to make the string literals disobey UTF-8> > rules, making the two purposes independent of each other and that makes me> > question this whole idea of having a single type doing both of those things.>> String literals still can't disobey UTF-8 rules. e.g. "\xaa" is still> invalid, since it doesn't form a valid UTF-8 codepoint. The only> difference is that \x now denotes bytes rather than codepoints, but the> compiler still enforces that those bytes form valid UTF-8.
Sorry, I can't read. Disregard what I said. I still have mixed feelings
regarding the str type and the UTF-8 invariant, but that truly has no relevance
for this patch then.