~technomancy/fennel

6 2

Raw string syntax proposal for Fennel

Details
Message ID
<CAAKhXobkDSx=209XXRrxxdzOGbADXUZUaM2ETQyuj34M3qWCRg@mail.gmail.com>
DKIM signature
pass
Download raw message
I often write strings that countain double quotes inside.  This
includes messages to be displayed with `print` or `io`, and especially
docstrings.  While such strings are printed correctly in the REPL or
in program log, it is hard to read and edit such strings in the
sources.  Fortunately, Emacs has a lot of facilities, like
separedit.el [1] package, that allows editing nested strings, and have
everything be escaped automatically.  But not everyone uses Emacs, and
this still has a problem of reading such strings.

Lua has raw strings denoted with `[[]]` syntax, which can actually
contain nested square brackets too.  This is done by adding additional
symbols between opening brackets, and matching the same between
closing ones: raw string `[==[[[string]]]==]` contains `[[string]]` in
it.  Such strings also has some useful properties, like ignoring
escape sequences, and first newline, making it easier to write
multi-line strings in deeply indented code, without having first line
to be on the same line as string start:

    some_very_long_function_name[[
    string with some long lines that exceed 80 character recommended
    width.  Especially when first line is placed on the same line as
    the call]]

So in Fennel, when we put strings in tables, or write function
docstrings that contain inner strings, it would be handy to have such
syntax to make things easier to read and write.  Unfortunately we
can't use Lua's `[[]]` directly, because it is used for sequential
collections, and for destructuring, which may be very confusing.  So I
propose a different delimiter variants for raw strings.  Here's a
plenty to choose from:

    1. r" "r - r is for raw string.
    2. @r" "r - parser macro style.
    3. r#" "# - Rust style [2].
    4. <" "> - Quote tag style

Options 1 and 2 can be increased in depth by increasing amount of `r`
symbols around the string. For example, here's how a documentation for
raw string can be written in a raw string:

    rrrr"
    Raw string is starts with one or more r symbols followed by a
    double quote, and ends with double quote, followed by the same
    amount of r symbols as for the opeining quote.  For example,
    here's a raw string that contains ordinary string in it:

    r"String "with quotes" in raw string"r

    Note, that there's no need to escape inner double quotes.  That
    would result in such Lua raw string:

    [[String "with quotes" in raw string]]

    Raw string can contain a raw string inside as well:

    rr"a r"raw string with "with" string"r within a raw string"rr

    The escaping is not needed as we start and end raw string with
    matching amount of r symbols.  The string abowe would result in
    the following Lua raw string:

    [[a r"raw string with "with" string"r within a raw string]]

    Note, that printed variant can be copied back to Fennel and read
    without any modifications.
    "rrrr

This would be almost the same if we use variant 2, except each string
would have to be prefixed with @, which is not ideal in my opinion.
Another example, with raw strings in tables:

(local raw-strings {:_VERSION "v0.0.1"
                    :_DESCRIPTION rr"
Raw string syntax proposal for Fennel language.  With raw strings we
can have "strings" within raw strings, and r"raw strings"r too"rr}

Here's another example how raw strings allow us to start sting on a
new line and the resulting string will be printed without it, as per
Lua [[]] string implementation.

Variant 3 is taken directly from Rust, just to show how this problem
is tackled in a completely different language.  I don't think we
should use this variant, as `#` symbol already used for `hash-fn` and
`auto-gensym`.

Variant 4 is another option to consider, as we can increase nesting
level by specifying more angle brackets around the string:

    <<"raw string with <"raw string">">>
    rr"raw string with r"raw string"r"rr

And it is a bit a bit easier to see correct string end, but I don't
really like this variant, because most of raw strings are of depth 1,
and at that depth it looks like string is being compared with `<`.
The 1st variant, in my opinion is also easier to type, and should be
equally easy to parse as any other variant, probably even easier than
variant 2.  It also doesn't require parser macro implementation
whatsoever.

There's one thing to concern though.  As always, I guess.

If raw string contains square brackets, Fennel compiler no longer can
produce simple `[[]]` string, and has to analyze raw string for having
at least one `[` or `]`.  But as far as I can see this is already
done, because docstrings are produced with `[[]]` strings already, and
therefore this can be already done for raw strings too.  So a raw
string with brackets like this:

    r"[[]]"r

Should produce such Lua string:

    [=[[[]]]=]


I'm open to suggestions and criticism, so if you have any thoughts I
would be glad to hear!

[1]: https://github.com/twlz0ne/separedit.el
[2]: https://doc.rust-lang.org/rust-by-example/std/str.html#literals-and-escapes


-- 
Best regards,
Andrey Listopadov
Details
Message ID
<878s89yq0f.fsf@whirlwind>
In-Reply-To
<CAAKhXobkDSx=209XXRrxxdzOGbADXUZUaM2ETQyuj34M3qWCRg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Andrey Orst <andreyorst@gmail.com> writes:

> So I propose a different delimiter variants for raw strings.  Here's a
> plenty to choose from:
>
>     1. r" "r - r is for raw string.
>     2. @r" "r - parser macro style.
>     3. r#" "# - Rust style [2].
>     4. <" "> - Quote tag style

Thanks for bringing this up.

I don't want to rule this out completely, but my first reaction here
is "we already have two notations for strings, and I'm not totally
convinced that's not too many already". In particular the first three
here evoke a very strong negative reaction.

> Options 1 and 2 can be increased in depth by increasing amount of `r`
> symbols around the string.

This seems like a downside to me. The notation is already unpredictable
enough that allowing nesting just makes it worse. If we introduce a new
string type, the rules around it need to be extremely simple. Raw
strings should be raw; the idea of treating certain things differently
inside a raw string specially seems to defeat the purpose. The only
thing that's not interpreted as string content should be the closing
delimiter. If you need a string which contains the closing delimiter,
you'll just have to use normal strings with escaping.

Honestly my preference would be to use «foo» or 「bar」 or possibly even
“baz” but ... I expect some resistance to that. The first in particular
is very easy to input in Emacs, but not everyone uses Emacs. =) And I
don't know how easy they are to input in other systems. But I would
imagine that Fennel support modes for various languages could have
commands for inserting these for you.

> If raw string contains square brackets, Fennel compiler no longer can
> produce simple `[[]]` string, and has to analyze raw string for having
> at least one `[` or `]`.  But as far as I can see this is already
> done, because docstrings are produced with `[[]]` strings already, and
> therefore this can be already done for raw strings too.  So a raw
> string with brackets like this:

The idea of emitting strings differently depending on what notation was
used to input them seems problematic. Right now a string in the AST is
just a string. If some strings must be treated differently from others,
suddenly we have to carry along metadata with it now? Even if we solve
that within the compiler, suddenly macros now have to be updated to
reflect the fact that strings are sometimes represented as
not-strings. Seems like this would be very error-prone and confusing.

Anyway, I'm interested in hearing what others think here too.

-Phil
Details
Message ID
<CAAKhXobDg8NtSGyW1PNuYwOvxMSk51r+x940pRgZK7Mbk6knoQ@mail.gmail.com>
In-Reply-To
<878s89yq0f.fsf@whirlwind> (view parent)
DKIM signature
pass
Download raw message
On Sun, Jan 31, 2021 at 8:28 PM Phil Hagelberg <phil@hagelb.org> wrote:
> This seems like a downside to me. The notation is already unpredictable
> enough that allowing nesting just makes it worse. If we introduce a new
> string type, the rules around it need to be extremely simple. Raw
> strings should be raw; the idea of treating certain things differently
> inside a raw string specially seems to defeat the purpose. The only
> thing that's not interpreted as string content should be the closing
> delimiter. If you need a string which contains the closing delimiter,
> you'll just have to use normal strings with escaping.

Perhaps I wasn't clear enough here.  Lua raw strings can be nested
too, and this is done by using [=[ and ]=] delimiters, where amount of
= signs can be increased to contain nested raw strings.  The notation
I'm suggesting here is a plain mapping on Lua raw string notation,
except of using [=[ we use r" to open the string, and instead of ]=]
we use a matching "r.  Ability to nest raw strings is the main point
of string being raw -- it preserves everything inside, even other raw
strings.

> The idea of emitting strings differently depending on what notation was
> used to input them seems problematic.

This is the side effect of supporting a feature of the host language
but with a different syntax.  Lua programmers do this manually -- when
square brackets need to be used in raw string, different raw string
delimiters are entered manually.  Since the proposed syntax doesn't
allow controlling what delimiter to use, mangling is required.  I mean
if Lua programmer wites a lua sting like this (for some reason)
[=[[[]][[]]]=], the [=[ and ]=] are inserted manually.  Since we can't
allow such syntax, when writing r"[[]][[]]"r, compiler has to do
mangling r" into [=[, and "r into ]=] to produce correct Lua raw
string.

Also

> depending on what notation was used

A slight correction here - depending on string contents, not the
notation. The notation implies mangling, but the mangling is always
based on contents, just as with regular Fennel identifier mangling.


-- 
Best regards,
Andrey Listopadov
Details
Message ID
<877dnsznbu.fsf@whirlwind>
In-Reply-To
<CAAKhXobDg8NtSGyW1PNuYwOvxMSk51r+x940pRgZK7Mbk6knoQ@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Andrey Orst <andreyorst@gmail.com> writes:

> Perhaps I wasn't clear enough here.  Lua raw strings can be nested
> too, and this is done by using [=[ and ]=] delimiters, where amount of
> = signs can be increased to contain nested raw strings.  The notation
> I'm suggesting here is a plain mapping on Lua raw string notation,
> except of using [=[ we use r" to open the string, and instead of ]=]
> we use a matching "r.

I see. IMO this is a flaw of the Lua design. I think it would be better if
raw strings were consistently and completely raw.

>> The idea of emitting strings differently depending on what notation was
>> used to input them seems problematic.
>
> This is the side effect of supporting a feature of the host language
> but with a different syntax.

I'm not sure that logic holds up. The point of Fennel is to offer new
notation while preserving Lua runtime semantics. String syntax is not
a part of runtime semantics. There are some places where we must keep
features from Lua even if we don't like them, such as multiple return
values. But I don't believe the same applies to syntactic features.

If we decide to add raw strings, we should do it on our own terms.

>> depending on what notation was used
>
> A slight correction here - depending on string contents, not the
> notation. The notation implies mangling, but the mangling is always
> based on contents, just as with regular Fennel identifier mangling.

If its the contents of the string which determine how it's emitted, that
means that sometimes Fennel's normal "" strings would be compiled into
raw Lua strings; is that what you meant? Can you be more specific about
what kind of contents would trigger it to be emitted as a raw string and
when it would be emitted as a regular string?

But again, I'm interested in hearing from others too.

-Phil
Details
Message ID
<CAAKhXoabCFE0htRxww7WsNHVE8iTK5D94iA+jaBw0O2uXb2cLg@mail.gmail.com>
In-Reply-To
<877dnsznbu.fsf@whirlwind> (view parent)
DKIM signature
pass
Download raw message
On Mon, Feb 1, 2021 at 2:41 AM Phil Hagelberg <phil@hagelb.org> wrote:
>
> I see. IMO this is a flaw of the Lua design. I think it would be better if
> raw strings were consistently and completely raw.

But Lua raw strings are consistently and completely raw... I'm not
sure that I understand your complain here.
And this is basically how raw strings are done in other languages
which have raw strings

Perl: qq@"string"@ or qq#"string"#
Kakoune: %{string} or %$string$ or %🦀string🦀
Rust: r#"string"# or r###"string"###
Open-std: R"[string]"

To name a few.

> I'm not sure that logic holds up. The point of Fennel is to offer new
> notation while preserving Lua runtime semantics. String syntax is not
> a part of runtime semantics.

Lua's [[ ]] strings have runtime semantics I suppose? Like ignoring
leading newlines and escape sequences. Maybe not runtime, but there
are semantic differences from ordinary strings.

> If its the contents of the string which determine how it's emitted, that
> means that sometimes Fennel's normal "" strings would be compiled into
> raw Lua strings; is that what you meant?

No. Ordinary "" strings and colon strings are always emitted as it is
done now.  Only the raw strings are subject to delimiter mangling:

Fennel -> Lua

r"hello"r -> [[hello]]
because no nested brackets were found.

r"a pair of ]] closing brackets"r -> [=[a pair of ]] closing brackets]=]
because two closing brackets were found.

r"a ]] and ]=] in a string"r -> [==[a ]] and ]=] in a string]==]
because we can't use both ]=] and ]] to close the string

This, AFAICS is the only rule needed for choosing wheter to emitt
string in [[ ]] or in [=[ ]=]. And this exactly matches Lua semantics.

> Can you be more specific about
> what kind of contents would trigger it to be emitted as a raw string and
> when it would be emitted as a regular string?

Raw strings only emitted when raw strings are used explicitly via r"
"r. Contents of raw string is never transformed, only delimiters are
transformed to Lua one, and transformation is based on contents (in
order to produce correct delimiters in few edge cases).

Here's a complete workflow:

1. Parser sees r" (or rr" - amount of r symbols before double quote is
arbitrary)
2. Parser treats everything it sees next as a string up until it sees
"r (with matching amount of r symbos)
3. Compiler checks if string contains closing delimiter (start with ]] or ]=])
   a. If string doesn't contain raw string delimiter compiler produces
this delimiter enclosed string in Lua
   b. If string contains delimiter compiler adds = between brackets and goto 3.

Here's the code which illustrates the algorithm and base idea:

(fn parse-raw-string [str]
  (var start-lua-delimiter "[=[")
  (var end-lua-delimiter "]=]")
  (let [start (string.match str "^[r]+\"")
        end (.. "\"" (string.match start "^[r]+"))
        str (string.gsub str (.. start "(.*)" end) "%1")]
    (while (string.find str end-lua-delimiter)
      (set start-lua-delimiter (string.gsub start-lua-delimiter "=" "==" 1))
      (set end-lua-delimiter (string.gsub end-lua-delimiter "=" "==" 1)))
    (print (.. start-lua-delimiter str end-lua-delimiter))))

And here are some examples:

(parse-raw-string "r\"a raw string\"r")
[=[a raw string]=]

(parse-raw-string "rrr\"a raw string\"rrr")
[=[a raw string]=]

(parse-raw-string "rr\"a raw string with r\"raw string\"r\"rr")
[=[a raw string with r"raw string"r]=]

(parse-raw-string "r\"a raw string with ]=] delimiter\"r")
[==[a raw string with ]=] delimiter]==]

(parse-raw-string "r\"a raw string with ]=] and ]==] delimiters\"r")
[===[a raw string with ]=] and ]==] delimiters]===]

(parse-raw-string "r\"a raw string only with ]==] delimiter\"r")
[=[a raw string only with ]==] delimiter]=]

All strings produces by this function can be copied into Lua REPL and
evaluated without problems, because these are raw strings.

Now of course, since I'm passing raw strings in ordinary strings I
have to escape quotes inside, but for real raw string implementation
this will not be necessary. So a call to print in fennel

(print rr"a r"raw string"r in a raw string"rr)

Will be compiled to following Lua:

print[=[a r"raw string"r in a raw string]=]

I hope this is a more clear illustration, as I think there is
misunderstanding of what I'm proposing here. Sorry for inconvenience!


-- 
Best regards,
Andrey Listopadov
Details
Message ID
<874kivzq93.fsf@whirlwind>
In-Reply-To
<CAAKhXoabCFE0htRxww7WsNHVE8iTK5D94iA+jaBw0O2uXb2cLg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Andrey Orst <andreyorst@gmail.com> writes:

> On Mon, Feb 1, 2021 at 2:41 AM Phil Hagelberg <phil@hagelb.org> wrote:
>>
>> I see. IMO this is a flaw of the Lua design. I think it would be better if
>> raw strings were consistently and completely raw.
>
> But Lua raw strings are consistently and completely raw... I'm not
> sure that I understand your complain here.

Oh, I see; I misread the bit about how to nest them.

I think it would be better to have the simplest possible rules and that
the opening and closing delimiters should always be the same, which
means that you can't nest them. I think raw strings are special-purpose,
and if you need nesting you can fall back to normal strings which are
much more flexible with escape sequences.

> Lua's [[ ]] strings have runtime semantics I suppose? Like ignoring
> leading newlines and escape sequences. Maybe not runtime, but there
> are semantic differences from ordinary strings.

Yes, they have different semantics, but they are not runtime semantics.
That makes it different from a feature like multiple return values,
where we must support it for compatibility with existing Lua code. If
you pass a string to an existing Lua function, there is no way for that
Lua code to know if it was originally written as a raw string or a
normal string.

>> If its the contents of the string which determine how it's emitted, that
>> means that sometimes Fennel's normal "" strings would be compiled into
>> raw Lua strings; is that what you meant?
>
> No. Ordinary "" strings and colon strings are always emitted as it is
> done now.  Only the raw strings are subject to delimiter mangling.

I see. So this raises an implementation problem. How can we represent
the raw string inside the AST so that it is both A) passed to macro code
or other AST-using functionality as a real string and B) able to be
emitted as a raw Lua string?

The closest analogy right now would be [] vs {}, which we track with
metatables even though they are both indistinguishable as tables at
runtime. But we can't set the metatable of a string.

So do we represent raw strings in the AST as tables? That seems like it
would just lead to more confusion in macros; they check to make sure
their argument is a table when it's actually a raw string, but it gets
treated like a table.

Or we could put them in a global weak table, but now we have a weird
situation where calling gsub on the string would turn it from a raw
string into a normal string; that doesn't make sense.

It's difficult for me to imagine how this could be implemented without
causing more problems than it solves.

-Phil
Details
Message ID
<CAAKhXoYc4UXA=tXkmD0Kd8=hC+C6TH1m3Fr5Kt9Rq-0wZu045w@mail.gmail.com>
In-Reply-To
<874kivzq93.fsf@whirlwind> (view parent)
DKIM signature
pass
Download raw message
On Mon, Feb 1, 2021 at 7:50 PM Phil Hagelberg <phil@hagelb.org> wrote:
> I see. So this raises an implementation problem. How can we represent
> the raw string inside the AST so that it is both A) passed to macro code
> or other AST-using functionality as a real string and B) able to be
> emitted as a raw Lua string?
>
> The closest analogy right now would be [] vs {}, which we track with
> metatables even though they are both indistinguishable as tables at
> runtime. But we can't set the metatable of a string.
>
> So do we represent raw strings in the AST as tables? That seems like it
> would just lead to more confusion in macros; they check to make sure
> their argument is a table when it's actually a raw string, but it gets
> treated like a table.

This is indeed a very concerning point, which I didn't thought of
before.  Indeed now I see why we can't simply produce [[ ]] strings,
so perhaps raw strings should be just escaped correctly by compiler.
This implies only one escape level:

r"raw string\n"r -> "raw string\\n"
r"raw "string" with string"r -> "raw \"string\" with string"
rr"raw r"string"r with raw string"rr -> "raw r\"string\"r with raw string"

The nesting part is IMO important, because it allows easy extraction
of such strings via very simple patterns available in Lua.

This way raw strings are exactly strings, and no manual escaping is
needed for inner strings and inner raw strings.
But now I agree that we already have 2 different string variants,
adding another one to remove the need for manual escaping seems, while
handy, but bloated a bit.
Anyway, if someone has additional thoughts I would be glad to hear.

-- 
Best regards,
Andrey Listopadov
Reply to thread Export thread (mbox)