~skeeto/public-inbox

3 2

Re: Let's implement buffered, formatted output

Alexander Shpilkin <ashpilkin@gmail.com>
Details
Message ID
<CAAiuFs8=G1BiDiMi=05h-7UG+Zkzr3Ej1M4L+gFBnL4o5TMHKg@mail.gmail.com>
DKIM signature
pass
Download raw message
[Technical note:  The apostrophe in the post title is misinterpreted by
Sourcehut's search page behind the "existing discussions" link.]

One issue with the suggested approach is that it can't reorder
insertions at runtime, so any attempts at localization will fail
miserably[1].  (Like security problems, i18n ones are mostly of hidden
assumptions, not code, to the point where it's occasionally be easier
to rip out and rewrite the problematic parts rather than fix them.)

I've felt that printf's DSL approach is too dynamic and too costly for
quite a while (cf djb's formatted output functions), but for some
reason it took your post for me to finally admit that it might be
necessary to treat this reordering as control flow.

The nicest way to do this without code generation[1] is probably to call
the output function repeatedly and have it return the insertion number,
similar to getopt().  Unfortunately, the caller will then have to write
those insertion numbers in a switch once again.

I wondered if I could hack up some macros that would do this numbering
automatically.  The answer is, well, I can, but the result is more evil
than I'd like.  Still, here it is in case anyone's interested:

---8<--- cut here ---8<---
#if 0 /* shell polyglot */
${CC-cc} ${CFLAGS--g -O2 -Wall} -o "${0%.c}" "$0" && exec "${0%.c}" "$@"
#endif

#ifdef __COUNTER__
#define tangle(D, S) tangle0(__COUNTER__, D, S)
#else
#define tangle(D, S) tangle0(__LINE__, D, S)
#endif

#define tangle0(C, D, S) tangle1(C, D, S)
#define tangle1(C, D, S) \
	if (0) _tglend ## C: ; else \
	for (FILE *_tgldst = (D);;) for (const char *_tglstr = (S);;) \
	switch (tangle(_tgldst, &_tglstr)) \
	case 0: if (sizeof(enum { _tglnum = 1 })) goto _tglend ## C
#define with \
	else case _tglnum: if (sizeof(enum { _tglnum = _tglnum + 1 }))

/* demo */

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>

int (tangle)(FILE *fp, const char **restrict ps) {
	for (;;) {
		const char *s = *ps;
		if (!(*ps = strchr(*ps, '#'))) { fputs(s, fp); return 0; }
		fwrite(s, 1, *ps - s, fp); *ps += 2;
		int n = *(unsigned char *)(*ps - 1);
		if (n) return n; else fputc('#', fp);
	}
}

#define tglstr(S) fputs((S), _tgldst)

int main(int argc, char **argv) {
	/* the template can be localized */
	tangle(stdout, "#\1: Hello #\2!\n");
	with tglstr(argv[0]);
	with tglstr(argc > 1 ? argv[1] : "world");
}
---8<--- cut here ---8<---

[1] It's possible to assemble strings from pieces the Right Way, but it
    takes quite a bit of complexity; see https://projectfluent.org/

-- 
Cheers,
Alex

Re: Let's implement buffered, formatted output

Details
Message ID
<20230502184129.k3g7mlfvkyyqhzgv@nullprogram.com>
In-Reply-To
<CAAiuFs8=G1BiDiMi=05h-7UG+Zkzr3Ej1M4L+gFBnL4o5TMHKg@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
> The apostrophe in the post title is misinterpreted

Thanks for the heads up! I wasn't aware. I'll need to sort that out.

> it can't reorder insertions at runtime

Good point. Encodings aren't difficult, but I lack the requisite i18n 
knowledge and experience to work out a good solution myself for these 
higher level issues. (Though I did pick up Esperanto a few years ago to 
help get insights on exactly this.) My typical localization interactions 
are when it's in the way, behaving badly. Like being unable to parse 
floats (strtod) because the application may not be using the C locale, or 
significant performance penalties when dealing with text.

Your macro hacks are certainly interesting! Though I'd hate using such 
complex macros in a real program. Thanks for sharing, though!

Re: Let's implement buffered, formatted output

Alexander Shpilkin <ashpilkin@gmail.com>
Details
Message ID
<CAAiuFs-B86B1VvhsmgCCXLMAqOg0QTFgoWJgLdndPBCdjeNj+Q@mail.gmail.com>
In-Reply-To
<20230502184129.k3g7mlfvkyyqhzgv@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
Quoting Christopher Wellons (2023-05-02 21:41:29)
> Encodings aren't difficult [...].

[Localization rant moved to different reply.]

> Your macro hacks are certainly interesting! Though I'd hate using such
> complex macros in a real program.

Yeah.  It's not *gross*, and the general structure is fairly normal
custom control structures[1] stuff, even if it's awkward, but the
fragility around "else" in user code both inside and after "with" blocks
makes me uncomfortable.  Consider it an experiment in making use of the
strange scoping in `if` and friends that C ≥ '99 inherited from C++98.

I feel I might've obscured my point by the macros, though, so here's
how you can use the same function without any:

	const char *fmt = "#\1: Hello #\2!\n"; int ins = 0;
	while ( (ins = (tangle)(stdout, &fmt)) ) switch (ins) {
	case 1: fputs(argv[0], stdout); break;
	case 2: fputs(argc > 1 ? argv[1] : "world", stdout); break;
	}

Wordier than a printf(), but not terrible, I think.  I still like the
anaphoric (so to say) destination in the macro version, though.

In the meantime I've also made a version that allows you to write
`tangle(Str(argv[0]) ": Hello " Str(argv[1]) "!\n")` and still end up
with a format string behind the scenes, but that one *is* very much
gross, so I'll sit on it some more.

> Thanks for sharing, though!

Glad to provide a bit of distraction :)

[1] https://www.chiark.greenend.org.uk/~sgtatham/mp/

-- 
Cheers,
Alex

Localization rant [Was: Re: Let's implement buffered, formatted output]

Alexander Shpilkin <ashpilkin@gmail.com>
Details
Message ID
<CAAiuFs98zXuxThHUbEKEfr1xSEYdtZm+TLbRLShR-SM9nuru2g@mail.gmail.com>
In-Reply-To
<20230502184129.k3g7mlfvkyyqhzgv@nullprogram.com> (view parent)
DKIM signature
pass
Download raw message
Quoting Christopher Wellons (2023-05-02 21:41:29)
> > it can't reorder insertions at runtime
>
> Good point. Encodings aren't difficult, but I lack the requisite i18n
> knowledge and experience to work out a good solution myself for these
> higher level issues. (Though I did pick up Esperanto a few years ago to
> help get insights on exactly this.)

It isn't the encodings that get you.  Those are indeed not difficult,
aside from particularly cursed ones, of which only a couple are in
active use (EUC-JP, Shift JIS, GB18030, ... that may be it?).

Instead, it's the little assumptions.  Like the assumption that the
number should always be inserted before the singular or plural noun
form (what I was objecting to in your system).  Or indeed that there
is only a singular and a plural noun form (Esperanto is the same as
English here, but gettext[1] has you covered on examples and is in
general much better than many commercial systems).  Or that the
"Tuesday" in "today is Tuesday" and "this will happen next Tuesday" is
the same (Google Translate tells me Esperanto will trip this; CLDR
failed here for many years, but not any more).  Or that knowing the
user's name is "John Doe" is enough to say "John Doe's documents"
(don't even try) or even "Hi John Doe" (vocatives do exist).

I'll stop here, I'm sure anybody who has touched localization has a
litany of complaints they could recite for you (once you get past the
learned helplessness caused by pre-localization code freezes).

In any case, not hardcoding the insertion order is the absolute minimum
for user-facing text.  Thankfully, that's not hard.

> My typical localization interactions
> are when it's in the way, behaving badly. Like being unable to parse
> floats (strtod) because the application may not be using the C locale,

That's an own goal on part of ISO and POSIX C---like much of the rest
of its locale infrastructure.  People have occasionally used
locale-specific formats even when not dealing with user-facing data
directly (like Excel's semicolon-separated CSVs in locales with decimal
commas), and it has rarely ended well.  The *_l functions can help.

> or significant performance penalties when dealing with text.

Unless your application is in the business of processing text, you
shouldn't need functions more advanced than memcpy().  Not wcwidth(),
that one's broken.  Not even strlen() unless it's to size a buffer.
(One operation that you might actually need for templating is
"bidi-isolate untrusted text"; it doesn't seem to be common in
libraries for some reason.)  If it is in that business, see above, see
the scripts page[2], and throw {w,}ctype in the trash where it belongs.
Encodings are, again, the least of your problems.

[1] https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

[2] https://r12a.github.io/scripts/index.html

-- 
Cheers,
Alex
Reply to thread Export thread (mbox)