~sircmpwn/hare-dev

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
9 5

[PATCH hare v2] bufio::scanner: add EOF_REST scan option

Details
Message ID
<20240804100416.12411-2-contact@willowbarraco.fr>
DKIM signature
pass
Download raw message
Patch: +45 -4
EOF_REST is the previous scanner behavior. Now with the new default
value, the scanner will return EOF if it reachs the source EOF without
encountering the delims.

This new option is necessary to parse uncomplete buffers in a
non-blocking way (ex: in hare-ev).

Also, my opinion is that it is a more previsible behavior.
scan_bytes a "foo bar" buffer with a "nak" token should not result with a
"foo bar" str. The user program might expect that the full buffer was
"foo barnak" which might not be the case.

The scanner is a good place to cover this need. read_line behavior could
be changed for consistency, but this is not a good idea because both methods
drains the source while returning EOF. The scanner will preserve the temporary
buffer, while read_line will just swallow the rest.

This is a breaking change. Programs that relies on the previous behavior
may now have a confusing scanner result that might be difficult to
debug.

The changes to the tests represent correctly the change of behavior:

unix::hosts::next need line breaks, because /etc/hosts-formatted formated lines
ends with line breaks. "127" is not invalid, it is an incomplete line.
The same conclusion for format::ini.

Signed-off-by: Willow Barraco <contact@willowbarraco.fr>
---
 bufio/scanner.ha           | 23 ++++++++++++++++++++++-
 bufio/scanner_test+test.ha | 19 +++++++++++++++++++
 format/ini/+test.ha        |  3 ++-
 unix/hosts/test+test.ha    |  4 ++--
 4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/bufio/scanner.ha b/bufio/scanner.ha
index 199eeb93..8786f33b 100644
--- a/bufio/scanner.ha
@@ -25,6 +25,17 @@ export type scanner = struct {
	pending: []u8,
	// User-confirmed maximum size of read buffer
	maxread: size,
	// Change some scanning behaviors
	opts: scan_options,
};

// Specifies the behaviours while scanning.
//
// EOF_REST change how the scanner works when reading EOF. By default, if the
// delims isn't found, it return EOF. With EOF_REST, it return the rest;
export type scan_options = enum uint {
	DEFAULT = 0,
	EOF_REST = 1 << 0,
};

// Creates a new [[scanner]] which will allocate and maintain a read buffer for
@@ -38,6 +49,7 @@ export type scanner = struct {
export fn newscanner(
	src: io::handle,
	maxread: size = types::SIZE_MAX,
	opts: scan_options = scan_options::DEFAULT,
) scanner = {
	return scanner {
		stream = &scanner_vtable,
@@ -46,6 +58,7 @@ export fn newscanner(
		maxread = maxread,
		start = 0,
		pending = [],
		opts = opts,
	};
};

@@ -53,7 +66,11 @@ export fn newscanner(
// return [[errors::overflow]] if the buffer length is reached, but will not
// perform any allocations. The user should not call [[finish]] after use unless
// they wish to free the underlying buffer through bufio.
export fn newscanner_static(src: io::handle, buffer: []u8) scanner = {
export fn newscanner_static(
	src: io::handle,
	buffer: []u8,
	opts: scan_options = scan_options::DEFAULT,
) scanner = {
	return scanner {
		stream = &scanner_vtable,
		src = src,
@@ -61,6 +78,7 @@ export fn newscanner_static(src: io::handle, buffer: []u8) scanner = {
		maxread = len(buffer),
		start = 0,
		pending = [],
		opts = opts,
	};
};

@@ -164,6 +182,9 @@ export fn scan_bytes(

		match (scan_readahead(scan)?) {
		case io::EOF =>
			if (scan.opts & scan_options::EOF_REST == 0) {
				return io::EOF;
			};
			if (len(scan.pending) == 0) {
				return io::EOF;
			};
diff --git a/bufio/scanner_test+test.ha b/bufio/scanner_test+test.ha
index b3fadf0f..42ee21d0 100644
--- a/bufio/scanner_test+test.ha
@@ -6,6 +6,7 @@ use encoding::utf8;
use io;
use memio;
use strings;
use types;

@test fn read_byte() void = {
	let buf = memio::fixed([1, 3, 3, 7]);
@@ -180,3 +181,21 @@ use strings;

	assert(scan_line(&scanner) is io::EOF);
};

@test fn scan_uncomplete_line() void = {
	let buf = memio::dynamic();
	let scan = newscanner(&buf);

	assert(scan_line(&scan) is io::EOF);

	io::write(&buf, strings::toutf8("hello"))!;
	io::seek(&buf, 0, io::whence::SET)!;

	assert(scan_line(&scan) is io::EOF);

	io::write(&buf, strings::toutf8("\n"))!;
	io::seek(&buf, -1, io::whence::CUR)!;

	let line = scan_line(&scan) as const str;
	assert(strings::compare(line, "hello") == 0);
};
diff --git a/format/ini/+test.ha b/format/ini/+test.ha
index 64013ced..c67d0668 100644
--- a/format/ini/+test.ha
+++ b/format/ini/+test.ha
@@ -13,7 +13,8 @@ name=Sourcehut
description=The hacker's forge
[harelang.org]
name=Hare
description=The Hare programming language"));
description=The Hare programming language
"));
	const sc = scan(&buf);
	defer finish(&sc);

diff --git a/unix/hosts/test+test.ha b/unix/hosts/test+test.ha
index cdd18fa4..40661a9f 100644
--- a/unix/hosts/test+test.ha
+++ b/unix/hosts/test+test.ha
@@ -48,10 +48,10 @@ def HOSTS_FILE = `
};

@test fn errors() void = {
	const s = "127";
	const s = "127\n";
	assert(next(&read(&memio::fixed(strings::toutf8(s))))
		is ip::invalid);
	const s = "127.0.0.1";
	const s = "127.0.0.1\n";
	assert(next(&read(&memio::fixed(strings::toutf8(s))))
		is invalid);
};
-- 
2.46.0

[hare/patches] build success

builds.sr.ht <builds@sr.ht>
Details
Message ID
<D371KGVVGCRQ.23M30XWGTD1B4@fra02>
In-Reply-To
<20240804100416.12411-2-contact@willowbarraco.fr> (view parent)
DKIM signature
missing
Download raw message
hare/patches: SUCCESS in 1m49s

[bufio::scanner: add EOF_REST scan option][0] v2 from [Willow Barraco][1]

[0]: https://lists.sr.ht/~sircmpwn/hare-dev/patches/54276
[1]: contact@willowbarraco.fr

✓ #1293380 SUCCESS hare/patches/alpine.yml  https://builds.sr.ht/~sircmpwn/job/1293380
✓ #1293382 SUCCESS hare/patches/netbsd.yml  https://builds.sr.ht/~sircmpwn/job/1293382
✓ #1293383 SUCCESS hare/patches/openbsd.yml https://builds.sr.ht/~sircmpwn/job/1293383
✓ #1293381 SUCCESS hare/patches/freebsd.yml https://builds.sr.ht/~sircmpwn/job/1293381
Lorenz (xha) <me@xha.li>
Details
Message ID
<ZrDDR6827AWbMCC8@xha.li>
In-Reply-To
<20240804100416.12411-2-contact@willowbarraco.fr> (view parent)
DKIM signature
pass
Download raw message
hi,

i think that we discussed this on IRC already; i guess i could
not convince you that its not a good idea :D

i think that the scanner is too complicated already for what it is
doing. it took me three hours to completely review the last patch
that improved its performance. that patch had a bad bug that is not
visible on first sight, and this one does aswell. i think this
should prove my point that we should not comlicated it futher.

this belongs in hare-ev imo. but if we really reach the concensus
that this should go in the stdlib, please review the logic in the
scanner carefully and maby consider making this option a boolean?

thank you!
Details
Message ID
<D37ZL38FODDO.3TW9AT1WKH3VD@willowbarraco.fr>
In-Reply-To
<ZrDDR6827AWbMCC8@xha.li> (view parent)
DKIM signature
pass
Download raw message
I've sent this rebased version because it has been requested here:

https://lists.sr.ht/~sircmpwn/hare-dev/%3C20240803140652.21676-2-contact@willowbarraco.fr%3E#%3CD3711A3U57KD.19O4MZF00EX9J@cmpwn.com%3E

Also about adding complexity, I'm personally in favor of changing the
behavior, and to not add options. The current behavior feels wrong to
me, and it make bufio almost useless for most of my use-cases.
Details
Message ID
<6803088d-4e93-4e9d-994c-7cee6cb28830@spxtr.net>
In-Reply-To
<D37ZL38FODDO.3TW9AT1WKH3VD@willowbarraco.fr> (view parent)
DKIM signature
pass
Download raw message
I agree with changing the default behavior and not adding an option. The
user can use bufio::scan_buffer() if they want the trailing data.
Lorenz (xha) <me@xha.li>
Details
Message ID
<ZrEGyqIzrlGjcbpW@xha.li>
In-Reply-To
<6803088d-4e93-4e9d-994c-7cee6cb28830@spxtr.net> (view parent)
DKIM signature
pass
Download raw message
On Mon, Aug 05, 2024 at 05:23:28PM +0100, spxtr wrote:
> I agree with changing the default behavior and not adding an option. The
> user can use bufio::scan_buffer() if they want the trailing data.

the problem is that in 99% of these cases they *do* want the trailing data.
i think changing the default behavior is a bad idea.
Details
Message ID
<D38Z7XDEPWHO.G0VCPXWKQ446@cmpwn.com>
In-Reply-To
<20240804100416.12411-2-contact@willowbarraco.fr> (view parent)
DKIM signature
pass
Download raw message
Here's my analysis of this change.

So, some comments on the lay of the land. Firstly, the use-case which
calls for this change is to scan partial buffers while they are still
being filled, e.g. by incoming network traffic, or a pipe, or just a
multi-reader/single-writer file. This is an important use-case and I
don't think that modifying bufio to accomodate this is an issue.

To acknowledge the main present-day use-case for which the scanner is
defined: to read text files line-wise, supporting functionality like the
/etc/hosts parser, format::ini parser, etc. By convention and arguably
by standard, on Unix all text files end in a line feed; most but
probably not all text editors uphold this and it is consistent with the
POSIX standard:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

Other functionality in the stdlib will return an empty token when
tokenizeing a string which ends with a delimiter, e.g. strings::tokenize
will return "" on the last iteration of
strings::tokenize("hello world ", " "); However, this should not be the
default behavior of bufio::scan_line. In my view the main purpose of
this function is to enumerate lines in a Unix text file, and LF followed
by EOF does not denote a line per the standard, and a "line" which does
not end in a line feed is not a line per the standard. Despite the
standard, however, it is not unheard of for text files with unterminated
lines to exist. Stuff like vim will unconditionally add a line feed to
every file when you :w, but not every editor does this.

So this context is important for understanding what ought to be the
default. By this line of reasoning I think that because lines are
*standardized* as ending in LF, and almost all editors uphold this
behavior, and files which do not end in LF are very uncommon (but not
unheard of), that it is a sensible default to discard trailing
characters, which is to say drop unlines from scan_line. However, it is
somewhat inconsistent with strings::tokenize, and I'm not sure the
argument applies to other sorts of tokens besides lines -- so we do need
a flag to control this behavior.

And regarding the patch itself, I think the docs and naming conventions
need to explain how this works a lot better.

I propose the following:

// Options which fine-tune the behavior of a [[scanner]].
type scan_options = enum uint {
	DEFAULT = EOF_DISCARD,
	// Upon encountering EOF, all bytes or characters between the
	// final token and EOF are discarded and EOF is returned
	// immediately.
	//
	// This option is recommended for use-cases where the user is
	// scanning over a file or buffer which may contain partial
	// content, and the user wishes to consume as many tokens as
	// possible and assume that additional data may follow EOF
	// before a new delimiter is written.
	//
	// This is the default behavior. Note that on Unix, text files
	// are always terminated with a new line, and [[scan_line]] will
	// enumerate all well-formed lines in a file with this flag --
	// however, when scanning ill-formed text files which include
	// text following the final line feed, this additional text will
	// be discarded.
	EOF_DISCARD = 0,
	// Upon encountering EOF, all bytes or characters between the
	// final token and EOF are treated as a token and returned to
	// the caller before returning EOF.
	//
	// This is recommended for use-cases where EOF is effectively
	// considered an additional delimiter between tokens, or where
	// the remainder of the file following the final delimiter is
	// meaningful.
	EOF_GREEDY = 1 << 0,
};

That said... I can be convinced that GREEDY should be the default.

-- 
Drew DeVault
Details
Message ID
<30e5a5c0-63bc-4d74-af9c-81290b9022d9@spxtr.net>
In-Reply-To
<D38Z7XDEPWHO.G0VCPXWKQ446@cmpwn.com> (view parent)
DKIM signature
pass
Download raw message
>	DEFAULT = EOF_DISCARD,
>	// Upon encountering EOF, all bytes or characters between the
>	// final token and EOF are discarded and EOF is returned
>	// immediately.

I think in most cases we do not want it to actually discard the
remaining data, but instead just leave it in the buffer. If more data
come in afterward then we want to start from the last delimiter.


>  export fn newscanner(
> 	src: io::handle,
> 	maxread: size = types::SIZE_MAX,
>+	opts: scan_options = scan_options::DEFAULT,
> ) scanner = {

As a general style thing, I prefer to not have multiple optional
function params in one function. I would tentatively suggest documenting
that the user can just set the flag themselves, like

	let scanner = bufio::newscanner(h);
	scanner.opts = EOF_GREEDY;
	// ...


And, really minor thing, but I'm always a bit suspicious of general
"options" enums with only two choices. Unless you plan on adding more
options later, what is wrong with a bool?
Details
Message ID
<D3P5RUAKWMA2.2IHICXA736MYI@willowbarraco.fr>
In-Reply-To
<30e5a5c0-63bc-4d74-af9c-81290b9022d9@spxtr.net> (view parent)
DKIM signature
pass
Download raw message
On Sun Aug 25, 2024 at 2:41 PM CEST, spxtr wrote:
> >	DEFAULT = EOF_DISCARD,
> >	// Upon encountering EOF, all bytes or characters between the
> >	// final token and EOF are discarded and EOF is returned
> >	// immediately.
>
> I think in most cases we do not want it to actually discard the
> remaining data, but instead just leave it in the buffer. If more data
> come in afterward then we want to start from the last delimiter.

That is what it does. It will returns EOF, leaving the internal buffer
untouched. The remainings will eventually be returned if the delim is
matched later, after more reads.

>
>
> >  export fn newscanner(
> > 	src: io::handle,
> > 	maxread: size = types::SIZE_MAX,
> >+	opts: scan_options = scan_options::DEFAULT,
> > ) scanner = {
>
> As a general style thing, I prefer to not have multiple optional
> function params in one function. I would tentatively suggest documenting
> that the user can just set the flag themselves, like
>
> 	let scanner = bufio::newscanner(h);
> 	scanner.opts = EOF_GREEDY;
> 	// ...
>
>
> And, really minor thing, but I'm always a bit suspicious of general
> "options" enums with only two choices. Unless you plan on adding more
> options later, what is wrong with a bool?

The idea is to minimize the impact over the API. It is less intrusive to
add/remove bitwise options, than to add/remove booleans arguments.
Details
Message ID
<D3SG4KC7H5LQ.3VA1UO11MM0MT@willowbarraco.fr>
In-Reply-To
<20240804100416.12411-2-contact@willowbarraco.fr> (view parent)
DKIM signature
pass
Download raw message
Anyone still got strong reason to nak this? Or can I send a v3 that
include the doc, and naming changes as recommended?
Reply to thread Export thread (mbox)