Hi all,
I can't seem to find it now, but I believe Drew has previously mentioned
intentions to re-work the harec test suite into something more general.
I am planning on starting a fresh write of the hosted typechecker and I
would like to start by creating a test suite of sorts.
For my initial hacking I had thrown together this little driver [1], but
I think something a _little_ bit more sophisticated (and a lot less
slapped-together) would be wise.
[1]: https://git.sr.ht/~roselandgoose/test_hare_unit
Any thoughts or suggestions?
Cheers,
RKL
> I can't seem to find it now, but I believe Drew has previously mentioned> intentions to re-work the harec test suite into something more general.> I am planning on starting a fresh write of the hosted typechecker and I> would like to start by creating a test suite of sorts.
I don't know what Drew had in mind, but in my opinion the most important thing
you can do to make the hosted tests better the bootstrap ones is writing A LOT
of tests for individual components (i guess these are called "unit tests"?). We
have a spec that is most of the time very clear about the desired behavior, and
we should have tests for individual bits of that behavior. At least in
principle a lot of that is easy to make tests for - testing type compatibility
just means writing a test for each entry in the relevant table from the spec
and so on. In practice, we're not doing this in bootstrap harec because it
would require calling compiler's internal functions directly from the tests,
and that's not really possible with the current approach. In hosted checker
this should be much easier fortunately. Having a lot of unit tests is imo
more important than importing the old tests and slowly making them pass one by
one.
One thing bootstrap harec does somewhat well is making sure invalid code is
actually rejected - we have an established way of doing that and we're also
strict about requiring those tests during code review. There is however no way
to track which error is associated with which test, and that means tests
frequently get outdated and irrelevant. Solving this will require some
thinking.
Third, and the hardest thing to make tests for, is generated code's runtime
behavior. We actually have relatively few of those in harec, but again, testing
this kind of thing more thoroughly would require us to write some tooling first
(and unfortunately in this case rewriting things in hare won't give us this
tooling for free). As long as you're just working on the typechecker and not
hosted codegen, you don't need to concern yourself with this part.
Hope this helps a bit :)
I agree with Bor's assessment with respect to the value of unit tests
and their applicability to the bootstrap compiler vs the hosted
compiler. However, there is also room for a different kind of tests.
Essentially, unit tests are intrusive tests; Hare accomodates these
better than C. However, we can also make a non-intrusive test suite,
becuase Hare is a standardized language the behavior of the compiler
should fit within an envelope defined by the specification. Given a set
of test cases that exercise a Hare compiler's conformance to the
standard plus a compiler-specific harness that can feed the test inputs
in and read the expected outputs out, one can build a conformance test
suite which is applicable to any Hare compiler (and moreover would be an
indispensible tool in building a new implementation).
This is what I wanted to build with a new Hare compiler test suite.
Thanks Bor & Drew!
> In hosted checker [calling compiler's internal functions directly from> the tests] should be much easier fortunately. Having a lot of unit tests is imo> more important than importing the old tests and slowly making them pass one by> one.
I think you're right. You've convinced me to build up the (intrusive)
unit tests alongside the typechecker.
> One thing bootstrap harec does somewhat well is making sure invalid code is> actually rejected - we have an established way of doing that and we're also> strict about requiring those tests during code review. There is however no way> to track which error is associated with which test, and that means tests> frequently get outdated and irrelevant. Solving this will require some> thinking.
I have some thoughts on how to achieve this... Hopefully today I'll have
time to flesh them out here:
https://git.sr.ht/~roselandgoose/hare2c2/tree/main/item/hare/check/errors.ha> becuase Hare is a standardized language the behavior of the compiler> should fit within an envelope defined by the specification. Given a set> of test cases that exercise a Hare compiler's conformance to the> standard plus a compiler-specific harness that can feed the test inputs> in and read the expected outputs out, one can build a conformance test> suite which is applicable to any Hare compiler (and moreover would be an> indispensible tool in building a new implementation).
What kinds of expected outputs do you have in mind - would you want to
capture standard out, for example, or might exit codes be sufficent?
I'm currently only picturing using 1. the exit code of each (compiler-
specific) harness, to indicate either successful compilation or
rejection during translation, and 2. the exit code of the compiled
binary, to indicate execution phase aborts and (relying on the design of
the test inputs) computation results and side-effects.
Let me know if I'm already off course.
Cheers,
RKL
Hi Bor,
>> One thing bootstrap harec does somewhat well is making sure invalid code is>> actually rejected - we have an established way of doing that and we're also>> strict about requiring those tests during code review. There is however no way>> to track which error is associated with which test, and that means tests>> frequently get outdated and irrelevant. Solving this will require some>> thinking.> I have some thoughts on how to achieve this... Hopefully today I'll have> time to flesh them out here:> https://git.sr.ht/~roselandgoose/hare2c2/tree/main/item/hare/check/errors.ha
Well I ran out of time last week, but I have now prototyped some
riggings. Let me know what sort of cases the following would be
insufficient for?
https://git.sr.ht/~roselandgoose/hare2c2/tree/main/item/hare/check/
I've got an error type with an identifying enum that is (hopefully) very
easy to add to (with a code-gen'd errcode_tostr function to get the doc
comment for each enum value). When compiled with +test, the error type
also contains the pc of the function that initialized it. Then finally
I have a function which wraps debug::symbol_by* to check if an error was
initialized by a given function.
I figure a test which asserts that the resulting error is (or contains)
an error with a given enum value, that was initialized by a particular
function, should keep rejection tests fairly tightly tied to the errors
they expect. Perhaps overly so? Would appreciate any harec maintainer's
thoughts.
Cheers,
RKL
Hm, that is an interesting, albiet hacky, approach. I think it might be
a bit too intrusive/deep. If we rename a function or split its code up
across several functions, other kinds of refactoring, we shouldn't
necessarily see conformance tests start failing.
Still of the opinion that the compiler internals should be mostly opaque
to the test suite.
Thanks for taking a look at this Drew, although I know it's now been a
while. I'm bumping this thread now, in part, because ~spxtr has been
asking about conformance tests on IRC.
>> https://git.sr.ht/~roselandgoose/hare2c2/tree/44823c1d9e3dc7f3c5749b60e9f571014f71f99d/item/hare/check
(link updated to reflect the status of the code at the time)
>> I've got an error type with an identifying enum that is (hopefully) very>> easy to add to (with a code-gen'd errcode_tostr function to get the doc>> comment for each enum value).>>> I figure a test which asserts that the resulting error is (or contains)>> an error with a given enum value, that was initialized by a particular>> function, should keep rejection tests fairly tightly tied to the errors>> they expect. Perhaps overly so? Would appreciate any harec maintainer's>> thoughts.> Hm, that is an interesting, albiet hacky, approach. I think it might be> a bit too intrusive/deep. If we rename a function or split its code up> across several functions, other kinds of refactoring, we shouldn't> necessarily see conformance tests start failing.> Still of the opinion that the compiler internals should be mostly opaque> to the test suite.
Its good to have more experienced eyes on my code vis a vis hackiness...
My aim with the intrusiveness of that code (storing the pc of the error
initializer) was to address this note from Bor about unit tests:
> There is however no way to track which error is associated with which> test, and that means tests frequently get outdated and irrelevant.> Solving this will require some thinking.
It seems more thinking is required. *shrug*
Anyways, spxtr: I'm curious if you have any thoughts on conformance test
ouputs?
>> becuase Hare is a standardized language the behavior of the compiler>> should fit within an envelope defined by the specification. Given a set>> of test cases that exercise a Hare compiler's conformance to the>> standard plus a compiler-specific harness that can feed the test inputs>> in and read the expected outputs out, one can build a conformance test>> suite which is applicable to any Hare compiler (and moreover would be an>> indispensible tool in building a new implementation).> What kinds of expected outputs do you have in mind - would you want to> capture standard out, for example, or might exit codes be sufficent?> I'm currently only picturing using 1. the exit code of each (compiler-> specific) harness, to indicate either successful compilation or> rejection during translation, and 2. the exit code of the compiled> binary, to indicate execution phase aborts and (relying on the design of> the test inputs) computation results and side-effects.
On IRC you mentioned testing granularity being desirable. I've been
operating under the assumption that testing the behavior of an
incomplete compiler with any granularity is best done with (intrusive)
unit tests.
This notion came from the difficulty I had trying to use the harec test
suite while developing my (previous) prototype of a hosted typechecker.
Even the tests written _specifically_ to test a given language feature
(an expression kind, a particular type-related algorithm, etc.)
typically involved several other language features, often implicitly.
I imagine that the fewer outputs a conformance test API has, the more
language features (and cleverness) are required for test writing. This
suggests to me that any testing granularity that can be achieved by a
non-intrusive test suite will only be of use to fairly full-fledged
compiler implementations - compilers that generate code for most
language features, anyways. And inversely that it wouldn't be very
useful to either of our projects just yet.
(Though I am very very excited by your particular project!! :D )
I'm interested to hear any thoughts you have on the topic.
Sincerely, and with enthusiasm,
Rosie Keith Languet
I want a test suite that can tell me that a given program is a
conformant Hare implementation according to the spec.
> I'm currently only picturing using 1. the exit code of each (compiler-> specific) harness, to indicate either successful compilation or> rejection during translation, and 2. the exit code of the compiled> binary, to indicate execution phase aborts and (relying on the design> of the test inputs) computation results and side-effects.
This is exactly what I had in mind.
The Hare spec makes very limited claims about how Hare programs must
output information. There's nothing about being required to have a print
function, or anything like that. Instead, 5.5 says that if constraints
are violated, this should be indicated in "whatever manner is
semantically appropriate", which for a Unix program means a nonzero exit
code. This is an easy output for conformance tests to check.
The spec also says that the implementation shall display diagnostic
messages, both for translation and execution phase errors. This is
difficult to test because it's (purposefully) underspecified. "lol array
access oob" is a conformant diagnostic message, as is a proper stack
trace with line numbers and such. (I would argue that this requirement
should be dropped to a "may" for the execution phase, but w/e).
For translation time, the spec states "shall display an error indicating
which constraint was invalidated". Unless we standardize on an error
message format, this is again underspecified and difficult to test for
an opaque compiler.
For a test that can answer whether *any* program is a conformant Hare
implementation, exit codes are really all that you can go off of.
> On IRC you mentioned testing granularity being desirable. I've been> operating under the assumption that testing the behavior of an> incomplete compiler with any granularity is best done with (intrusive)> unit tests.
I think I know what you mean. The conformance test suite I have in mind
is not a proper unit test from the compiler's perspective: it mushes all
of lexing, parsing, checking, eval, and codegen all into one pass/fail.
So in that sense it is basically as non-granular as a test suite could be.
By granular I mean eg rather than one big test for for loops (like
12-loops.ha), there are test cases for each kind of accumulator loop
(for (true), for (let i = 0z; i < 10; i += 1), etc), a test case for
for-each with .., another for &.., another for =>, another involving
break, another for break with label, another with continue, and so on.
This granularity is not strictly necessary, but it makes it way easier
to bring up a new Hare implementation (eg can test old-school for loops
even before implementing tagged unions, which are needed for iterator
loops).
> This notion came from the difficulty I had trying to use the harec> test suite while developing my (previous) prototype of a hosted> typechecker. Even the tests written _specifically_ to test a given> language feature (an expression kind, a particular type-related> algorithm, etc.) typically involved several other language features,> often implicitly.
Same.
I would like each test to only require the feature under test, but this
is impossible in practice. For instance, a possible test for for loops
would be
export fn main() void = {
let sum = 0;
for (let i = 1; i <= 9; i += 1) {
sum += i;
};
assert(sum == 45);
};
This supposed test for for expressions also has a function declaration,
two compound expressions, binary arithmetic and comparisons, an assert
statement, a superfluous binding, and so on.
(There will be a test harness that makes it so you don't need to write
out the full program for each test case).
I think this is fine. We should try to write conformance tests to use
only the language features that are necessary (and ideally "simpler"
than the feature you are testing), but accept that complete granularity
is not possible for this type of test. The conformance test suite is not
a complete replacement for intrusive unit testing.
AFAICT there are two main issues with this test type.
1. Many details of the spec can only be tested indirectly. There's no
explicit assert_type(expr, int), or assert_flexible_bounds(expr, lo,
hi), and so on. I suspect these can all be worked around, but I haven't
looked at all the cases, and the workarounds might not be pretty.
2. Basically all of the tests should really ought to be duplicated as
intrusive unit tests for a given implementation. Proper unit tests can
and should check flexible bounds directly, types, etc.
The dream would be to somehow write the tests in a structured format so
that effort doesn't need to be duplicated between intrusive unit tests
and opaque conformance tests. I doubt that this is possible to do
conveniently.
I'll put together a quick test harness and a handful of tests and see
how it goes.
:),
spxtr