~vdupras/duskos-discuss

9 2

Feedback on your new README

Details
Message ID
<69cfdb97-49d6-dccd-c7bb-df3b40eab7e6@gmx.de>
DKIM signature
missing
Download raw message
Hello Virgil,


> So, unless someone tell me about some option I don't know about, DuskCC is quite
> innovative on the aspect of self-hosting path length.

I took this as an invitation to comment on your README :)

Just to make clear, even with my figures, DuskOS is still pretty
innovative on the aspect of self-hosting path length.


> ### Shortest path to self-hosting for a C compiler
>
> Dusk OS self-hosts in about 1000 lines of assembly and a few hundred lines of
> Forth (the exact number depends on the target machine). From there, it
> bootstraps to a C compiler, which is roughly 3000 lines of Forth code (including
> arch-specific backend and assembler). To my knowledge, Dusk OS is unique in that
> regard.

The main question here is if your compiler is a "C compiler" or a "C
subset compiler". And the usual metric is to try to compile and run some
simple but realistic C programs written by others, and check if they
compile and run without errors. And if not, how large a patch would be
to make them compile and run it, while still running well when compiled
with other C compilers.

If you want to take the challenge, some easy yet useful (algorithmic) -
and at the same time easy to test - programs would be the following
tools from mescc-tools-extra, which is a project that contains some
tools that while not being essential during bootstrapping, are quite
useful, and which use a subset of C which most or even all C subset
compilers (e.g. M2-Planet, 8cc, chibicc) can compile without trouble.

https://github.com/oriansj/mescc-tools-extra
- ungz.c
- untar.c
- sha256sum.c
- sha3sum.c

You may notice that there is no xz decoder included, and this is because
most XZ decoders use more advanced C features which those subset C
compilers don't support.

<https://github.com/schierlm/xzdec-min> (shameless plug) is a version I
stripped down some time ago to be able to build with (some older versio
of) tinycc. Getting this (or another version) to compile and work would
be quite impressive (and probably also useful as people might find xz
compressed archives everywhere).

> You can pick any C compiler that requires POSIX and it will automatically
> require order of magnitudes more lines of code to bootstrap because you need
> that POSIX system in addition to the C compiler. So even if you pick a small C
> compiler such as tcc, you still need a POSIX system to build it, which is
> usually in the millions of LOCs.

Here I would rather look at small POSIX systems and not at usual ones.
Speaking of the x86 architecture, Fiwix (fiwix.org) is mostly Linux
compatible, yet still consists of only less than 50K LOC of C and assembly.

And when you look at those smaller C (subset) compilers, if you add
assembler and/or a small c-library (if they don't include one), they
weigh in around 20k to 30k LOC of C code each (for supporting only one
architecture, x86, but that is currenlty the same with Dusk OS).

There are also ambitions to run M2-Planet natively on x86, but they do
not work yet, so I would not count them here. M2-Planet built for the
UEFI boot services environment does work, however. But I am not sure if
you would have to count the firmware size in that case.

> You can try to dive further down the history lane for even simpler systems such
> as CP/M and BDS C. But even then, you're still looking at 25 000 lines of
> assembler for BDS C and it's going to lack backends for modern CPUs.

I would assume that Dusk OS' C compiler would land at similar ballpark
figures (something between 15K and 30K LOC) once it supports enough of
the C standard to be able to compile foreign code with no or minimal
modifications.


Regards,


Michael
Details
Message ID
<5db5a6f9-ab44-4e56-9d27-1fa6fc12edec@app.fastmail.com>
In-Reply-To
<69cfdb97-49d6-dccd-c7bb-df3b40eab7e6@gmx.de> (view parent)
DKIM signature
missing
Download raw message
On Sat, Nov 19, 2022, at 9:06 AM, Michael Schierl wrote:
> Hello Virgil,
>
>
>> So, unless someone tell me about some option I don't know about, DuskCC is quite
>> innovative on the aspect of self-hosting path length.
>
> I took this as an invitation to comment on your README :)

Of course that is, thanks for your feedback!

> Just to make clear, even with my figures, DuskOS is still pretty
> innovative on the aspect of self-hosting path length.
>
>
>> ### Shortest path to self-hosting for a C compiler
>>
>> Dusk OS self-hosts in about 1000 lines of assembly and a few hundred lines of
>> Forth (the exact number depends on the target machine). From there, it
>> bootstraps to a C compiler, which is roughly 3000 lines of Forth code (including
>> arch-specific backend and assembler). To my knowledge, Dusk OS is unique in that
>> regard.
>
> The main question here is if your compiler is a "C compiler" or a "C
> subset compiler".

It's true, "C compiler" is a misnomer. I call it an "almost C" elsewhere and I
should do so in this section too. Although, in the strict sense, DuskCC is a C
subset, it's also designed with the idea that every feature that it doesn't
implement "as-is" has a substitute, making porting of any C app convenient.

> And the usual metric is to try to compile and run some
> simple but realistic C programs written by others, and check if they
> compile and run without errors. And if not, how large a patch would be
> to make them compile and run it, [...]

I agree, that's my main tradeoff with DuskCC: simplicity vs patch size.

> [...] while still running well when compiled with other C compilers.

That, however, is not part of DuskCC design goals. A C application ported to
DuskCC is not going to compile on a regular C compiler. The semantics of the
stdlib, especially with memory allocation and I/O, are different.

> If you want to take the challenge, some easy yet useful (algorithmic) -
> and at the same time easy to test - programs would be the following
> tools from mescc-tools-extra, which is a project that contains some
> tools that while not being essential during bootstrapping, are quite
> useful, and which use a subset of C which most or even all C subset
> compilers (e.g. M2-Planet, 8cc, chibicc) can compile without trouble.
>
> https://github.com/oriansj/mescc-tools-extra
> - ungz.c
> - untar.c
> - sha256sum.c
> - sha3sum.c
>
> You may notice that there is no xz decoder included, and this is because
> most XZ decoders use more advanced C features which those subset C
> compilers don't support.
>
> <https://github.com/schierlm/xzdec-min> (shameless plug) is a version I
> stripped down some time ago to be able to build with (some older versio
> of) tinycc. Getting this (or another version) to compile and work would
> be quite impressive (and probably also useful as people might find xz
> compressed archives everywhere).

Thanks for the challenge idea, I accept it :) I briefly looked a ungz.c and I
don't see any road blocker. Memory allocation and I/O will have to change, of
course, but the rest of the code look like something that DuskCC can handle.
Such porting operations are very good opportunities to find compiler bugs.

>> You can pick any C compiler that requires POSIX and it will automatically
>> require order of magnitudes more lines of code to bootstrap because you need
>> that POSIX system in addition to the C compiler. So even if you pick a small C
>> compiler such as tcc, you still need a POSIX system to build it, which is
>> usually in the millions of LOCs.
>
> Here I would rather look at small POSIX systems and not at usual ones.
> Speaking of the x86 architecture, Fiwix (fiwix.org) is mostly Linux
> compatible, yet still consists of only less than 50K LOC of C and assembly.

Thanks for this reference, I didn't know about it and it looks very interesting.
I'll look at it soon and update my comparison references.

> And when you look at those smaller C (subset) compilers, if you add
> assembler and/or a small c-library (if they don't include one), they
> weigh in around 20k to 30k LOC of C code each (for supporting only one
> architecture, x86, but that is currenlty the same with Dusk OS).

Can tcc compile it? The makefile seems hardcoded on gcc and I didn't see
compiler compatibility notes in the readme. It might need some work to have a
short self-hosting path.

> There are also ambitions to run M2-Planet natively on x86, but they do
> not work yet, so I would not count them here. M2-Planet built for the
> UEFI boot services environment does work, however. But I am not sure if
> you would have to count the firmware size in that case.

Yeah, I feel I should mention M2-Planet in my comparison, but it's just a
different beast. Its goal, as I understand it, is to bootstrap to modern C from
the smallest binary seed possible, whereas my goal is to have the shortest
self-hosting loop possible.

Sure, a 1K seed is really small, but it doesn't give you a usable system. I
don't know how to compare those two projects.

I hadn't looked at M2's C compiler until now. I find this project hard to
approach, with sparse and confusing documentation. It's certainly impressive to
have a i386 C compiler in 4300 lines of assembly, but from what I see of it,
it's a quite restricted subset of it. For example, I see in cc_x86.s:4209 that
types *and* their indirection levels are hardcoded together, supporting only
void, int, char*, char**. In that regards, DuskCC's subject seems much more
comprehensive. But maybe I'm reading it wrong...

>
>> You can try to dive further down the history lane for even simpler systems such
>> as CP/M and BDS C. But even then, you're still looking at 25 000 lines of
>> assembler for BDS C and it's going to lack backends for modern CPUs.
>
> I would assume that Dusk OS' C compiler would land at similar ballpark
> figures (something between 15K and 30K LOC) once it supports enough of
> the C standard to be able to compile foreign code with no or minimal
> modifications.

I'm still optimistic that I'll stay below the 3K (excluding arch-specific
backends and assemblers) lines for DuskCC. There aren't that many features left
to implement. There's certainly a fair chunks of bugs left to iron out, but
I've been steadily squashing them and I'm happy to see that those fixes don't
affect complexity much. As long as I don't hit a bug uncovering a major design
oversight, I should be good.

Regards,
Virgil
Details
Message ID
<74f586ac-299f-a267-5b8f-d80fb7c934e5@gmx.de>
In-Reply-To
<5db5a6f9-ab44-4e56-9d27-1fa6fc12edec@app.fastmail.com> (view parent)
DKIM signature
missing
Download raw message
Hello Virgil,


Am 19.11.2022 um 16:52 schrieb Virgil Dupras:

>> [...] while still running well when compiled with other C compilers.
>
> That, however, is not part of DuskCC design goals. A C application ported to
> DuskCC is not going to compile on a regular C compiler. The semantics of the
> stdlib, especially with memory allocation and I/O, are different.

Hmm. If the differences are just libs, they could probably be worked
around by different include path or some #ifdefs? (M2libc also comes
with bootstrappable.{c,h} which includes a C implementation of the
M2-Planet builtin functions so that the program compiles with other C
compilers that don't have them).

 From my point of view, the advantage of programming in a C subset to
having a different procedural language like Oberon, is that for
debugging, you can put the program in your favourite C compiler and then
use all the tools you are used to (i.e. debugger, test coverage
analysis, valgrind) on them to find the bugs. And that option will be
gone if it "is not going to compile" there.

> Thanks for the challenge idea, I accept it :) I briefly looked a ungz.c and I
> don't see any road blocker. Memory allocation and I/O will have to change, of
> course, but the rest of the code look like something that DuskCC can handle.

I'm interested in seeing the final diff, just in case I'd plan to port
something to duskcc myself.

>> Here I would rather look at small POSIX systems and not at usual ones.
>> Speaking of the x86 architecture, Fiwix (fiwix.org) is mostly Linux
>> compatible, yet still consists of only less than 50K LOC of C and assembly.
>
> Can tcc compile it? The makefile seems hardcoded on gcc and I didn't see
> compiler compatibility notes in the readme. It might need some work to have a
> short self-hosting path.

If you asked this question a few days ago, I'd have to tell "I don't
know." But
<https://logs.guix.gnu.org/bootstrappable/2022-11-18.log#205903> makes
me believe the answer is yes.

(I'm curious whether rickmasters will try to upstream their changes to
the Fiwix code or if they prefer to maintain a fork. In any case, "it
has been done")

> Yeah, I feel I should mention M2-Planet in my comparison, but it's just a
> different beast. Its goal, as I understand it, is to bootstrap to modern C from
> the smallest binary seed possible,

Which would be the goal of the stage0 project. While it not only tries
to use the smalled binary seed possible, but to minimize the amount of
"original" (probably not very well reviewed) additional source code -
which is why it uses other projects like GNU Mes and M2-Planet on its
bootstrapping path, resulting in not the shorted bootstrap path possible.

M2-Planet is one step of that project (in particular a self-hosting C
subset reproducible (cross) compiler written in its own C subset, which
will also require M1-Macro assembler and hex2 linker, both of which can
be build with M2-Planet as wel). M2libc is the C library used for it.

Stage0 uses M2-Planet as one of its steps, by having minimal C compilers
written in assembly for its supported architectures that are written to
just compile M2-Planet and nothing else.

(The next step is to compile GNU Mes (scheme interpreter) with
M2-Planet, then use mescc (C compiler written in scheme) to compile tinycc).

> Sure, a 1K seed is really small, but it doesn't give you a usable system. I
> don't know how to compare those two projects.

In fact, tools like sectorforth or sectorlisp are equally small and
probably give you a more usable system (on their own) than stage0's
seed. It is just that nobody tried to bootstrap to modern C from there
(which is certainly possible if you invest enough effort).

Stage0 is more about picking what is there and building some minimal
glue between it and its own binary seed.

> I hadn't looked at M2's C compiler until now.

I'd start at https://github.com/oriansj/M2-Planet :)

> I find this project hard to
> approach, with sparse and confusing documentation. It's certainly impressive to
> have a i386 C compiler in 4300 lines of assembly, but from what I see of it,
> it's a quite restricted subset of it. For example, I see in cc_x86.s:4209

Hmm. You are looking at a different cc_x86.s than me
(https://github.com/oriansj/stage0-posix-x86/blob/master/NASM/cc_x86.S#L4209).
Yet from the file name, this is not M2-Planet but the "compiler" to
bootstrap it on x86 posix.

M2-Planet is more like 15k lines of C code (for all supported
architectures combined, excluding C library and assembler).

> I'm still optimistic that I'll stay below the 3K (excluding arch-specific
> backends and assemblers) lines for DuskCC. There aren't that many features left
> to implement. There's certainly a fair chunks of bugs left to iron out, but
> I've been steadily squashing them and I'm happy to see that those fixes don't
> affect complexity much. As long as I don't hit a bug uncovering a major design
> oversight, I should be good.


Good luck :)


Michael
Details
Message ID
<51dafc24-020e-415f-8c06-698eced3dcee@app.fastmail.com>
In-Reply-To
<74f586ac-299f-a267-5b8f-d80fb7c934e5@gmx.de> (view parent)
DKIM signature
missing
Download raw message
On Sat, Nov 19, 2022, at 7:20 PM, Michael Schierl wrote:
>> I hadn't looked at M2's C compiler until now.
>
> I'd start at https://github.com/oriansj/M2-Planet :)

Sorry, I mixed them up. I was indeed looking at stage0's C compiler and I didn't
know that its design goal was to minimally compile M2-Planet.

Thanks for your explanation of the relationship between all those projects. It's
much clearer to me now.

Virgil
Details
Message ID
<a534fbda-37fb-4db6-aa38-8c31f1bbea64@app.fastmail.com>
In-Reply-To
<69cfdb97-49d6-dccd-c7bb-df3b40eab7e6@gmx.de> (view parent)
DKIM signature
missing
Download raw message
On Sat, Nov 19, 2022, at 9:06 AM, Michael Schierl wrote:
> The main question here is if your compiler is a "C compiler" or a "C
> subset compiler". And the usual metric is to try to compile and run some
> simple but realistic C programs written by others, and check if they
> compile and run without errors. And if not, how large a patch would be
> to make them compile and run it, while still running well when compiled
> with other C compilers.
>
> If you want to take the challenge, some easy yet useful (algorithmic) -
> and at the same time easy to test - programs would be the following
> tools from mescc-tools-extra, which is a project that contains some
> tools that while not being essential during bootstrapping, are quite
> useful, and which use a subset of C which most or even all C subset
> compilers (e.g. M2-Planet, 8cc, chibicc) can compile without trouble.
>
> https://github.com/oriansj/mescc-tools-extra
> - ungz.c
> - untar.c
> - sha256sum.c
> - sha3sum.c

I've just pushed a version of ar/ungz.c that contains most of the code of the
original and DuskCC can compile it without error. I haven't yet verified that
the resulting binaries are correct though. Still a WIP.

Although I'm not finished with ungz's porting, I'd like to come back early to
this discussion. Although it's true that patch size is a good measure to know
how practical DuskCC is, it's going to be artificially high in the case of ungz
because, from what I can understand, M2-Planet CC is quite limited and ungz has
been borked to work with this compiler.

Yes, having a simpler syntax is supposed to make the compiler job's easier, but
in this case, it transforms a stack-only piece of code[1] into code full of
calls to calloc() and Dusk has no malloc().

I haven't begun debugging the resulting functions yet, but I don't think I'll do
it. I'll start over directly from zlib instead. That being said, I think that
the current diff between ar/ungz.c and the original gives a good idea of
DuskCC's capabilities.

Regards,
Virgil

[1]: https://github.com/madler/zlib/blob/master/contrib/puff/puff.c
Details
Message ID
<238053a4-239f-4df1-abbc-875a4143cba5@app.fastmail.com>
In-Reply-To
<a534fbda-37fb-4db6-aa38-8c31f1bbea64@app.fastmail.com> (view parent)
DKIM signature
missing
Download raw message
On Mon, Nov 28, 2022, at 3:20 PM, Virgil Dupras wrote:
> I haven't begun debugging the resulting functions yet, but I don't think I'll do
> it. I'll start over directly from zlib instead.

How long has it been? an hour? Yeah, it was much easier this time around[1] and
the diff is much closer to the original.

(compile only though, haven't actually tested it)

[1]: https://git.sr.ht/~vdupras/duskos/commit/dbf1f13
Details
Message ID
<0a60f31f-2e2c-4431-909b-7fe0d63f5f81@app.fastmail.com>
In-Reply-To
<238053a4-239f-4df1-abbc-875a4143cba5@app.fastmail.com> (view parent)
DKIM signature
missing
Download raw message
and... mission accomplished[1]! with a DuskCC line count of... 1166 ;)

Sure, there's a few TODOs in the ported puff.c, but nothing major. Once that's
done, the diff with the original will be pretty slim!

Cmon Michael, color yourself impressed :)

[1]: https://git.sr.ht/~vdupras/duskos/commit/156823c
Details
Message ID
<31b30807-f1f6-e596-0827-91529520c603@gmx.de>
In-Reply-To
<0a60f31f-2e2c-4431-909b-7fe0d63f5f81@app.fastmail.com> (view parent)
DKIM signature
missing
Download raw message
Hello,


Am 29.11.2022 um 21:29 schrieb Virgil Dupras:
> and... mission accomplished[1]! with a DuskCC line count of... 1166 ;)
>
> Sure, there's a few TODOs in the ported puff.c, but nothing major. Once that's
> done, the diff with the original will be pretty slim!

Looked at the diff against original puff.c, and it does not look too
bad. Really ugly in my opinion is the MAXBITSPLUSONE define, but I
assume I can guess why this is here (there is no constant expression
evaluation in DuskCC, and allocating an array on the stack requires the
size of the array to be known at compile time). Other minimal C subsets
avoid this by not allowing array allocations on the stack at all (which
is no big deal if you have malloc). But I'm sure you will find a general
solution for that which does not need too many lines in DuskCC.

Now, for something simple, replace blk.fs in your disk image by
blk.fs.gz and try to get it decompressed so that it can be used in the
cos emulator :D


Regards,


Michael
Details
Message ID
<63f7cd88-7233-491b-b519-21936a366a3d@app.fastmail.com>
In-Reply-To
<31b30807-f1f6-e596-0827-91529520c603@gmx.de> (view parent)
DKIM signature
missing
Download raw message
On Tue, Nov 29, 2022, at 5:04 PM, Michael Schierl wrote:
> Hello,
>
>
> Am 29.11.2022 um 21:29 schrieb Virgil Dupras:
>> and... mission accomplished[1]! with a DuskCC line count of... 1166 ;)
>>
>> Sure, there's a few TODOs in the ported puff.c, but nothing major. Once that's
>> done, the diff with the original will be pretty slim!
>
> Looked at the diff against original puff.c, and it does not look too
> bad. Really ugly in my opinion is the MAXBITSPLUSONE define, but I
> assume I can guess why this is here (there is no constant expression
> evaluation in DuskCC, and allocating an array on the stack requires the
> size of the array to be known at compile time).

Actually DuskCC already optimizes constant expressions away, just not at the
type parsing level. Solving this TODO is only a matter of reorganizing the code
a little bit.

> Now, for something simple, replace blk.fs in your disk image by
> blk.fs.gz and try to get it decompressed so that it can be used in the
> cos emulator :D

I don't see why it wouldn't be possible. It would certainly be a good test for
ungz though. But before I deal with bigger workloads, I want to change the puff
in-memory algorithm into a streaming one. I really take offense to the waste
represented by these in-memory buffers for an algo that supports streaming by
design.

Regards,
Virgil
Details
Message ID
<4cdc757c-f65a-4c7b-926a-02878df039e1@app.fastmail.com>
In-Reply-To
<63f7cd88-7233-491b-b519-21936a366a3d@app.fastmail.com> (view parent)
DKIM signature
missing
Download raw message
better now: https://git.sr.ht/~vdupras/duskos/commit/fdd210a
Reply to thread Export thread (mbox)