~dmbaturin/soupault

HTML-compliant ToC and ToC data structure available to plugins

Details
Message ID
<238b6109-2d09-ddb2-754d-8508afbeae46@baturin.org>
DKIM signature
pass
Download raw message
Hi everyone,

Recently I noticed that HTML standard still doesn't allow an <ul> or
<ol> element to have another <ul> or <ol> inside.
They can only have <li> elements for children, and nested lists should
be inside those <li>'s.

Then I noticed that a lot of table of contents (and nested lists in
general) around the web are non-compliant as well.
However, HTML-compliant nesting in fact has advantages for automated
rewriting.
It's also quite a bit harder to produce if you work with a flat list of
headings, which may be the reason why so many ToC's use invalid HTML.

If you have a tree of document sections, generating valid nested lists
from it is trivial. Many other things are also much simpler.
However, making a tree from a flat list isn't easy. It's pretty much an
LR(1) parser that parser generators can't help you with.

I've made up a general algorithm for building a headings tree.

First, there's now "valid_html" boolean option in the toc widget config:

[widgets.contents]
  widget = "toc"
  valid_html = true

It's false by default for now, but I'm tempted to make it true by
default. Hristos and I tested it on our websites and couldn't see any
visual problems.
It may break some styling, but I don't think I've seen any CSS that
relies on the old nesting to work, so it may be better to change it
before anyone does that.

For a live test, I've made a collapsible tree ToC on
https://baturin.org/docs/iproute2/#Table%20of%20contents
The new option forces correct nesting. Then a plugin converts <li>
elements that have a <ul> in them to HTML5 <details>/<summary> to make
them collapsible.
https://raw.githubusercontent.com/dmbaturin/iproute2-cheatsheet/master/plugins/collapsible-list.lua

But wait, there's more! To make it easier to write custom ToC plugins,
the headings tree is also available from Lua.

A new HTML.get_headings_tree function returns a nested table like
{{"<h1>Chapter one</h1>", {"<h2>Section one</h2>", {}}}, {"<h1>Chapter
two</h1>", {}}}
The first element is actually a reference to an element tree item, not a
string, but you get the idea.

Let me know if you think valid_html should be the new default, and
whether you have any issues with that code.
Export thread (mbox)