~dmbaturin/soupault

1

How to best unwrap an element using plugins?

Details
Message ID
<yBO6DQuH7mRnFnN8AfzMJnqu4055QOxwucDV@teff>
DKIM signature
missing
Download raw message
Let’s say I get some trash markup like

[source,html]
----
<div>
  <span><span class="useless">Hello</span></span>
</div>
----

What is the best way to remove the `.useless` element?

[source,html]
----
<div>
  <span>Hello</span>
</div>
----

I could `HTML.select(page, "span > span.useless")` and then grab its inner_html, remove the child, and then spit the contents into the parent, but something about it seems off—particularly with text nodes. There is a `wrap` widget but not `unwrap`. This would beg the question: what if there are more contents in the parent? Remove all contents and replace with the selected element or skip if there are other elements and its not purely a uselessly wrapped element.

Follow up question: is this something that would see enough demand to warrant extending the HTML API (and widget?)?

-- 
toastal ไข่ดาว | https://toast.al
PGP: 7944 74b7 d236 dab9 c9ef  e7f9 5cce 6f14 66d4 7c9e
Details
Message ID
<0370f1d3-6461-1fff-77e7-d196920365cc@baturin.org>
In-Reply-To
<yBO6DQuH7mRnFnN8AfzMJnqu4055QOxwucDV@teff> (view parent)
DKIM signature
missing
Download raw message
Hi Toastal,

 >but something about it seems off—particularly with text nodes

Could you clarify, does HTML.children not consider text nodes children 
in that case?
 From a quick test (direct with lambdasoup), it should work as expected.

top # Soup.parse ("<p>foo <span>bar</span> baz</p>") |> Soup.select_one 
"p" |> Option.get |>  Soup.children |> Soup.to_list |> List.map 
Soup.to_string ;;
- : string list = ["foo "; "<span>bar</span>"; " baz"]

Or it's something else that is wrong?

 >what if there are more contents in the parent?
 >Remove all contents and replace with the selected element or skip if 
there are other elements and its not purely a uselessly wrapped element.

I think the real issue here is that the answer will be different for 
different people.
Perhaps an "unwrap" plugin should have an option to check the child 
count, e.g. max_children, by default set to 1.

 >is this something that would see enough demand to warrant extending 
the HTML API (and widget?)

I've been quite reluctant to add new built-in widget ever since the Lua 
API was implemented.
Since you can't easily remove widgets or change their behavior, I think 
they should always start their life as plugins,
and if the demand is high and the design seems right, be converted to 
built-ins.

Adding new functions to the HTML API is less of a problem if they are 
granular and their design is uncontroversial,
but I'm not sure yet is unwrap functionality is uncontroversial.

I think it should be a plugin at first, I'll happily add it to the 
plugin list if you send it to me.


On 11/16/21 22:29, toastal wrote:
> Let’s say I get some trash markup like
>
> [source,html]
> ----
> <div>
>    <span><span class="useless">Hello</span></span>
> </div>
> ----
>
> What is the best way to remove the `.useless` element?
>
> [source,html]
> ----
> <div>
>    <span>Hello</span>
> </div>
> ----
>
> I could `HTML.select(page, "span > span.useless")` and then grab its inner_html, remove the child, and then spit the contents into the parent, but something about it seems off—particularly with text nodes. There is a `wrap` widget but not `unwrap`. This would beg the question: what if there are more contents in the parent? Remove all contents and replace with the selected element or skip if there are other elements and its not purely a uselessly wrapped element.
>
> Follow up question: is this something that would see enough demand to warrant extending the HTML API (and widget?)?
>
Reply to thread Export thread (mbox)