~lioploum/offpunk-devel

Bundle unmerdify with offpunk v1 PROPOSED

This patch superseeds the prior one I sent regarding the unmerdify
integration. After discussing with Ploum, I’ve decided to put unmerdify
in a single file directly bundled with offpunk to ease dependencies and
files management.

I’ve tried to make the changes to offpunk files as small as possible and
I’ve reverted some non-required changes I made in the previous patch to
facilitate the review.

For unmerdify to work, you need to first clone this repository on your
disk https://github.com/fivefilters/ftr-site-config.

Then provide the path to offpunk:

    ./offpunk.py --ftr-site-config /path/to/ftr-site-config


You should be able to `go https://ploum.net` and see by default the
expected homepage content instead of the truncated one that is displayed
without `unmerdiy`.

Vincent Jousse (1):
  feat: add unmerdify in one file

 ansicat.py   | 169 +++++++++++++--
 offpunk.py   |  52 ++---
 opnk.py      |  39 +++-
 unmerdify.py | 574 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 785 insertions(+), 49 deletions(-)
 create mode 100755 unmerdify.py

-- 
2.48.1
Hi,

I see that you're the upstream maintainer for unmerdify.

Since unmerdify.py is not included in the wheel package and is only used 
from a repo checkout, and there's also a `ModuleNotFoundError` check, I 
can guess that distro maintainers are expected to package unmerdify.

So my questions are:

- Are the single-file and the regular version of unmerdify 
   interchangeable? Will they be kept in sync between each other?

- Are you going to eventually tag releases for unmerdify?

- Will there be a more clear indication that unmerdify.py is a 
   third-party bundled dependency?

Cc-ing offpunk-packagers list because it is relevant.
Hi Anna,

Thanks for raising this.

Vincent recently managed to put all unmerdify into one single python 
file, which is nice.

My offer is to merge unmerdify into the offpunk project as an offpunk 
component. That would greatly reduce packaging burden.

But, of course, this is something Vincent should consider and think 
about as I don’t know his plan for the future of unmerdify and if he 
plans to use it in other unrelated project.

But, as unmerdify is still not used in Offpunk, this is not an urgent 
task.

Ploum

Le 25 mar 08 06:49, Anna  Vyalkova a écrit :
Hi,

Those questions are great questions that we indeed need to address, and 
I’m open to suggestions here.

My first approach was to only provide the regular version of unmerdify, 
tag versions, and make it a dependency of offpunk.

After some discussions with Ploum who wanted as few dependencies as 
possible to offpunk and was not convinced by the multi files approach of 
unmerdify, I tried to bundle unmerdify with offpunk as a Proof Of Concept.

I suppose that we now need to decide what we do with unmerdify. Do we 
make it a dependency to offpunk? And if so, do we keep the one file 
approach or not?

If we decide to bundle it with offpunk, we should keep the one file 
approach, delete the regular unmerdify repo and move everything to the 
offpunk one. But it would make unmerdify less reusable and less visible 
for people not using offpunk.

What do you think?



          
          
          
        
      

      
      
      
      
      
      

      
      
        
          






Le 25 mar 16 09:14, Vincent Jousse a écrit :
Next
Export patchset (mbox)
How do I use this?

Copy & paste the following snippet into your terminal to import this patchset into git:

curl -s https://lists.sr.ht/~lioploum/offpunk-devel/patches/57341/mbox | git am -3
Learn more about email & git

[PATCH 1/1] feat: add unmerdify in one file Export this patch

---
 ansicat.py   | 169 +++++++++++++--
 offpunk.py   |  52 ++---
 opnk.py      |  39 +++-
 unmerdify.py | 574 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 785 insertions(+), 49 deletions(-)
 create mode 100755 unmerdify.py

diff --git a/ansicat.py b/ansicat.py
index dfbd30a..9ba4a76 100755
--- a/ansicat.py
+++ b/ansicat.py
@@ -23,6 +23,13 @@ try:
except ModuleNotFoundError:
    _HAS_READABILITY = False

try:
    import unmerdify

    _HAS_UNMERDIFY = True
except ModuleNotFoundError:
    _HAS_UNMERDIFY = False

try:
    # if bs4 version >= 4.11, we need to silent some xml warnings
    import bs4
@@ -183,6 +190,7 @@ class AbstractRenderer:
        self.center = center
        self.last_mode = "readable"
        self.theme = offthemes.default
        self.ftr_site_config = None

    def display(self, mode=None, directdisplay=False):
        wtitle = self.get_formatted_title()
@@ -1075,6 +1083,21 @@ class ImageRenderer(AbstractRenderer):


class HtmlRenderer(AbstractRenderer):
    def __init__(
        self, content, url, center=True, ftr_site_config=None, source_url=None
    ):
        super().__init__(content, url, center)

        self.loaded_site_config = None

        if ftr_site_config is not None and _HAS_UNMERDIFY:
            config_files = unmerdify.get_config_files(ftr_site_config)
            self.loaded_site_config = unmerdify.load_site_config_for_url(
                config_files, source_url
            )
            if self.loaded_site_config is None:
                print(f"-> Unable to find site config for url {source_url}")

    def get_mime(self):
        return "text/html"

@@ -1380,17 +1403,29 @@ class HtmlRenderer(AbstractRenderer):
                    for child in element.children:
                        recursive_render(child, indent=indent)

        # the real render_html hearth
        if mode in ["full", "full_links_only"]:
            summary = body
        elif _HAS_READABILITY:
            try:
                readable = Document(body)
                summary = readable.summary()
            except Exception:
                summary = body
        else:
            summary = body
        # the real render_html heart
        if mode not in ["full", "full_links_only"]:
            parsed_summary = None

            if _HAS_UNMERDIFY and self.loaded_site_config is not None:
                parsed_summary = unmerdify.get_body(self.loaded_site_config, body)

                if parsed_summary is None:
                    print(
                        f"-> Impossible to get content from `unmerdify`, falling back to readability"
                    )

            # Unmerdify parsing failed, fallback to readability
            if _HAS_READABILITY and parsed_summary is None:
                try:
                    readable = Document(body)
                    parsed_summary = readable.summary()
                except Exception as err:
                    pass

            if parsed_summary is not None:
                summary = parsed_summary

        soup = BeautifulSoup(summary, "html.parser")
        # soup = BeautifulSoup(summary, 'html5lib')
        if soup:
@@ -1398,6 +1433,7 @@ class HtmlRenderer(AbstractRenderer):
                recursive_render(soup.body)
            else:
                recursive_render(soup)

        return r.get_final(), links


@@ -1488,7 +1524,9 @@ def get_mime(path, url=None):
    return mime


def renderer_from_file(path, url=None, theme=None):
def renderer_from_file(
    path, url=None, theme=None, ftr_site_config=None, source_url=None
):
    if not path:
        return None
    mime = get_mime(path, url=url)
@@ -1501,13 +1539,35 @@ def renderer_from_file(path, url=None, theme=None):
                f.close()
        else:
            content = path
        toreturn = set_renderer(content, url, mime, theme=theme)

        toreturn = set_renderer(
            content,
            url,
            mime,
            theme=theme,
            ftr_site_config=ftr_site_config,
            source_url=source_url,
        )
    else:
        toreturn = None
    return toreturn


def set_renderer(content, url, mime, theme=None):
def get_renderer(func, content, url, ftr_site_config=None, source_url=None):
    if ftr_site_config is None or func != HtmlRenderer:
        renderer = func(content, url)
    else:
        renderer = func(
            content,
            url,
            ftr_site_config=ftr_site_config,
            source_url=source_url,
        )

    return renderer


def set_renderer(content, url, mime, theme=None, ftr_site_config=None, source_url=None):
    renderer = None
    if mime == "Local Folder":
        renderer = FolderRenderer("", url, datadir=xdg("data"))
@@ -1522,7 +1582,13 @@ def set_renderer(content, url, mime, theme=None):
        current_mime = mime_to_use[0]
        func = _FORMAT_RENDERERS[current_mime]
        if current_mime.startswith("text"):
            renderer = func(content, url)
            renderer = get_renderer(
                func,
                content,
                url,
                ftr_site_config=ftr_site_config,
                source_url=source_url,
            )
            # We double check if the renderer is correct.
            # If not, we fallback to html
            # (this is currently only for XHTML, often being
@@ -1530,11 +1596,24 @@ def set_renderer(content, url, mime, theme=None):
            if not renderer.is_valid():
                func = _FORMAT_RENDERERS["text/html"]
                # print("Set (fallback)RENDERER to html instead of %s"%mime)
                renderer = func(content, url)

                renderer = get_renderer(
                    func,
                    content,
                    url,
                    ftr_site_config=ftr_site_config,
                    source_url=source_url,
                )
        else:
            # TODO: check this code and then remove one if.
            # we don’t parse text, we give the file to the renderer
            renderer = func(content, url)
            renderer = get_renderer(
                func,
                content,
                url,
                ftr_site_config=ftr_site_config,
                source_url=source_url,
            )
            if not renderer.is_valid():
                renderer = None
    if renderer and theme:
@@ -1542,7 +1621,16 @@ def set_renderer(content, url, mime, theme=None):
    return renderer


def render(input, path=None, format="auto", mime=None, url=None, mode=None):
def render(
    input,
    path=None,
    format="auto",
    mime=None,
    url=None,
    mode=None,
    ftr_site_config=None,
    source_url=None,
):
    if not url:
        url = ""
    else:
@@ -1550,7 +1638,9 @@ def render(input, path=None, format="auto", mime=None, url=None, mode=None):
    if format == "gemtext":
        r = GemtextRenderer(input, url)
    elif format == "html":
        r = HtmlRenderer(input, url)
        r = HtmlRenderer(
            input, url, ftr_site_config=ftr_site_config, source_url=source_url
        )
    elif format == "feed":
        r = FeedRenderer(input, url)
    elif format == "gopher":
@@ -1563,9 +1653,13 @@ def render(input, path=None, format="auto", mime=None, url=None, mode=None):
        r = PlaintextRenderer(input, url)
    else:
        if not mime and path:
            r = renderer_from_file(path, url)
            r = renderer_from_file(
                path, url, ftr_site_config=ftr_site_config, source_url=source_url
            )
        else:
            r = set_renderer(input, url, mime)
            r = set_renderer(
                input, url, mime, ftr_site_config=ftr_site_config, source_url=source_url
            )
    if r:
        r.display(directdisplay=True, mode=mode)
    else:
@@ -1609,6 +1703,18 @@ def main():
        help="Which mode should be used to render: normal (default), full or source.\
                                With HTML, the normal mode try to extract the article.",
    )
    parser.add_argument(
        "--ftr_site_config",
        type=str,
        help="If using unmerdify you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.",
    )

    parser.add_argument(
        "--source_url",
        type=str,
        help="If using unmerdify you need to provide the source url of the html to auto detect the config file to load.",
    )

    parser.add_argument(
        "content",
        metavar="INPUT",
@@ -1617,7 +1723,24 @@ def main():
        default=sys.stdin,
        help="Path to the text to render (default to stdin)",
    )

    args = parser.parse_args()

    if args.ftr_site_config is not None or args.source_url is not None:
        if not _HAS_UNMERDIFY:
            print("You need to install `unmerdify` in order to use this mode.")
            sys.exit(1)

        if args.ftr_site_config is None or args.source_url is None:
            print(
                "When using the `unmerdify` mode you need to specify the `--ftr_site_config` and `--source_url` parameters."
            )
            sys.exit(1)

        if not os.path.isdir(args.ftr_site_config):
            print("`--ftr_site_config` must be an existing directory.")
            sys.exit(1)

    # Detect if we are running interactively or in a pipe
    if sys.stdin.isatty():
        # we are interactive, not in stdin, we can have multiple files as input
@@ -1635,6 +1758,8 @@ def main():
                    url=args.url,
                    mime=args.mime,
                    mode=args.mode,
                    ftr_site_config=args.ftr_site_config,
                    source_url=args.source_url,
                )
        else:
            print("Ansicat needs at least one file as an argument")
@@ -1650,6 +1775,8 @@ def main():
                url=args.url,
                mime=args.mime,
                mode=args.mode,
                ftr_site_config=args.ftr_site_config,
                source_url=args.source_url,
            )


diff --git a/offpunk.py b/offpunk.py
index c91c72b..37db443 100755
--- a/offpunk.py
+++ b/offpunk.py
@@ -144,7 +144,7 @@ def needs_gi(inner):


class GeminiClient(cmd.Cmd):
    def __init__(self, completekey="tab", synconly=False):
    def __init__(self, completekey="tab", synconly=False, ftr_site_config=None):
        cmd.Cmd.__init__(self)
        # Set umask so that nothing we create can be read by anybody else.
        # The certificate cache and TOFU database contain "browser history"
@@ -188,6 +188,8 @@ class GeminiClient(cmd.Cmd):
            "default_protocol": "gemini",
        }
        self.redirects = offblocklist.redirects
        self.ftr_site_config = ftr_site_config

        for i in offblocklist.blocked:
            self.redirects[i] = "blocked"
        term_width(new_width=self.options["width"])
@@ -282,7 +284,9 @@ class GeminiClient(cmd.Cmd):
        # If launched without argument, we return the renderer for the current URL
        if not url:
            url = self.current_url
        return self.opencache.get_renderer(url, theme=self.theme)
        return self.opencache.get_renderer(
            url, theme=self.theme, ftr_site_config=self.ftr_site_config
        )

    def _go_to_url(
        self,
@@ -361,10 +365,14 @@ class GeminiClient(cmd.Cmd):
        elif not self.offline_only:
            # A cache is always valid at least 60seconds
            params["validity"] = 60
        # Use cache or mark as to_fetch if resource is not cached
        if handle and not self.sync_only:
            # Use cache or mark as to_fetch if resource is not cached
            displayed, url = self.opencache.opnk(
                url, mode=mode, grep=grep, theme=self.theme, **params
                url,
                mode=mode,
                grep=grep,
                theme=self.theme,
                ftr_site_config=self.ftr_site_config,
                **params,
            )
            modedurl = mode_url(url, mode)
            if not displayed:
@@ -617,7 +625,7 @@ class GeminiClient(cmd.Cmd):
        alias : show all existing aliases
        alias ALIAS : show the command linked to ALIAS
        alias ALIAS CMD : create or replace existing ALIAS to be linked to command CMD"""
        #building the list of existing commands to avoid conflicts
        # building the list of existing commands to avoid conflicts
        commands = []
        for name in self.get_names():
            if name.startswith("do_"):
@@ -633,19 +641,18 @@ class GeminiClient(cmd.Cmd):
        elif len(line.split()) == 1:
            alias = line.strip()
            if alias in commands:
                print("%s is a command and cannot be aliased"%alias)
                print("%s is a command and cannot be aliased" % alias)
            elif alias in _ABBREVS:
                print("%s is currently aliased to \"%s\"" %(alias,_ABBREVS[alias]))
                print('%s is currently aliased to "%s"' % (alias, _ABBREVS[alias]))
            else:
                print("there’s no alias for \"%s\""%alias)
                print('there’s no alias for "%s"' % alias)
        else:
            alias, cmd = line.split(None,1)
            alias, cmd = line.split(None, 1)
            if alias in commands:
                print("%s is a command and cannot be aliased"%alias)
                print("%s is a command and cannot be aliased" % alias)
            else:
                _ABBREVS[alias] = cmd
                print("%s has been aliased to \"%s\""%(alias,cmd))
        
                print('%s has been aliased to "%s"' % (alias, cmd))

    def do_offline(self, *args):
        """Use Offpunk offline by only accessing cached content"""
@@ -1421,9 +1428,7 @@ Use "view XX" where XX is a number to view information about link XX.
            list_path = self.list_path(list)
        if not list_path:
            print(
                "List %s does not exist. Create it with "
                "list create %s"
                "" % (list, list)
                "List %s does not exist. Create it with list create %s" % (list, list)
            )
            return False
        else:
@@ -1552,9 +1557,7 @@ Use "view XX" where XX is a number to view information about link XX.
        list_path = self.list_path(list)
        if not list_path:
            print(
                "List %s does not exist. Create it with "
                "list create %s"
                "" % (list, list)
                "List %s does not exist. Create it with list create %s" % (list, list)
            )
        elif not line.isnumeric():
            print("go_to_line requires a number as parameter")
@@ -1570,9 +1573,7 @@ Use "view XX" where XX is a number to view information about link XX.
        list_path = self.list_path(list)
        if not list_path:
            print(
                "List %s does not exist. Create it with "
                "list create %s"
                "" % (list, list)
                "List %s does not exist. Create it with list create %s" % (list, list)
            )
        else:
            url = "list:///%s" % list
@@ -2060,6 +2061,11 @@ def main():
        action="store_true",
        help="display available features and dependancies then quit",
    )
    parser.add_argument(
        "--ftr-site-config",
        type=str,
        help="If you want to use `unmerdify`, you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.",
    )
    parser.add_argument(
        "url",
        metavar="URL",
@@ -2082,7 +2088,7 @@ def main():
                os.makedirs(f)

    # Instantiate client
    gc = GeminiClient(synconly=args.sync)
    gc = GeminiClient(synconly=args.sync, ftr_site_config=args.ftr_site_config)
    torun_queue = []

    # Interactive if offpunk started normally
diff --git a/opnk.py b/opnk.py
index c614607..a2f726b 100755
--- a/opnk.py
+++ b/opnk.py
@@ -142,7 +142,7 @@ class opencache:
            if previous:
                print("Previous handler was %s" % previous)

    def get_renderer(self, inpath, mode=None, theme=None):
    def get_renderer(self, inpath, mode=None, theme=None, ftr_site_config=None):
        # We remove the ##offpunk_mode= from the URL
        # If mode is already set, we don’t use the part from the URL
        inpath, newmode = unmode_url(inpath)
@@ -155,6 +155,7 @@ class opencache:
            # default mode is readable
            mode = "readable"
        renderer = None

        path = netcache.get_cache_path(inpath)
        if path:
            usecache = inpath in self.rendererdic.keys() and not is_local(inpath)
@@ -175,7 +176,13 @@ class opencache:
                else:
                    usecache = False
            if not usecache:
                renderer = ansicat.renderer_from_file(path, url=inpath, theme=theme)
                renderer = ansicat.renderer_from_file(
                    path,
                    url=inpath,
                    theme=theme,
                    source_url=inpath,
                    ftr_site_config=ftr_site_config,
                )
                if renderer:
                    self.rendererdic[inpath] = renderer
                    self.renderer_time[inpath] = int(time.time())
@@ -189,7 +196,16 @@ class opencache:
        else:
            return None

    def opnk(self, inpath, mode=None, terminal=True, grep=None, theme=None, **kwargs):
    def opnk(
        self,
        inpath,
        mode=None,
        terminal=True,
        grep=None,
        theme=None,
        ftr_site_config=None,
        **kwargs,
    ):
        # Return True if inpath opened in Terminal
        # False otherwise
        # also returns the url in case it has been modified
@@ -211,7 +227,9 @@ class opencache:
        else:
            print("%s does not exist" % inpath)
            return False, inpath
        renderer = self.get_renderer(inpath, mode=mode, theme=theme)
        renderer = self.get_renderer(
            inpath, mode=mode, theme=theme, ftr_site_config=ftr_site_config
        )
        if renderer and mode:
            renderer.set_mode(mode)
            self.last_mode[inpath] = mode
@@ -328,10 +346,21 @@ def main():
        help="maximum age, in second, of the cached version before \
                                redownloading a new version",
    )

    parser.add_argument(
        "--ftr-site-config",
        type=str,
        help="If using the `unmerdify` mode, you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.",
    )
    args = parser.parse_args()
    cache = opencache()
    for f in args.content:
        cache.opnk(f, mode=args.mode, validity=args.cache_validity)
        cache.opnk(
            f,
            mode=args.mode,
            validity=args.cache_validity,
            ftr_site_config=args.ftr_site_config,
        )


if __name__ == "__main__":
diff --git a/unmerdify.py b/unmerdify.py
new file mode 100755
index 0000000..f4a7f43
--- /dev/null
+++ b/unmerdify.py
@@ -0,0 +1,574 @@
#!/usr/bin/env python3

import argparse
import fileinput
import glob
import logging
import logging.config
import os
import re
from copy import deepcopy
from dataclasses import dataclass
from urllib.parse import urlparse

from lxml import etree

LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "default": {
            "format": "[%(asctime)s] [%(levelname)8s] [%(filename)s:%(lineno)s - %(funcName).20s…] %(message)s",
            "datefmt": "%Y-%m-%d %H:%M:%S",
        }
    },
    "handlers": {
        "stdout": {
            "class": "logging.StreamHandler",
            "stream": "ext://sys.stdout",
            "formatter": "default",
        }
    },
    "loggers": {"": {"handlers": ["stdout"], "level": "ERROR"}},
}


logging.config.dictConfig(LOGGING)


LOGGER = logging.getLogger(__name__)


def set_logging_level(level):
    LOGGING["loggers"][""]["level"] = level
    logging.config.dictConfig(LOGGING)


@dataclass
class Command:
    """Class for keeping track of a command item."""

    name: str
    accept_multiple_values: bool = False
    is_bool: bool = False
    xpath_value: bool = False
    has_capture_group: bool = False
    special_command: bool = False
    ignore: bool = False


COMMANDS: list[Command] = [
    Command("author", accept_multiple_values=True),
    Command("autodetect_on_failure", is_bool=True),
    Command("body", accept_multiple_values=True),
    Command("date", accept_multiple_values=True),
    Command("find_string", accept_multiple_values=True),
    Command("http_header", has_capture_group=True, special_command=True),
    Command("if_page_contains", special_command=True),
    Command("login_extra_fields", accept_multiple_values=True),
    Command("login_password_field"),
    Command("login_uri"),
    Command("login_username_field"),
    Command("native_ad_clue", accept_multiple_values=True),
    Command("next_page_link", accept_multiple_values=True),
    Command("not_logged_in_xpath"),
    Command("parser"),
    Command("prune", is_bool=True),
    Command("replace_string", has_capture_group=True, accept_multiple_values=True),
    Command("requires_login", is_bool=True),
    Command("src_lazy_load_attr"),
    Command("single_page_link", accept_multiple_values=True),
    Command("skip_json_ld", is_bool=True),
    Command("strip", accept_multiple_values=True),
    Command("strip_id_or_class", accept_multiple_values=True),
    Command("strip_image_src", accept_multiple_values=True),
    Command("test_contains", special_command=True),
    Command("test_url", accept_multiple_values=True, special_command=True),
    Command("tidy", is_bool=True),
    Command("title", accept_multiple_values=True),
    Command("wrap_in", has_capture_group=True, special_command=True),
]

COMMANDS_PER_NAME: dict[str, Command] = {
    COMMANDS[i].name: COMMANDS[i] for i in range(0, len(COMMANDS))
}


def get_config_files(
    site_config_dir: str, include_config_dir: bool = True
) -> list[str]:
    """
    Read the *.txt files from the site_config directory and returns the file list.

    Parameters:
        site_config_dir (str): The path to the directory containing the config files
        include_config_dir (bool): Should the config_dir be included in the returned list

    Returns:
        filenames (list[str]): The list of filenames found with the .txt extension
    """
    filenames: list[str] = []

    for file in glob.iglob(f"{site_config_dir}/*.txt", include_hidden=True):
        if file.endswith("LICENSE.txt"):
            continue

        if include_config_dir:
            filenames.append(file)
        else:
            filenames.append(file.removeprefix(f"{site_config_dir}/"))

    filenames.sort()
    return filenames


def get_host_for_url(url: str) -> str:
    parsed_uri = urlparse(url)
    return parsed_uri.netloc


def get_possible_config_file_names_for_host(
    host: str, file_extension: str = ".txt"
) -> list[str]:
    """
    The five filters config files can be of the form

    - .specific.domain.tld (for *.specific.domain.tld)
    - specific.domain.tld (for this specific domain)
    - .domain.tld (for *.domain.tld)
    - domain.tld (for domain.tld)
    """

    parts = host.split(".")

    if len(parts) < 2:
        raise ValueError(
            f"The host must be of the form `host.com`. It seems that there is no dot in the provided host: {host}"
        )

    tld = parts.pop()
    domain = parts.pop()

    first_possible_name = f"{domain}.{tld}{file_extension}"
    possible_names = [first_possible_name, f".{first_possible_name}"]

    # While we still have parts in the domain name, prepend the part
    # and create the 2 new possible names
    while len(parts) > 0:
        next_part = parts.pop()
        possible_name = f"{next_part}.{possible_names[-2]}"
        possible_names.append(possible_name)
        possible_names.append(f".{possible_name}")

    # Put the most specific file names first
    possible_names.reverse()

    return possible_names


def get_config_file_for_host(config_files: list[str], host: str) -> str | None:
    possible_config_file_names = get_possible_config_file_names_for_host(host)

    for config_file in config_files:
        basename = os.path.basename(config_file)
        for possible_config_file_name in possible_config_file_names:
            if basename == possible_config_file_name:
                return config_file


def parse_site_config_file(config_file_path: str) -> dict | None:
    config = {}
    with open(config_file_path, "r") as file:
        previous_command = None
        while line := file.readline():
            line = line.strip()

            # skip comments, empty lines
            if line == "" or line.startswith("#") or line.startswith("//"):
                continue

            command_name = None
            command_value = None
            pattern = re.compile(r"^([a-z_]+)(?:\((.*)\))*:[ ]*(.*)$", re.I)

            result = pattern.search(line)

            if not result:
                logging.error(
                    f"-> 🚨 ERROR: unknown line format for line `{line}` in file `{config_file_path}`. Skipping."
                )
                continue

            command_name = result.group(1).lower()
            command_arg = result.group(2)
            command_value = result.group(3)

            # strip_attr is now an alias for strip, for example:
            # strip_attr: //img/@srcset
            if "strip_attr" == command_name:
                command_name = "strip"

            command = COMMANDS_PER_NAME.get(command_name)

            if command is None:
                logging.warning(
                    f"-> ⚠️ WARNING: unknown command name for line `{line}` in file `{config_file_path}`. Skipping."
                )
                continue

            # Check for commands where we accept multiple statements but we don't have args provided
            # It handles `replace_string: value` and not `replace_string(test): value`
            if (
                command.accept_multiple_values
                and command_arg is None
                and not command.special_command
            ):
                config.setdefault(command_name, []).append(command_value)
            # Single value command that should evaluate to a bool
            elif command.is_bool and not command.special_command:
                config[command_name] = "yes" == command_value or "true" == command_value
            # handle replace_string(test): value
            elif command.name == "replace_string" and command_arg is not None:
                config.setdefault("find_string", []).append(command_arg)
                config.setdefault("replace_string", []).append(command_value)
            # handle http_header(user-agent): Mozilla/5.2
            elif command.name == "http_header" and command_arg is not None:
                config.setdefault("http_header", []).append(
                    {command_arg: command_value}
                )
            # handle if_page_contains: Xpath value
            elif command.name == "if_page_contains":
                # Previous command should be applied only if this expression is true
                previous_command_value = config[previous_command.name]

                # Move the previous command into the "if_page_contains" command
                if (
                    previous_command.accept_multiple_values
                    and len(previous_command_value) > 0
                ):
                    config.setdefault("if_page_contains", {})[command_value] = {
                        previous_command.name: previous_command_value.pop()
                    }

                # Remove the entire key entry if the values are now empty
                if len(previous_command_value) == 0:
                    config.pop(previous_command.name)

            # handle if_page_contains: Xpath value
            elif command.name == "wrap_in":
                config.setdefault("wrap_in", []).append((command_arg, command_value))
            elif command.name == "test_url":
                config.setdefault("test_url", []).append(
                    {command.name: command_value, "test_contains": []}
                )
            elif command.name == "test_contains":
                test_url = config.get("test_url")
                if test_url is None or len(test_url) == 0:
                    logging.error(
                        "-> 🚨 ERROR: No test_url found for given test_contains. Skipping."
                    )
                    continue

                test_url[-1]["test_contains"].append(command_value)
            else:
                config[command_name] = command_value

            previous_command = command

    return config if config != {} else None


def load_site_config_for_host(config_files: list[str], host: str) -> dict | None:
    logging.debug(f"-> Loading site config for {host}")
    config_file = get_config_file_for_host(config_files, host)

    if config_file:
        logging.debug(f"-> Found config file, loading {config_file} config.")
        return parse_site_config_file(config_file)
    else:
        logging.debug(f"-> No config file found for host {host}.")


def load_site_config_for_url(config_files: list[str], url: str) -> dict | None:
    return load_site_config_for_host(config_files, get_host_for_url(url))


# Content extractor code


def replace_strings(site_config: dict, html: str) -> str:
    replace_string_cmds = site_config.get("replace_string", [])
    find_string_cmds = site_config.get("find_string", [])

    if len(replace_string_cmds) == 0 and len(find_string_cmds) == 0:
        return html

    if len(replace_string_cmds) != len(find_string_cmds):
        logging.error(
            "🚨 ERROR: `replace_string` and `find_string` counts are not the same but must be, skipping string replacement."
        )
    else:
        nb_replacement = 0

        for replace_string, find_string in zip(replace_string_cmds, find_string_cmds):
            nb_replacement += html.count(find_string)
            html = html.replace(find_string, replace_string)

        logging.debug(
            f"Replaced {nb_replacement} string{'s'[:nb_replacement ^ 1]} using replace_string/find_string commands."
        )

    logging.debug(f"Html after string replacement: {html}")

    return html


def wrap_in(site_config: dict, lxml_tree):
    for tag, pattern in site_config.get("wrap_in", []):
        logging.debug(f"Wrap in `{tag}` => `{pattern}`")
        elements = lxml_tree.xpath(pattern)
        for element in elements:
            parent = element.getparent()
            newElement = etree.Element(tag)
            newElement.append(deepcopy(element))
            parent.replace(element, newElement)


def strip_elements(site_config: dict, lxml_tree):
    for pattern in site_config.get("strip", []):
        remove_elements_by_xpath(pattern, lxml_tree)


def strip_elements_by_id_or_class(site_config: dict, lxml_tree):
    for pattern in site_config.get("strip_id_or_class", []):
        # Some entries contain " or '
        pattern = pattern.replace("'", "").replace('"', "")
        remove_elements_by_xpath(
            f"//*[contains(concat(' ',normalize-space(@class), ' '),' {pattern} ') or contains(concat(' ',normalize-space(@id),' '), ' {pattern} ')]",
            lxml_tree,
        )


def strip_image_src(site_config: dict, lxml_tree):
    for pattern in site_config.get("strip_image_src", []):
        # Some entries contain " or '
        pattern = pattern.replace("'", "").replace('"', "")
        remove_elements_by_xpath(f"//img[contains(@src,'{pattern}')]", lxml_tree)


def get_body_element(site_config: dict, lxml_tree):
    body_contents = []

    for pattern in site_config.get("body", []):
        elements = lxml_tree.xpath(pattern)

        for body_element in elements:
            body_contents.append(body_element)

    if len(body_contents) == 1:
        return body_contents[0]

    if len(body_contents) > 1:
        body = etree.Element("div")
        for element in elements:
            body.append(element)

        return body


def get_body_element_html(site_config: dict, lxml_tree):
    body = get_body_element(site_config, lxml_tree)
    if body is not None:
        return etree.tostring(body, encoding="unicode")


def remove_hidden_elements(lxml_tree):
    remove_elements_by_xpath(
        "//*[contains(@style,'display:none') or contains(@style,'visibility:hidden')]",
        lxml_tree,
    )


def remove_a_empty_elements(lxml_tree):
    remove_elements_by_xpath(
        "//a[not(./*) and normalize-space(.)='']",
        lxml_tree,
    )


def remove_elements_by_xpath(xpath_expression, lxml_tree):
    elements = lxml_tree.xpath(xpath_expression)
    for element in elements:
        if isinstance(element, etree._Element):
            element.getparent().remove(element)
        else:
            logging.error(
                f"🚨 ERROR: remove by xpath, element is not a Node, got {type(element)}."
            )


def get_xpath_value_for_command(
    site_config: dict, command_name: str, lxml_tree
) -> str | None:
    command_xpaths = site_config.get(command_name, [])

    for command_xpath in command_xpaths:
        value = get_xpath_value(site_config, command_xpath, lxml_tree)
        if value is not None:
            return value


def get_multiple_xpath_values_for_command(
    site_config: dict, command_name: str, lxml_tree
) -> list[str]:
    command_xpaths = site_config.get(command_name, [])
    values = []

    for command_xpath in command_xpaths:
        values = values + get_multiple_xpath_values(
            site_config, command_xpath, lxml_tree
        )

    return values


def get_xpath_value(site_config: dict, xpath: str, lxml_tree):
    elements = lxml_tree.xpath(xpath)

    if isinstance(elements, str) or isinstance(elements, etree._ElementUnicodeResult):
        return str(elements)

    for element in elements:
        # Return the first entry found
        if isinstance(element, str) or isinstance(element, etree._ElementUnicodeResult):
            return str(element)
        else:
            value = etree.tostring(element, method="text", encoding="unicode").strip()
            return " ".join(value.split()).replace("\n", "")


def get_multiple_xpath_values(site_config: dict, xpath: str, lxml_tree):
    values = []

    elements = lxml_tree.xpath(xpath)

    if isinstance(elements, str) or isinstance(elements, etree._ElementUnicodeResult):
        return str(elements)

    for element in elements:
        # Return the first entry found

        if isinstance(element, str) or isinstance(element, etree._ElementUnicodeResult):
            values.append(str(element))
        else:
            value = etree.tostring(element, method="text", encoding="unicode").strip()
            value = " ".join(value.split()).replace("\n", "")
            values.append(value)

    return values


def get_body(site_config: dict, html: str):
    html = replace_strings(site_config, html)
    html_parser = etree.HTMLParser(remove_blank_text=True, remove_comments=True)

    tree = etree.fromstring(html, html_parser)

    wrap_in(site_config, tree)
    strip_elements(site_config, tree)
    strip_elements_by_id_or_class(site_config, tree)
    strip_image_src(site_config, tree)
    remove_hidden_elements(tree)
    remove_a_empty_elements(tree)

    return get_body_element_html(site_config, tree)


def main() -> int:
    parser = argparse.ArgumentParser(
        description="Get the content, only the content: unenshittificator for the web"
    )

    parser.add_argument(
        "ftr_site_config",
        type=str,
        help="The path to the https://github.com/fivefilters/ftr-site-config directory, or a path to a config file.",
    )

    parser.add_argument(
        "-u",
        "--url",
        type=str,
        help="The url you want to unmerdify.",
    )

    parser.add_argument(
        "files",
        metavar="FILE",
        nargs="*",
        help="Files to read, if empty, stdin is used.",
    )

    parser.add_argument(
        "-l",
        "--loglevel",
        default=logging.ERROR,
        choices=logging.getLevelNamesMapping().keys(),
        help="Set log level",
    )

    # @TODO: extract open graph information if any
    #  https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L1241
    args = parser.parse_args()

    set_logging_level(args.loglevel)

    if os.path.isdir(args.ftr_site_config) and args.url is None:
        logging.error(
            "ERROR: You must provide an URL with --url if you don't provide a specific config file.",
        )
        return 1

    if os.path.isdir(args.ftr_site_config):
        config_files = get_config_files(args.ftr_site_config)
        loaded_site_config = load_site_config_for_url(config_files, args.url)
    else:
        loaded_site_config = parse_site_config_file(args.ftr_site_config)

    if loaded_site_config is None:
        logging.error(f"Unable to load site config for `{args.ftr_site_config}`.")
        return 1

    html = ""
    # We pass '-' as only file when argparse got no files which will cause fileinput to read from stdin
    for line in fileinput.input(
        files=args.files if len(args.files) > 0 else ("-",),
        openhook=fileinput.hook_encoded("utf-8"),
    ):
        html += line

    html_replaced = replace_strings(loaded_site_config, html)
    html_parser = etree.HTMLParser(remove_blank_text=True, remove_comments=True)

    tree = etree.fromstring(html_replaced, html_parser)

    title = get_xpath_value_for_command(loaded_site_config, "title", tree)
    logging.debug(f"Got title `{title}`.")

    authors = get_multiple_xpath_values_for_command(loaded_site_config, "author", tree)

    logging.debug(f"Got authors {authors}.")

    date = get_xpath_value_for_command(loaded_site_config, "date", tree)

    logging.debug(f"Got date `{date}`.")

    body_html = get_body(loaded_site_config, html)

    print(body_html)

    return 0


if __name__ == "__main__":
    main()
-- 
2.48.1