This patch superseeds the prior one I sent regarding the unmerdify integration. After discussing with Ploum, I’ve decided to put unmerdify in a single file directly bundled with offpunk to ease dependencies and files management. I’ve tried to make the changes to offpunk files as small as possible and I’ve reverted some non-required changes I made in the previous patch to facilitate the review. For unmerdify to work, you need to first clone this repository on your disk https://github.com/fivefilters/ftr-site-config. Then provide the path to offpunk: ./offpunk.py --ftr-site-config /path/to/ftr-site-config You should be able to `go https://ploum.net` and see by default the expected homepage content instead of the truncated one that is displayed without `unmerdiy`. Vincent Jousse (1): feat: add unmerdify in one file ansicat.py | 169 +++++++++++++-- offpunk.py | 52 ++--- opnk.py | 39 +++- unmerdify.py | 574 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 785 insertions(+), 49 deletions(-) create mode 100755 unmerdify.py -- 2.48.1
Hi, I see that you're the upstream maintainer for unmerdify. Since unmerdify.py is not included in the wheel package and is only used from a repo checkout, and there's also a `ModuleNotFoundError` check, I can guess that distro maintainers are expected to package unmerdify. So my questions are: - Are the single-file and the regular version of unmerdify interchangeable? Will they be kept in sync between each other? - Are you going to eventually tag releases for unmerdify? - Will there be a more clear indication that unmerdify.py is a third-party bundled dependency? Cc-ing offpunk-packagers list because it is relevant.
Hi Anna, Thanks for raising this. Vincent recently managed to put all unmerdify into one single python file, which is nice. My offer is to merge unmerdify into the offpunk project as an offpunk component. That would greatly reduce packaging burden. But, of course, this is something Vincent should consider and think about as I don’t know his plan for the future of unmerdify and if he plans to use it in other unrelated project. But, as unmerdify is still not used in Offpunk, this is not an urgent task. Ploum Le 25 mar 08 06:49, Anna Vyalkova a écrit :
Hi, Those questions are great questions that we indeed need to address, and I’m open to suggestions here. My first approach was to only provide the regular version of unmerdify, tag versions, and make it a dependency of offpunk. After some discussions with Ploum who wanted as few dependencies as possible to offpunk and was not convinced by the multi files approach of unmerdify, I tried to bundle unmerdify with offpunk as a Proof Of Concept. I suppose that we now need to decide what we do with unmerdify. Do we make it a dependency to offpunk? And if so, do we keep the one file approach or not? If we decide to bundle it with offpunk, we should keep the one file approach, delete the regular unmerdify repo and move everything to the offpunk one. But it would make unmerdify less reusable and less visible for people not using offpunk. What do you think?
If unmerdify is bundled with Offpunk, this means that the maintainance burden comes mainly on my shoulders, that the project infrastructure is already there. Also, like ansicat/netcache, it can be made as a standalone tool which could be used by other projects. Offpunk being the first test-case for unmerdify, I believe it makes sense to bundle it with offpunk to increase its use and testing. As it is a new library, most people will not have it and most Offpunk users will not use it for a long time. Also, If unmerdify prove to be successful and widely useful, the project could spin-off later to answer the need of people wanting to use unmerdify without using offpunk. On the other hand, bundling it with offpunk makes unmerdify harder to advertise to other projects that may use it. The real question: do you want to maintain unmerdify, to make it a standalone project and to advertize it so it will be used in other places? Or do you consider it more as a fun experiment but don’t want to invest too much in it? I know it is a hard question ;-)
Le 25 mar 16 09:14, Vincent Jousse a écrit :
Copy & paste the following snippet into your terminal to import this patchset into git:
curl -s https://lists.sr.ht/~lioploum/offpunk-devel/patches/57341/mbox | git am -3Learn more about email & git
--- ansicat.py | 169 +++++++++++++-- offpunk.py | 52 ++--- opnk.py | 39 +++- unmerdify.py | 574 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 785 insertions(+), 49 deletions(-) create mode 100755 unmerdify.py diff --git a/ansicat.py b/ansicat.py index dfbd30a..9ba4a76 100755 --- a/ansicat.py +++ b/ansicat.py @@ -23,6 +23,13 @@ try: except ModuleNotFoundError: _HAS_READABILITY = False +try: + import unmerdify + + _HAS_UNMERDIFY = True +except ModuleNotFoundError: + _HAS_UNMERDIFY = False + try: # if bs4 version >= 4.11, we need to silent some xml warnings import bs4 @@ -183,6 +190,7 @@ class AbstractRenderer: self.center = center self.last_mode = "readable" self.theme = offthemes.default + self.ftr_site_config = None def display(self, mode=None, directdisplay=False): wtitle = self.get_formatted_title() @@ -1075,6 +1083,21 @@ class ImageRenderer(AbstractRenderer): class HtmlRenderer(AbstractRenderer): + def __init__( + self, content, url, center=True, ftr_site_config=None, source_url=None + ): + super().__init__(content, url, center) + + self.loaded_site_config = None + + if ftr_site_config is not None and _HAS_UNMERDIFY: + config_files = unmerdify.get_config_files(ftr_site_config) + self.loaded_site_config = unmerdify.load_site_config_for_url( + config_files, source_url + ) + if self.loaded_site_config is None: + print(f"-> Unable to find site config for url {source_url}") + def get_mime(self): return "text/html" @@ -1380,17 +1403,29 @@ class HtmlRenderer(AbstractRenderer): for child in element.children: recursive_render(child, indent=indent) - # the real render_html hearth - if mode in ["full", "full_links_only"]: - summary = body - elif _HAS_READABILITY: - try: - readable = Document(body) - summary = readable.summary() - except Exception: - summary = body - else: - summary = body + # the real render_html heart + if mode not in ["full", "full_links_only"]: + parsed_summary = None + + if _HAS_UNMERDIFY and self.loaded_site_config is not None: + parsed_summary = unmerdify.get_body(self.loaded_site_config, body) + + if parsed_summary is None: + print( + f"-> Impossible to get content from `unmerdify`, falling back to readability" + ) + + # Unmerdify parsing failed, fallback to readability + if _HAS_READABILITY and parsed_summary is None: + try: + readable = Document(body) + parsed_summary = readable.summary() + except Exception as err: + pass + + if parsed_summary is not None: + summary = parsed_summary + soup = BeautifulSoup(summary, "html.parser") # soup = BeautifulSoup(summary, 'html5lib') if soup: @@ -1398,6 +1433,7 @@ class HtmlRenderer(AbstractRenderer): recursive_render(soup.body) else: recursive_render(soup) + return r.get_final(), links @@ -1488,7 +1524,9 @@ def get_mime(path, url=None): return mime -def renderer_from_file(path, url=None, theme=None): +def renderer_from_file( + path, url=None, theme=None, ftr_site_config=None, source_url=None +): if not path: return None mime = get_mime(path, url=url) @@ -1501,13 +1539,35 @@ def renderer_from_file(path, url=None, theme=None): f.close() else: content = path - toreturn = set_renderer(content, url, mime, theme=theme) + + toreturn = set_renderer( + content, + url, + mime, + theme=theme, + ftr_site_config=ftr_site_config, + source_url=source_url, + ) else: toreturn = None return toreturn -def set_renderer(content, url, mime, theme=None): +def get_renderer(func, content, url, ftr_site_config=None, source_url=None): + if ftr_site_config is None or func != HtmlRenderer: + renderer = func(content, url) + else: + renderer = func( + content, + url, + ftr_site_config=ftr_site_config, + source_url=source_url, + ) + + return renderer + + +def set_renderer(content, url, mime, theme=None, ftr_site_config=None, source_url=None): renderer = None if mime == "Local Folder": renderer = FolderRenderer("", url, datadir=xdg("data")) @@ -1522,7 +1582,13 @@ def set_renderer(content, url, mime, theme=None): current_mime = mime_to_use[0] func = _FORMAT_RENDERERS[current_mime] if current_mime.startswith("text"): - renderer = func(content, url) + renderer = get_renderer( + func, + content, + url, + ftr_site_config=ftr_site_config, + source_url=source_url, + ) # We double check if the renderer is correct. # If not, we fallback to html # (this is currently only for XHTML, often being @@ -1530,11 +1596,24 @@ def set_renderer(content, url, mime, theme=None): if not renderer.is_valid(): func = _FORMAT_RENDERERS["text/html"] # print("Set (fallback)RENDERER to html instead of %s"%mime) - renderer = func(content, url) + + renderer = get_renderer( + func, + content, + url, + ftr_site_config=ftr_site_config, + source_url=source_url, + ) else: # TODO: check this code and then remove one if. # we don’t parse text, we give the file to the renderer - renderer = func(content, url) + renderer = get_renderer( + func, + content, + url, + ftr_site_config=ftr_site_config, + source_url=source_url, + ) if not renderer.is_valid(): renderer = None if renderer and theme: @@ -1542,7 +1621,16 @@ def set_renderer(content, url, mime, theme=None): return renderer -def render(input, path=None, format="auto", mime=None, url=None, mode=None): +def render( + input, + path=None, + format="auto", + mime=None, + url=None, + mode=None, + ftr_site_config=None, + source_url=None, +): if not url: url = "" else: @@ -1550,7 +1638,9 @@ def render(input, path=None, format="auto", mime=None, url=None, mode=None): if format == "gemtext": r = GemtextRenderer(input, url) elif format == "html": - r = HtmlRenderer(input, url) + r = HtmlRenderer( + input, url, ftr_site_config=ftr_site_config, source_url=source_url + ) elif format == "feed": r = FeedRenderer(input, url) elif format == "gopher": @@ -1563,9 +1653,13 @@ def render(input, path=None, format="auto", mime=None, url=None, mode=None): r = PlaintextRenderer(input, url) else: if not mime and path: - r = renderer_from_file(path, url) + r = renderer_from_file( + path, url, ftr_site_config=ftr_site_config, source_url=source_url + ) else: - r = set_renderer(input, url, mime) + r = set_renderer( + input, url, mime, ftr_site_config=ftr_site_config, source_url=source_url + ) if r: r.display(directdisplay=True, mode=mode) else: @@ -1609,6 +1703,18 @@ def main(): help="Which mode should be used to render: normal (default), full or source.\ With HTML, the normal mode try to extract the article.", ) + parser.add_argument( + "--ftr_site_config", + type=str, + help="If using unmerdify you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.", + ) + + parser.add_argument( + "--source_url", + type=str, + help="If using unmerdify you need to provide the source url of the html to auto detect the config file to load.", + ) + parser.add_argument( "content", metavar="INPUT", @@ -1617,7 +1723,24 @@ def main(): default=sys.stdin, help="Path to the text to render (default to stdin)", ) + args = parser.parse_args() + + if args.ftr_site_config is not None or args.source_url is not None: + if not _HAS_UNMERDIFY: + print("You need to install `unmerdify` in order to use this mode.") + sys.exit(1) + + if args.ftr_site_config is None or args.source_url is None: + print( + "When using the `unmerdify` mode you need to specify the `--ftr_site_config` and `--source_url` parameters." + ) + sys.exit(1) + + if not os.path.isdir(args.ftr_site_config): + print("`--ftr_site_config` must be an existing directory.") + sys.exit(1) + # Detect if we are running interactively or in a pipe if sys.stdin.isatty(): # we are interactive, not in stdin, we can have multiple files as input @@ -1635,6 +1758,8 @@ def main(): url=args.url, mime=args.mime, mode=args.mode, + ftr_site_config=args.ftr_site_config, + source_url=args.source_url, ) else: print("Ansicat needs at least one file as an argument") @@ -1650,6 +1775,8 @@ def main(): url=args.url, mime=args.mime, mode=args.mode, + ftr_site_config=args.ftr_site_config, + source_url=args.source_url, ) diff --git a/offpunk.py b/offpunk.py index c91c72b..37db443 100755 --- a/offpunk.py +++ b/offpunk.py @@ -144,7 +144,7 @@ def needs_gi(inner): class GeminiClient(cmd.Cmd): - def __init__(self, completekey="tab", synconly=False): + def __init__(self, completekey="tab", synconly=False, ftr_site_config=None): cmd.Cmd.__init__(self) # Set umask so that nothing we create can be read by anybody else. # The certificate cache and TOFU database contain "browser history" @@ -188,6 +188,8 @@ class GeminiClient(cmd.Cmd): "default_protocol": "gemini", } self.redirects = offblocklist.redirects + self.ftr_site_config = ftr_site_config + for i in offblocklist.blocked: self.redirects[i] = "blocked" term_width(new_width=self.options["width"]) @@ -282,7 +284,9 @@ class GeminiClient(cmd.Cmd): # If launched without argument, we return the renderer for the current URL if not url: url = self.current_url - return self.opencache.get_renderer(url, theme=self.theme) + return self.opencache.get_renderer( + url, theme=self.theme, ftr_site_config=self.ftr_site_config + ) def _go_to_url( self, @@ -361,10 +365,14 @@ class GeminiClient(cmd.Cmd): elif not self.offline_only: # A cache is always valid at least 60seconds params["validity"] = 60 - # Use cache or mark as to_fetch if resource is not cached - if handle and not self.sync_only: + # Use cache or mark as to_fetch if resource is not cached displayed, url = self.opencache.opnk( - url, mode=mode, grep=grep, theme=self.theme, **params + url, + mode=mode, + grep=grep, + theme=self.theme, + ftr_site_config=self.ftr_site_config, + **params, ) modedurl = mode_url(url, mode) if not displayed: @@ -617,7 +625,7 @@ class GeminiClient(cmd.Cmd): alias : show all existing aliases alias ALIAS : show the command linked to ALIAS alias ALIAS CMD : create or replace existing ALIAS to be linked to command CMD""" - #building the list of existing commands to avoid conflicts + # building the list of existing commands to avoid conflicts commands = [] for name in self.get_names(): if name.startswith("do_"): @@ -633,19 +641,18 @@ class GeminiClient(cmd.Cmd): elif len(line.split()) == 1: alias = line.strip() if alias in commands: - print("%s is a command and cannot be aliased"%alias) + print("%s is a command and cannot be aliased" % alias) elif alias in _ABBREVS: - print("%s is currently aliased to \"%s\"" %(alias,_ABBREVS[alias])) + print('%s is currently aliased to "%s"' % (alias, _ABBREVS[alias])) else: - print("there’s no alias for \"%s\""%alias) + print('there’s no alias for "%s"' % alias) else: - alias, cmd = line.split(None,1) + alias, cmd = line.split(None, 1) if alias in commands: - print("%s is a command and cannot be aliased"%alias) + print("%s is a command and cannot be aliased" % alias) else: _ABBREVS[alias] = cmd - print("%s has been aliased to \"%s\""%(alias,cmd)) - + print('%s has been aliased to "%s"' % (alias, cmd)) def do_offline(self, *args): """Use Offpunk offline by only accessing cached content""" @@ -1421,9 +1428,7 @@ Use "view XX" where XX is a number to view information about link XX. list_path = self.list_path(list) if not list_path: print( - "List %s does not exist. Create it with " - "list create %s" - "" % (list, list) + "List %s does not exist. Create it with list create %s" % (list, list) ) return False else: @@ -1552,9 +1557,7 @@ Use "view XX" where XX is a number to view information about link XX. list_path = self.list_path(list) if not list_path: print( - "List %s does not exist. Create it with " - "list create %s" - "" % (list, list) + "List %s does not exist. Create it with list create %s" % (list, list) ) elif not line.isnumeric(): print("go_to_line requires a number as parameter") @@ -1570,9 +1573,7 @@ Use "view XX" where XX is a number to view information about link XX. list_path = self.list_path(list) if not list_path: print( - "List %s does not exist. Create it with " - "list create %s" - "" % (list, list) + "List %s does not exist. Create it with list create %s" % (list, list) ) else: url = "list:///%s" % list @@ -2060,6 +2061,11 @@ def main(): action="store_true", help="display available features and dependancies then quit", ) + parser.add_argument( + "--ftr-site-config", + type=str, + help="If you want to use `unmerdify`, you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.", + ) parser.add_argument( "url", metavar="URL", @@ -2082,7 +2088,7 @@ def main(): os.makedirs(f) # Instantiate client - gc = GeminiClient(synconly=args.sync) + gc = GeminiClient(synconly=args.sync, ftr_site_config=args.ftr_site_config) torun_queue = [] # Interactive if offpunk started normally diff --git a/opnk.py b/opnk.py index c614607..a2f726b 100755 --- a/opnk.py +++ b/opnk.py @@ -142,7 +142,7 @@ class opencache: if previous: print("Previous handler was %s" % previous) - def get_renderer(self, inpath, mode=None, theme=None): + def get_renderer(self, inpath, mode=None, theme=None, ftr_site_config=None): # We remove the ##offpunk_mode= from the URL # If mode is already set, we don’t use the part from the URL inpath, newmode = unmode_url(inpath) @@ -155,6 +155,7 @@ class opencache: # default mode is readable mode = "readable" renderer = None + path = netcache.get_cache_path(inpath) if path: usecache = inpath in self.rendererdic.keys() and not is_local(inpath) @@ -175,7 +176,13 @@ class opencache: else: usecache = False if not usecache: - renderer = ansicat.renderer_from_file(path, url=inpath, theme=theme) + renderer = ansicat.renderer_from_file( + path, + url=inpath, + theme=theme, + source_url=inpath, + ftr_site_config=ftr_site_config, + ) if renderer: self.rendererdic[inpath] = renderer self.renderer_time[inpath] = int(time.time()) @@ -189,7 +196,16 @@ class opencache: else: return None - def opnk(self, inpath, mode=None, terminal=True, grep=None, theme=None, **kwargs): + def opnk( + self, + inpath, + mode=None, + terminal=True, + grep=None, + theme=None, + ftr_site_config=None, + **kwargs, + ): # Return True if inpath opened in Terminal # False otherwise # also returns the url in case it has been modified @@ -211,7 +227,9 @@ class opencache: else: print("%s does not exist" % inpath) return False, inpath - renderer = self.get_renderer(inpath, mode=mode, theme=theme) + renderer = self.get_renderer( + inpath, mode=mode, theme=theme, ftr_site_config=ftr_site_config + ) if renderer and mode: renderer.set_mode(mode) self.last_mode[inpath] = mode @@ -328,10 +346,21 @@ def main(): help="maximum age, in second, of the cached version before \ redownloading a new version", ) + + parser.add_argument( + "--ftr-site-config", + type=str, + help="If using the `unmerdify` mode, you need to specify the path to the https://github.com/fivefilters/ftr-site-config directory locally.", + ) args = parser.parse_args() cache = opencache() for f in args.content: - cache.opnk(f, mode=args.mode, validity=args.cache_validity) + cache.opnk( + f, + mode=args.mode, + validity=args.cache_validity, + ftr_site_config=args.ftr_site_config, + ) if __name__ == "__main__": diff --git a/unmerdify.py b/unmerdify.py new file mode 100755 index 0000000..f4a7f43 --- /dev/null +++ b/unmerdify.py @@ -0,0 +1,574 @@ +#!/usr/bin/env python3 + +import argparse +import fileinput +import glob +import logging +import logging.config +import os +import re +from copy import deepcopy +from dataclasses import dataclass +from urllib.parse import urlparse + +from lxml import etree + +LOGGING = { + "version": 1, + "disable_existing_loggers": False, + "formatters": { + "default": { + "format": "[%(asctime)s] [%(levelname)8s] [%(filename)s:%(lineno)s - %(funcName).20s…] %(message)s", + "datefmt": "%Y-%m-%d %H:%M:%S", + } + }, + "handlers": { + "stdout": { + "class": "logging.StreamHandler", + "stream": "ext://sys.stdout", + "formatter": "default", + } + }, + "loggers": {"": {"handlers": ["stdout"], "level": "ERROR"}}, +} + + +logging.config.dictConfig(LOGGING) + + +LOGGER = logging.getLogger(__name__) + + +def set_logging_level(level): + LOGGING["loggers"][""]["level"] = level + logging.config.dictConfig(LOGGING) + + +@dataclass +class Command: + """Class for keeping track of a command item.""" + + name: str + accept_multiple_values: bool = False + is_bool: bool = False + xpath_value: bool = False + has_capture_group: bool = False + special_command: bool = False + ignore: bool = False + + +COMMANDS: list[Command] = [ + Command("author", accept_multiple_values=True), + Command("autodetect_on_failure", is_bool=True), + Command("body", accept_multiple_values=True), + Command("date", accept_multiple_values=True), + Command("find_string", accept_multiple_values=True), + Command("http_header", has_capture_group=True, special_command=True), + Command("if_page_contains", special_command=True), + Command("login_extra_fields", accept_multiple_values=True), + Command("login_password_field"), + Command("login_uri"), + Command("login_username_field"), + Command("native_ad_clue", accept_multiple_values=True), + Command("next_page_link", accept_multiple_values=True), + Command("not_logged_in_xpath"), + Command("parser"), + Command("prune", is_bool=True), + Command("replace_string", has_capture_group=True, accept_multiple_values=True), + Command("requires_login", is_bool=True), + Command("src_lazy_load_attr"), + Command("single_page_link", accept_multiple_values=True), + Command("skip_json_ld", is_bool=True), + Command("strip", accept_multiple_values=True), + Command("strip_id_or_class", accept_multiple_values=True), + Command("strip_image_src", accept_multiple_values=True), + Command("test_contains", special_command=True), + Command("test_url", accept_multiple_values=True, special_command=True), + Command("tidy", is_bool=True), + Command("title", accept_multiple_values=True), + Command("wrap_in", has_capture_group=True, special_command=True), +] + +COMMANDS_PER_NAME: dict[str, Command] = { + COMMANDS[i].name: COMMANDS[i] for i in range(0, len(COMMANDS)) +} + + +def get_config_files( + site_config_dir: str, include_config_dir: bool = True +) -> list[str]: + """ + Read the *.txt files from the site_config directory and returns the file list. + + Parameters: + site_config_dir (str): The path to the directory containing the config files + include_config_dir (bool): Should the config_dir be included in the returned list + + Returns: + filenames (list[str]): The list of filenames found with the .txt extension + """ + filenames: list[str] = [] + + for file in glob.iglob(f"{site_config_dir}/*.txt", include_hidden=True): + if file.endswith("LICENSE.txt"): + continue + + if include_config_dir: + filenames.append(file) + else: + filenames.append(file.removeprefix(f"{site_config_dir}/")) + + filenames.sort() + return filenames + + +def get_host_for_url(url: str) -> str: + parsed_uri = urlparse(url) + return parsed_uri.netloc + + +def get_possible_config_file_names_for_host( + host: str, file_extension: str = ".txt" +) -> list[str]: + """ + The five filters config files can be of the form + + - .specific.domain.tld (for *.specific.domain.tld) + - specific.domain.tld (for this specific domain) + - .domain.tld (for *.domain.tld) + - domain.tld (for domain.tld) + """ + + parts = host.split(".") + + if len(parts) < 2: + raise ValueError( + f"The host must be of the form `host.com`. It seems that there is no dot in the provided host: {host}" + ) + + tld = parts.pop() + domain = parts.pop() + + first_possible_name = f"{domain}.{tld}{file_extension}" + possible_names = [first_possible_name, f".{first_possible_name}"] + + # While we still have parts in the domain name, prepend the part + # and create the 2 new possible names + while len(parts) > 0: + next_part = parts.pop() + possible_name = f"{next_part}.{possible_names[-2]}" + possible_names.append(possible_name) + possible_names.append(f".{possible_name}") + + # Put the most specific file names first + possible_names.reverse() + + return possible_names + + +def get_config_file_for_host(config_files: list[str], host: str) -> str | None: + possible_config_file_names = get_possible_config_file_names_for_host(host) + + for config_file in config_files: + basename = os.path.basename(config_file) + for possible_config_file_name in possible_config_file_names: + if basename == possible_config_file_name: + return config_file + + +def parse_site_config_file(config_file_path: str) -> dict | None: + config = {} + with open(config_file_path, "r") as file: + previous_command = None + while line := file.readline(): + line = line.strip() + + # skip comments, empty lines + if line == "" or line.startswith("#") or line.startswith("//"): + continue + + command_name = None + command_value = None + pattern = re.compile(r"^([a-z_]+)(?:\((.*)\))*:[ ]*(.*)$", re.I) + + result = pattern.search(line) + + if not result: + logging.error( + f"-> 🚨 ERROR: unknown line format for line `{line}` in file `{config_file_path}`. Skipping." + ) + continue + + command_name = result.group(1).lower() + command_arg = result.group(2) + command_value = result.group(3) + + # strip_attr is now an alias for strip, for example: + # strip_attr: //img/@srcset + if "strip_attr" == command_name: + command_name = "strip" + + command = COMMANDS_PER_NAME.get(command_name) + + if command is None: + logging.warning( + f"-> ⚠️ WARNING: unknown command name for line `{line}` in file `{config_file_path}`. Skipping." + ) + continue + + # Check for commands where we accept multiple statements but we don't have args provided + # It handles `replace_string: value` and not `replace_string(test): value` + if ( + command.accept_multiple_values + and command_arg is None + and not command.special_command + ): + config.setdefault(command_name, []).append(command_value) + # Single value command that should evaluate to a bool + elif command.is_bool and not command.special_command: + config[command_name] = "yes" == command_value or "true" == command_value + # handle replace_string(test): value + elif command.name == "replace_string" and command_arg is not None: + config.setdefault("find_string", []).append(command_arg) + config.setdefault("replace_string", []).append(command_value) + # handle http_header(user-agent): Mozilla/5.2 + elif command.name == "http_header" and command_arg is not None: + config.setdefault("http_header", []).append( + {command_arg: command_value} + ) + # handle if_page_contains: Xpath value + elif command.name == "if_page_contains": + # Previous command should be applied only if this expression is true + previous_command_value = config[previous_command.name] + + # Move the previous command into the "if_page_contains" command + if ( + previous_command.accept_multiple_values + and len(previous_command_value) > 0 + ): + config.setdefault("if_page_contains", {})[command_value] = { + previous_command.name: previous_command_value.pop() + } + + # Remove the entire key entry if the values are now empty + if len(previous_command_value) == 0: + config.pop(previous_command.name) + + # handle if_page_contains: Xpath value + elif command.name == "wrap_in": + config.setdefault("wrap_in", []).append((command_arg, command_value)) + elif command.name == "test_url": + config.setdefault("test_url", []).append( + {command.name: command_value, "test_contains": []} + ) + elif command.name == "test_contains": + test_url = config.get("test_url") + if test_url is None or len(test_url) == 0: + logging.error( + "-> 🚨 ERROR: No test_url found for given test_contains. Skipping." + ) + continue + + test_url[-1]["test_contains"].append(command_value) + else: + config[command_name] = command_value + + previous_command = command + + return config if config != {} else None + + +def load_site_config_for_host(config_files: list[str], host: str) -> dict | None: + logging.debug(f"-> Loading site config for {host}") + config_file = get_config_file_for_host(config_files, host) + + if config_file: + logging.debug(f"-> Found config file, loading {config_file} config.") + return parse_site_config_file(config_file) + else: + logging.debug(f"-> No config file found for host {host}.") + + +def load_site_config_for_url(config_files: list[str], url: str) -> dict | None: + return load_site_config_for_host(config_files, get_host_for_url(url)) + + +# Content extractor code + + +def replace_strings(site_config: dict, html: str) -> str: + replace_string_cmds = site_config.get("replace_string", []) + find_string_cmds = site_config.get("find_string", []) + + if len(replace_string_cmds) == 0 and len(find_string_cmds) == 0: + return html + + if len(replace_string_cmds) != len(find_string_cmds): + logging.error( + "🚨 ERROR: `replace_string` and `find_string` counts are not the same but must be, skipping string replacement." + ) + else: + nb_replacement = 0 + + for replace_string, find_string in zip(replace_string_cmds, find_string_cmds): + nb_replacement += html.count(find_string) + html = html.replace(find_string, replace_string) + + logging.debug( + f"Replaced {nb_replacement} string{'s'[:nb_replacement ^ 1]} using replace_string/find_string commands." + ) + + logging.debug(f"Html after string replacement: {html}") + + return html + + +def wrap_in(site_config: dict, lxml_tree): + for tag, pattern in site_config.get("wrap_in", []): + logging.debug(f"Wrap in `{tag}` => `{pattern}`") + elements = lxml_tree.xpath(pattern) + for element in elements: + parent = element.getparent() + newElement = etree.Element(tag) + newElement.append(deepcopy(element)) + parent.replace(element, newElement) + + +def strip_elements(site_config: dict, lxml_tree): + for pattern in site_config.get("strip", []): + remove_elements_by_xpath(pattern, lxml_tree) + + +def strip_elements_by_id_or_class(site_config: dict, lxml_tree): + for pattern in site_config.get("strip_id_or_class", []): + # Some entries contain " or ' + pattern = pattern.replace("'", "").replace('"', "") + remove_elements_by_xpath( + f"//*[contains(concat(' ',normalize-space(@class), ' '),' {pattern} ') or contains(concat(' ',normalize-space(@id),' '), ' {pattern} ')]", + lxml_tree, + ) + + +def strip_image_src(site_config: dict, lxml_tree): + for pattern in site_config.get("strip_image_src", []): + # Some entries contain " or ' + pattern = pattern.replace("'", "").replace('"', "") + remove_elements_by_xpath(f"//img[contains(@src,'{pattern}')]", lxml_tree) + + +def get_body_element(site_config: dict, lxml_tree): + body_contents = [] + + for pattern in site_config.get("body", []): + elements = lxml_tree.xpath(pattern) + + for body_element in elements: + body_contents.append(body_element) + + if len(body_contents) == 1: + return body_contents[0] + + if len(body_contents) > 1: + body = etree.Element("div") + for element in elements: + body.append(element) + + return body + + +def get_body_element_html(site_config: dict, lxml_tree): + body = get_body_element(site_config, lxml_tree) + if body is not None: + return etree.tostring(body, encoding="unicode") + + +def remove_hidden_elements(lxml_tree): + remove_elements_by_xpath( + "//*[contains(@style,'display:none') or contains(@style,'visibility:hidden')]", + lxml_tree, + ) + + +def remove_a_empty_elements(lxml_tree): + remove_elements_by_xpath( + "//a[not(./*) and normalize-space(.)='']", + lxml_tree, + ) + + +def remove_elements_by_xpath(xpath_expression, lxml_tree): + elements = lxml_tree.xpath(xpath_expression) + for element in elements: + if isinstance(element, etree._Element): + element.getparent().remove(element) + else: + logging.error( + f"🚨 ERROR: remove by xpath, element is not a Node, got {type(element)}." + ) + + +def get_xpath_value_for_command( + site_config: dict, command_name: str, lxml_tree +) -> str | None: + command_xpaths = site_config.get(command_name, []) + + for command_xpath in command_xpaths: + value = get_xpath_value(site_config, command_xpath, lxml_tree) + if value is not None: + return value + + +def get_multiple_xpath_values_for_command( + site_config: dict, command_name: str, lxml_tree +) -> list[str]: + command_xpaths = site_config.get(command_name, []) + values = [] + + for command_xpath in command_xpaths: + values = values + get_multiple_xpath_values( + site_config, command_xpath, lxml_tree + ) + + return values + + +def get_xpath_value(site_config: dict, xpath: str, lxml_tree): + elements = lxml_tree.xpath(xpath) + + if isinstance(elements, str) or isinstance(elements, etree._ElementUnicodeResult): + return str(elements) + + for element in elements: + # Return the first entry found + if isinstance(element, str) or isinstance(element, etree._ElementUnicodeResult): + return str(element) + else: + value = etree.tostring(element, method="text", encoding="unicode").strip() + return " ".join(value.split()).replace("\n", "") + + +def get_multiple_xpath_values(site_config: dict, xpath: str, lxml_tree): + values = [] + + elements = lxml_tree.xpath(xpath) + + if isinstance(elements, str) or isinstance(elements, etree._ElementUnicodeResult): + return str(elements) + + for element in elements: + # Return the first entry found + + if isinstance(element, str) or isinstance(element, etree._ElementUnicodeResult): + values.append(str(element)) + else: + value = etree.tostring(element, method="text", encoding="unicode").strip() + value = " ".join(value.split()).replace("\n", "") + values.append(value) + + return values + + +def get_body(site_config: dict, html: str): + html = replace_strings(site_config, html) + html_parser = etree.HTMLParser(remove_blank_text=True, remove_comments=True) + + tree = etree.fromstring(html, html_parser) + + wrap_in(site_config, tree) + strip_elements(site_config, tree) + strip_elements_by_id_or_class(site_config, tree) + strip_image_src(site_config, tree) + remove_hidden_elements(tree) + remove_a_empty_elements(tree) + + return get_body_element_html(site_config, tree) + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Get the content, only the content: unenshittificator for the web" + ) + + parser.add_argument( + "ftr_site_config", + type=str, + help="The path to the https://github.com/fivefilters/ftr-site-config directory, or a path to a config file.", + ) + + parser.add_argument( + "-u", + "--url", + type=str, + help="The url you want to unmerdify.", + ) + + parser.add_argument( + "files", + metavar="FILE", + nargs="*", + help="Files to read, if empty, stdin is used.", + ) + + parser.add_argument( + "-l", + "--loglevel", + default=logging.ERROR, + choices=logging.getLevelNamesMapping().keys(), + help="Set log level", + ) + + # @TODO: extract open graph information if any + # https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L1241 + args = parser.parse_args() + + set_logging_level(args.loglevel) + + if os.path.isdir(args.ftr_site_config) and args.url is None: + logging.error( + "ERROR: You must provide an URL with --url if you don't provide a specific config file.", + ) + return 1 + + if os.path.isdir(args.ftr_site_config): + config_files = get_config_files(args.ftr_site_config) + loaded_site_config = load_site_config_for_url(config_files, args.url) + else: + loaded_site_config = parse_site_config_file(args.ftr_site_config) + + if loaded_site_config is None: + logging.error(f"Unable to load site config for `{args.ftr_site_config}`.") + return 1 + + html = "" + # We pass '-' as only file when argparse got no files which will cause fileinput to read from stdin + for line in fileinput.input( + files=args.files if len(args.files) > 0 else ("-",), + openhook=fileinput.hook_encoded("utf-8"), + ): + html += line + + html_replaced = replace_strings(loaded_site_config, html) + html_parser = etree.HTMLParser(remove_blank_text=True, remove_comments=True) + + tree = etree.fromstring(html_replaced, html_parser) + + title = get_xpath_value_for_command(loaded_site_config, "title", tree) + logging.debug(f"Got title `{title}`.") + + authors = get_multiple_xpath_values_for_command(loaded_site_config, "author", tree) + + logging.debug(f"Got authors {authors}.") + + date = get_xpath_value_for_command(loaded_site_config, "date", tree) + + logging.debug(f"Got date `{date}`.") + + body_html = get_body(loaded_site_config, html) + + print(body_html) + + return 0 + + +if __name__ == "__main__": + main() -- 2.48.1