~lioploum/offpunk-devel

Integrating `unmerdify` to allow custom HTML parsing rules

Details
Message ID
<4841675b-b03e-4c80-9677-ddc18d840656@jousse.org>
DKIM signature
pass
Download raw message
Hi there,

Following some recent discussions on the mailing list, I took the time 
to integrate `unmerdify` to offpunk.

`unmerdify` https://codeberg.org/vjousse/unmerdify is a python lib using 
the rules from https://github.com/fivefilters/ftr-site-config to extract 
the content of a webpage (the rules are mainly used by wallabag at the 
moment https://github.com/wallabag/wallabag/ ).

The idea is to use `unmerdify` if installed and if a rule for the 
current URL exists. It will fallback to the current behavior of using 
Readability if `unmerdify` can't find a rule for the current url.

You can find the current integration here: 
https://codeberg.org/vjousse/offpunk/pulls/1

You can git clone the project locally, switch to the feat/add-unmerdify 
branch and test the integration.

git clone git@codeberg.org:vjousse/offpunk.git
cd offpunk
git checkout feat/add-unmerdify
pip install -r feat/add-unmerdify #or pip install 
git+ssh://git@codeberg.org/vjousse/unmerdify.git

You will then need to have a local copy of the rules available here: 
https://github.com/fivefilters/ftr-site-config

After that, calling offpunk with the --ftr-site-config flag set up 
should do the trick

./offpunk.py --ftr-site-config 
/home/vjousse/usr/projects/unmerdify/ftr-site-config

You can try to go to https://ploum.net and check that the page is 
rendered as expected and not truncated as it is the case with the 
current version of offpunk.

Don't hesitate to provide feedback, I will later on provide a patch on 
this mailing to integrate it on sr.ht master branch.

Regards,
Vince
Reply to thread Export thread (mbox)