~lioploum/offpunk-devel

3 2

Integrating `unmerdify` to allow custom HTML parsing rules

Details
Message ID
<4841675b-b03e-4c80-9677-ddc18d840656@jousse.org>
DKIM signature
pass
Download raw message
Hi there,

Following some recent discussions on the mailing list, I took the time 
to integrate `unmerdify` to offpunk.

`unmerdify` https://codeberg.org/vjousse/unmerdify is a python lib using 
the rules from https://github.com/fivefilters/ftr-site-config to extract 
the content of a webpage (the rules are mainly used by wallabag at the 
moment https://github.com/wallabag/wallabag/ ).

The idea is to use `unmerdify` if installed and if a rule for the 
current URL exists. It will fallback to the current behavior of using 
Readability if `unmerdify` can't find a rule for the current url.

You can find the current integration here: 
https://codeberg.org/vjousse/offpunk/pulls/1

You can git clone the project locally, switch to the feat/add-unmerdify 
branch and test the integration.

git clone git@codeberg.org:vjousse/offpunk.git
cd offpunk
git checkout feat/add-unmerdify
pip install -r feat/add-unmerdify #or pip install 
git+ssh://git@codeberg.org/vjousse/unmerdify.git

You will then need to have a local copy of the rules available here: 
https://github.com/fivefilters/ftr-site-config

After that, calling offpunk with the --ftr-site-config flag set up 
should do the trick

./offpunk.py --ftr-site-config 
/home/vjousse/usr/projects/unmerdify/ftr-site-config

You can try to go to https://ploum.net and check that the page is 
rendered as expected and not truncated as it is the case with the 
current version of offpunk.

Don't hesitate to provide feedback, I will later on provide a patch on 
this mailing to integrate it on sr.ht master branch.

Regards,
Vince
Details
Message ID
<173799058276.7.3632282113790686329.579172841@ploum.eu>
In-Reply-To
<4841675b-b03e-4c80-9677-ddc18d840656@jousse.org> (view parent)
Sender timestamp
1737990572
DKIM signature
pass
Download raw message
Hi Vincent,

I’ve downloaded unmerdify and ftr-site-config. I’ve check your offpunk 
branch but there are tons of changes which seem to not be related to 
unmerdify at all.

Would you care sending a minimal patch for it? Or maybe we can work on 
it together.

The idea would be to add unmerdify support as an alternative to 
python-readability. 

There would be no arguments at launch but simply two options in offpunk 
itself: unmerdify_path  and ftr_path.  When both are set to valid path, 
offpunk would transparently use unmerdify instead of readability.

(offpunk could also automatically chechk $PATH for unmerdify in case it 
is installed through a package manager)


That would be a perfect "first" step and that could be 100% releasable 
as it.

Next step would be to automatically fetch ftr config for a visited site 
and caching it but we will keep that for further development once we are 
sure unmerdify is working well.


For me to progress on this, could you simply send me the patch with what 
you did (and without all the added cruft)?



Le 25 jan 05 04:41, Vincent Jousse a écrit :
>Hi there,
>
>Following some recent discussions on the mailing list, I took the time
>to integrate `unmerdify` to offpunk.
>
>`unmerdify` https://codeberg.org/vjousse/unmerdify is a python lib using
>the rules from https://github.com/fivefilters/ftr-site-config to extract
>the content of a webpage (the rules are mainly used by wallabag at the
>moment https://github.com/wallabag/wallabag/ ).
>
>The idea is to use `unmerdify` if installed and if a rule for the
>current URL exists. It will fallback to the current behavior of using
>Readability if `unmerdify` can't find a rule for the current url.
>
>You can find the current integration here:
>https://codeberg.org/vjousse/offpunk/pulls/1
>
>You can git clone the project locally, switch to the feat/add-unmerdify
>branch and test the integration.
>
>git clone git@codeberg.org:vjousse/offpunk.git
>cd offpunk
>git checkout feat/add-unmerdify
>pip install -r feat/add-unmerdify #or pip install
>git+ssh://git@codeberg.org/vjousse/unmerdify.git
>
>You will then need to have a local copy of the rules available here:
>https://github.com/fivefilters/ftr-site-config
>
>After that, calling offpunk with the --ftr-site-config flag set up
>should do the trick
>
>./offpunk.py --ftr-site-config
>/home/vjousse/usr/projects/unmerdify/ftr-site-config
>
>You can try to go to https://ploum.net and check that the page is
>rendered as expected and not truncated as it is the case with the
>current version of offpunk.
>
>Don't hesitate to provide feedback, I will later on provide a patch on
>this mailing to integrate it on sr.ht master branch.
>
>Regards,
>Vince
>

-- 
Ploum - Lionel Dricot

Blog: https://www.ploum.net
Bikepunk: https://bikepunk.fr/
Details
Message ID
<7311b077-1093-44d8-8c5d-fba90a0fde2a@jousse.org>
In-Reply-To
<173799058276.7.3632282113790686329.579172841@ploum.eu> (view parent)
Sender timestamp
1737995786
DKIM signature
pass
Download raw message
Hi,

I’ll have a look, the changes are due to automatic linting by ruff, I’ll 
have to find a way to deactivate it for offpunk I guess ^^

I’ll keep you posted.

On 1/27/25 16:09, Ploum wrote:
> Hi Vincent,
> 
> I’ve downloaded unmerdify and ftr-site-config. I’ve check your offpunk
> branch but there are tons of changes which seem to not be related to
> unmerdify at all.
> 
> Would you care sending a minimal patch for it? Or maybe we can work on
> it together.
> 
> The idea would be to add unmerdify support as an alternative to
> python-readability.
> 
> There would be no arguments at launch but simply two options in offpunk
> itself: unmerdify_path  and ftr_path.  When both are set to valid path,
> offpunk would transparently use unmerdify instead of readability.
> 
> (offpunk could also automatically chechk $PATH for unmerdify in case it
> is installed through a package manager)
> 
> 
> That would be a perfect "first" step and that could be 100% releasable
> as it.
> 
> Next step would be to automatically fetch ftr config for a visited site
> and caching it but we will keep that for further development once we are
> sure unmerdify is working well.
> 
> 
> For me to progress on this, could you simply send me the patch with what
> you did (and without all the added cruft)?
> 
> 
> 
> Le 25 jan 05 04:41, Vincent Jousse a écrit :
>> Hi there,
>>
>> Following some recent discussions on the mailing list, I took the time
>> to integrate `unmerdify` to offpunk.
>>
>> `unmerdify` https://codeberg.org/vjousse/unmerdify is a python lib using
>> the rules from https://github.com/fivefilters/ftr-site-config to extract
>> the content of a webpage (the rules are mainly used by wallabag at the
>> moment https://github.com/wallabag/wallabag/ ).
>>
>> The idea is to use `unmerdify` if installed and if a rule for the
>> current URL exists. It will fallback to the current behavior of using
>> Readability if `unmerdify` can't find a rule for the current url.
>>
>> You can find the current integration here:
>> https://codeberg.org/vjousse/offpunk/pulls/1
>>
>> You can git clone the project locally, switch to the feat/add-unmerdify
>> branch and test the integration.
>>
>> git clone git@codeberg.org:vjousse/offpunk.git
>> cd offpunk
>> git checkout feat/add-unmerdify
>> pip install -r feat/add-unmerdify #or pip install
>> git+ssh://git@codeberg.org/vjousse/unmerdify.git
>>
>> You will then need to have a local copy of the rules available here:
>> https://github.com/fivefilters/ftr-site-config
>>
>> After that, calling offpunk with the --ftr-site-config flag set up
>> should do the trick
>>
>> ./offpunk.py --ftr-site-config
>> /home/vjousse/usr/projects/unmerdify/ftr-site-config
>>
>> You can try to go to https://ploum.net and check that the page is
>> rendered as expected and not truncated as it is the case with the
>> current version of offpunk.
>>
>> Don't hesitate to provide feedback, I will later on provide a patch on
>> this mailing to integrate it on sr.ht master branch.
>>
>> Regards,
>> Vince
>>
> 
Details
Message ID
<173799426996.7.17794930333288806276.579252624@ploum.eu>
In-Reply-To
<7311b077-1093-44d8-8c5d-fba90a0fde2a@jousse.org> (view parent)
Sender timestamp
1737994264
DKIM signature
pass
Download raw message
Le 25 jan 27 04:36, Vincent Jousse a écrit :
>Hi,
>
>I’ll have a look, the changes are due to automatic linting by ruff, I’ll
>have to find a way to deactivate it for offpunk I guess ^^

I’m not against patches that does some cleaning but this should be kept 
separated from functionnal patches.
Reply to thread Export thread (mbox)