Vitaly Ovchinnikov: 1 hyperlinks: better parsing of emails without mailto prefixes 2 files changed, 36 insertions(+), 1 deletions(-)
Copy & paste the following snippet into your terminal to import this patchset into git:
curl -s https://lists.sr.ht/~rjarry/aerc-devel/patches/44615/mbox | git am -3Learn more about email & git
Add some new tests from the emails I have and make them work by adjusting the code that looks for hyperlinks. The idea is to treat "inline" emails (those without mailto:) a little bit different and stop a little earlier while looking for their ends. Signed-off-by: Vitaly Ovchinnikov <v@postbox.nz> --- The new test pieces are from an HTML-only email where the email address was used without mailto: prefix a couple of times this way: <a href="mailto:a@abc.com">a@abc.com</a> I am not quite sure about the `inlineEmail` line, as I took the algorithm "as is" from the current code. It works fine, though. Another option could be using the processed html (if any) while searching for links. This might be a little bit more predictable, as the other software does all the dirty job. On the other hand we might loose some links this way. Overall, I believe there should be much more checks added here in the future. The `urichars` slice should be probably replaced with something else as we now have unicode domain names. Anyway, this particular fix solves a number of particular problems I have, so why not? If it breaks something, let's add that to the tests and fix the whole thing again :)
Acked-by: Robin-Jarry <robin@jarry.cc> Applied. Thanks!
lib/parse/hyperlinks.go | 17 ++++++++++++++++- lib/parse/hyperlinks_test.go | 20 ++++++++++++++++++++ 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/lib/parse/hyperlinks.go b/lib/parse/hyperlinks.go index dd334dd..1200f93 100644 --- a/lib/parse/hyperlinks.go +++ b/lib/parse/hyperlinks.go @@ -37,6 +37,9 @@ func HttpLinks(r io.Reader) (io.Reader, []string) { scheme = j - i j = scheme + // "inline" email without a mailto: prefix - add some extra checks for those + inlineEmail := len(match) > 4 && match[2] == -1 && match[4] == -1 + for !emitUrl && j < len(b) && bytes.IndexByte(urichars, b[j]) != -1 { switch b[j] { case '[': @@ -69,9 +72,21 @@ func HttpLinks(r io.Reader) (io.Reader, []string) { } else { j++ } + case '&': + if inlineEmail { + emitUrl = true + } else { + j++ + } default: j++ } + + // we don't want those in inline emails + if inlineEmail && (paren > 0 || ltgt > 0 || bracket > 0) { + j-- + emitUrl = true + } } // Heuristic to remove trailing characters that are @@ -91,7 +106,7 @@ func HttpLinks(r io.Reader) (io.Reader, []string) { continue } url := string(b[:j]) - if match[2] == -1 && match[4] == -1 { + if inlineEmail { // Email address with missing mailto: scheme. Add it. url = "mailto:" + url } diff --git a/lib/parse/hyperlinks_test.go b/lib/parse/hyperlinks_test.go index cedad64..00a0676 100644 --- a/lib/parse/hyperlinks_test.go +++ b/lib/parse/hyperlinks_test.go @@ -114,6 +114,26 @@ func TestHyperlinks(t *testing.T) { text: "You can reach me via the somewhat strange, but nonetheless valid, email mailto:~mpldr/list@[2001:db8::7]?subject=whazzup%3F", links: []string{"mailto:~mpldr/list@[2001:db8::7]?subject=whazzup%3F"}, }, + { + name: "simple email in <a href>", + text: `<a href="mailto:a@abc.com" rel="noopener noreferrer">`, + links: []string{"mailto:a@abc.com"}, + }, + { + name: "simple email in <a> body", + text: `<a href="#" rel="noopener noreferrer">a@abc.com</a><br/><p>more text</p>`, + links: []string{"mailto:a@abc.com"}, + }, + { + name: "emails in <a> href and body", + text: `<a href="mailto:a@abc.com" rel="noopener noreferrer">b@abc.com</a><br/><p>more text</p>`, + links: []string{"mailto:a@abc.com", "mailto:b@abc.com"}, + }, + { + name: "email in <...>", + text: `<div>01.02.2023, 10:11, "Firstname Lastname" <a@abc.com>:</div>`, + links: []string{"mailto:a@abc.com"}, + }, } for i, test := range tests { -- 2.39.2 (Apple Git-143)
builds.sr.ht <builds@sr.ht>aerc/patches: SUCCESS in 5m16s [hyperlinks: better parsing of emails without mailto prefixes][0] from [Vitaly Ovchinnikov][1] [0]: https://lists.sr.ht/~rjarry/aerc-devel/patches/44615 [1]: mailto:v@postbox.nz ✓ #1056813 SUCCESS aerc/patches/openbsd.yml https://builds.sr.ht/~rjarry/job/1056813 ✓ #1056812 SUCCESS aerc/patches/alpine-edge.yml https://builds.sr.ht/~rjarry/job/1056812
Vitaly Ovchinnikov, Sep 12, 2023 at 23:19: