~rjarry/aerc-devel

aerc: hyperlinks: better parsing of emails without mailto prefixes v1 APPLIED

Vitaly Ovchinnikov: 1
 hyperlinks: better parsing of emails without mailto prefixes

 2 files changed, 36 insertions(+), 1 deletions(-)
#1056812 alpine-edge.yml success
#1056813 openbsd.yml success
Export patchset (mbox)
How do I use this?

Copy & paste the following snippet into your terminal to import this patchset into git:

curl -s https://lists.sr.ht/~rjarry/aerc-devel/patches/44615/mbox | git am -3
Learn more about email & git

[PATCH aerc] hyperlinks: better parsing of emails without mailto prefixes Export this patch

Add some new tests from the emails I have and make them work by
adjusting the code that looks for hyperlinks.

The idea is to treat "inline" emails (those without mailto:) a little
bit different and stop a little earlier while looking for their ends.

Signed-off-by: Vitaly Ovchinnikov <v@postbox.nz>
---

The new test pieces are from an HTML-only email where the email address
was used without mailto: prefix a couple of times this way:
<a href="mailto:a@abc.com">a@abc.com</a>

I am not quite sure about the `inlineEmail` line, as I took the
algorithm "as is" from the current code. It works fine, though.

Another option could be using the processed html (if any) while
searching for links. This might be a little bit more predictable, as the
other software does all the dirty job. On the other hand we might loose
some links this way.

Overall, I believe there should be much more checks added here in the
future. The `urichars` slice should be probably replaced with something
else as we now have unicode domain names. Anyway, this particular fix
solves a number of particular problems I have, so why not? If it breaks
something, let's add that to the tests and fix the whole thing again :)

 lib/parse/hyperlinks.go      | 17 ++++++++++++++++-
 lib/parse/hyperlinks_test.go | 20 ++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/lib/parse/hyperlinks.go b/lib/parse/hyperlinks.go
index dd334dd..1200f93 100644
--- a/lib/parse/hyperlinks.go
+++ b/lib/parse/hyperlinks.go
@@ -37,6 +37,9 @@ func HttpLinks(r io.Reader) (io.Reader, []string) {
		scheme = j - i
		j = scheme

		// "inline" email without a mailto: prefix - add some extra checks for those
		inlineEmail := len(match) > 4 && match[2] == -1 && match[4] == -1

		for !emitUrl && j < len(b) && bytes.IndexByte(urichars, b[j]) != -1 {
			switch b[j] {
			case '[':
@@ -69,9 +72,21 @@ func HttpLinks(r io.Reader) (io.Reader, []string) {
				} else {
					j++
				}
			case '&':
				if inlineEmail {
					emitUrl = true
				} else {
					j++
				}
			default:
				j++
			}

			// we don't want those in inline emails
			if inlineEmail && (paren > 0 || ltgt > 0 || bracket > 0) {
				j--
				emitUrl = true
			}
		}

		// Heuristic to remove trailing characters that are
@@ -91,7 +106,7 @@ func HttpLinks(r io.Reader) (io.Reader, []string) {
			continue
		}
		url := string(b[:j])
		if match[2] == -1 && match[4] == -1 {
		if inlineEmail {
			// Email address with missing mailto: scheme. Add it.
			url = "mailto:" + url
		}
diff --git a/lib/parse/hyperlinks_test.go b/lib/parse/hyperlinks_test.go
index cedad64..00a0676 100644
--- a/lib/parse/hyperlinks_test.go
+++ b/lib/parse/hyperlinks_test.go
@@ -114,6 +114,26 @@ func TestHyperlinks(t *testing.T) {
			text:  "You can reach me via the somewhat strange, but nonetheless valid, email mailto:~mpldr/list@[2001:db8::7]?subject=whazzup%3F",
			links: []string{"mailto:~mpldr/list@[2001:db8::7]?subject=whazzup%3F"},
		},
		{
			name:  "simple email in <a href>",
			text:  `<a href="mailto:a@abc.com" rel="noopener noreferrer">`,
			links: []string{"mailto:a@abc.com"},
		},
		{
			name:  "simple email in <a> body",
			text:  `<a href="#" rel="noopener noreferrer">a@abc.com</a><br/><p>more text</p>`,
			links: []string{"mailto:a@abc.com"},
		},
		{
			name:  "emails in <a> href and body",
			text:  `<a href="mailto:a@abc.com" rel="noopener noreferrer">b@abc.com</a><br/><p>more text</p>`,
			links: []string{"mailto:a@abc.com", "mailto:b@abc.com"},
		},
		{
			name:  "email in &lt;...&gt;",
			text:  `<div>01.02.2023, 10:11, "Firstname Lastname" &lt;a@abc.com&gt;:</div>`,
			links: []string{"mailto:a@abc.com"},
		},
	}

	for i, test := range tests {
-- 
2.39.2 (Apple Git-143)
aerc/patches: SUCCESS in 5m16s

[hyperlinks: better parsing of emails without mailto prefixes][0] from [Vitaly Ovchinnikov][1]

[0]: https://lists.sr.ht/~rjarry/aerc-devel/patches/44615
[1]: mailto:v@postbox.nz

✓ #1056813 SUCCESS aerc/patches/openbsd.yml     https://builds.sr.ht/~rjarry/job/1056813
✓ #1056812 SUCCESS aerc/patches/alpine-edge.yml https://builds.sr.ht/~rjarry/job/1056812
Vitaly Ovchinnikov, Sep 12, 2023 at 23:19: