~edwardloveall/scribe

3 2

Handling link.medium.com

Martin Puppe <dev@mpuppe.de>
Details
Message ID
<63f25535f998c70011966c6eab6f9e790af6df0a.camel@mpuppe.de>
DKIM signature
pass
Download raw message
Thank you for merging my patches yesterday.

I have been thinking about link.medium.com. Users get URLs with this
domain, like for example https://link.medium.com/AXEtCilplkb, when they
click the Twitter button on the Medium page of a Post and share the
Post on Twitter. There might possibly be other ways to get URLs like 
that. Currently, Scribe cannot handle URLs like that because the path
does not contain a post ID.

These URLs are shortened URLs and Medium is redirecting any requests
for these URLs to the „real“ URL of the post. For privacy reasons, I
personally think it would be desirable that Scribe also handle these
URLs[^1]. Scribe would have to do a request for the shortened URL and
get the real URL. It could then redirect the user to the corresponding
Scribe URL. What do you think? Do you agree? Or is this outside of the
scope of the project?

I see three problems that have to be considered if it is decided that
Scribe should resolve shortened URLs. The first problem is the
trickiest and most of the following is about that.

The first problem is that Scribe does not know the complete original
URL. It only gets a path but not the host part of the original URL. The
question is then how Scribe can discriminate shortened (link.) URLs 
from regular URLs. Redirector (definitely) and Privacy Redirect (maybe)
could be made to submit the original host part as an additonal query
parameter. Scribe could then use this information to discriminate.

Alternatively, one could try to discriminate without any additional
information. Here are some shortened URLs from the first few posts of
the front page:

* https://link.medium.com/old9A0e6Xjb  
* https://link.medium.com/uIRDIB6blkb  
* https://link.medium.com/P7MvrqCiyib  
* https://link.medium.com/0jKuhQkwlkb  
* https://link.medium.com/8tO698PZ7jb  
* https://link.medium.com/NmO7jB8unkb  
* https://link.medium.com/YR7f4Ad7ekb

And here are some post IDs:

* 7036d2d2b0d8
* 7c5211d9ee82
* 715ce90e6cdc
* d20f8a7136bf
* eb2c769b8fb2
* ebf14148ca3b
* 3e9d9de98384
* 3619fd56d07d  

A regular expression that would cover the paths of the shortened URLs,
would look like this /[a-zA-Z0-9]+$/. Compare this to the regular
expression for post ids: /[a-f0-9]+$/. There are 16 out of 62 symbols
that may appear in both kinds of strings. Assuming an even distribution
and a length of eleven characters, the probability that the path of a
shortened URL only contains these 16 characters is p = (16/62)^11 ≈
3,3807e-7, which is pretty low. Now, the more URLs we look at the
higher the probability that we encounter at least one that contains
only the 16 ambiguous characters. But for n URLs, the probability that
we encounter only URLs that can be clearly discriminated is q = (1 - p)
^ n. It obviously converges to 0. But how fast does it converge? Here
are some numbers:

* n =          1: q ≈ 99,99997 %
* n =      1,000: q ≈ 99,97 %
* n =     10,000: q ≈ 99,66 %
* n =    100,000: q ≈ 96,68 %
* n =  1,000,000: q ≈ 71,31 %
* n =  2,000,000: q ≈ 50,86 %
* n = 10,000,000: q ≈  3,40 %

I think it is safe to say, that most users will never in their lifetime
submit a URL to Scribe where we cannot clearly discriminate whether it
is a shortened URL or not. And I have not even considered that non-
shortened URLs usually contain more than just the post id. In any case,
if the URL is ambiguous we can treat it as a regular URL, and if that
fails, we can fall back to treating it as a shortened URL.  

The second problem is how to fit redirecting into the existing
architecture. Currently, detecting the post ID happens in
src/actions/articles/show.cr without any HTTP requests. Requests to
Medium servers only happen in src/clients/medium_client.cr. When we
want to resolve shortened URLs, we have to make one or two HTTP
requests before we get the real URL and thus the post ID. Where should
these requests be made?

The third problem is how to get the real URL. I have not yet been able
to get the real URL by following redirects with a terminal client
(HTTPie). But of course, this should be solvable. If browsers can do
it, any other HTTP client can do it as well. I would have to look into
it some more.

Martin

[^1]: While writing this email, I have thought a bit more about privacy
and how to achieve it. I am not sure anymore if I personally think
privacy should be a primary goal of alternative frontends[^2]. If the
end is privacy, there might be better means to achieve it. I will have
to think about it some more. In the meantime, I still wanted to offer
my thoughts on shortened URLs because there are reasons other than
privacy for Scribe to handle these URLs.

[^2]: Not that my personal opinion should necessarily matter for
Scribe. It is your project.
Details
Message ID
<D32A3D61-1190-4F6C-9C1F-16464E83FE68@edwardloveall.com>
In-Reply-To
<63f25535f998c70011966c6eab6f9e790af6df0a.camel@mpuppe.de> (view parent)
DKIM signature
pass
Download raw message
Amazing Martin. I appreciate all your considerations, math, and quest
for privacy no matter what that might mean to you or anyone else.

Not to skip past your solutions, but what if scribe had a `link`
subdomain i.e. `link.scribe.rip`? Lucky
[supports subdomains](https://github.com/luckyframework/lucky/pull/1537)
and so an action could handle detecting and redirecting to the main
scribe site. Would that work with Redirector and Privacy Redirect?

> I have not yet been able to get the real URL by following redirects
> with a terminal client

I was able to do this with curl:

```
$ curl -I "https://link.medium.com/YR7f4Ad7ekb"

HTTP/1.1 307 Temporary Redirect
Location: 
https://rsci.app.link/YR7f4Ad7ekb?_p=c61129cb9e1c65f6eb038ffced

... other headers ...
```

Following that url produces:

```
$ curl -I 
"https://rsci.app.link/YR7f4Ad7ekb?_p=c61129cb9e1c65f6eb038ffced"

HTTP/2 307
location: 
https://medium.com/p/staking-and-farming-3619fd56d07d?source=social.tw&_branch_match_id=977968740893479025

... other headers ...
```

Which has the medium url scribe needs. Technically that will also
redirect a couple more times depending on if there's a domain connected
to that account or not, but as far as scribe is concerned, that's all we
need to do. I tried a few links you provided and they all followed this
two-step pattern. Logic to resolve these redirects could be written in
Crystal.

All that said, I kind of agree with you that this might be overkill.
I'm not opposed to it, but is it worth the time/complexity/effort? There
are a couple other things I'd like to clean up before this, but I might
get around to it. I'd love to hear what you think about this solution
first.

Thanks again.

Edward

On 16 Oct 2021, at 11:01, Martin Puppe wrote:

> Thank you for merging my patches yesterday.
>
> I have been thinking about link.medium.com. Users get URLs with this
> domain, like for example https://link.medium.com/AXEtCilplkb, when 
> they
> click the Twitter button on the Medium page of a Post and share the
> Post on Twitter. There might possibly be other ways to get URLs like
> that. Currently, Scribe cannot handle URLs like that because the path
> does not contain a post ID.
>
> These URLs are shortened URLs and Medium is redirecting any requests
> for these URLs to the „real“ URL of the post. For privacy reasons, 
> I
> personally think it would be desirable that Scribe also handle these
> URLs[^1]. Scribe would have to do a request for the shortened URL and
> get the real URL. It could then redirect the user to the corresponding
> Scribe URL. What do you think? Do you agree? Or is this outside of the
> scope of the project?
>
> I see three problems that have to be considered if it is decided that
> Scribe should resolve shortened URLs. The first problem is the
> trickiest and most of the following is about that.
>
> The first problem is that Scribe does not know the complete original
> URL. It only gets a path but not the host part of the original URL. 
> The
> question is then how Scribe can discriminate shortened (link.) URLs
> from regular URLs. Redirector (definitely) and Privacy Redirect 
> (maybe)
> could be made to submit the original host part as an additonal query
> parameter. Scribe could then use this information to discriminate.
>
> Alternatively, one could try to discriminate without any additional
> information. Here are some shortened URLs from the first few posts of
> the front page:
>
> * https://link.medium.com/old9A0e6Xjb
> * https://link.medium.com/uIRDIB6blkb
> * https://link.medium.com/P7MvrqCiyib
> * https://link.medium.com/0jKuhQkwlkb
> * https://link.medium.com/8tO698PZ7jb
> * https://link.medium.com/NmO7jB8unkb
> * https://link.medium.com/YR7f4Ad7ekb
>
> And here are some post IDs:
>
> * 7036d2d2b0d8
> * 7c5211d9ee82
> * 715ce90e6cdc
> * d20f8a7136bf
> * eb2c769b8fb2
> * ebf14148ca3b
> * 3e9d9de98384
> * 3619fd56d07d
>
> A regular expression that would cover the paths of the shortened URLs,
> would look like this /[a-zA-Z0-9]+$/. Compare this to the regular
> expression for post ids: /[a-f0-9]+$/. There are 16 out of 62 symbols
> that may appear in both kinds of strings. Assuming an even 
> distribution
> and a length of eleven characters, the probability that the path of a
> shortened URL only contains these 16 characters is p = (16/62)^11 ≈
> 3,3807e-7, which is pretty low. Now, the more URLs we look at the
> higher the probability that we encounter at least one that contains
> only the 16 ambiguous characters. But for n URLs, the probability that
> we encounter only URLs that can be clearly discriminated is q = (1 - 
> p)
> ^ n. It obviously converges to 0. But how fast does it converge? Here
> are some numbers:
>
> * n =          1: q ≈ 99,99997 %
> * n =      1,000: q ≈ 99,97 %
> * n =     10,000: q ≈ 99,66 %
> * n =    100,000: q ≈ 96,68 %
> * n =  1,000,000: q ≈ 71,31 %
> * n =  2,000,000: q ≈ 50,86 %
> * n = 10,000,000: q ≈  3,40 %
>
> I think it is safe to say, that most users will never in their 
> lifetime
> submit a URL to Scribe where we cannot clearly discriminate whether it
> is a shortened URL or not. And I have not even considered that non-
> shortened URLs usually contain more than just the post id. In any 
> case,
> if the URL is ambiguous we can treat it as a regular URL, and if that
> fails, we can fall back to treating it as a shortened URL.
>
> The second problem is how to fit redirecting into the existing
> architecture. Currently, detecting the post ID happens in
> src/actions/articles/show.cr without any HTTP requests. Requests to
> Medium servers only happen in src/clients/medium_client.cr. When we
> want to resolve shortened URLs, we have to make one or two HTTP
> requests before we get the real URL and thus the post ID. Where should
> these requests be made?
>
> The third problem is how to get the real URL. I have not yet been able
> to get the real URL by following redirects with a terminal client
> (HTTPie). But of course, this should be solvable. If browsers can do
> it, any other HTTP client can do it as well. I would have to look into
> it some more.
>
> Martin
>
> [^1]: While writing this email, I have thought a bit more about 
> privacy
> and how to achieve it. I am not sure anymore if I personally think
> privacy should be a primary goal of alternative frontends[^2]. If the
> end is privacy, there might be better means to achieve it. I will have
> to think about it some more. In the meantime, I still wanted to offer
> my thoughts on shortened URLs because there are reasons other than
> privacy for Scribe to handle these URLs.
>
> [^2]: Not that my personal opinion should necessarily matter for
> Scribe. It is your project.
Martin Puppe <dev@mpuppe.de>
Details
Message ID
<d90e5d40ebd5f01233302f4731a6225c111adc9f.camel@mpuppe.de>
In-Reply-To
<D32A3D61-1190-4F6C-9C1F-16464E83FE68@edwardloveall.com> (view parent)
DKIM signature
pass
Download raw message
Sorry for the late reply. I have been incredibly busy the last two
weeks.

Am Samstag, dem 16.10.2021 um 13:18 -0400 schrieb Edward Loveall:
> Amazing Martin. I appreciate all your considerations, math, and quest
> for privacy no matter what that might mean to you or anyone else.
> 
> Not to skip past your solutions, but what if scribe had a `link`
> subdomain i.e. `link.scribe.rip`? Lucky
> [supports
> subdomains](https://github.com/luckyframework/lucky/pull/1537)
> and so an action could handle detecting and redirecting to the main
> scribe site. Would that work with Redirector and Privacy Redirect?

Your solution should work with both. In Redirector, a second rule would
be necessary. In Privacy Redirect, the logic in function
`redirectMedium` in src/pages/background/background.js[^1] would have
to be changed.

[^1]: https://github.com/SimonBrazell/privacy-redirect/pull/311/files 


> > I have not yet been able to get the real URL by following redirects
> > with a terminal client
> 
> I was able to do this with curl:
> 
> ```
> $ curl -I "https://link.medium.com/YR7f4Ad7ekb"
> 
> HTTP/1.1 307 Temporary Redirect
> Location: 
> https://rsci.app.link/YR7f4Ad7ekb?_p=c61129cb9e1c65f6eb038ffced
> 
> ... other headers ...
> ```

Weirdly, I get a response with “HTTP/1.1 200 OK” with curl or HTTPie.
If I look at the network monitor in Firefox I get redirects with code
307 like you. I don’t really understand it.

> All that said, I kind of agree with you that this might be overkill.
> I'm not opposed to it, but is it worth the time/complexity/effort?
> There
> are a couple other things I'd like to clean up before this, but I
> might
> get around to it. I'd love to hear what you think about this solution
> first.

Actually, the first solution I had thought off, was letting Scribe
handle *all* subdomains. But then, setting up DNS and certificates
would have been more difficult. Let’s Encrypt for example does not
support wildcard certificates. That is why I threw that idea overboard.
Handling only the link subdomain makes it much simpler but I did not
think of that. That’s just one more set of DNS records and one more
(non-wildcard) certificate.

Your solution would work equally well. It just has different trade-
offs. It’s conceptually simpler and probably more robust. On the other
hand, as stated above, it would need a second rule in Redirector and
more complex logic in Privacy Redirect. Also a second set of DNS
records and a second certificate.

In the end, I think it’s all overkill. In my personal opinion, privacy
does not have to be a main goal for Scribe [^2]. I personally do not
use Scribe, Nitter, Libreddit for privacy. I use them for the cleaner
and faster user experience. For privacy, I use stuff like uBlock Origin
and Temporary Containers (with automatic mode enabled). And if I were
really concerned, I would use Tor.

[^2]: If it were a goal, the next thing that would have to be tackled
are the images which are still loaded from medium.com.

Martin
Details
Message ID
<AE164316-0F99-44C7-BCA6-C2C757585E4E@edwardloveall.com>
In-Reply-To
<d90e5d40ebd5f01233302f4731a6225c111adc9f.camel@mpuppe.de> (view parent)
DKIM signature
pass
Download raw message
Yeah those are really good points. I did built scribe in part for 
privacy, but it's not really achieving that goal as you pointed out with 
images and embeds. My main goal is to get people to stop using Medium. 
Then all of these problems disappear 😄

I'll put this on the back burner for now as it doesn't seem like a 
priority and it does add some complications. Thanks for talking this 
through with me.

Edward
Reply to thread Export thread (mbox)