~ireas/rusty-man-dev

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
3 2

[PATCH] Handle std/core/alloc duplicates

Details
Message ID
<20210406235104.1851010-1-evan@new-schmidt.com>
DKIM signature
pass
Download raw message
Patch: +220 -6
This dedups items from alloc or core that are re-exported under std,
selecting the non-std variant.

Its implemented by updating the sort and dedup passes in a way that
groups identical items from the offending namespaces next to each other
(with std last), and equates them by ignoring the root namespace,
respectively.

A side effect of this method is that items shared between alloc and
core (like `fmt::Result`) are also deduplicated.

Since the std variants are effectively only removed when the alloc/core
variants are present, any item removed by the deduplication can still be
viewed by specifying it's full path, e.g. `rusty-man std::vec::Vec`.

Beyond removing the std duplicates, this preserves the previous sorting
order by sorting again after deduplicating.
---
Hi! I came across this lovely project when I was looking at the cursive
docs, and #17 was one of the first things I found I was missing, so I
decided to try implementing it.

I tried a couple methods, including deduplicating only if there was a
single item in the results, but this ended up seeming the closest to
my read of the original ticket and most consistent/reliable.

I also tried to keep the new methods attached to the relevant objects
but still private - let me know what you think of that organization.

I also updated the README example, it didn't match the current behavior
from what I found.

That said, this is the first time I've done this kind of list sorting
and filtering in rust, I'd be happy to try something else!

Some examples of the results:

u8 (before):

    $ rusty-man u8
    Found mulitple matches for u8 – select one of:
    
    [ 0 ] core::u8 (Module): The 8-bit unsigned integer type.
    [ 1 ] std::u8 (Primitive): The 8-bit unsigned integer type.
    [ 2 ] std::u8 (Module): The 8-bit unsigned integer type.
    
    >

u8 (after):

    $ ./target/debug/rusty-man u8
    Found multiple matches for u8 – select one of:
    
    [ 0 ] core::u8 (Module): The 8-bit unsigned integer type.
    [ 1 ] std::u8 (Primitive): The 8-bit unsigned integer type.
    
    >

Vec (before):

    $ rusty-man Vec
    Found mulitple matches for Vec – select one of:
    
    [ 0 ] alloc::vec::Vec (Struct): A contiguous growable array type, written <code>Vec<T></code> but …
    [ 1 ] std::vec::Vec (Struct): A contiguous growable array type, written <code>Vec<T></code> but …
    
    >

Vec (after):

    $ # just opens alloc::vec::Vec
    $ ./target/debug/rusty-man Vec


 README.md     |  10 +--
 src/source.rs | 216 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 220 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 4d5df1a..6b8bbf2 100644
--- a/README.md
+++ b/README.md
@@ -30,7 +30,7 @@ ## Example Usage
$ rusty-man kuchiki::NodeRef
```

You don’t have to specificy the full item name:
You don’t have to specify the full item name:
```
$ rusty-man NodeRef
```
@@ -45,11 +45,11 @@ ## Example Usage
If there are multiple matches for the keyword, rusty-man will show you a list
of all matching items:
```
$ rusty-man --source my/other/crate/target/doc u8
Found mulitple matches for u8 – select one of:
$ rusty-man u8
Found multiple matches for u8 – select one of:

[ 0 ] core::u8: The 8-bit unsigned integer type.
[ 1 ] std::u8: The 8-bit unsigned integer type.
[ 0 ] core::u8 (Module): The 8-bit unsigned integer type.
[ 1 ] std::u8 (Primitive): The 8-bit unsigned integer type.

> 1
```
diff --git a/src/source.rs b/src/source.rs
index ea1cafd..a3bb655 100644
--- a/src/source.rs
+++ b/src/source.rs
@@ -68,10 +68,74 @@ pub fn search(&self, name: &doc::Name) -> anyhow::Result<Vec<index::IndexItem>>
            .map(|i| i.find(name))
            .collect::<Vec<_>>()
            .concat();
        Self::dedup(&mut items);
        items.sort_unstable();
        items.dedup();

        Ok(items)
    }

    fn dedup(items: &mut Vec<index::IndexItem>) {
        items.sort_unstable_by(index::IndexItem::cmp_normalize_std);
        items.dedup_by(|l, r| index::IndexItem::dup_root(l, r) || l.eq(&r));
    }
}

// extensions for deduping std/allow/core re-exports
impl index::IndexItem {
    const DUPE_ROOTS: [&'static str; 3] = ["std", "alloc", "core"];

    /// Check if two index items are the same, re-exported from `alloc`/`core` to `std`
    fn dup_root(left: &Self, right: &Self) -> bool {
        left.ty == right.ty
            && left.name.rest() == right.name.rest()
            && Self::DUPE_ROOTS.contains(&left.name.first())
            && Self::DUPE_ROOTS.contains(&right.name.first())
    }

    /// An alternate ordering of `IndexItem` that sorts by:
    /// 1. items with "crate" names from `std`, `alloc`, and `core` ahead of items from other modules
    /// 2. type
    /// 3. `name.rest()`
    /// 4. `name.first()`
    /// 5. description
    ///
    /// The intention is to use this for deduplicating common re-exported items by ordering them next to each other.
    fn cmp_normalize_std(left: &Self, right: &Self) -> std::cmp::Ordering {
        use std::cmp::Ordering::*;

        // order std duplicate modules higher than all others
        match (
            Self::DUPE_ROOTS.contains(&left.name.first()),
            Self::DUPE_ROOTS.contains(&right.name.first()),
        ) {
            (true, false) => return Less,
            (false, true) => return Greater,
            _ => (),
        }

        // matching types should be next to each other (some modules share names with types)
        match left.ty.cmp(&right.ty) {
            Equal => (),
            c => return c,
        }

        match left.name.rest().cmp(&right.name.rest()) {
            Equal => (),
            c => return c,
        }

        match left.name.first().cmp(&right.name.first()) {
            Equal => (),
            c => return c,
        }

        match left.description.cmp(&right.description) {
            Equal => (),
            c => return c,
        }

        Equal
    }
}

impl DirSource {
@@ -260,3 +324,153 @@ pub fn get_source<P: AsRef<path::Path>>(path: P) -> anyhow::Result<Box<dyn Sourc
        ))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_dedup_vec() {
        use super::doc::ItemType::*;
        use index::IndexItem;;

        let items = vec![
            IndexItem {
                name: "std::vec::Vec".to_owned().into(),
                ty: Typedef,
                description: "".to_string(),
            },
            IndexItem {
                name: "alloc::vec::Vec".to_owned().into(),
                ty: Typedef,
                description: "".to_string(),
            },
        ];

        let mut dedupe_items = items.clone();
        Sources::dedup(&mut dedupe_items);
        assert_eq!(
            &items[1..2],
            dedupe_items.as_slice(),
            "std variant should be removed when both are present"
        );

        let mut items = items;
        items.pop();
        let original_items = items.clone();
        Sources::dedup(&mut items);
        assert_eq!(
            original_items, items,
            "std variant should NOT be removed when alone"
        );
    }

    #[test]
    fn test_dedup_u8() {
        use super::doc::ItemType::*;
        use index::IndexItem;;

        let mut items = vec![
            IndexItem {
                name: "core::u8".to_owned().into(),
                ty: Module,
                description: "".to_string(),
            },
            IndexItem {
                name: "std::u8".to_owned().into(),
                ty: Primitive,
                description: "".to_string(),
            },
            IndexItem {
                name: "std::u8".to_owned().into(),
                ty: Module,
                description: "".to_string(),
            },
        ];

        assert_eq!(
            true,
            items
                .iter()
                .find(|i| i.name.first() == "std" && i.ty == Module)
                .is_some()
        );
        Sources::dedup(&mut items);
        assert_eq!(
            None,
            items
                .iter()
                .find(|i| i.name.first() == "std" && i.ty == Module)
        );
    }

    #[test]
    fn test_dedup_result() {
        use super::doc::ItemType::*;
        use index::IndexItem;;

        let mut items = [
            ("alloc::fmt::Result", Typedef),
            ("anyhow::Result", Typedef),
            ("bincode::Result", Typedef),
            ("clap::Result", Typedef),
            ("core::fmt::Result", Typedef), // duplicate
            ("core::result::Result", Enum),
            ("darling::Result", Typedef),
            ("darling::error::Result", Typedef),
            ("darling_core::error::Result", Typedef),
            ("regex_syntax::Result", Typedef),
            ("serde_json::Result", Typedef),
            ("serde_json::error::Result", Typedef),
            ("std::fmt::Result", Typedef),    // duplicate
            ("std::io::Result", Typedef),     // unique
            ("std::result::Result", Enum),    // duplicate
            ("std::thread::Result", Typedef), // unique
            ("syn::Result", Typedef),
            ("syn::parse::Result", Typedef),
            ("walkdir::Result", Typedef),
            ("xml::reader::Result", Typedef),
            ("xml::writer::Result", Typedef),
        ]
        .iter()
        .map(|(name, ty)| {
            let name = name.to_string().into();
            let ty = ty.to_owned();
            IndexItem {
                name,
                ty,
                description: "".to_string(),
            }
        })
        .collect::<Vec<_>>();

        let removed = [
            "std::fmt::Result",
            "std::result::Result",
            "core::fmt::Result",
        ];

        assert_eq!(
            removed.len(),
            items
                .iter()
                .filter(|i| removed.contains(&i.name.full()))
                .count(),
            "All expected duplicates should be present before dedup"
        );
        let initial_len = items.len();

        Sources::dedup(&mut items);

        assert_eq!(
            initial_len - removed.len(),
            items.len(),
            "Only expected duplicates should be removed"
        );
        assert_eq!(
            None,
            items.iter().find(|i| removed.contains(&i.name.full())),
            "All expected duplicates should be removed"
        );
    }
}
-- 
2.31.1
Details
Message ID
<20210407204107.GA1835@ireas.org>
In-Reply-To
<20210406235104.1851010-1-evan@new-schmidt.com> (view parent)
DKIM signature
pass
Download raw message
Hi Evan,

On 2021-04-06 16:51:04, Evan Lloyd New-Schmidt wrote:
> Hi! I came across this lovely project when I was looking at the cursive
> docs, and #17 was one of the first things I found I was missing, so I
> decided to try implementing it.

Great!  Please let me know if you have any other issues.  I haven’t
worked on this project in a while, but I’m still interested in
continuing its development and improving its usability.

Thank you for the patch!  It looks very good to me.  I have some
questions and comments:

As far as I see, the item that is retained is implicitly determined by
alphabetical order, i. e. prefering alloc over core over std, right?  I
think that’s the right thing to do, but it would be good to add a
comment that mentions the significance of the alphabetical order.

Currently, we have to make sure that dup_root and cmp_normalize_std use
the same ordering so that the de-duplication works.  Did you consider
using sort_unstable_by_key and dedup_by_key instead? The key function
could be similar to:
    |item| {
        let is_stdlib_reexport = DUPE_ROOTS.contains(&item.name.first());
	let unique_name = if is_stdlib_reexport {
	    item.name.last()
	} else {
	    item.name.as_ref()
	};
        (is_stdlib_reexport, unique_name, item.ty, &item.description)
    }

Also, would it make sense to ignore the description when comparing the
items?

> I also tried to keep the new methods attached to the relevant objects
> but still private - let me know what you think of that organization.

If I had implemented this, I would probably have written free functions
in source.rs.  I’ll have to think about this a bit more, but that’s no
big issue.

> I also updated the README example, it didn't match the current behavior
> from what I found.

Thanks!

> +    fn dedup(items: &mut Vec<index::IndexItem>) {
> +        items.sort_unstable_by(index::IndexItem::cmp_normalize_std);
> +        items.dedup_by(|l, r| index::IndexItem::dup_root(l, r) || l.eq(&r));

Why do we need eq?  Shouldn’t dup_root return true if l.eq(&r)?

> +// extensions for deduping std/allow/core re-exports
> +impl index::IndexItem {
> +    const DUPE_ROOTS: [&'static str; 3] = ["std", "alloc", "core"];

Nit:  &[&str] is less verbose.

> +#[cfg(test)]
> +mod tests {
> +    use super::*;
> +
> +    #[test]
> +    fn test_dedup_vec() {
> +        use super::doc::ItemType::*;
> +        use index::IndexItem;;

Trailing semicolon.

> +    #[test]
> +    fn test_dedup_u8() {
> +        use super::doc::ItemType::*;
> +        use index::IndexItem;;

Trailing semicolon.

> +    #[test]
> +    fn test_dedup_result() {
> +        use super::doc::ItemType::*;
> +        use index::IndexItem;;

Trailing semicolon.

Thanks again for the patch!
Robin
Details
Message ID
<ef22a1dc-7a7e-9f62-b601-40fc236c2251@new-schmidt.com>
In-Reply-To
<20210407204107.GA1835@ireas.org> (view parent)
DKIM signature
pass
Download raw message
Hi Robin,

> Please let me know if you have any other issues.

Will do!

> As far as I see, the item that is retained is implicitly determined by
> alphabetical order, i. e. prefering alloc over core over std, right?  I
> think that’s the right thing to do, but it would be good to add a
> comment that mentions the significance of the alphabetical order.

That is correct, and fair point on having an explicit comment - I'll add 
that in.

> Currently, we have to make sure that dup_root and cmp_normalize_std use
> the same ordering so that the de-duplication works.  Did you consider
> using sort_unstable_by_key and dedup_by_key instead? The key function
> could be similar to:
>      |item| {
>          let is_stdlib_reexport = DUPE_ROOTS.contains(&item.name.first());
> 	let unique_name = if is_stdlib_reexport {
> 	    item.name.last()
> 	} else {
> 	    item.name.as_ref()
> 	};
>          (is_stdlib_reexport, unique_name, item.ty, &item.description)
>      }

That's very interesting - I tried using the _key variants initially, and 
I thought about using tuples to set the fields' sort order, but I never 
thought to return the calculated boolean and name variants! That looks 
much more elegant IMO.

Unfortunately, when I tried implementing this version I got "error: 
lifetime may not live long enough". I tried specifying some manually, 
but it appears that the *_by_key methods don't allow key functions that 
return references to the slice's items - this question on S.O. has some 
more info: https://stackoverflow.com/a/56106352

If I understand correctly, to use the key method we'd have to either:
1. Clone the string slices that are returned, potentially with the 
cached sort? (although it appears there is no unstable cached sort).
2. Rework the design of doc::Name somehow to use an external reference, 
which seems a little unreasonable.

I'm not sure if the cloning would have much of a performance impact with 
the fairly small number of search results, what do you think?

Regardless of whether the key methods are used, with this kind of 
approach to removal I think we'll always need two different comparison 
functions - since in the sorting step the crate names need to be sorted, 
but in the dedup step they need to be equal in order to remove always 
remove std. I think the key methods might make that clearer and also 
allow more reuse though, so I'll work on a solution using them.

> Also, would it make sense to ignore the description when comparing the
> items?

I wasn't sure about that - I would *think* that any item with the same 
relative path and type will be the same, but I decided to stick with the 
conservative route. For std, alloc, and core I think it's probably fine 
to ignore it.
>> +    fn dedup(items: &mut Vec<index::IndexItem>) {
>> +        items.sort_unstable_by(index::IndexItem::cmp_normalize_std);
>> +        items.dedup_by(|l, r| index::IndexItem::dup_root(l, r) || l.eq(&r));
> 
> Why do we need eq?  Shouldn’t dup_root return true if l.eq(&r)?

Good question! I had to think a little more about this. With how it's 
written currently, yes - dup_root will always return false for items 
that don't have roots in DUPE_ROOTS. But there's no reason it couldn't 
be a broader equality comparison now - previously I was using it for two 
different purposes that .

> Nit:  &[&str] is less verbose.

It looks like that static lifetime inference doesn't apply to associated 
constants in impl blocks for some reason, I get lifetime errors if I 
don't add 'static to both `&`s. Moving it to the module scope works with 
that form though, so I'll move it there.

> Trailing semicolon.

Gosh, I thought I had looked all over this - and I'm a little surprised 
rustfmt didn't change it. I'll clean those up.

> Thanks again for the patch!

Sure thing! Thanks for creating such a helpful tool - just the other 
week I was writing something in C and remembered how nice it is to have 
every library function just a `man` call away, and how I missed that 
with Rust. Having a terminal companion for the full-featured web 
interface is very nice - its already come in handy for writing this patch.
Details
Message ID
<20210409070323.GA1919@ireas.org>
In-Reply-To
<ef22a1dc-7a7e-9f62-b601-40fc236c2251@new-schmidt.com> (view parent)
DKIM signature
pass
Download raw message
On 2021-04-08 17:41:54, Evan Lloyd New-Schmidt wrote:
> Unfortunately, when I tried implementing this version I got "error: lifetime
> may not live long enough". I tried specifying some manually, but it appears
> that the *_by_key methods don't allow key functions that return references
> to the slice's items - this question on S.O. has some more info:
> https://stackoverflow.com/a/56106352

That’s a bummer.  Maybe it is possible to use a key function together
with sort_by?  E. g. something like this (where get_tuple is the
function I outlined previously):

    vec.sort_by(|l, r| get_tuple(l).cmp(&get_tuple(r)))
    vec.dedup_by(|l, r| get_tuple(l).eq(&get_tuple(r)))

Then there should be no lifetime issues AFAIS.

> I'm not sure if the cloning would have much of a performance impact with the
> fairly small number of search results, what do you think?

It shouldn’t be a big issue.  I’d like to explore other approaches
first, but if there is no simple workaround, it’s okay to clone the
data.

> Regardless of whether the key methods are used, with this kind of approach
> to removal I think we'll always need two different comparison functions -
> since in the sorting step the crate names need to be sorted, but in the
> dedup step they need to be equal in order to remove always remove std. I
> think the key methods might make that clearer and also allow more reuse
> though, so I'll work on a solution using them.

Good point.  I think this could be fixed by adding a then_with call to
the sort function:
    vec.sort_by(|l, r| {
        get_tuple(l)
	    .cmp(&get_tuple(r))
	    .then_with(|| get_preference(l).cmp(&get_preference(r)))
    })

get_preference could either use the implicit alphabetical order, or use
the index of the root module in DUPE_ROOTS as an explicit ordering.

> > > +    fn dedup(items: &mut Vec<index::IndexItem>) {
> > > +        items.sort_unstable_by(index::IndexItem::cmp_normalize_std);
> > > +        items.dedup_by(|l, r| index::IndexItem::dup_root(l, r) || l.eq(&r));
> > 
> > Why do we need eq?  Shouldn’t dup_root return true if l.eq(&r)?
> 
> Good question! I had to think a little more about this. With how it's
> written currently, yes - dup_root will always return false for items that
> don't have roots in DUPE_ROOTS. But there's no reason it couldn't be a
> broader equality comparison now - previously I was using it for two
> different purposes that .

I see,  makes sense.

> > Nit:  &[&str] is less verbose.
> 
> It looks like that static lifetime inference doesn't apply to associated
> constants in impl blocks for some reason, I get lifetime errors if I don't
> add 'static to both `&`s. Moving it to the module scope works with that form
> though, so I'll move it there.

Oh, I didn’t know that.  But I think it’s fine to have the constant in
the module scope.

/Robin
Reply to thread Export thread (mbox)