~olly/yoyo

1

Patch: stable topological sort using the Gapotchenko algorithm

Kruis, Anselm <anselm.kruis@atos.net>
Details
Message ID
<c213104fbca14f11a1941fd7686cdfde@atos.net>
DKIM signature
missing
Download raw message
Hi,

below is the patch. 

-----Ursprüngliche Nachricht-----
Von: Kruis, Anselm 
Gesendet: Mittwoch, 14. April 2021 00:44
An: '~olly/yoyo@lists.sr.ht' <~olly/yoyo@lists.sr.ht>
Cc: 'Olly Cope' <olly@ollycope.com>
Betreff: Re: Sorting migrations from multiple sources

Hi Olly,

Many thanks for your reply and your patch. Please apologize my late reply, I was on vacation for during the last two weeks.
Your patch works for me. It is indeed much simpler than mine. 

Another observation: the function topological_sort() might be to aggressive. Given I have three migrations in this order by ID:
"m1.sql", "m2.sql", "m3.sql" and "m3" depends on "m1"
Currently topological_sort() yields "m1", "m3", "m2". But the initial order "m1", "m2", "m3" already satisfies the dependency
condition.

Of the various possible topological sorts we usually want the one, which is most similar to the original sequence of migrations.
To define "most similar" one can use the Damerau-Levenshtein-distance or the Levenshtein-distance.
See Oleksiy Gapotchenkos blog http://blog.gapotchenko.com/stable-topological-sort for an algorithm.
If you are interested, I just coded a patch.

Kind regards
  Anselm

diff -r 312182a8c03f yoyo/migrations.py
--- a/yoyo/migrations.py	Mon Jan 18 15:13:41 2021 +0000
+++ b/yoyo/migrations.py	Wed Apr 14 01:00:56 2021 +0200
@@ -756,72 +756,66 @@
     return heads
 
 
+def _gapotchenko_topological_sort(iterable, is_arrow, *, raise_on_cycle=False):
+    # iterable: an ordered iterable of vertices
+    # is_arrow: a callable(vertex1, vertex2) that is True, if there
+    #           is an arrow from vertex1 to vertex2
+
+    vertices = list(iterable)
+
+    # compute the transitive closure (= reach-ability)
+    # using a recursive depth first search (DFS)
+    tc = defaultdict(set)
+    def dfs(s, v):
+         # Mark reachability from start to v as true.
+        s.add(id(v))
+        # Find all the vertices reachable through v
+        for vv in vertices:
+            if id(vv) not in s and is_arrow(v, vv):
+                dfs(s, vv)
+    for v in vertices:
+        dfs(tc[id(v)], v)
+
+    # And now the algorithm given by Oleksiy Gapotchenko
+    # in http://blog.gapotchenko.com/stable-topological-sort
+    # This implementation is a bit optimised and can optionally
+    # raise on cycles
+    while True:
+        for i, vi in enumerate(vertices):
+            for j in range(i):
+                vj = vertices[j]
+                if is_arrow(vj, vi):
+                    if id(vj) not in tc[id(vi)]:
+                        # vj is not in the transitive closure of vi --> no cycle
+                        del vertices[i]
+                        vertices.insert(j, vi)
+                        break # restart
+                    # it is a cycle
+                    if raise_on_cycle:
+                        # raise ValueError("graph contains a cycle", vi)
+                        raise exceptions.BadMigration(
+                            "Circular dependencies among these migrations {}".format(
+                                ", ".join(m.id for m in vertices if id(m) in tc[id(vi)])
+                            )
+                )
+            else:
+                if raise_on_cycle and is_arrow(vi, vi):
+                    # a degenerate cycle
+                    raise exceptions.BadMigration(
+                        "Circular dependencies among these migrations {}".format(vi.id)
+                    )
+                continue
+            break
+        else:
+            return vertices
+
+
 def topological_sort(
     migration_list: Iterable[Migration],
 ) -> Iterable[Migration]:
 
-    # Make a copy of migration_list. It's probably an iterator.
-    migration_list = list(migration_list)
-
-    # Track graph edges in two parallel data structures.
-    # Use OrderedDict so that we can traverse edges in order
-    # and keep the sort stable
-    forward_edges = defaultdict(
-        OrderedDict
-    )  # type: Dict[Migration, Dict[Migration, int]]
-    backward_edges = defaultdict(
-        OrderedDict
-    )  # type: Dict[Migration, Dict[Migration, int]]
-
-    def sort_by_stability_order(
-        items, ordering={m: index for index, m in enumerate(migration_list)}
-    ):
-        return sorted(
-            (item for item in items if item in ordering), key=ordering.get
-        )
-
-    for m in migration_list:
-        for n in sort_by_stability_order(m.depends):
-            forward_edges[n][m] = 1
-            backward_edges[m][n] = 1
-
-    def check_cycles(item):
-        stack: List[Tuple[Migration, List[Migration]]] = [(item, [])]
-        while stack:
-            n, path = stack.pop()
-            if n in path:
-                raise exceptions.BadMigration(
-                    "Circular dependencies among these migrations {}".format(
-                        ", ".join(m.id for m in path + [n])
-                    )
-                )
-            stack.extend((f, path + [n]) for f in forward_edges[n])
-
-    seen = set()
-    for item in migration_list:
-
-        if item in seen:
-            continue
-
-        check_cycles(item)
-
-        # if item is in a dedepency graph, go back to the root node
-        while backward_edges[item]:
-            item = next(iter(backward_edges[item]))
-
-        # is item at the start of a dependency graph?
-        if forward_edges[item]:
-            stack = [item]
-            while stack:
-                m = stack.pop()
-                yield m
-                seen.add(m)
-                for child in list(reversed(list(forward_edges[m]))):
-                    if all(
-                        dependency in seen
-                        for dependency in backward_edges[child]
-                    ):
-                        stack.append(child)
-        else:
-            yield item
-            seen.add(item)
+    # Make a copy of migration_list and do an initial sort by id
+    migration_list = list(sorted(list(migration_list), key=lambda m: m.id))
+    return _gapotchenko_topological_sort(migration_list,
+                                         lambda x,y: y in x.depends,
+                                         raise_on_cycle=True)
Details
Message ID
<20210414074238.77h4y5idr42l6li4@hooper>
In-Reply-To
<c213104fbca14f11a1941fd7686cdfde@atos.net> (view parent)
DKIM signature
missing
Download raw message
Thanks! This looks really good. I'm excited to have a look at that 
algorithm, it's new to me and looks pretty interesting. I'll review your 
patch for inclusion as soon as I can (but I have a lot on right now with 
other work so it might not be until next week). 

Olly.

On 13/04/2021, Kruis, Anselm wrote:
>Hi,
>
>below is the patch.
>
>-----Ursprüngliche Nachricht-----
>Von: Kruis, Anselm
>Gesendet: Mittwoch, 14. April 2021 00:44
>An: '~olly/yoyo@lists.sr.ht' <~olly/yoyo@lists.sr.ht>
>Cc: 'Olly Cope' <olly@ollycope.com>
>Betreff: Re: Sorting migrations from multiple sources
>
>Hi Olly,
>
>Many thanks for your reply and your patch. Please apologize my late reply, I was on vacation for during the last two weeks.
>Your patch works for me. It is indeed much simpler than mine.
>
>Another observation: the function topological_sort() might be to aggressive. Given I have three migrations in this order by ID:
>"m1.sql", "m2.sql", "m3.sql" and "m3" depends on "m1"
>Currently topological_sort() yields "m1", "m3", "m2". But the initial order "m1", "m2", "m3" already satisfies the dependency
>condition.
>
>Of the various possible topological sorts we usually want the one, which is most similar to the original sequence of migrations.
>To define "most similar" one can use the Damerau-Levenshtein-distance or the Levenshtein-distance.
>See Oleksiy Gapotchenkos blog http://blog.gapotchenko.com/stable-topological-sort for an algorithm.
>If you are interested, I just coded a patch.
>
>Kind regards
>  Anselm
>
>diff -r 312182a8c03f yoyo/migrations.py
>--- a/yoyo/migrations.py	Mon Jan 18 15:13:41 2021 +0000
>+++ b/yoyo/migrations.py	Wed Apr 14 01:00:56 2021 +0200
>@@ -756,72 +756,66 @@
>     return heads
>
>
>+def _gapotchenko_topological_sort(iterable, is_arrow, *, raise_on_cycle=False):
>+    # iterable: an ordered iterable of vertices
>+    # is_arrow: a callable(vertex1, vertex2) that is True, if there
>+    #           is an arrow from vertex1 to vertex2
>+
>+    vertices = list(iterable)
>+
>+    # compute the transitive closure (= reach-ability)
>+    # using a recursive depth first search (DFS)
>+    tc = defaultdict(set)
>+    def dfs(s, v):
>+         # Mark reachability from start to v as true.
>+        s.add(id(v))
>+        # Find all the vertices reachable through v
>+        for vv in vertices:
>+            if id(vv) not in s and is_arrow(v, vv):
>+                dfs(s, vv)
>+    for v in vertices:
>+        dfs(tc[id(v)], v)
>+
>+    # And now the algorithm given by Oleksiy Gapotchenko
>+    # in http://blog.gapotchenko.com/stable-topological-sort
>+    # This implementation is a bit optimised and can optionally
>+    # raise on cycles
>+    while True:
>+        for i, vi in enumerate(vertices):
>+            for j in range(i):
>+                vj = vertices[j]
>+                if is_arrow(vj, vi):
>+                    if id(vj) not in tc[id(vi)]:
>+                        # vj is not in the transitive closure of vi --> no cycle
>+                        del vertices[i]
>+                        vertices.insert(j, vi)
>+                        break # restart
>+                    # it is a cycle
>+                    if raise_on_cycle:
>+                        # raise ValueError("graph contains a cycle", vi)
>+                        raise exceptions.BadMigration(
>+                            "Circular dependencies among these migrations {}".format(
>+                                ", ".join(m.id for m in vertices if id(m) in tc[id(vi)])
>+                            )
>+                )
>+            else:
>+                if raise_on_cycle and is_arrow(vi, vi):
>+                    # a degenerate cycle
>+                    raise exceptions.BadMigration(
>+                        "Circular dependencies among these migrations {}".format(vi.id)
>+                    )
>+                continue
>+            break
>+        else:
>+            return vertices
>+
>+
> def topological_sort(
>     migration_list: Iterable[Migration],
> ) -> Iterable[Migration]:
>
>-    # Make a copy of migration_list. It's probably an iterator.
>-    migration_list = list(migration_list)
>-
>-    # Track graph edges in two parallel data structures.
>-    # Use OrderedDict so that we can traverse edges in order
>-    # and keep the sort stable
>-    forward_edges = defaultdict(
>-        OrderedDict
>-    )  # type: Dict[Migration, Dict[Migration, int]]
>-    backward_edges = defaultdict(
>-        OrderedDict
>-    )  # type: Dict[Migration, Dict[Migration, int]]
>-
>-    def sort_by_stability_order(
>-        items, ordering={m: index for index, m in enumerate(migration_list)}
>-    ):
>-        return sorted(
>-            (item for item in items if item in ordering), key=ordering.get
>-        )
>-
>-    for m in migration_list:
>-        for n in sort_by_stability_order(m.depends):
>-            forward_edges[n][m] = 1
>-            backward_edges[m][n] = 1
>-
>-    def check_cycles(item):
>-        stack: List[Tuple[Migration, List[Migration]]] = [(item, [])]
>-        while stack:
>-            n, path = stack.pop()
>-            if n in path:
>-                raise exceptions.BadMigration(
>-                    "Circular dependencies among these migrations {}".format(
>-                        ", ".join(m.id for m in path + [n])
>-                    )
>-                )
>-            stack.extend((f, path + [n]) for f in forward_edges[n])
>-
>-    seen = set()
>-    for item in migration_list:
>-
>-        if item in seen:
>-            continue
>-
>-        check_cycles(item)
>-
>-        # if item is in a dedepency graph, go back to the root node
>-        while backward_edges[item]:
>-            item = next(iter(backward_edges[item]))
>-
>-        # is item at the start of a dependency graph?
>-        if forward_edges[item]:
>-            stack = [item]
>-            while stack:
>-                m = stack.pop()
>-                yield m
>-                seen.add(m)
>-                for child in list(reversed(list(forward_edges[m]))):
>-                    if all(
>-                        dependency in seen
>-                        for dependency in backward_edges[child]
>-                    ):
>-                        stack.append(child)
>-        else:
>-            yield item
>-            seen.add(item)
>+    # Make a copy of migration_list and do an initial sort by id
>+    migration_list = list(sorted(list(migration_list), key=lambda m: m.id))
>+    return _gapotchenko_topological_sort(migration_list,
>+                                         lambda x,y: y in x.depends,
>+                                         raise_on_cycle=True)
Reply to thread Export thread (mbox)