epilys

Detecting cycles in tag parent-child relationship in sqlite3

Problem statement

Story tags can have multiple parents. This establishes a hierarchy where a tag can have a tree of descendants. We want this directed graph to be acyclic (DAG) because cycles don’t make sense in a topical hierarchy, and when we query the graph we would have to be extra cautious to not fall in infinite loops.

Database schema

Here are the tables created by django:

The table sic_tag_parents contains one row for each directed edge.

Recursive Common Table Expression (CTE) to the rescue

sqlite3 provides us with the ability to perform recursive queries with Common Table Expressions. We can use this to write a query that returns all paths within the graph. Then, when we insert a new parent-child relationship, we check if there already exists a directed path from the child to its parent.

Unfortunately it’s not possible to make a table CHECK CONSTRAINT since the constraint must refer to the table, which exists only after the CREATE TABLE statement is finished. We can instead make the check in a BEFORE INSERT trigger and RAISE(ABORT) if a cycle is detected, thus preventing the insertion.

Firstly, the trigger shall execute a SELECT query for each inserted row. In order to raise an exception, we have to SELECT it:

Secondly, in order to abort only when a specific condition is met, we can use a WHERE EXISTS:

In the EXISTS expression we shall put a recursive CTE that returns something (anything) if a cycle exists. If anything is returned, EXISTS evaluates to true and RAISE is selected.

This is the CTE query:

Line-by-line:

This defines a temporary table/view called w with these columns as an expression. This is similar to defining regular views as the result of a SELECT statement.

Here we select every edge as the start of a potential path. These edges are paths of length 1, and are also acyclic, hence we select 0 as cycle. This line is performed one time at the beginning of the CTE, it’s the basis of the recursion.

The base select is united with the select that follows:

Here we JOIN sic_tag_parents with w itself, by adding edges to the already existing paths in the previous iteration of w. The path is built as a string of comma separated ids. cycle is true if to_tag_id, the parent, already exists in the previous path. The WHERE NOT cycle is necessary to prevent infinite recursion.

Finally, we select the paths that begin with the parent and contain the child.

The entire trigger:

Appendix

We can visualise the paths with a CTE.

First, create a utility view that associates edges with their corresponding tag names:

Sample output:

sqlite> SELECT * FROM sic_tag_parents_name LIMIT 2;
from_name    from_tag_id  to_name           to_tag_id
-----------  -----------  ----------------  ---------
programming  57           computer science  102
debugging    137          programming       57

Then write a similar CTE as a view that creates a path string out of the tag names:

CREATE TEMPORARY VIEW sic_tag_parents_paths AS 
WITH RECURSIVE w(parent, last_visited, parent_name, last_visited_name, already_visited, cycle, length) AS (
      SELECT DISTINCT to_tag_id AS parent, from_tag_id AS last_visited, to_name AS parent_name, from_name AS last_visited_name, to_name AS already_visited, 0 AS cycle, 1 as length FROM sic_tag_parents_name

      UNION ALL

      SELECT t.to_tag_id AS parent, t.from_tag_id AS last_visited, t.to_name, t.from_name,  t.to_name || ' > ' || already_visited, already_visited LIKE '%'||t.to_name||"%", length+1 AS length FROM sic_tag_parents_name AS t JOIN w ON w.last_visited = t.to_tag_id
      WHERE NOT cycle
)
SELECT * FROM w;

The output:

sqlite> select distinct(last_visited_name || ' > ' || already_visited) as path, length from sic_tag_parents_paths order by length desc limit 25;
path                                                                 length
-------------------------------------------------------------------  ------
c > programming languages > programming > computer science           3
c++ > programming languages > programming > computer science         3
clojure > programming languages > programming > computer science     3
elixir > programming languages > programming > computer science      3
erlang > programming languages > programming > computer science      3
go > programming languages > programming > computer science          3
haskell > programming languages > programming > computer science     3
javascript > programming languages > programming > computer science  3
java > programming languages > programming > computer science        3
lisp > programming languages > programming > computer science        3
lua > programming languages > programming > computer science         3
python > programming languages > programming > computer science      3
ruby > programming languages > programming > computer science        3
rust > programming languages > programming > computer science        3
swift > programming languages > programming > computer science       3
emacs > commandline > unix > operating systems                       3
emacs > commandline > unix                                           2
debugging > programming > computer science                           2
programming languages > programming > computer science               2
vcs > programming > computer science                                 2
commandline > unix > operating systems                               2
linux > unix > operating systems                                     2
openbsd > unix > operating systems                                   2
dragonflybsd > unix > operating systems                              2
sqlite3 > databases > computer science                               2