Make And Enquiry

Recursive Database Queries with PostgreSQL

19th November 2009

Whilst working on a new website with social networking features, we needed to come up with an efficient way to retrieve a list of friends of a given user.  The added difference that has flummoxed many an SQL database in the past, is that we also need to retrieve the friends of that users friends, and their friends, etc.

The ultimate goal was to retrieve a list of users who were within five degrees of the primary user, along with the shortest route through the friends network to get to that person.  This can be a bit of a hard problem to visualise, so lets map out an example.

If Colin is friends with Alex who is then friends with Chris, then we want our search to discover that Chris is linked to Colin, with a path of Colin -> Alex -> Chris, and a depth of 2.

In addition, should Alex also be friends with Peter, who is then friends with Chris (giving a path of Colin -> Alex -> Peter -> Chris and a depth of 3) then we only want our search return the previous, shorter route.

There have been numerous attempts to solve this kind of problem using SQL in the past, using multiple joins, pre-calculating the data in another table, recursively searching the database from the application, etc.  However PostgreSQL 8.4 now adds another tool to this armory, WITH RECURSIVE (Common Table Expressions).

The WITH statement allows you to effectively allows creation of temporary tables for use only in the current statement.  These can be used to create queries that are more easy to understand but can also speed up certain queries.  Depesz has a great overview here.

Adding the RECURSIVE clause alters the behavior of the WITH statement to allow us to solve our problem.  Lets start by throwing the solution out there, and then we can discuss how it works:

WITH RECURSIVE friendlist(fid, path, depth) AS (
    SELECT f.friend_id, ARRAY[f.person_id, f.friend_id], 1 FROM friend AS f WHERE person_id = 7470
UNION ALL
    SELECT f.friend_id, path || f.friend_id, l.depth + 1
    FROM friend AS f, friendlist AS l
    WHERE f.person_id = l.fid AND depth < 5
)
SELECT * FROM (
    SELECT DISTINCT ON (fid) fid, path, depth
    FROM friendlist
    ORDER BY fid, depth ASC
) AS r ORDER BY depth ASC;

To test the statement I created a table called friend with two fields, person_id and friend_id.  This table literally records that the user person_id is friends with user friend_id.  I then populated the table with a couple of million random rows, and began hacking away to find a solution to the problem.

The first line in our solution defines the dataset friendlist with an id (stores the friends id), a path to that user and how many steps you need to go through to reach them.  Friendlist is effectively a set of temporary tables that last the lifetime of the statement, allowing the recursive magic to happen.

Line two contains the first SELECT statement, which is used to pre-populate friendlist.  When entries are added to friendlist they are effectively added twice - once to a version of friendlist that forms the returned result set, and once to an active version of friendlist that is used to control the recursive processes.  We'll come back to that.  In our example I'm starting the search by selecting all the friends of person 7470.

The UNION ALL tells the database to merge every result from the subsequent select - the alternative is to just specifiy UNION which will only add unique entries to the result set.

The second SELECT is where the magic happens - for every row that gets stored in friendlist (the active version, said we'd come back to that!), this second SELECT statement is run.  So each time a new link to a friend is found and added to the result set, this SELECT statement is run again to find any friends of the that just found friend.  This is the recursive part.

As with any recursive algorithm it is important to make sure that it cannot infinite loop, that the process will always end rather than be able to run forever.  With a tree you might check to make sure that a node you wish to visit is not already in the path, for example.  In this instance we are only interested in following the data to a maximum depth from the originating user, hence only finding friends with a depth less than 5.

The final SELECT statement is used for processing and returning the results from friendlist.  For a simple query this could have just been a SELECT * FROM friendlist, but we want to make sure we only return the shortest route to a given user, not all routes, and we thought it would be nice to sort the data by depth, listing your closest friends first.

The inner SELECT DISTINCT ON (fid)… statement limits the result set to only the shortest route to a given user, through sorting by depth and using the DISTINCT ON functionality to only return the first result for a given friends user id.

The outer SELECT sorts the results by depth, as the inner SELECT will return in fid order.

Our example query returns all 9,000 results from our example data set in well under a second on my laptop, so with a bit of caching will be more than fast enough for our needs.  Removing the path information seriously speeds up the query, should that information be redundant for your needs (under half a second for the same data set on the same laptop).

Hopefully this will help someone out there - whilst some other databases have had this feature for a while it has already proved a valuable addition to PostgreSQL for some of the sites we are working on.

 

More information is available within the PostgreSQL online documentation.



Comments to “Recursive Database Queries with PostgreSQL”

No Comments have been posted yet.


Leave a Reply
 





  • captcha