Keeping an ordered collection in PostgreSQL
I’m hitting an issue I’m quite sure a thousand developers encountered before me: how to keep an ordered collection in an SQL database.
Keeping an ordered collection seems trivial at first, but can soon become quite complicated.
Lets summarize: what we have is a table A, referencing a table B via a foreign key, and we want to keep the B rows belonging to A sorted.
Table "public.posts"
Column | Type | Modifiers
---------+---------+----------------------------------------------------
id | integer | not null default nextval('posts_id_seq'::regclass)
content | text |
Indexes:
"posts_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "comments" CONSTRAINT "comments_post_id_fkey"
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE
Table "public.comments"
Column | Type | Modifiers
---------+---------+-------------------------------------------------------
id | integer | not null default nextval('comments_id_seq'::regclass)
content | text |
post_id | integer |
Indexes:
"comments_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"comments_post_id_fkey" FOREIGN KEY (post_id)
REFERENCES posts(id) ON DELETE CASCADE
A post
has many comments
. So far, so good. Now, in a usual commenting system, comments are not ordered by the user (they are sorted by date and maybe by thread).
In my case, comments
has to be an ordered array in the post. Like so:
{
"id": 3,
"content": "My awesome post",
"comments": [
{ "id": 3, "content": "This should be the first comment"},
{ "id": 6, "content": "This should be the second comment"}
]
}
In a document database, this wouldn’t be a problem, but in PostgreSQL this is not supported out of the box.
There are basically three options:
- to store the order as an array
- to use a linked list
- to use an order column in
comments
1. to store the order as an array
Storing the order as an array would mean to have something like [3, 6]
in
my post row. This is a nice solution for small arrays, but it has two
downsides: “no possibility to put a unique constraint on the members” and “no
possibility to put a foreign key constraint on the members”.
2. to use a linked list
The linked list is the only solution that scale. If you have a million row, you can insert one in the middle with only one (or two if your linked list is two-way) updates and one insert.
The downsides of the linked list is that it is hard to ensure integrity, like no cycling reference or ensuring the list is not fragmented. The other big downside is that it requires quite some work to re-build the array from the list.
3. to use an order column in comments
The last solution is to use an order column in comments. It has the downsides that it requires renumbering all the comments for a post on any update/insert operation, but it makes building the array easy and fast.
Let’s add a position
to our comments
.
alter table comments add position int not null;
Also add a constraint to ensure each comment has it’s own position:
alter table comments add constraint "position"
unique(post_id, position)
deferrable initially deferred;
We defer the constraint to allow shuffling the position in a transaction.
So far, so good.
Now, let’s look at each update operations, I ommited the where post_id = xx
in each operation, but this is required.
INSERT
Inserting a comment requires making room for it and insert, like:
begin;
# n is the new comment index, must be 0 <= n <= count(comments)
update comments set position = position + 1 where position >= n;
insert into comments ...;
commit;
DELETE
Deleting a comment requires to fill the gap it creates:
begin;
delete from comments where id = ...;
# n is the comment index
update comments set position = position - 1 where position >= n;
commit;
UPDATE
Updating (reordering) a comment requires more renumbering operations:
begin;
# n is the old index
update comments set position = position - 1 where position >= n;
# if the post_id change, do the following operations on the new post_id
# m is the new index
update comments set position = position + 1 where position >= m;
update comments set position = m where id = theid;
commit;
Conclusion
I tried to optimize it, but I couldn’t find a way to make it simpler. I thought about using a 64 bit int, and always insert in between, for example, with 7 bit, the first position would be 64 and if you want to insert before, it’s 32, this allows to keep gaps for insert and reordering, but if you hit noperations where n is the number of bit in your position field, you are stuck and have to redistribute the values over the entire table. It might be more efficient with some applications, but in my case the user is going to do a lot of reshuffling, and will often trigger redistribution.