Partitioning and constraints part 1 - UNIQUE
Partitioning in PostgreSQL has been an artisanal work for a long time now. And despite the current discussion running since few month on PostgreSQL’s hackers mailing list, it will probably stay this way for some time again. Just because it requires a lot of brainstorm and work.
Nowadays, I believe the current state of partitioning under PostgreSQL is quite well documented and under control most of the time. You can find a lot of informations about that online, starting by PostgreSQL documentation itself, but about tooling as well, extension, etc.
However, there’s still a dark side, not well covered or understood about partitioning under PostgreSQL: constraints related to them. More specifically unique constraints covering all partitions of a partitioned table and how to refer to them from foreign keys. This series of article analysis how to implement them by hands ourself thanks to some PostgreSQL great features, detailing how to avoid major traps. You will see that crafting these «constraints» wannabes requires some attention, but is definitely doable, in a clean way.
As this subject requires quite some details and explanations, I decided to split it in multiple articles. This first part is about creating a UNIQUE constraint across all partitions of a table. Next one covers how to reference a partitioned table. And maybe some other depending on the motivation, inspiration and feedback.
Study case
I chose to illustrate this article with a table partitioned by date range. This is a fairly frequent practice and adapting this article to another partitioning scheme is quite easy anyway.
Range partitioning on the PK has no challenge: each value of the PK could only
live in strictly one child, each of them enforcing the PK internally. So the
uniqueness of this PK across the partitions is already enforced by constraints
CHECK and UNIQUE of each partition. That’s why my study case partition range
using a timestampstz
column. As the CHECKs do not apply on the primary key
values, each of its values can be in any partition, which can lead to duplicate
values residing in different partitions. Exactly what we really want to avoid.
So here is the dummy schema with table master
partitioned across 5 childs
using a date range key partitioning ts
.
The SET DEFAULT
are only there to keep other commands simple to read. Note
that I do not create the trigger on INSERT and UPDATE on the master table. This
is out of the scope of this article, will not be needed and add no challenge to
the subject.
The naive solution
Of course, the whole trick revolves around triggers. We have to check the uniqueness of a PK value across all partitions after any INSERT or UPDATE on any of them, for each rows. Let’s dive in and get wet with a first naive version of such a trigger:
Obviously, each partition need the trigger to check that the PK value it is about to write does not already exist in one of its siblings. The trigger function itself is quite easy to understand: if we find a row with the same value as the PK, raise an exception. Tests sounds promising:
OK, it works like expected. But there is two big issues with this situation. The first one is that a race condition involving two transactions or more is able to break our home made unique constraint:
The second issue is that real constraints can be deferred, which means constraints are disabled during a transaction and enforced on user request and at the end of the transaction by default. In other words, using deferred constraints allows you to violate them temporarilly during a transaction as far as everything is respected at the end. For more information about this mechanism, see the SET CONSTAINTS, CREATE TABLE and… the CREATE TRIGGER pages.
Yes, documentation says triggers can be deferred when defined as
CONSTRAINT TRIGGER
. So we can solve this issue by recreating our triggers:
The INITIALLY IMMEDIATE
means the trigger constraint will be executed right
after the related statement. The opposite DEFERRED
behavior fire the trigger
at the very end of the transaction unless the user decide to SET CONSTRAINTS {
ALL | name [, ...] } IMMEDIATE
somewhere during the transaction.
Defering the trigger to avoid the race condition ?
Now, if you step back a second to look at what we have, you might wonder if
forcing our constraints triggers to be DEFERRABLE INITIALLY DEFERRED
would
solve the race condition. As constraints are checked at the very end of the
transaction, maybe this would work by kind of serializing each transaction and
their constraints? Short answer is: no.
For one, deferred constraints comes with a cost in performance we might not want to pay at each transaction. But most importantly, if you declared your trigger as deferrable, one could set it to IMMEDIATE, even if it is set as INITIALLY DEFERRED. So this is definitely not a viable solution. But anyway, occulting this for the purpose of the study, does it work?
Again, no. Even if it solves the “human timing race condition”, there’s another
very small window where another race condition is possible in the core of
PostgreSQL, when multiple transactions do not conflict and get committed all
together at the exact same time. This idea itself sounds suspicious anyway, too
fragile. If there is no good ol’locks floating around, there’s a race condition
close enough to break things. It is pretty easy to prove with the following
bash loop hammering each partitions with 100 INSERTs with colliding values as
PK. Note that the triggers has been altered to INITIALLY DEFERRED
:
Well, that’s pretty bad, we have a bunch of duplicated key. 23 of them appear in two different partitions, three others in three different partitions and even one in all of them! I could find duplicates like that each time I ran this scenario. Note that on 500 inserts, only 209 survived in total. That makes 291 exceptions raised out of 324 expected, counting the duplicated keys that were not caught.
Isolation level?
Well, last chance. If this many transactions were committed in the exact same
time, maybe we can force them to serialize with isolation level SERIALIZABLE
?
After applying the preceding query, I re-ran the same scenario as the previous test: only 76 rows survived out of the 500 INSERTs, all of them unique. At last! Ok, this reflects what we had in mind previously, but we had to force PostgreSQL to really serialize transactions. Any other isolation level will just fail. And by the way, this works with IMMEDIATE and DEFERRED triggers as transactions are virtually serialized or rollback’ed. Log file confirms a lot of serialization conflicts were raised, grep’ing the log file shows 415 serialization exceptions and only 9 from our trigger:
This solution work, but having to stay in SERIALIZABLE mode to achieve our goal is a heavy constraint to carry. Moreover, we have the same problem than with DEFERRED triggers: as a simple user can change its isolation level, any bug in the application or not informed user can lead to scenarios with silent duplications. Fortunately, another simpler and safer solution exist.
Real solution: adding locks
The SERIALIZABLE
solution works because to emulate serial transaction
execution for all transactions, it takes predicate locks behind the scene
to detect serialization anomalies. What about taking care of this ourselves? We
are used to locks, we know they work fine.
The best solution sounds to acquire a lock before being able to write out the value. This actually boil down to forcing conflicting transactions on a lock to serialize themselves, instead of having the engine do all the works for everyone. The question now is «how can we hold a lock on something that doesn’t exists yet?». The answer is: Advisory Locks. Advisory locks offers to applications a lock mechanism and manager on arbitrary integer values. It does not applies on real objects, transaction or rows. As the documentation says: «It is up to the application to use them correctly».
The idea now is simply to acquire an advisory lock on the same value as NEW.id in the trigger function. It should do the trick, cleanly, safely:
And with this version of master_id_pkey()
, in “read committed” isolation
level, here is result of the same scenario as in the previous chapter,
executing 500 INSERTs concurrently with conflicting keys:
Sounds good. What about a small pgbench scenario?
After this 5 minute run with 5 workers inserting as fast as they can highly conflicting data, we have 48,351 rows in the partitions, 82,557 conflicting rows were rejected and not a single duplicate in the table.
I couldn’t find any duplicated value after stressing this solution. Whatever the number of queries for each parallel sessions working, whatever the pgbench scenario, I had no unique violation across partitions, as expected. This work in any transaction isolation level and user can not turn this off by mistake. This is safe…
…Well, as far as a superuser or the owner of the table do not disable the trigger on the table, obviously. But hey, they can drop the unique constraint on a normal table as well, right?
Wow, at last, finished. What? No? I can hear you thinking it only applies on integers. OK, bonus.
Supporting other types
Supporting unique constraint on integers was straightforward using advisory
locks. But how can this applies to other types? Like text for instance ? Easy:
hash it1! For the purpose of this last chapter, lets add a unique constraint
on comment
:
If you followed this far, no need to play the “find the error game” to identify
what’s the most important change here. The lock is taken on the result of the
simple text to integer hash function hashtext
, already provided in
PostgreSQL’s core.
Ok, I can hear optimization freaks crying. Theoretically, two different strings can collide. But this hash function is supposed to compute uniform results amongst 4 billion possible values. I can live with the probability of two concurrent writes involving two different strings colliding here. The “1 case” out of 4 billion is already enough for me, but these colliding strings has to show up at the exact same time (at least in the same bunch of few milliseconds). And even if you are unlucky enough to experience this, these two transactions will just be serialized, not a big deal.
And if you are really not comfortable with this, you understood the trick here anyway: find a way to hold a lock somewhere to avoid concurrency. Use some other hashing function, create an extension with its own lock machinery in memory, write in an unlogged table (erk), whatever you want.
Time to test now.
Wow (again), only 29,785 rows out of 93,902 transactions escaped of that intense-colliding scenario. And we only find unique values across all partitions, as expected. Aaaaand grep-ing from the log file, I can find 64,117 rejected rows…
Conclusion
Such a long way already. Thank you for reading so far. At first I thought I could write about unique and foreign keys in the same article, but look at what we already covered…We talked about constraint triggers, race conditions, isolation level, advisory locks and hashing… few!
I do realize the solution provided here requires some skills and attention. This is not all magic and easy to play with. As long as this feature is not handled directly by the core, partitioning will require people to craft their tools themselves. In the meantime, it is a nice subject to learn more about these concepts, your favorite RDBMS and play with it.
I think this way of guaranteeing unicity over several partitions is bulletproof. If you think you found a loophole, please send me some feedback, I’ll be pleased to learn about them.
And don’t forget, we are not done! Lot of fun with foreign keys in the next part! Stay tuned!
-
No data has been harmed during this test. ↩
Comments Leave a comment by email or Disqus
Note: I only publish your name/pseudo, mail subject and content. I will NOT publish your email address.
No comments yet.