Choosing Primary Keys, Unique IDs wisely

Think twice before making this decision

Making a choice for Primary Key or unique ID??? Well think twice..

We are a RAD platform provider. People define Objects, Fields, Relationships, CRUD UI and behavior of their apps on User actions using this platform. For sake of this article let me call these resources as Metadata of the application. Once they have defined all the metadata they can deploy their apps on our cloud based platform and User can then login and start creating data.

Our platform uses ID’s (64 bit long number) as our identifier for all metadata and data in the system. Once you develop your application in one running instance of platform an ID is generated for the metadata and data of the application. This ID from now onward is the only way to uniquely identify these resources. Typically, you would develop an application in your dev environment and then move it to production. Since it is a RAD platform we wanted to keep this process very simple and hassle free. So we devised a technique where in order to move an application from one system to another you just need to export the application as an XML and then import it in the another system. This way you can easily push updates to your production system. And here the concept of “Original ID” comes handy. You can simply think of Original ID as another ID to uniquely identify metadata across different running instances of the platform.

Since the target system is agnostic of the application coming in we want to make sure that it understands what metadata is newly imported and what is already present in the system and needs just an update. So whenever we create a metadata in source system a new ID is generated, this ID is the primary key for this resource in this system but also is assigned as the Original ID for this resource. When we move this resource from source system to target system there can be already existing ID which might collide with the migrated resource ids. Hence we create new IDs for the primary key column of this resource but we keep the Original IDs in target system as generated in original system. This way wherever resource moves we assume that any resource with same Original ID is exact same resource from same source system.

Now the real problem comes when our customers need more flexibility. They may want to move their production systems to completely new environments like a different Publicly/Privately hosted multi-tenant cloud where already 1000 such apps of different customers are running. This will create a real mess because the Original ID’s are not globally unique. As far as the customer was moving his apps in his own systems it was all good but if he wants to move outside he is stuck because his Original IDs might collide with some other customer resources’ Original IDs. And with the assumption that same Original IDs mean same resource, it will be all mess.

The problem was huge, if we wanted to grow big we wanted this limitation to go away. The obvious solution seemed to be using the UUID’s as original ID’s instead of 64 bit long ID’s. It was simple mathematics, the probability of any UUID getting generated twice is as bad as someone getting hit by a meteor while walking on the road. Or in other words- only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

Wow that sounded like a plan. We just need to change the logic of generating the Original ID’s and we are all set for thousand years. But they say — “life is not obligated to give us what we expect”. One of our engineers say- “OK, that’s fine we can use UUID’s but what happens to the data. We can move the metadata that define the applications but how do we move our data. Do you suggest that we change the primary key of our data to UUID? That will kill the performance of our platform.”

I got to say that was smart argument. We can expect our customers to move their metadata as well as data to other systems. But as soon as they move their data we are stuck with a new problem there will be Primary Key column violations. There will be data with same ID’s as the data coming in. We can’t simply generate new ID for this data as it will be used as foreign key in other tables. So what do we do change ID’s to UUID as well. Well no, that’s not a good idea because we will simply increase work for our database server. It will now have to index a random string of 128 bit. The data will be more fragmented and bigger to fit in memory. This will definitely bring down the performance of our system.

But then another smart guy said — “I don’t think this will be a problem if we use Type 1 UUID which is sequential and will result into creating lesser data fragmentation.” Another guy said — “And what about making it sequential and storing as binary. That would be even faster. Isn’t it?”.

Well I was not a fan of binary ID’s because we display id on our UI. This means we will have to convert them back n forth from binary to string and vice versa. Nonetheless, even if we do that what is the impact. I was interested in the impact on the performance. So I started to research a bit about using UUID’s as primary key and also ran few tests.

Let me share the results-

GUID primary key vs BIGINT primary key to insert 1 million records
Shows the performance degradation of the INSERT when used UUID over 64 bit LONG as PK.

This is exactly what the first guy was saying. We cannot replace LONG ids with UUID in case of data because data is huge in numbers. This was a simple decision but we would need an alternative lets talk about that later. But we can use UUID’s for Original ID’s because that is identifier of metadata and metadata is not that huge. And we saw that within limits UUID performs good enough. But then there is another question raised by another guy — i.e which type of guid should we use- Type1, Type4, Type1 as binary??

Then I ran following tests which approve of storing UUID as String instead of Binary –

Insert 2 million rows cumulatively in Table1 having binary(16) UUID column as PK and in Table2 having char(36) UUID column as PK
Insert 2 million rows cumulatively in Table1 having binary(16) UUID column as PK and in Table2 having char(36) UUID column as PK.

But we got an idea of storing UUID as a 22 long string instead of 36 long string. This was another improvement we could do over storing UUIDs as String. We just need to encode our strings as Base64 strings and that will be enough to compress them.

Phew, one problem solved. But what about the other. What do we do to make the ID of data universally unique? Well, our bright manager had a great idea. He said- “Well we don’t need to make ID’s for data universally unique. We need to make them unique only when customer brings his data in a new system.”

That statement brought a wide smile on my face. I knew what he was pointing to. We have unique IDs for customers in our systems. Why not make the ID for data and the ID for customer a composite key? This will always make sure that when a new customer comes with data in a new system we just need to create a new id for this customer and the composition of both ID’s will make us identify all the data uniquely system wide for a particular customer.

But in comes our architect and says- I don’t trust Databases. How would I know whether composite keys perform as good as single column 64 bit LONG keys?

I said let’s prove it. And I did another test-

Average time taken in seconds to select 2M records from DB which is gradually growing in size
Average time taken in seconds to select 2M records from DB which is gradually growing in size.
Average time taken in seconds to insert 2M records in DB which is gradually growing in size
Average time taken in seconds to insert 2M records in DB which is gradually growing in size.

Above graphs clearly shows that composite key works at par with single column 64 bit long ID as Primary key even on huge data.

Hurrah!! we just solved the biggest problem on earth 😀 Let’s have a good sleep tonight.

P.S. All tests are run on Productoin Server profile. With MySQL dedicated database server v5.7 running on multi core machines with 32 GB capacity of memory.

Few References-