Generating cluster wide unique IDs

generating cluster wide unique IDs

All enterprise applications handle tons and tons of data and all this data needs a ID that is unique cluster-wide. In relational database we create primary keys for the same purpose. Some databases support inbuilt column types (AUTO_INCREMENT/AUTO_NUMBER) for generating a monotonically increasing 64 bit long number. While some people prefer to generate ids in their application layer in order to gain more control over the generation and then persist the records using data layer. The second approach however requires you to cache the latest generated number and not lose the track of id’s already generated ID’s (mostly through some kind of persistence) to avoid Primary key collisions.

Both above approaches have their Pros and Cons in itself but they both have a common drawback that these are not resilient in case of distributed architecture. Just think of data shards spread across multiple database nodes how would technique 1 make sure the tables in different nodes do not produce same auto_increment number OR imagine a topology where you are running your application on multiple nodes and these different nodes are serving different parts of the system then how would technique 2 fulfill needs of all nodes.

So what is the distributed approach?

Well there is no single approach that can fulfil all the requirements but following are the few most popular approaches which are used in many large scale applications.

UUID

One very simple approach would be generate UUID’s. UUID’s are 128 bits hexadecimal numbers which have a very very very low probability of getting generated twice. So one can simply use any UUID generator and use them for the primary keys.

UUID’ s require no coordination between different nodes and can be generated independently. But then remember they are big in size and they don’t index well (mentioned in my post earlier).

Ticket Servers

This is one of the very famous approaches where you can simply maintain a table to store just the latest generated ID and every time a node asks for ID they make a ‘select for update’ on this table, update the value with a incremented value and use the selected value as the next ID.

This approach is resilient and distributed in nature. The ID generation can be separated from the actual data store. However there is a risk of Single Point of Failure as all the nodes rely on this table for the next ID and if this service goes down your app may stop functioning properly.

Also this approach might not be suitable in case where the writes per second are very high because that will overload the Ticket Server and also degrade your app performance.

Twitter Snowflake (Unique ID generator)

This approach tackles the problem of SPOF as well as the latency issues. Here the ID is generated as a concatenation of timestamp, node ID and Sequence number.

The timestamp is the System time measured as number of millisec since EPOCH. 41 bits are allotted to timestamp.

Node ID can be assigned to any physical node when during its startup and it can be retrieved from a shared cache in the cluster. Node ID can occupy next 10 bits.

And the Sequence number can be a monotonically increasing 12 bit number.


These are some famous approaches but you can also check out few other distributed ID generator provided by vendors like Zookeeper and Hazelcast.

Before closing this blog I would also like to warn people using a 64 bit number as an ID to be careful when using them on UI. Because any browser will truncate a 64 bit number. This is because any JavaScript Number supports only 2⁵³-1 as the maximum safe integral value. So it is always safe to send your ID’s as Strings instead of numbers.

Hope it helps you decide what approach you would like to take in generating ID’s for your application.

Cheers!!! Happy coding.