Back to blog
What is a UUID?
A UUID (Universally Unique Identifier) is a 128-bit number used to uniquely identify information in computer systems.
The key features of a UUID are:
Universally Unique: The probability of the same UUID being generated twice is extremely low.
128-bit: UUIDs are 128 bits long (16 bytes).
Identification: UUIDs are used to identify information. This could be identifying a server, user, file, database row, API request, etc.
Standardized Format: All UUIDs follow the format defined in RFC 4122. They contain hexadecimal digits separated into groups and a dash.
By using an algorithm to generate a UUID rather than sequences like auto-incrementing integers, the identifier can be generated in a distributed environment without central coordination.
This makes UUIDs very useful in systems with multiple servers/databases or in peer-to-peer applications where central coordination is not feasible. The uniqueness and standardized format are the foundations that enable UUIDs to identify information across different systems.
History and Origins
Universally unique identifiers (UUIDs) were originally used in the Apollo Network Computing System in the 1980s as part of Apollo Computer's Distributed Computing Environment (DCE). The goal was to create an identifier format that could be generated independently in distributed systems without any centralized coordination.
In 1998, the Internet Engineering Task Force (IETF) published the UUID specification as the RFC 4122 standard. This formally defined the format and generation rules for UUIDs. Since then, UUIDs have been widely adopted and used in many technologies like COM/DCOM, CORBA, Cocoa, LDAP, and more.
The RFC 4122 standard generalized and expanded the format beyond the original DCE specification while keeping backward compatibility. It defined different variants and versions of UUIDs to support different generation methods. This allowed UUIDs to be used more flexibly across diverse systems and platforms.
UUID Format
A UUID is represented as a 128-bit number. The binary representation contains 128 bits of data as a string of 32 hexadecimal digits grouped into 5 sections separated by hyphens.
The full format of a UUID is:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Where each x is a hexadecimal digit (0-9 or a-f).
The binary layout of a UUID contains:
16 octets / 128 bits total
The octets are transmitted in big endian byte order. This means the most significant byte is transmitted first.
The first 3 sections contain 32 bits each (4 octets / bytes).
The 4th section contains 16 bits (2 octets).
The 5th section contains 48 bits (6 octets).
In summary, the binary wire format of a UUID is a 128 bit / 16 byte string transmitted in big endian byte order. This binary data is generally represented using hexadecimal digits grouped into 5 sections separated by hyphens.
UUID Variants
UUID variants indicate different types of UUIDs. There are three main variants defined in RFC 4122:
Variant 0: Reserved
Variant 0 UUIDs are reserved for backward compatibility. These are not commonly used today.
Variant 1: Common Variant
Variant 1 is the most common variant used today. This is the variant generated by typical UUID algorithms. The first three bits in the first octet are "10x" where x indicates the UUID version number.
Variant 2: Microsoft Backward Compatibility
Variant 2 UUIDs are reserved for backward compatibility with legacy Microsoft products. The first three bits in the first octet are "110".
Variant 2 UUIDs allowed Microsoft products to generate UUIDs without using an IEEE 802 MAC address as the node identifier. This was useful for products that did not have network access or did not use Ethernet.
UUID Versions
There are 5 key versions of UUIDs that each generate IDs differently:
Version 1: Timestamp and MAC Address
Version 1 UUIDs are generated based on the timestamp, MAC address, and a few other values. Specifically, it uses:
A 60-bit timestamp value that represents the number of 100-nanosecond intervals since the start of the Gregorian calendar on 15 October 1582. This provides uniqueness over space and time.
A 48-bit Ethernet MAC address, which provides uniqueness over space. This is usually the MAC address of the network interface card that generated the UUID.
A 14-bit randomly generated number that provides uniqueness over time. This is also called the clock sequence.
A 6-bit version number set to 0b0001 to indicate version 1.
A 2-bit variant field set to 10 indicate the RFC 4122 variant.
Version 1 UUIDs are very popular and work well in distributed systems since they contain a timestamp and MAC address. However, the timestamp value will run out in the year 3400 due to its limited number of 100 ns intervals since 1582.
Version 2: DCE Security
Version 2 UUIDs are based on DCE Security and are designed for use with Distributed Computing Environment (DCE) security. The key difference is that instead of using the MAC address for uniqueness, version 2 UUIDs use a local domain number specific to the local network.
Version 2 UUIDs are otherwise reserved and less commonly used today.
Version 3: MD5 Hash
Version 3 UUIDs are generated by taking a namespace UUID and name, combining them, and creating an MD5 hash. The result is a version 3 UUID that is deterministic and unique based on the namespace and name.
Version 3 UUIDs require an application to have a unique namespace and provide a name in that namespace, such as a DNS domain name. They work well for creating UUIDs unique within a system.
Version 4: Random
Version 4 UUIDs use a randomly or pseudorandomly generated 128-bit number. This provides uniqueness since the random values have a very low probability of collision.
To generate a version 4 UUID, 122 random bits are generated along with a 6-bit version number of 0b0100, and a 2-bit variant 10 to indicate RFC 4122 UUIDs.
Version 4 UUIDs are simple and efficient to generate but have a slight probability of collisions since they are random.
Version 5: SHA-1 Hash
Version 5 UUIDs work similar to version 3, but use the more secure SHA-1 hashing algorithm instead of MD5. A namespace UUID and name are combined and hashed with SHA-1 to generate a version 5 UUID.
Like version 3 UUIDs, version 5 requires an application-specific namespace and name value. The result is a UUID unique within that namespace.
Version 1: Based on Time and MAC Address
UUID Version 1 is one of the most commonly used versions today. It is based on the generation of a timestamp, MAC address, and some other data.
The timestamp component contains the number of 100-nanosecond intervals since the start of the Gregorian calendar on 15 October 1582 at midnight UTC. The timestamp provides up to 100 nanosecond resolution, allowing the generation of up to 10 million UUIDs per second. However, since the timestamp is limited to 60 bits, Version 1 UUIDs can only be generated up to the year 3400 AD, after which they will begin to duplicate.
The MAC address is usually the host's Ethernet MAC address, since this is unique across manufacturers. However, if the host does not have an Ethernet adapter, a random MAC address can be generated. The MAC address provides 48 bits of entropy to ensure uniqueness.
The combination of timestamp and MAC address with additional data like clock sequence and variant provides a total of 122 bits of entropy, making the probability of generating duplicate UUIDs negligible. Overall, Version 1 offers very good uniqueness while also containing useful timestamp information, making it very popular for distributed and decentralized systems.
Versions 3 and 5: Namespace name-based
Versions 3 and 5 of UUIDs are name-based and derived from a namespace identifier and name.
The namespace identifier is a UUID that specifies the namespace from which the name will be generated. The name is an arbitrary string that is used alongside the namespace to generate the UUID.
Version 3 uses the MD5 hashing algorithm to generate a 128-bit hash from the namespace ID and name. Version 5 does the same but uses SHA-1 instead of MD5.
The generated hash is used to construct a new UUID containing the hashed values along with some additional information like the version number. This results in a UUID that is unique within the context of the namespace.
The benefit of name-based UUIDs is that they allow creating UUIDs in a deterministic way when an application needs to generate identifiers that are unique in a closed context or namespace. For example, this can be useful for generating unique keys in a database without requiring a central issuing authority.
Since name-based UUIDs rely on hashing the namespace and name, even small differences in the input values will produce completely different UUIDs. This helps avoid collisions and duplicates. At the same time, the same input will always generate the same UUID.
Version 4: Random
The version 4 UUID is generated using random numbers. This version uses a random or pseudorandom number generator to produce 128 random bits for the UUID.
The first octet contains the UUID version number (4) as well as the variant 1 as bits 6-7. The remaining 122 bits are randomly generated by the algorithm, with no structure or order.
Since version 4 UUIDs consist of random bits, there is a possibility of generating the same UUID multiple times, known as a collision. However, because 122 bits allow for 2^122 or about 5.3 x 10^36 possible combinations, collisions are extremely unlikely to occur in practice.
The randomness and lack of structure in version 4 UUIDs make them well-suited for applications that need to generate identifiers quickly and do not require other properties like ordering. They can be generated efficiently without needing to access an external source like a MAC address or clock.
However, the randomness also means that version 4 UUIDs reveal no information about the time or source of their creation. They also cannot be ordered or sorted.
Overall, version 4 provides a simple and efficient way to generate UUIDs with a very low probability of collisions. The tradeoff is a lack of structure and absence of metadata like timestamps or source IDs.
Uses
UUIDs are used in a variety of applications and systems:
Distributed Databases
In distributed databases, UUIDs are commonly used as unique keys. Since the keys are generated independently without a central authority, there is a very low probability of key collisions between different nodes. This allows distributed databases like Cassandra, HBase, and MongoDB to scale horizontally without running into primary key conflicts.
Security
In security systems, UUIDs can be used to generate unique tokens for session management, passwords, authorization codes, and other security critical components. The unpredictability of UUIDs makes them suitable for security applications where cryptographic randomness is important.
Caching
UUIDs are useful for identifying cached objects in systems like memcached and Redis. When objects are cached using a random UUID rather than sequential keys, it improves the distribution and indexing performance of the cache. The randomness also lowers the chance of cache collision attacks.
Drawbacks of UUIDs
UUIDs have some drawbacks that are worth considering when deciding whether to use them:
Large size: At 128 bits or 16 bytes, UUIDs take up a lot more space than simpler unique identifiers like auto-incrementing integers. This can have performance implications for databases and applications that store many UUIDs.
Chance of collisions: UUID version 4 is randomly generated, so there is a possibility that two UUIDs could collide and duplicate. However, with 2^122 possible UUIDs, the chance is extremely small.
Version 1 timestamp resolution: UUID version 1 uses the system time plus a clock sequence as part of generating the identifier. However, the timestamp only has 100 nanosecond resolution. This could result in duplicate UUIDs if the clock sequence rolls over during a single 100 nanosecond tick.
While UUIDs have some potential drawbacks, they are still very useful for generating identifiers that have an extremely high probability of being unique across space and time. The large size makes the chance of collisions negligible in most cases. Overall, UUIDs are a robust solution for many distributed systems and applications that require unique IDs.
A final fun-fact — it's likely that there has never been a single collision between UUIDs in the history of the world! Despite most companies adopting the UUID format, the number of possibilities is just that high.
Published on
Jan 4, 2024
in
Data
David Dobrynin
CTO
About THE article
Published on
Jan 4, 2024
in
Data
About THE Author
David Dobrynin
Additional content