Nov 4, 2022 3 min read programming

Unicode surrogates and exponential growth

An interesting fact about Unicode is that it assigns code points for things that aren't themselves characters. Some of these things are special values called surrogates. These are used in UTF-16 to represent characters outside the Basic Multilingual Plane, or BMP.^[1] There are 2,048 such code points. 1,024 are called "high surrogates" and 1,024 are called "low surrogates." To represent a character outside the BMP you use a high surrogate followed by a low surrogate. Each Unicode plane consists of 65,536 code points, and the BMP is no exception. So there are 65,536 - 2,048 = 63,488 non-surrogate code points, also called Unicode scalar values.^[2]

An interesting thing about this is that this "costs" 11 bits to represent, but not really. It "costs" 11 bits in the sense that you need at least $2^{11} = 2048$ code points to represent all possible surrogates. But it doesn't in the sense that you will have way more than $2^5 = 32$ code points left over. The math checks out, but this is sort of unintuitive, so it seems worth exploring in more detail.

One way of thinking about this is adding bits. Every bit you add doubles the number of values you can represent, but you still only need the same number of values (not bits) to represent the 2,048 code points you allocated. In this frame you think of the extra bits as defining an "index" over $2^{11}$-bit spaces. You have 5 leftover bits to define that index, which means you have $2^5 = 32$ possible indices. You only need to use one possible index to store the 2,048 code points; that leaves you with $2^5 - 1 = 31$ remaining possible indices, and each of those indices is associated with a completely separate space of 2,048 code points, which means you have $31 \cdot 2,048 = 63,488$ remaining code points you can assign as you like.

I like to visualize this in terms of imaginary extra bits. Imagine you have just 11 bits, 11 "slots" or "places" for ones and zeros, so that you can only represent 2,048 values. You can imagine an infinite line of additional slots (or bits) that are locked at 0; only the finitely many slots you actually have available can be set to 1.^[3]^[4] Adding additional bits gives you your "index over 2,048-codepoint-spaces." The way I think of this, the 11 trailing bits are "used up" only in the context of the zero-index space. Adding the 5 additional bits means you now have a bunch of spaces where those 11 bits can be used to mean completely different things.

Another way to put it: As you add more bits beyond the initial 11, that enables "dual use" of the initial 11 bits.^[5]

The surrogate code points themselves fall within the range of the Basic Multilingual Plane but aren't technically assigned characters -- they're just reserved for use in UTF-16, if I understand correctly. ↩︎
Technically, I think there should be 63,487 non-surrogate code points, because one code point -- U+FFFE -- is not actually a valid Unicode code point. U+FEFF is the byte order mark. The byte-flipped opposite U+FFFE is declared invalid so that if you see a pair of bytes 0xFE and 0xFF, in either order, then you know whether you're looking at a Big Endian or a Little Endian encoding of UTF-16 -- because there is only one valid way to read this pair of bytes. This is pretty cool! ↩︎
These zeros don't add any more information so you're still stuck with the same finite information capacity; this is just a convenient visualization to understand what's going on. ↩︎
Only $9.99 a month to unlock additional slots. Act now during our flash sale event and get $64$ ($\ll \infty$) bits for just $4.99! ↩︎
I say "dual use" in quotes because it's more than dual -- you get to use the same range of bits not just twice but 32 times total. ↩︎

You might also like...

Things I learned calling shm_open() on a Mac

Lessons learned from battling Godot 3 for twelve hours

Beware map indexing in C++

Eval game rule 10

You could have invented PEM encoding