The length of an SMS Part 1: Bits of Encoding

If you’ve ever wondered why a text message (or SMS for Short Message Service) is split in 2 (or 3 or 5), or why some texts are shorter than others; this post will go into some quirks of SMS and why that is.

This was originally one long post, but it became unwieldy so we’re splitting it in 2, just like an SMS.

Bits and bytes

My dog took a byte out of my hard drive. Now it’s in bits.

Hopefully you’ve heard of computers being run by 1s and 0s — those are bits.

A bit by itself is mostly useless — it can only represent 2 things — but combined together in an unimaginable number of ways, we get the wonders of modern computing. From Wikipedia to Twitter to Zevvle and all the other networks in-between. One of the most popular ways to group bits is in chunks of 8, called a byte:

And because each bit in a byte has 2 possible states, a byte has 256 possible values (), from 00000000 to 11111111.

Alphabets and encoding

Because computers don’t understand letters, numbers, pictures or much else to be honest (in fact they’re really stupid), we need to represent those things in terms they do understand: bits (and bytes). We need to encode letters and numbers with 0s and 1s. When you see the letter ‘Z,’ the computer sees 01011010 (probably).

Encoding is a way of saying “represent information.” We do it all the time — when you write the letter ‘Z,’ you’re encoding this concept of ‘Z’ with your handwriting, and because we humans are good at pattern matching, you could probably write ‘Z’ a 100 different ways and it’d be legible.

But computers are more precise and need stricter rules. It can’t be “Well it roughly represents Z, so what’s the problem?” It needs to be defined down to the bit; a set of rules that says “These bits in this specific order represent this specific thing.” Because we mostly agree on the same encodings, what I see on my screen as I’m writing is the same as you’re reading right now (hopefully).

Here be encoding dragons

This may all sound fine and dandy, if we can agree on the encoding. But we’re human and that’s not always the case. You may have seen something like this before:

iMessage encoding problem

That question mark box is your computer or phone saying “Argh! I don’t understand this sequence of 0s and 1s.” This often happens with older software trying to represent modern encodings. And it gets worse:

Wrong encoding of previous paragraph
The last paragraph using the wrong encoding.

One encoding to rule them all

Because encoding causes headaches, in 1987 a few engineers from Xerox and Apple set out to create one encoding standard to rule all encoding standards. They called this standard Unicode for a “unique, unified, universal encoding.” That’s a lot of encoding.

Unicode defines a massive list of 1,114,112 numbers (what they call code points) that are used for all sorts of letters, numbers, symbols and emojis. Enough to cover all living characters and ones we haven’t even invented yet. As of May 2019, only 137,929 code points had been used.

However, Unicode doesn’t specify how those numbers are encoded into bytes, so it’s technically not an encoding. All it says is that the number 90 means Z , 900 is ΄ and 9000 is .

Enter the UTF

The Unicode Transformation Format, or UTF, defines a way to map those 1,114,112 different Unicode possibilities to actual bytes the computer understands. Using 2 bytes isn’t enough as that only allows 65,536 variants (), so you’d need 3 bytes at a minimum. But 3 is an awkward number for computers, thus 4 is the next best.

Using 4 bytes per character is how UTF-32 encoding works (4 bytes being 32 bits). But that means a lot of wasted space — by stuffing the extra space with 0s, it’d mean encoding ‘Z’ with 24 extra 0s:

The letter 'Z' encoded in UTF-32.

Just using the English alphabet, what could be a 1 MB file would have to be saved as 4 MB… Because of that, UTF-32 is hardly used.

At some point, someone had the bright idea and thought “Well, what if we use only the number of bytes we need? If we can get away with 1 let’s use 1, and if we need 4 then we’ll just use 4.”

And so variable-width encoding was born. The most common of these, UTF-8, does exactly that — uses 1 byte when it can, and 2, 3 or 4 bytes when it must. The telephone emoji is one such 4-byte example:

The telephone emoji encoded in UTF-8 (Apple's design).

Whereas the letter ‘Z’ uses just 1 byte:

However, encodings don’t specify how something should look. That’s up to the computer and font system, which is why the telephone emoji above looks different on Android and iOS — Google and Apple have designed (and copyrighted!) different emojis for the same encoding.

There’s just one problem: how do you define the boundaries between characters if they’re all different sizes? We’ll leave that for another time!

In part 2, we’ll talk about how this is all relevant to SMS, why one message is sometimes split into 2 or more and why some messages are different lengths…

Stay tuned!

Nick Goodall