The length of an SMS Part 2: More Encoding Adventures
Did you read part 1? If not, you should definitely read part 1. To sum up:
- Computers only understand 1s and 0s (bits). A byte is 8 bits.
- Computers use bytes to represent things we understand, like the alphabet and emoji. How they do that is called encoding.
- Unicode exists; a universally(ish)-understood way to ‘encode’ our system of letters, numbers, glyphs and characters into bits and bytes.
In the early 80’s, engineers at the GSM Corporation were working on some fancy new tech to send messages using the telephone system. At the time the network was mostly used for voice calls, but part of the system used for signalling (e.g., to notify a phone of an incoming call) didn’t see much use and they figured it’d be a cost-effective way to send messages, provided they were as short as possible — originally 128 bytes.
Given the space constraints, how do you fit as many character combinations as you can? If UTF-8 had existed they might’ve gone with that, and it would’ve given them 256 1-byte characters (a byte having 256 or possible values), but it turns out they didn’t need all those possibilities. By using 1 less bit per character — 7 instead of 8 — they could squeeze some extra space for the message…
Yet another encoding
And so that’s what happened: they developed a new encoding — GSM-71 — which uses 7 bits per character, giving 128 possible values (). This was enough for our Western alphabets, including upper and lowercase letters, numbers, punctuation and also some Latin.
The original size of an SMS was 128 bytes (1024 bits), meaning you could fit 146 GSM-7 characters (). However, they later extended this to 140 bytes, bringing the maximum message length to 160.
128 available characters was okay, but… what about other alphabets? Emojis? The rest of our literature?!
And another standard2
Remember UTF-8? That’s the encoding frontrunner of today, but it wasn’t always that way. Alongside Unicode exists the Universal Coded Character Set, or UCS. It’s almost identical to Unicode, and also isn’t itself an encoding — it just maps characters to numbers, or code points. The simplest of the UCS encodings is UCS-2 which represents a code point with 2 bytes.
When SMS was ready to level up into thousands of characters, UCS-2 was ready. If you send an SMS with a special character like ♛ (black chess queen), it’s likely encoded with UCS-2. But there’s a tradeoff: although you get access to 65,408 more characters, they all use more space. In other words, your messages have to be shorter. With only 140 bytes in an SMS, this means 70 characters with UCS-2 vs the 160 in GSM-7.
But that’s not all. Unlike UTF-8 which is variable-width, UCS-2 is quite strict and only uses 2 bytes, meaning it can only represent 65,536 different characters, and none of those include emojis (😢). So…
Back to the UTF
To get around the 65 thousand character limit of UCS-2, another encoding — UTF-16 — was created in its wake, which can vary its width to access the rest of the 1.1 million Unicode/UCS code points. It’ll use 2 bytes when it can and 4 bytes when it must, giving us all those delicious emojis.
So which encoding will your SMS use? Ultimately that’s up to the network — UCS-2 is still everywhere and whether the operator supports UTF-16 is up to them and their equipment. Not to mention the handset — some older phones won’t even know about UTF-16!
What about really long messages?
The engineers tasked with all this could have set a hard limit on SMS length. “No 161-character SMSs,” they might say. But no, these people were ambitious and settled on something else — Concatenated SMS. ‘Concatenated’ means linked together, and that’s exactly how this works: split longer messages into sections that can fit over the network, and then somehow link them together.
To achieve this linking, each part of the message gets something called a User Data Header, or UDH.3 It’s a bit of hidden information put at the start of the message which tells the network or phone how many parts there are, where this part fits into the sequence and a reference for the whole message. But this isn’t free — it uses 6 bytes which, as you may have guessed, shortens the size of your SMS!
With 6 bytes less, this brings each message’s total capacity down to 134 bytes. That means 153 characters with GSM-7 () or 67 characters with UCS-2.
|Encoding||Character size||SMS length||Concatenated SMS|
|UCS-2||2 bytes (16 bits)||70||67|
|UTF-16||2 or 4 bytes||70*||67*|
When you send 1 message but it shows you used more, this is why. Using even 1 emoji will change the encoding and shorten the message length.
We hope this has made things a little clearer — an SMS can fit 140 bytes, but how many characters that means depends on the encoding and whether the message is split into multiple parts.
Technically GSM 03.38, but that’s a mouthful. ↩
UCS isn’t actually a standard, it’s something defined in a standard, namely ISO 10646. We didn’t want to confuse things toooo much. ↩
Picture messages (MMS) also have User Data Headers and they can be different lengths. For Concatenated SMS they’re always 6 bytes. ↩