2013: Advent Computing: More bits

by on December 3, 2013

Yesterday I talked about encoding characters, i.e. letters and numbers and symbols, in bytes. One byte can represent any number between 0 and 255, and the near-universal standard known as ASCII gives a specific character for any number between 0 and 127.

If you write 0 and 127 in binary, using all eight bits you get in each byte, you get 00000000 and 01111111. The thing to note here is that the first bit is 0 for both these numbers, and indeed every number in between. We use that first bit as a “flag” for there being something special going on with the coming character.

Originally, “something special” meant this byte meant the character was in some local character scheme, one that would normally be country specific. British documents might use an encoding that included a “£” symbol, while Germans would be more interested in vowels with umlauts.

There are a few problems with this scheme, however. The first is that if a German ever wants to send an email to a Brit, they need to find some way of agreeing whose encoding to use. While there were a few ways of doing this, it often went wrong, resulting in odd characters in place of quote marks and so forth. The second is that nobody thought about how this would work for languages like Chinese, which scoff at the idea of a mere 256 characters.

That’s where Unicode comes in, which assigns lots of symbols and lots of numbers, including letters with umlauts, pound signs, Chinese characters, and some more exciting things (such as a “character” for putting one character on top of another, such as an umlaut without a letter on a letter without an umlaut). There are over a million such symbols assigned a number.

But that, clearly, won’t fit in a single byte. There are a number of ways of dealing with this, the most common of which (at least for western writing) is called “UTF-8”. The “8” means the default is for a character to be eight bits long, but, just like before, if the first bit is a “1”, we do something more complicated. This time, that initial “1” flag means we’ll use several bytes to describe a single character. A euro symbol, for example, is represented by the number 8364, clearly bigger than the 255 a single byte can hold, so we write it as three bytes: 11100010 10000010 10101100. The number of leading “1”s in the first byte tells us how many bytes the entire character will take.

But there are multiple ways of writing Unicode, too, so while this deals with the problem of one encoding not having characters that exist in another encoding, it doesn’t handle working out which encoding to use in the first place. We’re getting better at that: you can normally decide that using ASCII, which is common to most ways of converting symbols to numbers, but you will still occasionally see web pages or emails with “€ú&” every so often in place of the correct punctuation.

And now you know why.

One Response to “More bits”

  • duckbunny says:

    So that’s why when the punctuation goes wrong you get three symbols in place of the one! I am excited to know this!

Leave a Reply