Those are the sound bytes that you listen when you deal with software internationalization. And that’s a huge subject to put in a small post but, I will make a résumé, to give you the moving-on knowledge. Let’s start.
What is ASCII ?
This one is probably the most easy to answer. ASCII is a table and this table has 128 codes, that is 27, which means you need 7 bits to represent each code.
Each code is translated to characters, and characters my be printable or non-printable. Non-printable characters are also known as control characters.
In ASCII table, control characters are under code 32, and 32 is the decimal code representation of space character.
If you have bash terminal, you may type in ascii and you’ll get a view of the table:
I’ve marked two very well known control characters i.e. Carriage Return and Line Feed.
ASCII stands for American Standard Code for Information Interchange and it was published in 1963 by American National Standards Institute. I have added this information, so focus on why this has indeed, all the English alphabet, and no other. Thus, it will get tricky, if you want to write let’s say in Deutsch or Portuguese. Characters such as ß or é are quite common, and that’s when code pages come around…
What are code pages?
A code page is simply a number that is assigned to identify a list of 128 characters, a.k.a character set.
As I said, we have one spared bit in ASCII table and a lot of characters missing. So the solution seems to be easy: Let’s place the missing characters in those 128 available codes. And it did worked out, because every one that didn’t speak english, had indeed that approach. So for each, let’s say, language you have a code page, i.e. is a code that identifies the character set for the language that you are interested in.
Historically, OEM code pages were first created. These codes were registered by companies like HP or Dell which licensed MS-DOS for distribution with their hardware.
Indeed, two groups of code pages exist, Windows code pages and OEM code pages.
For instance, Windows code page 28591 is the equivalent to standard ISO-8859-1, a.k.a Latin1.
You may see all the characters that are defined in this standard by typing man iso_8859-1 in your bash terminal. Here’s a sample:
Oct Dec Hex Char Description
240 160 A0 NO-BREAK SPACE
241 161 A1 ¡ INVERTED EXCLAMATION MARK
242 162 A2 ¢ CENT SIGN
243 163 A3 £ POUND SIGN
335 221 DD Ý LATIN CAPITAL LETTER Y WITH ACUTE
336 222 DE Þ LATIN CAPITAL LETTER THORN
337 223 DF ß LATIN SMALL LETTER SHARP S
340 224 E0 à LATIN SMALL LETTER A WITH GRAVE
341 225 E1 á LATIN SMALL LETTER A WITH ACUTE
343 227 E3 ã LATIN SMALL LETTER A WITH TILDE
344 228 E4 ä LATIN SMALL LETTER A WITH DIAERESIS
375 253 FD ý LATIN SMALL LETTER Y WITH ACUTE
376 254 FE þ LATIN SMALL LETTER THORN
377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS
Now we have indeed used the Most Significant Bit.
So we can codify the string “Olá mundo!” using ISO-8859-1:
And check if we’re right:
And it matches.
Now, you might be wondering, how does it go if you wanna write “Алло. Wie heißt du? Como estás?” in a file?
Well, let’s see. Character á has the code 225 in ISO-8859-1 but no representation at all in ISO-8859-5, in which case you would get a ? character, or maybe a . It depends on the mood of your editor. On the other side, л has the code 219 in ISO-8859-5 but not representation on ISO-8859-1…
Maybe it’s time to speak about Unicode.
What is Unicode, anyway?
Unicode is an abstract representation of nearly all characters that exist. Thus, every single character has a hexadecimal code representation assigned.
For instance A is represented by U+0041, л is U+043B and á U+00E1.
So by using Unicode you can now represent the string above as:
At this point it is important to retain that code points do not state anything about how this code should be translated into bits. This is merely an abstraction that allows everyone to write a document using the same code point reference no matter what character is written.
Everything is good, but we really need to save this into disk. How do we do this? How do we encode ?!
What is an encoding?
Encoding tells you how to put into bits those code points above.
And there are lots of encodings available. UTF-8 is the most common one, so let’s look into this one. How does it works?
UTF-8 can use from 1 up to 4 bytes to encode any of the code points available in Unicode. And they are 134 thousand as per version 11.0.
The first bits tell how many bytes are used to encode a code point. If more than one is needed then the following bytes are prefixed with 10.
The following table summarizes this, where x represents digits available for code points codification:
1 byte 0xxxxxxx 7 bits 2 byte 110xxxxx 10xxxxxx 11 bits 3 byte 1110xxxx 10xxxxxx 10xxxxxx 16 bits 4 byte 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 bits
So let’s take for instance the encoding of € character using UTF-8:
€ has the code point U+20AC.
which is binary is 10 0000 1010 1100, i.e. 14 bits.
From the table above one can see that 3 bytes are needed to place in 14 bits.
Thus, the UTF-8 encoding comes as:
11100010 10000010 10101100
that is 0xE2 Ox82 0xAC
You may have noticed that if you encode “Hello Word” in ASCII or UTF-8 that results is exactly the same. And this is possible because code points from english alphabet match the exactly same decimal codes that ASCII table has. This was the crucial point for compatibility between the old fashion ASCII files and the needs that UTF-8 brought.
If you’re still with me and you wanna find out more I strongly advice you to read this must read famous article on this subject.