Handling encodings (UTF-8)
From time to time I’m asked to extend the number of encodings supported by TextMate. My answer is normally that the user should be using UTF-8, so here’s a bit of history, reasons for using UTF-8, and tips about handling it in miscellaneous contexts.
Initially we had ASCII which define 128 characters (some of them being control characters). Each character can be represented with 7 bits, and you can see all of them by running
man ascii in your shell.
Since ASCII only contain the letters A-Z (without diacritics) several 8 bit extensions were made (e.g. CP-1252, MacRoman, iso-8859-1), but 8 bit isn’t enough to also add e.g. greek letters, so multiple variants exist (MacRoman, MacGreek, MacTurkish, …).
The different 8 bit encodings are generally not interchangeable w/o loss, so a new standard had to be created (Unicode) which is a superset of all existing encodings.
Unicode is 32 bit, which gives it plenty of room to grow, e.g. the default encoding for documents transferred over http (iso-8859-1) does not contain the € symbol, and has no room to add it.
So Unicode should sell itself, seeing how it’s the only way to actually represent all the characters you can type both now and in the future.
But a byte is 8 bit (an octet) and there is a lot of software which treat strings as octet streams, and some of them expect to find miscellaneous tokens in these strings represented using their ASCII values (e.g. parsers, compilers and interpreters).
This is where UTF-8 enters the picture. UTF-8 is an 8 bit representation of Unicode and when it comes to new protocols, RFC 2277 from IETF says: Protocols MUST be able to use the UTF-8 charset.
Besides being an 8 bit encoding and being able to represent Unicode, it has a few other very nice properties:
- Every ASCII character is represented as an ASCII character in UTF-8.
- Every UTF-8 byte which looks like an ASCII character, is an ASCII character.
- Generating a random 15 byte sequence containing characters in the range 0x17—0xFF has a probability of 0.000081 to be valid UTF-8 (the probability gets lower, the longer the sequence is, and is also lower for actual text).
Properties 1 and 2 are important to keep compatibility with our existing ASCII heavy software. E.g. a C compiler would generally only know about ASCII, but since strings and comments are treated as byte streams, we can use UTF-8 for our entire source and put non-ASCII characters in both our strings and comments.
Property 3 turns out to be attractive because it means we can heuristically recognize UTF-8 with a near 100% certainty by checking if the file is valid. Some software think it’s a good idea to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but it is not, because the file can already be recognized, and placing a BOM in the beginning of a file means placing three bytes in the beginning of the file which a program that use the file may not expect (e.g. the shell interpreter looks for
#! as the first two bytes of an executable).
Serving HTML as UTF-8
What I hear the most is that some browsers do not support UTF-8. This is not true (since at least version 4 of IE/NS), but you need to include the encoding in the http response headers. If you’re using apache and the default charset is not set to UTF-8, you can add the following to your
You can also set it for specific extensions, e.g.:
AddCharset utf-8 .txt .html
Receiving user data as UTF-8
If you accept data from the user via an HTML form, you should add
accept-charset="utf-8" to the form element, e.g.:
<form accept-charset="utf-8" …> … </form>
This will ensure that data is sent as UTF-8, and no, you cannot rely on the encoding if you do not supply this! Nor can you rely on all users limiting their use of characters to the ASCII subset which is common for the majority of encodings.
To make LaTeX interpret your document as UTF-8, add this in the top:
By default Terminal.app should already be set to use UTF-8 (Window Settings → Display). Since HFS+ is using UTF-8 for file names, this makes sense not only to be able to
grep files in UTF-8, but
ls will return data in UTF-8 as well (since it’s dumping file system names).
In addition to the display preference, you should also add the following line to your profile (e.g.
~/.bash_profile for bash users):
In fact without it, subversion will fail to work for repositories which use non-ASCII characters (when it re-encodes filenames to the local system).
Other programs are also using the variable, e.g. vim will only interpret UTF-8 multi-byte sequences correct with the variable set.
Converting between encodings
If you need to convert between encodings, you can use
ls|iconv -f utf-8 -t ucs-2|xxd
Will convert the result from
ls to ucs-2 (16 bit unicode) and do a hex dump of that.
iconv has a transliteration feature if you need to use a lossy encoding, e.g.:
echo “that\'s nice…”|iconv -f utf-8 -t ASCII//TRANSLIT