UTF-8 Everywhere

A new manifesto:


I agree with most of it. I’ve been through this madness many times in the past, working with text strings in many different locales. UTF-8 is fine. And the arguments about UTF-8 taking more space are madness for two reasons:

  • Text is small.
  • If you need to make it even smaller, you compress it.

Text is small. According to Google, 129 million books had been published as of 2010. The average number of words in a book is 64,000 words, or something under 400,000 characters. So let’s assume that the worst-case happens and UTF-8 encoding gives us 4 bytes per character. That’s 1.6 million bytes per book, or 200 TB for everything published in modern times (past 400 years).

Text compresses quite well. Unless you’re writing literal nonsense, text is going to compress incredibly well. English text can be compressed to under 2 bits per character, and the same very likely holds true for other languages. Even if it’s 4 bits per character our corpus of  “everything published in modern times” will fit in 25 TB.

So looking at the average book of 400,000 characters, that will fit in 200 KB. I have desktop icons that are bigger than 200 KB.

The advantages of UTF-8 are tremendous. It’s byte-strings through most of your plumbing, except for parts that have to construct or parse text. And at that point, you can use whatever internal representation makes you happy.

But outside of those handful of routines, use UTF-8.