I started playing around with Khan Academy’s exercise framework. If I am successful in figuring this out sufficiently, the efforts would be applied for teaching a foreign language (Thamil) rather than teaching math. I had gotten the basic case figured out — a single-question, question-answer exercise. From there, I inserted Thamil characters and saved the HTML doc into UTF-16 encoding, and by then, I experienced serious errors. Fortunately, it didn’t take much more than learning the basics of Unicode and a little poking around to figure it out, fix the issue, and get a better understanding of how it all fits together.
First, what the first sample exercise looked like. You have to type the word “green” (no quotes) to get the problem right.
With the first step done, I decided to mix in a little Thamil. Here’s what I got instead:
All seemed well except for the representation of the Thamil characters. The file was saved using the gEdit text editor in the UTF-8 encoding. I vaguely remembered UTF-8 to be like fancy ASCII. Figuring that Thamil characters were in the range of the spec where at least 2 bytes were necessary, I tried again, but this time saving the file in UTF-16:
Either the KA code or JS apparently didn’t like pages in UTF-16. The HTML rendered is that of the original HTML prior to KA’s “macro” code execution, and browsers detect the encoding as UTF-16, but it doesn’t change the fact that the rendered page most likely failed to execute the KA code. At this point, I needed to review Unicode for more clues.
Background on Unicode
Probably the best 2 quick references for Unicode in the context of using it in programs are the following:
- UTF-8 and Unicode FAQ
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (by Joel Spolsky)
People may have come across Joel Spolsky’s post already, as I did a year or two ago. It’s only now that came across the other link above for the first time. But I must admit that I found it to explain things better, even the quirks, although YMMV. With that knowledge seemingly refreshed, it seemed as if UTF-16 was the only encoding that made sense to use, and the situation remained puzzling.
The breakthrough came when it seemed that the UTF-8 file rendered fine in Mac OS X. That meant that: 1) UTF-8 was not an impediment to rendering Thamil and other “complex” scripts, and 2) the Unicode-rendering OSes have long been able to interpret UTF-16 characters from UTF-8 character streams. That means for the “complex scripts” (‘Indic’, Arabic, etc.), two bytes can be put together to represent every UTF-16 double-byte/character/codepoint in a way that UTF-8 can handle. Also, for Thamil and other scripts, one glyph (what gets displayed to the screen as a single “character”) can be represented by one or two characters/codepoints.
In light of this, I checked what the default encodings were in the browsers I was using. Sure enough, my browsers were still using ISO-8859-1/Latin-1 as the default encoding for reason (I never changed it from the defaults?), and that encoding will crap out on codepoints it isn’t defined for, including Thamil.
For good measure, I inserted the line
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
in the HTML head tag to help give the browser a hint of what encoding to use in case it was set on “Auto-detect”.
Update (12/1/11): It turns out that the Apache web server can (and does) by default set the encoding of the files it serves to be the extended ASCII / Latin extended, i.e. ISO-8859-1. When this conflicts with the HTTP header information set by the tag, then sometimes, browsers will use the web server’s reported encoding and not the one in the HTTP headers. This can be fixed by adding a .htaccess file overriding default Apache encoding settings with UTF-8.