Character sets, or, all change for the new standard.

[ Home page | Things which suck ]

Ever wondered why, as you browse from web site to web site, the way quotation marks appears varies apparently at random? Well, as with so many things, it's the result of the international standards process, which has produced in the form of ISO-10646 (``Unicode'') a standard character set which is not compatible with ASCII, which has been in use for thirty years. Naturally, there are web pages which represent this as a Good Thing. This is not one of them.

What you want What you write in ASCII What the ASCII is now supposed to mean Unicode, the emperor's new clothes ... and in HTML How browsers actually display this
open single quote Open single quote. A grave accent, despite the fact that this is obviously useless unless combined with another character. U+2018
Who can tell? It probably depends on which screwed-up font you're using today. Netscape doesn't understand the named entity and will display it as a literal.
close single quote Close single quote. A symmetric apostrophe.. U+2019
Probably the same as the other quotation mark, but who can tell?
open double quote Two open single quotes, ``. Two acute accents in a row. Huh? U+201C
Probably displays as a symmetric quotation mark; you might as well use the ASCII ". Naturally Netscape doesn't understand the named entity (are you spotting a pattern yet?).
close double quote Two close single quotes, ''. Two apostrophes in a row. U+201D
As for the left hand one.
em dash Two hyphens in a row, --. Two hyphens in a row. (To be fair, this is what it's always meant, but it's an old convention.) U+2014
Here we are, a genuine advance: a widely used character in Unicode which can't be expressed very well in ASCII. And -- guess what? -- the browsers don't get it right. Netscape displays the character code as a `?' and the named entity as a literal.

When I discovered this nonsense I wrote a bit of code which would rewrite the ASCII equivalents described above -- which, obviously, I still use when writing stuff, just as you do in TeX -- into their `correct' Unicode equivalents when the pages are delivered to the clients. This was, as I should have known before I started, a complete waste of time, because the browsers don't get it right. That's hardly a surprise.

The problem with this is, of course, that users of Microsoft Windows have nasty almost-Unicode fonts, which display the ASCII open quote character as a grave accent and the ASCII close quote character as a symmetric apostrophe. I don't know whether it's possible to fix this. (Note that if you use the more recent version of XFree86, it installs fonts which look like the Microsoft Windows ones. It is possible to replace these with the ones from the previous version.)

Obviously the Unicode people aren't going to change their minds about this one, so we're stuck with the interpretation they've selected. I don't know what the correct way to handle this is; clearly if you use the `new' way of doing things you will look like an illiterate moron to users with older browsers ? because your text will look ?like this? ? with a ? every time you want to put in an ?unusual character? ? like a quotation mark. Doing things the old way is probably not supportable either, though it has the nice property that all the characters are things which you can enter with your keyboard. Arguably I could program my web server to deliver slightly different content to different clients, but that's completely bogus.

Why the Unicode people couldn't have left things alone and made their character set ASCII-compatible I do not know. It's sad that something so simple is so broken.

Copyright (c) 2002 Chris Lightfoot. All rights reserved.