You Can Quote Me

You Can Quote Me


by Puddintane

If you’ve poked around the Web for any length of time, I’m sure you’ve seen pages filled with weird ‘letters’ that look something like this:

á¢â‚¬Å“Fred, Iá¢â‚¬â„¢d like to talk to you,á¢â‚¬ ¿ said Erin. á¢â‚¬Å“Those ferschlugginer á¢â‚¬ËœCode Pagesá¢â‚¬â„¢ I hear about are one of those maddening kludges left over from the dark ages of á¢â‚¬ËœData Processingá¢â‚¬â„¢ that we all sometimes struggle with when publishing on the Web. I doná¢â‚¬â„¢t understand what the heck the guys who invented them thought they were doing. What were they, á¢â‚¬ËœMoronsá¢â‚¬â„¢?á¢â‚¬ ¿

The odd character groupings, of course, are what they call ‘curly quotes.’

Actually, once upon a time, there was a very compelling reason for ‘code pages.’ In the glorious days of computing yesteryear, there were arcane clanking engines of data input called ‘Keypunch Machines,’ and yes, Dear, Granny once worked with an IBM 026 Keypunch Machine on a daily basis. That particular model was capable of printing only fifty-six encoded ‘characters,’ so the designers had to choose very carefully from amongst all the possible characters there are in every language. In English, for example, there are twenty-six ‘letters,’ and they come in both upper and lowercase versions, making a total of fifty-two actual ‘characters.’ You can easily deduce from this that this leaves very few encodings left over for numerals, much less punctuation marks and anything else that the heart might desire. Other languages might have either more or fewer ‘letters’ that required encoding. The ‘solution’ to this problem was predictable, the character set was reduced by allowing only uppercase letters to be encoded, leaving a bit of room for arabic numerals and punctuation marks. The tiny bits of paper punched out of the ‘cards’ were known as ‘chad,’ and in IBM shops the container built into a keypunch machine to catch the chad was often called as a ‘bit bucket,’ if you’ve ever wondered where the word came from.

Mind you, these general sorts of limitations weren’t, and aren’t, limited to IBM keypunch machines. An ordinary ‘typewriter’ keyboard — similar to the one in front of you, if you’re reading this on a computer screen, is similarly limited in the total number of actual keys, with different ‘cases’ and characters accessed by various combinations of ‘shift keys.’ A manual typewriter keyboard of that era had forty-four letter keys or less, so fifty-two separate codes probably seemed like lots of space to play around in at the time. Alternate forms of individual letters and numbers could be accessed with ‘control codes’ such as those used by early teletype machines — and yes, Virginia, Granny has spent many years working with those as well — so the early ‘computer codes’ such as ASCII and IBM’s EBCDIC had special codes for ‘shift in’ and ‘shift out,’ because they were required by early computer interfaces, which often, in later years, used teletype machines as input devices in addition to — or instead of — keypunch machines, since the typewriters of that distant era were purely manual machines which produced ‘output’ in the form of pieces of paper with ink on them, which could only be read by human beings at the time, whilst teletypes were designed for ‘electronic’ communication over wires to begin with, a ‘high-tech’ version of the telegraph. Indeed, the very early electronic ‘computer terminals’ were often known as ‘glass teletypes’ by the cognoscenti, because we remembered our own history. The operating system BC runs under is almost certainly some version of Berkeley Unix under some descendent of the ‘C shell’ an interface to the Unix (or derivative) operating system developed by Bill Joy on a Lear Siegler ADM-3, a ‘glass teletype.’ Granny knows this because she used to share the terminal room at Cory Hall with him in the interminable (if you’ll pardon the pun) late-night sessions a UC Berkeley degree required, although of course she was a lowly undergraduate at the time.

Well, since character sets were small, and the particular characters required for individual projects were large, the concept of ‘code pages’ was developed, a method whereby almost any given ‘basket’ of available characters could be provided, including accented characters for languages which used them, Greek letters for mathematical functions or scholarship, and many other things. In practice this turned into an incredible mess of ‘proprietary’ encodings, with multiple vendors developing essentially the same sets of characters, but using their own whacky ideas of how to arrange them, on the same general principle as modern razor ‘cartridges’, which I’m sure you’ve noticed only fit the handles they were designed for, so one is forced to purchase replacement cartridges from the same company which sold one the handle at a discount price.

Microsoft (thank you, Bill Gates) was particularly profligate in producing proprietary variants of non-ASCII code pages, so most of the problems one encounters are down to Microsoft’s use of proprietary ‘in-house’ encodings in preference to international standards such as Unicode.

In fact, their latest versions of the Windows/Vista operating system still use the old proprietary Microsoft encodings by default, last I heard, although it’s possible to change that behaviour through system preferences.

Big Closet uses Unicode encoding by default, in a format known as ‘UTF-8,’ which uses variable-width character encodings to allow users to input and read the character sets used by almost every language in the world, including Chinese, Japanese, Arabic, Russian, and countless more. Most modern operating systems come pre-installed with a fall-back Unicode font which allows the system to create and display almost any character one desires or encounters, as long as it’s truly Unicode-compliant.

Unfortunately, in the case of Windows and its descendants, it’s probably not, as mentioned above. One has to force Windows and Vista to behave correctly, which one can do either by setting a system preference or by communicating the fact that one is using Unicode in the page header on every webpage, usually like this:

<?xml version="1.0" encoding="utf-8"?>

or like this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

which arcane formula tells the browser what to expect.

The sure way to avoid this problem is to insert all the ‘extended characters’ using HTML ‘escape sequences,’ which one can easily find on the web.

&#8220;Double curly quotes&#8221; “Included text”

&#8216;Single curly quotes&mp;#8217; ‘Included text’

Here's a comprehensive list, handily arranged into more or less coherent bundles, so one can avoid looking at characters one might not be interested in, like the Khmer script used in Cambodia. Note too that the less common Asiatic languages tend to be poorly supported in most so-called ‘complete’ Unicode fonts supplied by manufacturers, unless one has installed ‘multilingual support’ or similar option. You can look at the supported fonts in the notes at the top of each ‘glop’ of characters, what the Unicode people call a ‘range.’

Alan Wood's Unicode Resources

Look at the Characters section first, probably. You'll find the most common (and quite a few uncommon) characters there. Note that Unicode is not a panacea, though. Some languages aren't supported, although there is a registry of "Private Use Area Ranges" that covers things like the Elvish Tengwar, or Klingon whatever they are. The Unicode authorities demand a ‘corpus’ of text and signs of actual use before they make an official ‘range,’ so start writing Elvish, or Klingonese, today!

Note too that there are many resources available that allow you to inspect the actual character sets installed on your own machine. I use one called Popchar from Ergonis Software that seems both useful and stable for my own platform (Mac OS), but there are several other options, some better and more stable than others.

MS Windows/Vista systems have the Character Map utility included in every release, which is handy enough to be usable, if not quite the delight to use that Popchar is, for example.

The ‘escape sequences’ almost always work correctly, because they use pure ASCII characters to indicate the actual character one wants in an unambiguous way. To make life less tedious that remembering and typing in the arbitrary numbers, HTML defines many of the more common characters as more or less mnemonic ‘named entities’ which you can see here: HTML Named Entities

Alternatively, you can create your own macros to do whatever you want through using a utility like Typinator (for the Mac, Windows, and other platforms) which allows one to enter arbitrary strings with simple mnemonic commands, which is very handy when one (for example) makes a pretty title header for a story and wishes to be able to enter it in exactly the same way each and every time without remembering which file you have the template stored in.

Click Like or Love to appropriately show your appreciation for this post: