You Can Quote Me

Printer-friendly version

Author: 

Taxonomy upgrade extras: 

You Can Quote Me


by Puddintane

If you’ve poked around the Web for any length of time, I’m sure you’ve seen pages filled with weird ‘letters’ that look something like this:

á¢â‚¬Å“Fred, Iá¢â‚¬â„¢d like to talk to you,á¢â‚¬ ¿ said Erin. á¢â‚¬Å“Those ferschlugginer á¢â‚¬ËœCode Pagesá¢â‚¬â„¢ I hear about are one of those maddening kludges left over from the dark ages of á¢â‚¬ËœData Processingá¢â‚¬â„¢ that we all sometimes struggle with when publishing on the Web. I doná¢â‚¬â„¢t understand what the heck the guys who invented them thought they were doing. What were they, á¢â‚¬ËœMoronsá¢â‚¬â„¢?á¢â‚¬ ¿

The odd character groupings, of course, are what they call ‘curly quotes.’

Actually, once upon a time, there was a very compelling reason for ‘code pages.’ In the glorious days of computing yesteryear, there were arcane clanking engines of data input called ‘Keypunch Machines,’ and yes, Dear, Granny once worked with an IBM 026 Keypunch Machine on a daily basis. That particular model was capable of printing only fifty-six encoded ‘characters,’ so the designers had to choose very carefully from amongst all the possible characters there are in every language. In English, for example, there are twenty-six ‘letters,’ and they come in both upper and lowercase versions, making a total of fifty-two actual ‘characters.’ You can easily deduce from this that this leaves very few encodings left over for numerals, much less punctuation marks and anything else that the heart might desire. Other languages might have either more or fewer ‘letters’ that required encoding. The ‘solution’ to this problem was predictable, the character set was reduced by allowing only uppercase letters to be encoded, leaving a bit of room for arabic numerals and punctuation marks. The tiny bits of paper punched out of the ‘cards’ were known as ‘chad,’ and in IBM shops the container built into a keypunch machine to catch the chad was often called as a ‘bit bucket,’ if you’ve ever wondered where the word came from.

Mind you, these general sorts of limitations weren’t, and aren’t, limited to IBM keypunch machines. An ordinary ‘typewriter’ keyboard — similar to the one in front of you, if you’re reading this on a computer screen, is similarly limited in the total number of actual keys, with different ‘cases’ and characters accessed by various combinations of ‘shift keys.’ A manual typewriter keyboard of that era had forty-four letter keys or less, so fifty-two separate codes probably seemed like lots of space to play around in at the time. Alternate forms of individual letters and numbers could be accessed with ‘control codes’ such as those used by early teletype machines — and yes, Virginia, Granny has spent many years working with those as well — so the early ‘computer codes’ such as ASCII and IBM’s EBCDIC had special codes for ‘shift in’ and ‘shift out,’ because they were required by early computer interfaces, which often, in later years, used teletype machines as input devices in addition to — or instead of — keypunch machines, since the typewriters of that distant era were purely manual machines which produced ‘output’ in the form of pieces of paper with ink on them, which could only be read by human beings at the time, whilst teletypes were designed for ‘electronic’ communication over wires to begin with, a ‘high-tech’ version of the telegraph. Indeed, the very early electronic ‘computer terminals’ were often known as ‘glass teletypes’ by the cognoscenti, because we remembered our own history. The operating system BC runs under is almost certainly some version of Berkeley Unix under some descendent of the ‘C shell’ an interface to the Unix (or derivative) operating system developed by Bill Joy on a Lear Siegler ADM-3, a ‘glass teletype.’ Granny knows this because she used to share the terminal room at Cory Hall with him in the interminable (if you’ll pardon the pun) late-night sessions a UC Berkeley degree required, although of course she was a lowly undergraduate at the time.

Well, since character sets were small, and the particular characters required for individual projects were large, the concept of ‘code pages’ was developed, a method whereby almost any given ‘basket’ of available characters could be provided, including accented characters for languages which used them, Greek letters for mathematical functions or scholarship, and many other things. In practice this turned into an incredible mess of ‘proprietary’ encodings, with multiple vendors developing essentially the same sets of characters, but using their own whacky ideas of how to arrange them, on the same general principle as modern razor ‘cartridges’, which I’m sure you’ve noticed only fit the handles they were designed for, so one is forced to purchase replacement cartridges from the same company which sold one the handle at a discount price.

Microsoft (thank you, Bill Gates) was particularly profligate in producing proprietary variants of non-ASCII code pages, so most of the problems one encounters are down to Microsoft’s use of proprietary ‘in-house’ encodings in preference to international standards such as Unicode.

In fact, their latest versions of the Windows/Vista operating system still use the old proprietary Microsoft encodings by default, last I heard, although it’s possible to change that behaviour through system preferences.

Big Closet uses Unicode encoding by default, in a format known as ‘UTF-8,’ which uses variable-width character encodings to allow users to input and read the character sets used by almost every language in the world, including Chinese, Japanese, Arabic, Russian, and countless more. Most modern operating systems come pre-installed with a fall-back Unicode font which allows the system to create and display almost any character one desires or encounters, as long as it’s truly Unicode-compliant.

Unfortunately, in the case of Windows and its descendants, it’s probably not, as mentioned above. One has to force Windows and Vista to behave correctly, which one can do either by setting a system preference or by communicating the fact that one is using Unicode in the page header on every webpage, usually like this:

<?xml version="1.0" encoding="utf-8"?>

or like this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

which arcane formula tells the browser what to expect.

The sure way to avoid this problem is to insert all the ‘extended characters’ using HTML ‘escape sequences,’ which one can easily find on the web.

&#8220;Double curly quotes&#8221; “Included text”

&#8216;Single curly quotes&mp;#8217; ‘Included text’

Here's a comprehensive list, handily arranged into more or less coherent bundles, so one can avoid looking at characters one might not be interested in, like the Khmer script used in Cambodia. Note too that the less common Asiatic languages tend to be poorly supported in most so-called ‘complete’ Unicode fonts supplied by manufacturers, unless one has installed ‘multilingual support’ or similar option. You can look at the supported fonts in the notes at the top of each ‘glop’ of characters, what the Unicode people call a ‘range.’

Alan Wood's Unicode Resources

Look at the Characters section first, probably. You'll find the most common (and quite a few uncommon) characters there. Note that Unicode is not a panacea, though. Some languages aren't supported, although there is a registry of "Private Use Area Ranges" that covers things like the Elvish Tengwar, or Klingon whatever they are. The Unicode authorities demand a ‘corpus’ of text and signs of actual use before they make an official ‘range,’ so start writing Elvish, or Klingonese, today!

Note too that there are many resources available that allow you to inspect the actual character sets installed on your own machine. I use one called Popchar from Ergonis Software that seems both useful and stable for my own platform (Mac OS), but there are several other options, some better and more stable than others.

MS Windows/Vista systems have the Character Map utility included in every release, which is handy enough to be usable, if not quite the delight to use that Popchar is, for example.

The ‘escape sequences’ almost always work correctly, because they use pure ASCII characters to indicate the actual character one wants in an unambiguous way. To make life less tedious that remembering and typing in the arbitrary numbers, HTML defines many of the more common characters as more or less mnemonic ‘named entities’ which you can see here: HTML Named Entities

Alternatively, you can create your own macros to do whatever you want through using a utility like Typinator (for the Mac, Windows, and other platforms) which allows one to enter arbitrary strings with simple mnemonic commands, which is very handy when one (for example) makes a pretty title header for a story and wishes to be able to enter it in exactly the same way each and every time without remembering which file you have the template stored in.

Comments

I'd forgotten most of that stuff

My first experience with a computer was designing systems on a CDC-3800. Data was entered using the 80-column IBM punch cards. Once fed in, the computer generated the machine code based on "octals" (numbers from 000-777).

Those were the days, my friend.
We thought they'd never end.

LOL

Loved the history lesson.

Hugs,
Erica

Am I ever glad that

that antiquated system has been superceeded by modern HTML

    Stanman
May Your Light Forever Shine

My first experience ...

... with using a remote computer was via a teletype and an audio coupler (a device in which you placed a standard telephone handset so the audio code could be sent and interpreted by the remote machine). I was using it for electronic circuit analysis and it really wasn't all that effective quite apart from being very, very slow. However, it was a step up from a deck of 80 column Hollerith cards and an overnight wait for the results (that was if you'd not made a mistake in punching the cards) The data were usually encoded with ASCII characters.

I know the machines I first got involved with back in 1961 didn't have VDUs or keyboards (unless you counted the array of switches and push buttons). I used to test them in the factory before they were delivered. One of my colleagues was a musician and programmed on to play Mozart's Clarinet concerto which was pretty cool.

I just can't think how powerful personal computers will be in the next 50 years. Unfortunately, I won't be around to know.

Robi

For those that don't know

For those that don't know what the audio coupler was, it was the original 'dial-up modem'. You would dial-up (with a rotary dial, not touch-tone) a distant computer on a standard phone, listen for the weird burping tones that indicated you got through. Push the phone handset into a cradle on your modem. Instead of sending/ receiving electronic signals it actually had a speaker/ mic set-up to send/ receive sound to a standard phone that it modulated/ demodulated (where mo/dem came from) into code. These technological wonders ran at 400 baud, not 1m baud, not 100k baud, not even 56k baud of modern 'dial-up'. This speed is similar to the speed of a moderately fast typist and you could actually watch the cursor move across the screen as it received text from the computer.

Some notes

Those funny characters in your first quote are because the browser didn't recognize it as UTF-8. There are a number of reasons why the browser might not recognize it, most of which come down to either misconfiguration of one variety or another on either end, or the browser's attempt to auto-detect because of rampant misconfiguration.

The various code pages are for non-latin alphabets. There are code pages for cyrillic (Russian), Hebrew, Greek and quite a few others. For some reason known only to the gods of the standards process, the standards committees left those curly quotes off of the various code pages, which is the entire reason for Microsoft's non-standard code pages.

Code pages preceded Unicode by well over a decade. While Microsoft presents a handy target, they're not responsible - I first encountered code pages while working on IBM mainframes long before Bill Gates got the contract for the operating system for the first IBM PC. I don't know whether IBM invented the concept or not, but they had to have it in order to sell computers to people who didn't want to use English, strange as that might seem.

Unicode was an attempt, mostly successful, to unify the various languages. The UTF-8 encoding was contributed by Plan 9, and came in towards the end of the process, which is why curly quotes are three characters instead of the more logical two.

The meta tag only tells the browser what to expect if the page is accessed from disk. (It needs to be right up front to actually work, though.) For reasons known only to the internet standards bodies, the browser is supposed to get the character set encoding from the server via an HTTP header, and completely ignore whatever you put in the HTML as a hint.

The only way I know of absolutely, 100% guaranteeing that your page will render the way you want, regardless of how the server is (mis)configured or the vagaries of different browsers, is to stick to the 96 characters of plain old boring ASCII and use character entities (those funny ("’" or "&x2652;") for everything else. That is, of course, easier said than done.

Xaltatun

>> For some reason known only to the gods of the standards

Puddintane's picture

Actually, there's a very coherent reason: Many languages use entirely different ‘quotation marks,’ such as « Français », „Deutsch“, and many more. In fact, quotation marks are amongst the least stable characters there are. The HTML <Q> tag is supposed to expand into properly balanced quotation marks in whatever language is defined in the header section, but it often goes quite wrong in practice*, because browser designers mostly like to ignore the difficult bits, and properly displaying quotes is very difficult. The conventions used in many languages are fairly arcane, have exceptions, and some writers love to disobey them. James Joyce, for example, loved one (of several) French quotation schemes, and drove his typesetters crazy thereby.

-------

* The word wrong in the preceding sentence is supposed to be quoted, but probably isn't.

-

Cheers,

Puddin'

A tender heart is an asset to an editor: it helps us be ruthless in a tactful way.
--- The Chicago Manual of Style

Actually, I do know

Actually, I do know the reasons where I said "known only to the standards committees." I was trying to be a bit humorous and keep the post to a reasonable length by avoiding details I suspected most of the readers wouldn't care about.

The real reason the standards committee left the punctuation off is that they decided not to use the 80-9F range, and hence didn't have room for all the accented characters and the punctuation as well. The reason they didn't use the 80-9F range was to make the page 7-bit compatible for telecommunications systems that only handled 7 bits - which were still in use when those standards were first created. If they'd used that range they'd have risked sending unintended control characters down the wire with not very pleasant results for the networks.

Microsoft didn't worry about that nicety, so they simply used the 80-9F area for punctuation when they created their variants. Today it's a dead issue - there are so few (if any) systems left that can only handle 7 bits that nobody worries about it.

Xaltatun

Not quite so fast!

http://en.wikipedia.org/wiki/UTF-7

Yes, Unicode even defines a transformation format for 7 bit systems...

And you better not be one of those "everybodies" who ignore that format if you're writing an IRC client. You're welcome to use UTF-8 internally, but as soon as you start transmitting to another client you'd better put it into a UTF-7 transform first!

And if you're writing a bot or a server it's UTF-7 ALL THE TIME.

And IRC aren't the only ones...

So no... there are still people who concern themselves over a 7 bit constraint. And they're all in areas where text REALLY matters. Other people can't easily fix a format screw up in the fields where this is still a constraint.

Yes, the fields where this issue exists are becoming rapidly fewer, but to say "nobody" cares anymore is very naive.

Oh, and the reason the constraint exists ISN'T because of the systems used as client systems, it's because of the transmission protocols that have been in use from the times that the 7 bit constraint was an even bigger deal than it is today. These protocols STILL aren't upgraded, and so anything that needs to be transmitted over such interfaces must STILL remain constrained. It is slowly changing, but this involves a lot of infrastructure and isn't going to be cheap or easy to change. And definitely not fast.

Abigail Drew.

Ancient memories... thank you Puddin'

It has been... a long time since I used key-punch and tele-typewriters (60 wpm & 100 wpm).

Back in... okay, I'll say it... Feb of 1968 I designed, etched, and constructed a fully electronic encoder and decoder for the Baudot codes we used. The encoder was a piece of cake the decoder took a bit of thinking but at least they both worked and the encoder could be set by a switch to either of the standard speeds (above). Used the 20ma standard not the 60ma one.

Those circuit board etchings were accomplished in one of the latrines at the communication school at Ft. Monmouth, NJ. The two circuit devices were handed off to the research facility there shortly before I went TDY to an AFB for advanced carrier repair training & where I aided the instructors rather than sleep through the course.

Hell, I had my commercial ticket, my amateur ticket, two engineering degrees (finished when I was 18), and generally liked electronic equipment more than people at that time. I suppose I was a sort of a prehistoric nerd but before today's sort of computers came along. God how I wanted to have something like the laptop I'm using to write this but back in 1967. Would have had to keep it well hidden so I wouldn't be burned at the stake for witchcraft though.

At any rate... THANK YOU for the glimpse back into history. It brought back a lot of memories for me.

A.

hmmm. laptop computer and solar cell recharging station in 1600's...... hmmm again. would need two or three spare hard drives with op-sys already on them, and maybe six or seven spare memory cards for when the originals failed (a guaranteed happening). Need to think about that story line a bit.