ASCII and the scourge of lexial sorting.

A word from our sponsor:

The Breast Form Store Little Imperfections Big Rewards Sale Banner Ad (Save up to 50% off)
Printer-friendly version

Author: 

I suppose that there has always been a past. That every thing we live with has a history and that we we all have to deal with the consequences and the compromises that that past brought. There's an old aphorism that I learned from one of my more memorable mentors: "Yesterday's solution is tomorrow's problem."

If you are as old as I am or if you have read some history then you might recall that in the 1900's New York city had a terrible pollution problem. Every horse dropped 50 pounds of "apples" every day. And there were hundreds of thousands of horses in New York. There were articles in the paper talking about how the city could not survive the onslaught of manure much less the stench. A few years later the problem was solve. The automobile solved New Yorks' pollution problem. Fast forward to today. Yesterday's solution becomes tomorrow's problem.

ASCII is the American Standard Code for Information Interchange, It's some geeky detail down in the bowels of computer organization and architecture. It is the historical way that computers store letters, numbers, punctuation, white space and all those other little pictures on your screen. It's the American Standard because of squatting rights and despotic hegemony and industrial sabotage. And all those other historical adventures that brought us here. Pretty much it was once the most common way that computers knew how to deal with textual information.

It included lots of useful things. One was that it preserved compatibility with the standard that came before it. It also provided a simple way to sort American English words. And it was everywhere that there was a computer. So it was what was used.

ASCII has a petty serious problem when it comes to sorting though. Specifically when sorting numbers. While with letters it is not unreasonable for "aa" to come before "ab" and for "gh" to come before "gi". This is not usually true for numbers. With numbers '1' comes before '2' and '7' comes before '8' . But ordering numbers bigger than '9' is different. ASCII works great for ordering multi-letter words. But it is not so good for ordering numbers: 'Abbie' comes before 'Annie' and ASCII "does the right thing" but the same is not true of numbers: '2' comes before '10' in numbers but in ASCII Sorting we get confusing results. When sorting using ASCII rules we see that the order is '1', '10', '2'. That's confusing. That's a problem.

Technology can come to our rescue again. Technology can provide another way to sort our titles so that "wonderful story part 10" does not show up between "wonderful story part 1" and "wonderful story part 2". But extra steps have to be taken to make that happen. We need our technologists to find a solution to this crisis. I know that they are up to it.

---

Just to be doubly sure that we are all in on the joke. I'm writing the above with my tongue firmly planted in my cheek. Those who operate and maintain this website deserve all the highs praise as well as sufficient financial compensation. If ever there were people deserving of canonization it is them. I'd love to find some way to to provide in kind assistance.

Comments

Seeing typos

crash's picture

Why are typos so obvious after you post and where is the edit button?

Your friend
Crash

go to

Maddy Bell's picture

your blog list, select the post, at the top you can choose to view, edit or delete, select edit - jobs done!

Mads


image7.1.jpg    

Madeline Anafrid Bell

There is a reason for leading Zero's

but some platforms seem to ignore them (hey MS, I'm looking at you) unless you are very specific in your formatting (excel colums with text numbers)

The key to Ascii or any text based sorting and avoiding the 1, 10, 2 problem is to just use a numbering scheme that does not have variable length values and leading zero's don't count.
In your case, I'd start at 100. 101 will follow and 110 will follow 109.

But... Yes there is one. I can name one platform where that won't work because of the endianness of the CPU and the inherrent byte and word lengths used in the archicture. Luckily, most of us won't encounter that these days.

Samantha

Putting off problems

crash's picture

A coworker once blamed Eisenhower for saying: "Plans are useless, but planning is essential." I'm not sure that it was actually that Eisenhower that said it. Maybe it was someone else with the same name. My point is that If we have a plan then we have a shared context against which we can make changes.

My other point is that eventually you'll run out of zeros.

And then there is "Star Wars problem". Release order or number order? Which is the best order to watch the movies. I'll always argue that the best way order is release order. Even if we are talking about "Star Wars". Or even "Watch number 7 then stop". Which is probably the best way to maintain sanity, Hollywood has to squeeze things till there is no goodness left.
But if you only watch "A new Hope" do that then you'll miss "A Phantom Menace" and Trisha Biggar's costumes for Natalie Portman. We have to make our choices and we have to live with our consequences. And Jar Jar.

I guess I write all this just to say: It's complected. We're not computers. We're people who sometimes pretend to be computers pretending to be people. Or something like that.

I hope your day is going well too.

Peace, Love, Grace

Your friend
Crash

Standard site method

...which, of course, only works once you know about it!

Of course you have to have the same title for every part of your epic.

After that, the chapter numbers come, usually between some kind of delimiter characters. I tend to use -33-, other people have used =12= or even spaces will do.

That works right up until part 10. Then you have to go down to a lowdown field on the submission form, which is "Weighting". The general rule is: set the weight to be one bigger than the number of digits in your chapter number. It defaults to zero, so that works for small stories. For 10-99, set it to 1; for 100-999, set it to 2, if you are writing "Bike", set it to 3 and so on.

In reality this is just another sort key, so it makes sure that anything with a lower weight gets sorted before anything with a higher weight. It doesn't matter what numbers you use, just make sure that the more digits in the part number, the larger the weight.

Penny

Best Explanation

Of weighting I've read. I've always been kind of iffy about it, but now I get how it is used.

Thank you, Penny!


"Life is not measured by the breaths you take, but by the moments that take your breath away.”
George Carlin

ASCII is not the problem

Words are arranged left to right, numbers need to be sorted right to left. The compare routine needs to understand it is comparing a "human readable" number.

Gumby - I'm flexible

"Imagination is more important, than knowledge" - Albert Einstein

“The most exciting phrase to hear in science, the one that heralds
new discoveries, is not ‘Eureka!’, but ‘that’s funny…’” - Isaac Asimov

Roman Numerals

And then you have roman numerals or "one" "two" "three" and so on.
The IMHO best way is to have a separate field where you can store a serial number to sort by in addition to a series title (same value for the whole series) and a chapter title (purely informative). That way you can also handle appendices and front matter like dramatis personae. But then the user interface gets a bit more complicated ...

ASCII and ye shall transceive ...

Long ago, my workplace had both worlds, ASCII (Windows PCs) and EBCDIC (IBM mainframe). I won't explain EBCDIC because I >like< being allowed on this site ...

My assignment was to compress files, and fearing that a someone would not know how to decompress them, I put in a header referencing the book where I got the compression algorithm. Then, to 'cover' the case of the file ending up on the 'other' machine, I added a second header of "ASCII message follows". In EBCDIC....

This was long before (early 1980's) "download and install ZIP" was an option ...
---
And now we have UNICODE...