Weird encodings of non-ASCII characters on BigCloset

A word from our sponsor:

Printer-friendly version

Author: 

Blog About: 

I've been noticing that when stories have expressions in languages that use non-English letters, e.g., French, Irish, etc., frequently the characters are mangled, usually replaced by several odd characters and spaces, such as rá¨vá¨rence or Người yáªu.

What's more, sometimes it looks ok in the story, but in the list of titles, it's mangled, or vice versa. E.g., Drea's story Dainéal’s Dream, a title which in some places displays correctly but in other places, such as links to the story, is mangled: Dainéal’s Dream,

I'm assuming this is due to different character encodings being used in different places, or the writer using one encoding but when uploaded to BigCloset, a different one is used. If it's a language I'm familiar with, I can usually figure out what should be there, but if it's one I don't know, it's so mangled, I can't use any translate program to figure out what is meant.

Does anyone know how this happens? Or how to interpret the character encoding or convert it either to the original encoding or to UTF-8?

In all cases, I'm looking at BigCloset pages and stories using FireFox. If I download the pages as HTML, the HTML always specifies that the encoding for the page is UTF-8.

Comments

Decodings

erin's picture

The decodings of non-ASCII characters are done in browsers, so it is close to impossible for us to control that. Standards exist, but some Small-and-Squishy software companies like to use their own codings, different than everyone else's, just to confuse things I suspect. If you use programs written by Small-and-Squishy to produce content and don't use or even know about the ways to set the encoding scheme, then your content will appear strangely full of these odd characters on some browsers.

We do try to filter some of them out by various means, but it ain't easy or simple.

Hugs,
Erin

= Give everyone the benefit of the doubt because certainty is a fragile thing that can be shattered by one overlooked fact.

The sooner we forget all about ASCII the better

{Slashdot... are you listening}
I first encountered ASCII FIFTY (50) years ago when I was learning to program in Basic on a Teletype. It punched tape in 7-bit ASCII.
It was a great improvement over EBCDIC (IBM speak).

Around 1980 I got involved with a project that resulted in the Dec Multi-National Character Set which is regarded as the forerunner to UTF-8, UTF-16 and the rest. I was working for DEC at the time and working on different character sets for VT100s and VT200s.

ASCII in its original form, could handle umlauts 7-bits gives a maximum of 128 characters. When the first 31 are non printable characters such as STX/ETX (Start of Text, End of Text) there is not a lot left for:-
26 latin capitals,
26 latin lowercase,
numerical and punctuation characters.

The world mostly now runs on UTF-8 (which almost matches ASCII 8-bit) but allows the stream of characters to include things like Emojii by using a control sequence. It also allows us to embed Cryllic and Arabic text in documents with ease.

Erin I know goes back to the dark ages like me an can remember punched cards and the like but it is time to kill off the words ASCII (American Standard for Computer Information Interchange)
Sorry for the end of week rant but every time I go into my Office I see the big fat volume of UNICODE Language definitions on my bookshelf. Yes, I should get rid of it but...
Samantha

Mismatched encoding of letters like -á-é-ú ?

If the web page specifies UTF-8 (with the "meta" tag), then my browser decodes the bytes as UTF-8. (I don't know what it does when it gets invalid UTF-8.)

I dug into the cases where I was pretty sure I knew what the correct letter was, and in one case, if I looked at the unicode values for the two characters, and treated that as the UTF-8 encoding of something, I got the character I expected. é = c3 a9 which is the UTF-8 encoding of unicode code point e9, which is é (e with acute accent), which in that case was the correct letter.

That would make sense if someone uploaded UTF-8 and BigCloset thought it was, say Windows (code page 1252.)
For what it's worth, if I create an HTML file with accented letters (in UTF-8) and leave out the <meta charset="utf-8" /> specification, I see that sort of thing. Note that every web page from BigCloset that I've looked at includes that meta tag, so BigCloset seems to think it's always sending UTF-8 and is telling my browser that.

That doesn't work for the others -- they all begin with á (hex e1), and in UTF-8, e1 starts a 3-byte encoding (which excludes pretty much any letter used in a European language, but does include things like m-dashes and left- and right quotes.)

What's funny is that a given word may be encoded correctly in one place on a page and incorrectly in another place on the same page.

I don't know whether there are any encoding issues when typing (or cut-and-pasting) stuff into an input box, like this one. I've never noticed a problem with non-ASCII characters, but I haven't done any exhaustive tests, either.

It's a mystery....

Remember

erin's picture

Remember, your browser does the decoding.

Hugs,
Erin

= Give everyone the benefit of the doubt because certainty is a fragile thing that can be shattered by one overlooked fact.

It used to be that I could

It used to be that I could force Firefox to switch encoding, when I got to a site that was a bit.. mangled.

Unfortunately, the oh-so-smart "community programmers" (meaning the people being paid at Mozilla, rather than the actual people that try to contribute) removed that option, and I now regularly hit sites with corrupted code that I can't manually correct - because Firefox Knows What You Want.

-
You can tell when Firefox decided to ignore their actual manifesto. It was when they added Pocket to the base Firefox, rather than as an add-on. You see, they're not supposed to add anything to the base Firefox that does not have the full API available of what's sent and received. Pocket, as far as I know, still has not released their API data - they just bribed enough people at the Mozilla project to add it, and the admins carefully either don't answer, or delete, anyone that questions them about not following their own policies.


I'll get a life when it's proven and substantiated to be better than what I'm currently experiencing.

I think not so much as

I think not so much as confuse things but more to make money: "Want to see correctly HTML produced by our product, use our browser; want to view and edit texts produced by our worst processor, use our word processor; etc.".
There's the Free Software Foundation (https://www.fsf.org/) which tried (and still tries) to stop such proprietary abuse for years. There should be tips on that site how to move away from M$$$$, so that in the end you'd only need win dross to start games on your PC requiring such abomination.

Maybe your settings

I also use Firefox - currently 91.9.1esr as part of Debian Bullsye - and I don't see most of the effects you seem to.

It is possible that there are Settings boxes or choices you have which are different to mine. (My setup is all en_GB.UTF8).

For example, in the 'General' tab, if you go to 'Fonts and Colours' -> 'Advanced' there is a tick box at the bottom: 'Allow pages to choose their own fonts, instead of your selections above'. (Mine is ticked.)

Further down, in 'Language', I have 'English (United Kingdom)' selected. The button next to that says 'Alternatives', which I take to mean that I can choose more than one language to be displayed. I understand this might be necessary in places like e.g. Israel or Canada among others.

What you have in that dialog depends on which language packs you have loaded - and, I'm guessing, that will also affect the fonts available for displaying content.

Hope this may help.

Penny

Not a font issue

It's not about fonts -- they only control the style of the letters.

The language selection might affect the decoding if you're using one of the old 8-bit character sets, like the various ISO-8859 specs, where the same hex value can refer to different characters depending upon which ISO-8859 spec is used. But the point of unicode is that there's no language setting for it -- every character in every language supported by unicode has a unique value -- that's why they need 21 bits to encode them. e9 is always lower case e with acute accent, 0434 is always cyrillic lower case "d". If the HTML page specifies UTF-8, then any language setting is irrelevant to how the text is decoded. (It might affect what language the error messages are in, though.)

That may be so, but

It is no good decoding a UTF-8 symbol if your browser then has no means to display that symbol.

I would guess that, if it doesn't find a match in any of the fonts available to it, then there is a chain of least-worst options that it follows, probably ending with either a black or hollow square or one with the hex inside.

For example, I might access a Japanese or Korean website which is coded UTF-8 but I certainly do not have any fonts from those languages loaded. I'm probably going to see rows of boxes on my display.

Penny