User:ArielGlenn/Unicode hell

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Fun facts about Unicode and polytonic Greek[edit]

So one day I was writing a little script to transliterate Ancient Greek text to Roman characters, following the information on User:Atelaes's About Ancient Greek page. Once I thought I had it working pretty well, I decided I needed a good chunk of text to test. so I went to the one place I knew had some polytonic text: τα μούτρα του George Le Nonce. I cut and pasted a heading and fed it into my transliteration script: κατηγορίες ἱστολογημάτων and was horrified to see the output katēgor es histologēm tōn

Where did the accented vowels go? For further investigation I found another page with some text from Euripides on it, and tried that. It soon became clear than *only the vowels with a single acute accent* were causing problems.

So, what was going on?

Well, here's ά as I type it, and here's what it really is:
$ echo -n ά | od -c
0000000 316 254
0000002

here's what happens when I use the other ά from the blog:
$ echo -n ά | od -c
0000000 341 275 261
0000003

That's right, they are from two different code blocks! Augh! One is from the monotonic greek code block, and the other is from the *polytonic* greek code block of unicode. You can go look at the two code blocks in Wikipedia's article on the Greek alphabet.

OK, so here is a fun experiment you can try. Take the word γάρ (typed on a linux system in greek keyboard layout). This has α with acute accent, *not* with oxeia (how do you spell that in English? οξεία εννοώ anyways). Now do a Google search with that: Αποτελέσματα 1 - 10 από περίπου 208,000 για γάρ. (0.03 δευτερόλεπτα) with the first result is the (unaccented) γαρ at el.wiktionary.org

OK, now do a Google search with the word with the οξεία: Αποτελέσματα 1 - 10 από περίπου 137,000 για γάρ. (0.10 δευτερόλεπτα) and the first entry is at perseus.tufts.edu. Specifically, search for *this* γάρ with site:en.wiktionary.org and you will get: Η αναζήτηση - γάρ site:en.wiktionary.org - δε βρήκε κάποιο έγγραφο. (No results.) But search for the other one, same modifier, Αποτελέσματα 1 - 9 από 9 από το en.wiktionary.org για γάρ. (0.03 δευτερόλεπτα) (look at that, it found nine pages).

No good? You are right. It's no good. Do I have a solution? Nope, not a one.

Note that the dictionary lookup at perseus has the same issues; it only finds things typed with characters from the extended code block. You have been warned!

Note that these shenanigans have been going on for years. Here are a couple of links to dicussions about this very issue, still unresolved, at the hellug mailing list and at the linux-utf8 mailing list.

For a comprehensive discussion of all things Greek and Unicode, look at this site on Greek Unicode Issues.

Update[edit]

Wikimedia software apparently (or maybe it's the browser, but I doubt it) silently takes the α with οξεία and converts it to ά so on this page you can't actually tell the difference. But if you go to the blog mentioned above and cut and paste, you too can try these tests.

Update two[edit]

Some fun can be had by looking at the earlier versions of the Corinth article. Early on, User:Muke entered translations for Ancient Greek and (Modern) Greek. He used both accents: [1] is the last revision that preserves this difference. The next revision, [2], hs both accents as ό which surely was a feature of a change in the MediaWiki software. Just as well, since we can't search for the other version of the letter any more...

Update three[edit]

As of fedora 9, entering accented characters from the polytonic layout no longer produces three-byte characters. This means that you cannot use them to search with in Google! Cut and paste from the unicode chart here: [3] if you get truly desperate.