Wiktionary talk:Transliteration and romanization/Archive 1

From User Talk:Danny

Re your Hebrew and Yiddish contributions, how would you feel about adding transliterations to ngo with them? Eclecticology 22:50 Dec 29, 2002 (UTC)

Sure. I just dont know the system that is preferred. Danny

So this is another one of those questions where I put my foot in my mouth! :-]
My quick answer would be whatever system is used by the Library of Congress for transliterating book titles. Eclecticology

I went with easy pronunciation. I have a copy of several transliteration schemes at work, but unfortunately, I don't like any of them, particularly for Hebrew. How is the system I used? Danny

It's a step in the right direction. Of course I say this as a complete non-reader of Hebrew. It seems too that we may need to distinguish between transliterations and pronunciation guides, and my impression is that you tried to do the latter. Transliterations (or romanizations in a language like Chinese) give a guide to representing the other language in another script, which for practical purposes here means the latin script. A pronunciation guide will give an idea about how to pronounce a language, and will often even be needed between two languages using the same script. Transliterations tend to be more formal ways to represent things, and may have a role in how Wiktionary can deal with words in other scripts. I welcome Wiktionary articles with titles in other scripts for the interesting problems that they give us to solve. Eclecticology

There are, of course, very specific problems with transliterating Hebrew. First of all, there are sounds that should be distinguished, even though modern Hebrew speakers do not distinguish them either (one of these sounds was apparently lost as early as in biblical times, hence the shibboleth/sibboleth story in Judges.) Second, there are no vowels in written Hebrew. Points called nikkud, which are used to represent vowels, are early medieval. Third, certain Hebrew letters fricate, depending on their position in a syllable, so that the same letter can have two sounds. Fourth, the two shewa sounds, weak and strong, the first of which is essentially silent, the second of which is not but often ends up silent. put it all together and you get quite a mess. Not to mention that modern Hebrew has no formalized spelling rules, though it is preferred to note long vowels. Take dog, for instance: I gave a pronunciation (KEH-lev), whereas the letters, K-L-B, can also be read kah-LEIV (the biblical name Caleb--note the b in the English version vs. the v in the pronunciation, k'LEIV (meaning "like a heart), kah-LEIV (like the heart), or theoretically k'LOOV (cage). I've seen a lot of methods of transliteration--I even helped develop a couple--but I've never found one I liked. Danny

from Jehovah

Transliteration of Hebrew יהוה Causative fom (Hiphil) of the verb "havah" (הוה) "to be / to become". "He causes to be" or "He comes to be". The word deliberately uses the vowel sounds from "adonai" (אדני) "lord".

from Wiktionary:Beer parlour archive/July-September 04 == Searching for words in non-Latin-alphabet languages ==

Searching for words in non-Latin-alphabet languages

The Main Page states that the purpose of the English wiktionary is to define words from all languages in English (maybe I'm simplifying a bit). I see a problem for people trying to search for a word in, say, Russian, Greek, Chinese, or Hindi. The words have been entered in the alphabet, syllabary, etc., of the target language, and they are not searchable with Latin letters. I know some Greek and Hebrew, but I don't really know how to set my keyboard to enter these characters, and I'm completely lost in Russian, Chinese, and other languages. I suspect the average user would have much more difficulty than I.

Do we have a method to allow people to search for foreign words using the Latin alphabet? If so, what is it? If not, are we severely limiting the Wiktionary? Is there any simple method to link the Latin version of a word to the native version, to make the Wiktionary more searchable and useful? -- RSvK 00:05, 26 Jul 2004 (UTC)

Your observations raise some very interesting questions and problems. To some extent they overlap with the questions at Wiktionary talk:Categories where a partial solution may be found, but some more profound questions also arise. How do we ensure exact 1 to 1 correspondence between the original script and Latin script. In Greek both κσ and ξ may be represented by ks. How can we ensure that the process can be perfectly reversed? How do we cope with tonality in Chinese? There have already been objections to having tone marks appear in article titles. A quick look at "ma" in my Chinese dictionary shows that it can be represented by 16 different characters across the four available tones. To answer your questions: No we don't have a functioning system to do what you describe, but it would be nice to have it. Yes, we are severely limiting Wiktionary, and ... No, there is no simple solution. Eclecticology 03:22, 26 Jul 2004 (UTC)

Some entries today raise the question again. A one-to-one correspondence and perfect reversal are not necessary, just a way to get from a Latin representation of a foreign word to the definition. For example, if someone sees the word "kalos" somewhere in Latin letters and wants to know what it means, it doesn't do much good to have the definition available only from the Greek letters. RSvK 04:02, 7 Aug 2004 (UTC)

I agree that perfect reversal is not needed. What is needed is a standard that everybody agrees upon. Chinese has Pinyin and Japanese has Romaji and they are standardized so that the way a word is spelled in latin letters is most often the only common way to spell the word in latin letters, or at least people know when an outdated system is used. For other languages this is not the case. Granted that people have some need to find the Libyan leader's name using latin characters. This does not mean that we should create an article for every way it can be spelled in latin characters. Likewise, if we choose just one way, people using the other ways won't find it. While Greek may not be as bad as Arabic or Hebrew, along with Russian and the other Cyrillic-using languages, Greek just doen't have an agreed standard of romanisation. You can only take a guess at a spelling. With Chinese and Japanese you can know the spelling. — Hippietrail 04:55, 7 Aug 2004 (UTC)

That's kind of what I was afraid of. There really is no one-to-one correspondence between Latin letters and other systems, particularly if we want to make the system usable--such as Latin 'e' equating both to Greek 'eta' and 'epsilon'.

Would it work for now to just enter the Latin version with a redirect to the other language version of a given word? -- RSvK 05:31, 26 Jul 2004 (UTC)

You can always use ē for eta, and ō for omega. The enormous scope of the problem can be gleaned at http://lcweb.loc.gov/catdir/cpso/roman.html Your suggestion probably will work as long as there isn't a word already with that spelling in another language. Eclecticology 08:51, 26 Jul 2004 (UTC)

Yes, this is a problem of enormous scope. The problem is whether the foreign non-Latin language dictionaries are going to be of any use to people who don't know that non-Latin method of writing. The web site above seems to indicate that the problem is solvable--the romanization table for Greek is quite straightforward. If I see a word in a newspaper, in Latin letters, from Russian, Greek, Hindustani, Chinese, etc., there ought to be a way to look it up. Yes, there are some multiple spellings. If they are common in Latin letter sources, we should probably add them.

I propose that we add some kind of stub entries for these words. If nothing else, just indicate the language and include a link to the entry in the language's own script. Otherwise the Wiktionary's usefulness becomes greatly diminished.

Yes, this means a lot of entries, but I don't see any way around it.

It also means a lot of mess when there are several ways to romanise a word with some people adding one or another on some entries, and some adding some other on other entries. Find us some romanisation standards for the languages you care about and we can then do it properly. — Hippietrail 02:04, 8 Aug 2004 (UTC)

Eclecticology lists a romanization standard for Greek above here, which looks reasonably straightforward. I suggest we need it not only for Modern Greek but for Classical and Koine Greek, because of the large number of words in English derived from Greek. RSvK 20:39, 14 Aug 2004 (UTC)

I would prefer some person or persons with knowledge of various ages of Greek to look

at several systems before just assuming the first one is going to be best. We should strive for the best at Wiktionary. I am positive that there are more than a couple of "standards" in use for various jobs and by various groups. I'll try to dig some up so they can be compared. — Hippietrail 00:31, 15 Aug 2004 (UTC)

It might be useful also in the foreign script to include a Latin-letter transliteration.

We already do this. It goes after the headword, in parentheses. For languages like Chinese which have more than one recognized system, we also add the transliteration in each system with a label, when we know. — Hippietrail 02:04, 8 Aug 2004 (UTC)

For words with the same spelling in other languages, just have the usual separate language sections. RSvK 14:38, 7 Aug 2004 (UTC)

One (ambitious) way around it would be to build in intelligence to the search interface that matches inquiries that are not found in the dictionary to possible alternative romanisations. This would work something like the Google search engine which sometimes gives you a prompt at the top of the results page asking 'Did you mean ~~~~?' We would have to think carefully about possible perceptions of favouring particular romanisations. Oska 01:56, 8 Aug 2004 (UTC)

The discouraging part of it is the "Hard work" :-)

We're in amazingly good shape for Chinese. See Wiktionary:Chinese Pinyin index and the more general Wiktionary:Chinese index. Pinyin romanization (not the same as transliteration, which only applies to alphabetic or syllabic scripts) works from a limited number of acceptable morphemes. Should we be simply transferring, for example, everything for the Pinyin "can" (pronounced /tsan/ not /kan/) to the page for can. Those Pinyin index pages could be much improved, but they provide a tremendous resource to work from.

Each alphabetic script brings its own problems. Some have several ways of being transliterated. For the Russian letter Я we can use "ia", "ja", "ya" or "ia͡". The Chicago Manual of Style recommends using the system used by the U. S. Board on Geographical Names. It is the one that uses "ya", whereas the Library of Congress is the one that prefers the ligatured version. The average contributor will be glad to not have to figure out how to type that! It comes down to agreeing what we are going to use a certain way of transliterating and sticking to it.

If we congine ourselves to three languages (Russian, Chinese, and modern Greek) until the thing is debugged, the others should come a lot easier. Eclecticology 04:51, 8 Aug 2004 (UTC)

I disagree with the whole premise of this discussion.

"they are not searchable with Latin letters" — they're not written with Latin letters either. Why should they be searchable with Latin letters?
"link the Latin version of a word with the native version" — it's not our place to invent new spellings, Latin or otherwise.
"if someone sees the word kalos somewhere in Latin letters" — that's what =Alternative Spellings= is for, but again, it's not our job to make such things. (I don't deny that it happens a lot. Sometimes there are legitimate examples, such as Kat' exochen.

The best solution is to make it easier for people to enter text in foreign languages. For example if you go to the Esperanto Wikipedia you'll notice that above the search box is the label "Serĉu ĉ ĝ ĥ ĵ ŝ ŭ" (search), giving the user easy entry to all the Esperanto accents by copy and paste. If you edit a page on the Italian Wikipedia you are greeted with the text "Clicca uno di questi caratteri speciali per inserirlo nel testo: È à é è ì ó ò ù – «» “” ‘’ [[]] {{}}" (click one of these special characters to insert it into the text), with the relevant characters being clickable for automatic insertion. Either of these options could be doable, all we need is to make a page to help enable international character input. We don't need to dumb down Wiktionary for the technically impaired when we have the option to technically empower people instead. —Muke Tever 21:30, 15 Aug 2004 (UTC)

Nobody's suggesting that we invent new spellings. If we treat the latin transliteration as an alternative spelling it would link or redirect to the orthography of that language. We are not talking about Esperanto or Italian where there are only a handful of specially accented characters. The proposed approach gives the user another series of options. So does your proposal. Once you have developed your proposal to the point where the "technically impaired" among us can easily look up a Bulgarian word your efforts will be very much appreciated. In the meantime, we hope to accomodate those knowledge of the language is limited to a rudimentary decoding of their alphabet. Eclecticology 22:30, 15 Aug 2004 (UTC)

True, in Esperanto or Italian there are only a handful of specially-accented characters, but while in those languages they may appear in many words of a search, in Wiktionary the problem is offset by the fact that a person is generally only going to be searching for one word at a time. My "proposal" only requires a page somewhere like Wiktionary:Cyrillic with a table like

Bulgarian capitals	А	Б	В	Г	Д	Е	Ж	З	И	Й	К	Л	М	Н	О	П	Р	С	Т	У	Ф	Х	Ц	Ч	Ш	Щ	Ъ	Ь	Ю	Я
Bulgarian smalls	а	б	в	г	д	е	ж	з	и	й	к	л	м	н	о	п	р	с	т	у	ф	х	ц	ч	ш	щ	ъ	ь	ю	я
Transliteration	a	b	v	g	d	e	ž	z	i	j	k	l	m	n	o	p	r	s	t	u	f	x	c	č	š	št	ŭ	ʼ	ju	ja

I already use similar beasts myself, at User:Muke/grc (greek capitals including archaic characters) and User:Muke (a less-developed version, Gothic).

I don't like the pages at Romanized titles because it is actually harder for the user: someone running across a (say) Greek word in situ may want to know what it means and it is easier for them to enter it directly than to have to figure out what each individual character means first.

The biggest reason I don't like this Romanized-title talk is because I see pages like erān coming into existence which 1) contain diacritics not in the original orthography, 2) represent effort not put into the page with the proper spelling, ἐρᾶν (which at the moment doesn't even exist, so people who actually know the word can't find it), 3) don't have any proof — Google:erān knows of none either — that Latin spelling of this word is even used (which is indeed tantamount to inventing new spellings). Indeed, erān doesn't even mention that the word isn't in its original script (though that is a fixable problem).

What I really don't want to see is a policy that endorses misspellings like this in a dictionary. If a transliteration setup like what is being proposed is brought into effect, it should 1) clearly notate that the transliteration is not a real word and 2) be in ASCII, as the primary call for it is people who can't type extended characters, which applies equally well to ā as it does for λ; this does indeed offer grounds for ambiguity on the Latin side, but as the existence of the Latin page itself should be comparable to disambiguation ("Eran is also a Greek word; see ἐρᾶν") this shouldn't be a problem. (As already noted, proper transliterations with fun diacritics appear on the native script page itself anyway.) —Muke Tever 23:13, 15 Aug 2004 (UTC)

OK. I can go along with the essentials of your alternative as satisfying the purpose of this exercise. It is directed at people who don't know the language at all, or in the case of Cyrillic, Devenagiri or Arabic scripted languages may not even know what language they are looking at. I can accept that the transliterated link be written with a simplified, and unnaccented script. On the page itself, and under the appropriately placed language the indication would be "Transliteration of ..." and list the various possibilities with only the briefest indication of what each means.

For Cyrillic, the transliteration needs to be uniform without regard to which language is being used. It will not do to have "Щ" sometimes appear as "šč" or "št", but it should always be "shch". Eclecticology 23:56, 16 Aug 2004 (UTC)

Well, you said Bulgarian, so I went for Bulgarian :) where "Щ" is indeed always "št" and not "šč" (ignorable pedantic side note: the form of the letter is originally a ш-т ligature, the descender being originally central and Cyrillic ligatures being written top-to-bottom instead of side-to-side as in Latin, so št is actually a more-original value than the Russian).

Actually, for an audience who doesn't know the language, it might be better not to confuse them with (possibly contradictory) transliterations at all—when entering Cyrillic text to find a page, it doesn't matter what the value of "Щ" is; the page oughtta should tell them exactly what "Щ" is supposed to stand for.

Transliterations could be useful to people who have the phonetic form of a word but don't know what the spelling should look like. But in this case the user has the same problem as if he didn't know how to spell "bureaucrat": we don't, as we may when browsing a print dictionary, have the capacity to search "nearby" words to find which is our word properly spelt. (And languages like Russian, while perhaps better than English, are still not phonetically spelled.)

The "Transliteration of..." section btw is a Very Good Idea. —Muke Tever 02:36, 17 Aug 2004 (UTC)

Blushingly about Bulgarian I should have stuck with Russian. I do tend to think in terms of the written language. Spoken language opens up a lot of new problems like mishearings or dialectical differences. That might need to wait until we have voice recognition software installed. ;-) Searching nearby words gets a lot more tricky when the language itself is a variable.

For this purpose a very rough one-fits-all transliteration is what we need. The instruction page would make it clear that the series of letters that they are typing may not accurately reflect the pronunciation of the language. The key question that we are trying to answer is "How do I look up a word written in a language and script that I do not know or even recognize?" Eclecticology 00:49, 18 Aug 2004 (UTC)

Merge debate

The following discussion has been moved from Wiktionary:Requests for moves, mergers and splits.

This discussion is no longer live and is left here as an archive. Please do not modify this conversation, but feel free to discuss its conclusions.

Wiktionary talk:Policy - Transliteration/Collection of past discussions

Latest comment: 14 years ago3 comments2 people in discussion

Not really sure where this should be, but it isn't here. Bad capitalization, bad entry title. If it's already archived elsewhere, I think it can be deleted, otherwise this is a Beer Parlor style discussion, not talk about a Wiktionary page (which has never existed). Mglovesfun (talk) 17:34, 9 March 2010 (UTC)Reply

It's an archive of the talkpage for [[Wiktionary:Policy - Transliteration]] which latter is now a redirect to [[Wiktionary:Transliteration and romanization]], so this page, if it's seen as needing moving, can be moved to [[Wiktionary talk:Transliteration and romanization/Archive 1]], with a link to it from [[Wiktionary talk:Transliteration and romanization]].—msh210℠ 19:42, 9 March 2010 (UTC)Reply

Done, I'll consider this closed unless someone wants to dispute it. Mglovesfun (talk) 15:00, 13 March 2010 (UTC)Reply