Wiktionary talk:Khmer romanization

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

Khmer module[edit]

@Wyang do you think you can put Module:km-translit on your to-do list? :) User:Stephen G. Brown might also help. --Anatoli (обсудить/вклад) 14:45, 17 June 2014 (UTC)

No problem :) Wyang (talk) 23:48, 17 June 2014 (UTC)
Thank you. :) I hope it's possible, I'll assist where I can. --Anatoli (обсудить/вклад) 00:11, 18 June 2014 (UTC)

Automatic pronunciation and romanisation for Khmer[edit]

Previous discussions:

@Stephen G. Brown, Atitarev, Octahedron80, Hippietrail, Nisetpdajsankha, វ័ណថារិទ្ធ, Judexvivorum

Hi all. As some of you may have noticed, I am currently working on the infrastructure for the automatic romanisation and pronunciation of Khmer on Wiktionary, as well as a revamp of the page Wiktionary:Khmer romanization. The aim of this is to achieve automatic Khmer romanisation on Wiktionary pages with no or very minimal need for manual input, similar to what we currently have for Thai. The module used as the backend for this will be Module:km-pron, to be developed similarly to the now-mature Module:th-pron.

The Khmer script is not very phonetic, so purely predicting the pronunciation from orthography is not feasible. The renowned Chuon Nath Dictionary and other dictionaries make use of phonetic respellings ― though seemingly inconsistently ― to indicate the pronunciation of irregular words, and this principle will also be used as the basis for this project. I think it is possible to accurately derive IPA pronunciations from respellings (provided the respellings are defined clearly), again in a fashion akin to the existing Thai infrastructure, although there are dictionaries in Thai that systematically annotate any word whose orthography is not in a phonetically respelt form, with the phonetically respelt form of the word, which appear to be lacking for Khmer. (or maybe there are also Khmer dictionaries as such?) Some examples of phonetically respelt forms of Khmer words, and an attempt to syllabify based on respellings, can be found on Module:km-pron/testcases (in test_syllabify).

We also need to decide on our approach to Khmer romanisation, if we can achieve automatic pronunciation on Khmer entries. At present the Khmer romanisations are quite heterogeneous: this romanisation guide appears to recommend the UN system, Stephen has been using an IPA-derived romanisation, and some entries use other systems or have no romanisation. Technically, it seems none of the commonly schemes (UN, Geographical, BGN/PCGN, ALA) can be fully automated, as they all rely on both orthographical and pronunciation information of the word simultaneously, to varying extents. Several possibilities exist for romanisation, IMO:

  1. Devise a Wiktionary transcription for Khmer, based on the UN romanisation scheme, the Geographical Department scheme and/or the IPA, and clearly explaining its use on Wiktionary:Khmer romanization. I've tentatively listed some UN-derived transcriptions we could use for consonants, in the column Transcription, open to feedback of course. The template {{km-IPA}} will be added to all Khmer entries, and the romanisation will be extracted from the template call on the entry, exactly like Thai {{th-pron}}. Such a transcription system could again be applied in various manners, for example a Thai-like structure, which is attempting to romanise any Khmer word by extracting its phonetic respelling and transcription from its entry, and if successful, use that transcription; if unsuccessful:
    1. Set the romanisation to nil, and display no romanisation; or
    2. Delegate the romanisation to the current Module:km-translit, which is buggy but could produce something better than nothing.
  2. Adopt a strict one-to-one transliteration system and apply it in all cases. This method would be doable, though the output would not be as helpful as a transcription output.

If we think this approach for pronunciation and romanisation is worth adopting, the pertinent tasks for the time being would include arranging and codifying the respelling–pronunciation correspondences in the form of tables on Wiktionary:Khmer romanization, and after that is complete, implementing those rules in Module:km-pron to achieve the automatic conversion.

I apologise for this relatively long post. Any comments, suggestions, technical help, criticisms, moral support, etc. would be welcome. Thank you!

Wyang (talk) 10:33, 10 February 2018 (UTC)

Great efforts, Frank! I will be able to assist to some extent (but probably not on the same level as Thai) when I learn how to convert phonetic systems to the phonetic respelling. I've got a phonetic Khmer dictionary, which I can use but there are dictionaries and textbooks out there. The Sealang dictionary, which uses a IPA system, can be used as well.--Anatoli T. (обсудить/вклад) 11:36, 10 February 2018 (UTC)
Thanks Anatoli!
@Atitarev, Stephen G. Brown, Octahedron80, Judexvivorum, Hippietrail, Nisetpdajsankha, វ័ណថារិទ្ធ
I have finished revamping the guide page Wiktionary:Khmer romanization, and the automatic pronunciation/romanisation template {{km-IPA}} and module Module:km-pron, which were written based on the IPA and transcriptions on that page, have passed preliminary testing (Module:km-pron/testcases) and should be mostly ready for use. Examples of how the pronunciation template would look on Khmer entries can be found on the template page.
I have tentatively proposed a transcription system for Khmer, also outlined on the guide page, based on Stephen's existing system, IPA and the UN system. This transcription can be accurately generated on every Khmer entry that will use the pronunciation template. The details of the transcription scheme are up for discussion as this is only a draft scheme, but I suggest we develop modules such as Module:km-headword, in parallel to Module:th-headword, to automatically romanise linked Khmer terms (in {{l}}, {{m}}, {{cog}}, etc.) and have headword-line templates display romanisations by extracting the transcription from the Khmer entries, like Thai. This will remove most, if not all, need for manual transcription/transliteration and greatly simply our work. The terms that fail automatic romanisation can still use Module:km-translit as a fallback.
Any thoughts would be greatly appreciated. Thanks! Wyang (talk) 13:35, 12 February 2018 (UTC)

I just wanted to put in my two cents. I think this is great. It seems to work very well for Thai. (It would be great to see more work done for Lao by the way, which in theory has a nice logical orthography, but in practice turns out to have a ton of very difficult edge cases!)

It's over two years since I was in Cambodia learning Khmer and I've never found anyone to practice with since. I do have a self-study textbook, a small dictionary with some grammar, and a large pocket dictionary. I also have a Lonely Planet Southeast Asia Phrasebook with a Khmer section.

I have to say I vastly prefer the romanization system User:Stephen G. Brown has developed over the years to the ones in any of my books and to the one we've been auto-generating here for some time. My next preference is a simple phonemic IPA-based system. Not a too-narrow one full of diacritics.

I think we could aim at something between what we do for Chinese and Thai, and what we do for English pronunciation. We can support multiple "official" transcriptions that are automatically generated from the most detailed one, or from one we design ourselves if none of the others are sufficiently detailed, or straight from the IPA. In the case of English pronunciation transcription we devised our own system to compliment IPA since many people have an irrational dislike for IPA and since all American dictionaries use somewhat similar non-IPA systems. If we were to go down that path we could use Stephen's system, or develop something using that as a starting point.

In short I think Stephen's input here would be key. I've always been impressed by his knowledge and work here. Combining his skills with you guys' technical automation/scripting skills we could come up with something that is the best anywhere on the internet.

I'm really looking forward to seeing how this goes. (And please put Lao on the list for attention soon.) — hippietrail (talk) 00:11, 13 February 2018 (UTC)

Just a couple of remarks. My biggest gripe with choosing transcription systems is that most of the existing systems were developed at a time when Unicode fonts did not exist, and when fonts for most languages that used scripts other than Roman were difficult or impossible to come by and use. Unicode and the development of fonts for almost every script makes those old transcription systems obsolete. It is no longer necessary to have unique Roman letters for each and every native glyph. No longer do we need to transcribe Korean as L in every case, because we have the original Korean font right there next to it.
The only advantage that I can see for the UN system for Khmer is that it has been used for place names on some maps for several decades. So a person might google a UN transcription such as lékhtoŭ and possibly find it here. However, the vast majority of Khmer words are not place names and are not transcribed on maps. Besides, it is difficult for most people to type some of the letters, such as ŭ, so they would search for u instead. Now there is an addition to the UN system, referred to as provisional. I don't know what they mean by provisional. I don't know how the originators of the UN system came up with those equivalents. Their pronunciation is cryptic to say the least. I really dislike those old systems. They were always inadequate, and now they are unnecessary.
I do like the idea of transliterating as ʋ.
I noticed a couple of problems in Wiktionary:Khmer romanization. First, the letters and . One is a consonant, the other a vowel, and they are identical in appearance. Traditionally they have not been considered separate letters, but one letter used in two ways. Unicode provides both, and this has led to big problems with misspellings. Native Cambodian speakers can't get them right. Today, only one of them is in use. The other is deprecated. We should be using (the consonant) exclusively, never .
Another problem I see is some of the subscript forms on Table 2. For example, ហ្ហ្គ (the subscript, not the letter). That subscript comprises +. I don't think such subscript clusters are possible. Ignoring subscripts that appear to the left or right (such as យ្យ, រ្រ, ស្ស), there is only one subscript that can appear together with a second subscript, and that is វ្វ. Even វ្វ is rare. One example is ឧច្ឆ្វាស (uccʰvaasa) (ឧចឆវាស), Ucchvasa (hell for murderers and those who eat impure meat), from Sanskrit उच्छ्वास (ucchvāsa, death, breath, expiration, exhalation). Considering how bad ឧច្ឆ្វាស (’ŏchâchhvas) looks, I think the use of វ្វ as a secondary subscript is probably not used in native Khmer words. It probably requires a special font for words borrowed from Sanskrit. —Stephen (Talk) 08:15, 13 February 2018 (UTC)
@Hippietrail, Stephen G. Brown
Thank you both for your comments. I have the same feeling about the existing major romanisation schemes for Khmer ― none of the systems is pure transliteration or transcription and all involve a mix of both, making them very difficult to automate when we have both the orthographical spelling and the phonetic respelling of a word. The UN system makes use of many (perhaps too many) diacritics, which can be quite confusing, and transcriptions in the UN system could be difficult to understand.
I have added some notes to Wiktionary:Khmer romanization to explain that entries should not use the independent vowel ឣ, but rather អ, and that a respelling should always be used in the template {{km-IPA}} if the word is spelt with an independent vowel letter. Good point regarding the subscript forms too; I removed the subscript forms of the composite consonant letters. I was thinking something like sf- in Western loanwords, where the composite consonant could be written as a subscript, but such words probably do not exist at all.
Stephen, do you have suggestions regarding the romanisation system to use on Wiktionary? The proposed scheme is in the columns "Wiktionary Transcription" on the page Wiktionary:Khmer romanization, and examples of romanisations can be found on Module:km-pron/testcases (the third table test_transcript). It is largely consistent with the system you have been using for Khmer entries, with some small differences. Transcriptions of the vowels are mainly aligned with their IPA pronunciation, and I tried to minimise the use of vowels with diacritics in the system, with the exception of the three short diphthongs ĕə, ŭə and ŏə, which can be contrastive with the long diphthongs, e.g. vs. ŭə. This is not an urgent issue, as the module backend Module:km-pron can be easily modified if we would like to alter the transcriptions of certain consonants or vowels.
What do we think about the idea of automating romanisation in the link templates, by having the template extract romanisation from the Khmer entry? For example, {{m|km|បរិយោសាន}} (current output: បរិយោសាន (paʾreyaosaan)) will make the template extract the romanisation paʾreʾyaosaan from the បរិយោសាន entry directly, rather than sending it to Module:km-translit for the auto-transliteration (to produce bârĭyoŭsan). This way we can ensure that the romanisation produced by the template is always correct. This is similar to how links to Thai terms, such as {{m|th|ประธานาธิบดี}} (output: ประธานาธิบดี (bprà-taa-naa-tí-bɔɔ-dii)), get their correct romanisations from the target entry (bprà-taa-naa-tí-bɔɔ-dii), rather than an auto-transliteration module.
If we think the automatic pronunciation and romanisation by {{km-IPA}} is desirable, I will start to apply the template on Khmer entries and gradually remove the existing manual transcriptions to make it entirely automatic like Thai. Wyang (talk) 09:40, 13 February 2018 (UTC)
Seems okay to me. I guess if there are technical reasons why we need some diacritics, then we have to accept them. Automating romanization in link templates is fine with me (as far as I can imagine it). Not too sure about respellings, though. That sounds rather difficult. What about a template that would produce respellings from a manual romanization? —Stephen (Talk) 12:05, 13 February 2018 (UTC)
@Stephen G. Brown
It is not really difficult using the reference tables on Wiktionary:Khmer romanization. Most of the respellings can be found in the Chuon Nath Dictionary, available on Tovnah Khmer Online Dictionaries (KOD) and Sealang. For example, the word បដិសេធ (bɑɑdeseet) is listed to have the respelling ប៉ៈដិសែត in that dictionary, and this respelling can be used directly in the pronunciation template to generate the pronunciation. In some cases, minor modifications are needed on the respelling: ប្រធានាធិបតី (prɑthiəniəthɨppaʾdəy) has the respelling of ប្រធានាធិប៉ៈដី in the Chuon Nath Dictionary, but needs to be slightly modified to become the actual respelling ប្រ់ធានាធិបប៉ៈដី (short syllable of ប្រ, and repeated in the syllable ធិប) to be used in the template. I have been using the Khmer script picker tool to input and modify Khmer texts. I will also see if I can draft up a clearer respelling guide or conversion template as well.
Thanks for the feedback on the proposed romanisation scheme. I will request in the Beer Parlour for the automatic romanisation to be enabled. Wyang (talk) 12:44, 13 February 2018 (UTC)

How should we add a test case?[edit]

I was going to add testcases for two Khmer words that come up early when you start learning the language that are very difficult for English speakers to pronounce: the words for I and delicious.

The former, ខ្ញុំ, is already included, but the later, ឆ្ងាញ់ is not yet.

I see the format is different from the equivalent page for Lao. The format isn't difficult but as I'm very rusty with Khmer and the details of how we transliterate it here, I'm not sure what to put in all of the fields. Would it be of benefit to use ឆ្ងាញ់ as a short tutorial example on how to add a testcase either right here, or on the testcase page? By the way it would be beneficial if the testcase page had a direct link to the testcase data page so people can add testcases if they have language knowledge but not so much module/Lua/scripting/hacking knowledge. — hippietrail (talk) 00:27, 13 February 2018 (UTC)

@Hippietrail Oops, sorry, I forgot to include the link for the testcase data: Module:km-pron/testcases/data. I have added the case of ឆ្ងាញ់ to the list: diff. You can just add new words and leave out some fields (transcription, IPA, etc.) if you are unsure. Wyang (talk) 00:36, 13 February 2018 (UTC)
Here's another term: ប្រធានាធិបតី. --Lo Ximiendo (talk) 00:54, 13 February 2018 (UTC)
And another: ឧសភា. --Lo Ximiendo (talk) 00:56, 13 February 2018 (UTC)
@Lo Ximiendo Both have been added. Wyang (talk) 01:52, 13 February 2018 (UTC)
Then this: ហ្វៃហ្វា. --Lo Ximiendo (talk) 04:22, 13 February 2018 (UTC)
@Lo Ximiendo Yes check.svg Done. Wyang (talk) 13:09, 13 February 2018 (UTC)

Display of diacritics[edit]

(Notifying Stephen G. Brown, Wyang, Octahedron80): : Hi. I think of replacing instances of  ំ with in Wiktionary:Khmer_romanization#Diacritics - just the first column. Any objections? They get left-aligned in the table, though.--Anatoli T. (обсудить/вклад) 11:47, 17 March 2018 (UTC)

No objections from me, go ahead please! Wyang (talk) 11:50, 17 March 2018 (UTC)
Done. Not perfect, though. --Anatoli T. (обсудить/вклад) 12:30, 17 March 2018 (UTC)