Module talk:as-translit/testcases

From Wiktionary, the free dictionary
Jump to navigation Jump to search

@Wyang Please help. DerekWinters (talk) 04:13, 25 July 2017 (UTC)[reply]

@DerekWinters lolAryaman (मुझसे बात करो) 04:21, 25 July 2017 (UTC)[reply]
Does Assamese do schwa-dropping? The module doesn't as of now. —Aryaman (मुझसे बात करो) 04:28, 25 July 2017 (UTC)[reply]
It does, but Anatoli asked me to do a bare-bones version. The schwa-dropping is confusing for sure, but is similar to Bengali's. Also, many Sanskrit terms are borrowed by way of Bengali, which influences the schwa-dropping. DerekWinters (talk) 04:35, 25 July 2017 (UTC)[reply]
Oh, that's what I thought. Schwas are the hardest part of these translit modules... —Aryaman (मुझसे बात करो) 04:43, 25 July 2017 (UTC)[reply]
@DerekWinters, Aryamanarora More testcases please! I will try to figure out the rules from the testcases... Wyang (talk) 06:02, 25 July 2017 (UTC)[reply]
@Wyang I added some more testcases that should be fairly representative of the set. Xobdo does a pretty good job (if not perfect, with some inconsistencies) of showing how a word will be pronounced, but not in proper IPA style. DerekWinters (talk) 16:06, 25 July 2017 (UTC)[reply]
@DerekWinters Thanks. I added some rules for Assamese. I see that the transcriptions on this page are not consistent with Wiktionary:Assamese transliteration; for example, স is s here. Should the policy page be modified to agree with this, or vice versa? For now I've converted the testcases to conform to the policy page. Wyang (talk) 07:58, 26 July 2017 (UTC)[reply]
@Wyang There are many cases where the expected pronunciation of a letter will change in Assamese, almost always in tatsam words. Thus, is will be puspô, spôrxô, etc. DerekWinters (talk) 14:18, 26 July 2017 (UTC)[reply]
@DerekWinters, Aryamanarora So... is there a way to know from orthography which pronunciation these letters will have? Would we like this module to be a transliterative or transcriptive module? (Wiktionary:Assamese transliteration still needs to be modified if we decide to use a more transcriptive system of romanisation.) Wyang (talk) 22:05, 26 July 2017 (UTC)[reply]
@Wyang: I don't know much about Assamese, but looking at the test cases, this assimilation to "s" only occurs at the beginning of a consonant cluster (which generally occur only in tatsam words). IMO this module should be as transcriptive as is reasonably possible. —Aryaman (मुझसे बात करो) 22:49, 26 July 2017 (UTC)[reply]
@Wyang When শ, ষ, স are the first member of a conjunct (almost only (if not only) in tatsam words) it is to be transliterated as an 's', like কৃষ্ণ (krisnô) or সংস্কৃত (xôngskrit). When ব is the second (or last) member of a conjunct (also almost only in tatsam words) it is to be transliterated as w, acting essentially like ৱ, as in ত্বৰিত্‍ (twôrit) or ধ্বংস (dhwôngxô). য as a second member, as in ব্য়ৰহার (byôwôhar) acts like a য়. That should be about it. DerekWinters (talk) 02:08, 27 July 2017 (UTC)[reply]
@DerekWinters, Aryamanarora Ah okay, I added the rule for শ / ষ / স. There is still one failed testcase: why is য different in শূন্য and আশ্চৰ্য? Wyang (talk) 12:37, 27 July 2017 (UTC)[reply]
@Wyang This is where the way Unicode works is not exactly ideal. If you see, the য in শূন্য is the squiggle (jôfôla in Bengali) and really should be considered a separate letter, but in আশ্চৰ্য is still from য. I guess the only time it'll be pronounced j as a second member of a conjunct is if it's ৰ ্ য. DerekWinters (talk) 12:50, 27 July 2017 (UTC)[reply]
Terrific. I implemented that exception rule. All the testcases are displaying correctly now. Wyang (talk) 12:59, 27 July 2017 (UTC)[reply]
@Wyang Perfect! I think we can implement it now! DerekWinters (talk) 15:02, 27 July 2017 (UTC)[reply]

@Sagir Ahmed Msa নমস্কাৰ! Could you take a look at the test cases and see if the expected transliteration is correct? DerekWinters (talk) 17:47, 9 August 2017 (UTC)[reply]

User:Sagir Ahmed Msa Hi! Yes Assamese does shwa dropping, it's actually /ɔ/ in Assamese. In native and other indo-aryan words the /ɔ/ is mostly dropped at the end of words, but in conjunct consonants it is not dropped. For example: কাম is /kam/ (kam), but কৰ্ম is /kɔɹmɔ/ (kormo). In বান্ধ /bandʱ/ (bandh), the vowel is dropped. This is the only native/indo-aryan word where the conjunct consonant dropped the vowel i guess, but maybe some speakers pronounce উত্তৰাখণ্ড and ঝাৰখণ্ড as "Uttorakhond" and "Zharkhond" respectively, while খণ্ড is always "Khondo", maybe Hindi/English influence?. And sometimes it depends on the context. For example: অসম /ɔχɔm/ (oxom) means "Assam" but অসম /ɔːxomo/ (o xomo) means "non plain/equal" which is a Sanskrit loanword, the native version for "equal" is সমান /xɔman/ (xoman). জন is /zɔnɔ/ in "Jana gana mana.." and জনসংখ্যা (/zɔnɔxɔiŋkʰa/); but it's /zɔn/ in classifiers like মানুহজন (manuhzon) "the man". Ather rule for native words: It's not dropped after -ব (-bo) suffix in verbs. Like কৰিব (koribo), লাগিব (lagibo), হ’ব (hóbo) etc. I think /ɔ/ is almost always present after ব (bo) in native words (at the end ). A native word where it's dropped is ৰবাব (robab) "pomelo". I can't remember more.

In foreign/non-indo-aryan words the vowel is always dropped, infact the alphabet works like semi-abjad sometimes (which happens with other indic alphabets of modern indo-aryan languages too) especially in Arabic and Persian loanwords and heavily influenced by Bengali alphabet and Urdu alphabet. For example আহমেদ (ahmed, not ahomed), ইছলাম (islam, not isolam). It also happens native words but in combined words mostly, like ঘৰখন (ghorkhon) is "ghor (house)" and "khon (the)", কলগছ (kolgos) is "kol (banana)" and "gos (tree)" etc. And it also happens in other foreign words. Like ফুটবল is (futbol, not futopol), ঈদগাহ (idgah, not idogah) etc. In foreign words conjunct consonants don't drop the vowel. For example: ইংলেণ্ড (inglend), ফ্ৰান্স/ফ্ৰাঞ্চ (frans) etc. ’ (অ’) /o/ and ও /ʊ/ vowels are used for foreign words that end with "o". Next: ব/ৱ and য/য় - (after "virama/hoxonto") - In native and other indo-aryan words, they follow many rules. 1) Beginning. When ্ব is in the beginning of a word, it's silent, but শ, ষ and স (x) change to "s", For example: ধ্বংস is "dhongxo" /dʱɔŋxɔ/, not "dhwongxo" and ত্বৰিৎ is "torit" /tɔɹit/, not "tworito". (Sometimes the w is pronounced though, due to Hindi influence, like the word স্বাগতম is sometimes pronounced as "swagotom" in news channels. But we can ignore it) Other words: স্বাধীন (sadhin), স্ব (so), জ্বৰ (zor) etc. In case of ্য it's pronounced as "y" "e" or silent. For example: A vowel = জ্ঞান (gyan, the ঞ works like য in this case), ত্যাগ (tyag), শ্যামল (syamol) etc. O অ vowel = ব্যক্তি (bekti) Other voweld = জ্যোতি (zúti) /zʊti/

2) Inside: in case if ব the other consonant becomes double. For example: ঈশ্বৰ (issor), বিশ্বাস (bissax) Although in some words the ব (bo) remains "bo" Like: আহ্বান (ahban), গহ্বৰ (gohbor) When i checked old Assamese coins of Ahom kingdom, i saw a dot bellow ব, and it was র (wo), and the dot was used in conjuncts as well. The word স্বৰ্গদেউ was written like স্বৰ্গদের with a dot bellow স্ব. I think it differentiated between ব (bo) and ৱ (wo). (Maybe due to Bengali influenced that dot was no more used? or it was stopped being used during Ahom kingdom?) In case of ্য and "i" comes before the conjunct and the other consonant becomes doubled, for example: ধন্যবাদ (dhoinnobad, not dhonyobad), কল্যাণ (koillan, not kolyan), বিজ্ঞান (biggan, not bigyan or biznian) etc. 3) At the end: It is same as 2) Foe example: বিশ্ব (bisso), ঘনত্ব (ghonotto), শস্য (xoisso) etc.

For foreign words ব is both w and b. য is y.

(I used a different romanisation, o = ô অ ó = o অ’ e = ê এ é = e এ’ ú = û ও)


Thanks.

@Sagir Ahmed Msa ধন্যবাদ! That was a lot of really good information. The romanization we're using is what I found on wiktionary, and also keeps it in line with Bengali, Oriya, etc.
Could you add/change the expected values in the module directly? That would really help. DerekWinters (talk) 15:27, 11 August 2017 (UTC)[reply]

@Wyang, Atitarev, DerekWinters I think this is ready, and we have an Assamese contributor that probably needs it. —Aryaman (मुझसे बात करो) 19:08, 15 September 2017 (UTC)[reply]

@DerekWinters Great job! I don't know much about Assamese but thanks for looking into it. The shwa-droppings can be left for future generations to work out :). --Anatoli T. (обсудить/вклад) 22:07, 15 September 2017 (UTC)[reply]

@Atitarev: That might be sooner than you think :) only 6 failed testcases right now. Where is DerekWinters though? —Aryaman (मुझसे बात करो) 00:20, 22 September 2017 (UTC)[reply]

@Aryaman Hi, can we use a different romanisation for vowels? In the current romanisation, the commonly used vowels অ এ ঐ ও ঔ use diacritics/accents but the rare ones অ’ and এ’ don't have diacritics/accents. These diacritics aren't available in Google Indic keyboard (which is very good for many indic languages including Assamese, because of its better transliteration), so I need to change keyboard many times. I think since IPA is mentioned in Wiktionary:Assamese transliteration, Help:IPA/Assamese, Assamese language and also in closely related languages that use the same/similar romanisation, many sounds are still different, like In Assamese "t th d dh" are alveolar, but in Bengali and Odia they are dental etc. I suggest:

অ = o
অ’ = ó
এ = e
এ’ = é
ঐ = oi
ও = ü (or ú)
ঔ = ou

Typing is easier if we don't use diacritics for commonly used vowels (no need to hold and select the correct accent.)

-- Sagir

conjuncts[edit]

@User:Aryamanarora Hi, can you make ৰ্ক to rk (instead of rko) effective only in final? Same for ৰ্খ ৰ্চ ৰ্ছ ৰ্জ ৰ্ট etc which are rare in Indo-Aryan words but used in other loanwords. Sagir Ahmed Msa (talk) 20:11, 22 October 2017 (UTC)[reply]

@Sagir Ahmed Msa:  DoneAryaman (मुझसे बात करो) 21:13, 22 October 2017 (UTC)[reply]