Module talk:hi-translit

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Module testing[edit]

Testing मानक (mānak) हिन्दी (hindī): [DIRECT MODULE CALL REDACTED], expected: "mānak hindī". --Anatoli (обсудить/вклад) 02:14, 24 April 2013 (UTC)[reply]

Testing conjuncts: [DIRECT MODULE CALL REDACTED] --Anatoli (обсудить/вклад) 02:25, 24 April 2013 (UTC)[reply]

Testing some text:

Original: हिन्दी संवैधानिक रूप से भारत की प्रथम राजभाषा और भारत की सबसे अधिक बोली और समझी जानेवाली भाषा है। चीनी के बाद यह विश्व में सबसे अधिक बोली जाने वाली भाषा भी है।

हिन्दी और इसकी बोलियाँ उत्तर एवं मध्य भारत के विविध राज्यों में बोली जाती हैं । भारत और अन्य देशों में में ६० करोड़ से अधिक लोग हिन्दी बोलते, पढ़ते और लिखते हैं। फ़िजी, मॉरिशस, गयाना, सूरीनाम की अधिकतर और नेपाल की कुछ जनता हिन्दी बोलती है।

हिन्दी राष्ट्रभाषा, राजभाषा, सम्पर्क भाषा, जनभाषा के सोपानों को पार कर विश्वभाषा बनने की ओर अग्रसर है। भाषा विकास क्षेत्र से जुड़े वैज्ञानिकों की भविष्यवाणी हिन्दी प्रेमियों के लिए बड़ी सन्तोषजनक है कि आने वाले समय में विश्वस्तर पर अन्तर्राष्ट्रीय महत्त्व की जो चन्द भाषाएँ होंगी उनमें हिन्दी भी प्रमुख होगी।:

Translit: [DIRECT MODULE CALL REDACTED]. --Anatoli (обсудить/вклад) 02:36, 24 April 2013 (UTC)[reply]

To do[edit]

  1. Remove the inherent vowel (a) from consonants before diacritics, virama and at the end of words (or default them to be without "a" and add it when necessary).
  2. Try to implement the rules of dropping the inherent a.

Should the default consonants be without "a"? --Anatoli (обсудить/вклад) 02:42, 24 April 2013 (UTC)[reply]

Not if the same system is to be used for Sanskrit. If you do remove the inherent vowel for Hindi, at least make a separate Module:sa-translit where it isn't removed. —Angr 12:00, 22 June 2013 (UTC)[reply]
Fair enough but I'm currently not working on this. I don't know how to do and if it's possible. --Anatoli (обсудить/вклад) 12:09, 22 June 2013 (UTC)[reply]

This should transliterated to "shr" as per w:Hunterian transliteration. I don't want to break anything, so can anyone add it? Aryamanarora (talk) 19:52, 27 October 2015 (UTC)[reply]

We don't use Hunterian transliteration but a variety of IAST, so "śra" is correct. See also WT:HI TR. --Anatoli T. (обсудить/вклад) 22:40, 27 October 2015 (UTC)[reply]
Ahh, okay, thanks for the answer :) Aryamanarora (talk) 00:40, 28 October 2015 (UTC)[reply]

Double width tilde[edit]

There's a problem with the combining double-width tilde (U+0360) that's used on the vowels ऐं a͠i and औं a͠u; the combining character is being inserted before the letters when the correct codepoint placement for a double-width combining character is between the two letters. This means it appears straddling the a and the letter before it. (I suspect that the person who made the mistake was using a font that misrendered combining characters too far to the right, causing them to put it in the wrong place). 86.138.252.217 08:30, 12 May 2016 (UTC)[reply]

@Wyang I noticed this as well. —Aryamanarora (मुझसे बात करो) 14:48, 12 May 2016 (UTC)[reply]
I changed the placement of the tilde to between the letters. There was probably aesthetic concern by the person who originally suggested placing the tilde before the first letter (compare ha͠i and h͠ai). Wyang (talk) 09:21, 15 May 2016 (UTC)[reply]

ज्ञ (jña) transliteration[edit]

@Atitarev, Wyang Wouldn't it be more accurate (and helpful) to transliterate this as gy? —Aryamanarora (मुझसे बात करो) 19:27, 9 April 2017 (UTC)[reply]

I don't have a strong opinion on this. I like IAST but we don't follow it 100% for Hindi.--Anatoli T. (обсудить/вклад) 22:01, 9 April 2017 (UTC)[reply]
@Aryamanarora It seems to be listed in the translit as gy, yet it still transliterates as . How is that the case? — AWESOME meeos * ([nʲɪ‿bʲɪ.spɐˈko.ɪtʲ]) 22:54, 9 April 2017 (UTC)[reply]
@Awesomemeeos I kind of understand Lua, and it seems to me this line text = gsub(text, 'ज्ञ', conv) at the end overrides that somehow? —Aryamanarora (मुझसे बात करो) 22:55, 9 April 2017 (UTC)[reply]
@Aryamanarora I think you're right about that... but the problem is whether we should keep the cluster untouched... You should turn the cluster as a comment until further notice.
Since this module very much generates a transcription rather than a transliteration, I think gy would be better. But then, this module isn't entirely transcriptive either... Wyang (talk) 09:59, 10 April 2017 (UTC)[reply]
@Aryamanarora, Wyang, DerekWinters Hi. I have formed my opinion now on the transliteration of ज्ञ. I vote for preserving "jñ" as in the Oxford Hindi-English dictionary, →ISBN. ज्ञान (jñān) is transliterated "jñān" (as the current default). Our current Hindi transliteration is quite phonetic, otherwise. I actually prefer we use this dictionary for determining how to deal with shwa-dropping as well for ambiguous cases. --Anatoli T. (обсудить/вклад) 12:12, 11 April 2017 (UTC)[reply]

[edit]

@Aryamanarora The Unicode Standard says this character is specific to Marathi [1] :

Independent vowel for Marathi
0972 ॲ DEVANAGARI LETTER CANDRA A

I could imagine being used for Hindi, but have you seen it in actual use for Hindi? There may be (possibly prescriptive) rules regarding when to use ॲ, ऑ, but it appears there are many ways people spell a single English word in Devanagari so those rules aren't always applied.

Also, are Marathi modules still reliant on Module:hi-translit, since it says

It is also used to transliterate Marathi (mr), Awadhi (awa), Haryanvi (bgc), Braj (bra) and Garhwali (gbm).

and because the Marathi modules are still in development?

Kutchkutch (talk) 21:00, 8 November 2017 (UTC)[reply]

@Kutchkutch: Yes, the "also used to transliterate" list is always up to date, because it is generated by a module that searches for the languages that use this transliteration module. (That is, the languages that have "hi-translit" in their language data file. Marathi's data file is in Module:languages/data2.) — Eru·tuon 23:35, 8 November 2017 (UTC)[reply]
Hi @Erutuon:. Thanks for the detailed answer!
I've noticed that you've contributed to Marathi phonology.
I was thinking of making a Module:typing-aids/data/omr based on Module:typing-aids/data/inc-pra for typing in the Modi script, but as Aryamanarora correctly pointed out, all the available Modi fonts do not work properly so there's no point until Noto fonts supports it. Kutchkutch (talk) 00:45, 9 November 2017 (UTC)[reply]
@Kutchkutch: Sorry, forgot to reply to this. The only reason I added the character was because Marathi translit was relying on this. I think though MOD:mr-translit is reasonably functioning, so I've switched Marathi to rely on that. I will remove ॲ from this now, since Hindi (or any of the closely related Hindi belt languages that use this module) do not use the chandra a. —Aryaman (मुझसे बात करो) 01:30, 9 November 2017 (UTC)[reply]
@Aryamanarora: Thanks for clarifying! If the Marathi's reliance on MOD:hi-translit is removed then any future change to MOD:hi-translit that affects MOD:mr-translit will need to copied manually.
There's always a chance Bombay/Bambaiya Hindi might use through close contact with Marathi, but that would require evidence of repeated usage and not just one instance. Kutchkutch (talk) 01:53, 9 November 2017 (UTC)[reply]
@Kutchkutch: No problem! I'm not sure about Bambaiya, since it's so hard to find it written down. —Aryaman (मुझसे बात करो) 02:04, 9 November 2017 (UTC)[reply]

Words like ख़ुशबुओं[edit]

@Atitarev, AryamanA This word is transliterated with final -on, not . Is this correct? Benwing2 (talk) 01:44, 10 August 2020 (UTC)[reply]

@Benwing, AryamanA: No, should be .
We also have a bit of discrepancy between दाँत (dā̃t) and दान्त (dānt), pronounced the same way, the former should also be "dānt" in this case. Anusvara (a single dot ं) or chandrabindu (a semicircle with a dot ँ) are pronounced the same way as their alt forms with न (n) or म (m) in certain positions. --Anatoli T. (обсудить/вклад) 01:54, 10 August 2020 (UTC)[reply]
@Atitarev, Benwing2: It should always be nasalized word-finally, I'll fix the module. The example Anatoli gave is a bit tricky; this is an issue of orthographic standardization (or lack thereof), and I'm not sure what to do about it. I believe long vowels force the chandrabindu to behave like the anusvara. —AryamanA (मुझसे बात करेंयोगदान) 02:55, 10 August 2020 (UTC)[reply]
Fixed (only word final). —AryamanA (मुझसे बात करेंयोगदान) 03:28, 10 August 2020 (UTC)[reply]

थ़[edit]

@AryamanA, Benwing2: Is letter थ़ used for loanwords from English? E.g. थ़ैंक्स and थ़ैंक्यू - hardly any Google hits but plenty of nuqtaless spellings: थैंक्स (tha͠iks) and थैंक्यू (tha͠ikyū). --Anatoli T. (обсудить/вклад) 03:49, 10 August 2020 (UTC)[reply]

@Atitarev: It's not actually used in mainstream spelling, the nuqtaless one is standard. I'm not really sure why it exists. —AryamanA (मुझसे बात करेंयोगदान) 14:41, 10 August 2020 (UTC)[reply]
@AryamanA, Benwing2 I wonder if we should adopt nuqta'ed forms as main entries for these rare loanwords for pronunciation, transliteration purposes. -Anatoli T. (обсудить/вклад) 04:15, 11 August 2020 (UTC)[reply]
@Atitarev, AryamanA I guess Aryamanarora's point is that these words aren't pronounced with [θ] normally either. Benwing2 (talk) 04:26, 11 August 2020 (UTC)[reply]
@AryamanA, Benwing2: But they are, aren't they? I think it's just the letter थ़, that is not common out there. That's what you hear in Bollywood movies. Not sure, if it's code-switching, too common. Hindi Wikipedia uses English "thin" as an example where it should be used.--Anatoli T. (обсудить/вклад) 04:32, 11 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Like with a few other nuqta'ed letters, both pronunciations may be valid, e.g. ख़ूबानी (xūbānī) can be read as its nuqtaless form खुबानी (khubānī) both /xuː.bɑː.niː/ and /kʰuː.bɑː.niː/. --Anatoli T. (обсудить/вклад) 04:45, 11 August 2020 (UTC)[reply]
@AryamanA, Benwing2: I went ahead and created थ़ैंक्स and थैंक्स (tha͠iks) as example entries. Perhaps they should be reversed in what is the main entry or थ़ैंक्स should only be used as a phonetic respelling. What do you think is right, @AryamanA? In any case, it's an example where spelling and pronunciations may differ. --Anatoli T. (обсудить/вклад) 07:13, 11 August 2020 (UTC)[reply]
@Atitarev: I think I can safely say that no Hindi speaker, educated in English or not, has the /θ/ in their phoneme inventory. I don't think the entries for थ़ैंक्स etc. are necessary, nor have I ever seen the use of the nuqta on th (so I doubt it would pass CFI). They also are not necessary for phonetic respelling IMO, given, like I said, /θ/ isn't a phoneme in Standard Indian English even. —AryamanA (मुझसे बात करेंयोगदान) 14:33, 11 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Thanks. I was wrong again. I have reversed entries and made थ़ैंक्स just a hard redirect. It must be code-switching then when people use /θ/ in English words. --Anatoli T. (обсудить/вклад) 23:09, 11 August 2020 (UTC)[reply]
@Atitarev, Benwing2: It's no worry, thanks for understanding. Personally, even when codeswitching, I use थ in my speech (not थ़), so I don't think it's very common. (I have a General American accent in English, but it becomes more Indian English-like when codeswitching with Hindi). —AryamanA (मुझसे बात करेंयोगदान) 23:22, 11 August 2020 (UTC)[reply]

॰ changes[edit]

@AryamanA, Atitarev Aryaman, I don't think your recent change involving ॰ (see Talk:अवकलन) is correct. Its former purpose was to force a pronounced schwa without deleting any other schwas. I created various phonetic respellings last night based on this understanding, and they are no longer correct. Its new effect in अवकलन is equivalent to a virama, meaning it essentially isn't adding anything new, and there's no longer a way of forcing a pronounced schwa. Even in its old usage it was problematic because if placed between an anusvara and a ग, it would wrongly cause the anusvara to be transliterated as n instead of . IMO it's also counterintuitive to place it before the consonant after which the schwa should be retained, instead of at the position of the retained schwa. Finally, it's being transliterated as a ., which means it can't be used in subst= respellings in {{ux}}/{{uxi}}. I think the solution to all these issues is to support a * placed at the position of the retained schwa, which transliterates to a while not causing schwa deletion elsewhere. Benwing2 (talk) 01:33, 13 August 2020 (UTC)[reply]

@Benwing2, AryamanA: I agree with @Benwing2. --Anatoli T. (обсудить/вклад) 02:26, 13 August 2020 (UTC)[reply]
@Atitarev, Benwing2: Ah, looks like I misunderstood earlier; all that the ॰ does, as I have seen in dictionaries, is indicate a morpheme boundary which affects the schwa deletion such that each morpheme is schwa deleted as an independent unit. So yeah, we don't really have a schwa adder, and I am good with adding * as one. In that case, I agree. —AryamanA (मुझसे बात करेंयोगदान) 02:52, 13 August 2020 (UTC)[reply]
@Benwing2, AryamanA: Thanks. @AryamanA, do you want to keep ॰ in cases like प्रजनन (prajanan) (i.e. प्र॰जनन)? It makes sense but if we use *, it would make it easier (no need to know morphemes or thorough knowledge of Hindi), consistent and no possibility of the case described above. Also, I think it's fair to use symbol ॰ for its intended purpose and transliterate it accordingly. What do you think? Basically we would mostly need to use two additional symbols for re-spellings a virama and *. Even virama, if it's hard to type, could be replaced with a different symbol (e.g. ^) for easiness. It's not exposed to users, anyway. --Anatoli T. (обсудить/вклад) 03:04, 13 August 2020 (UTC)[reply]
@Benwing2, Atitarev: I agree that the virama (with a potential equivalent alternative such as ^) and a schwa-adding * are sufficient, and definitely easier to use! BTW, I recently published some research on modelling schwa deletion (https://www.aclweb.org/anthology/2020.acl-main.696/), but the model we developed was in Python--it did outperform the Wiktionary hi-translit significantly, perhaps it would be a good idea for me to port it to Lua in the coming days? —AryamanA (मुझसे बात करेंयोगदान) 03:32, 13 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Yes, shwa-deletion algorythms are interesting and we should probably use the latest knowledge on this. I won't be programming here but I think light shwa is probably worth considering for transliteration purposes as well, later on. There's too much to do still, it seems. Thanks for the link, I'll have a read. --Anatoli T. (обсудить/вклад) 03:38, 13 August 2020 (UTC)[reply]
@Atitarev: Yeah, machine learning has made the problem a lot easier. Basically, we used ML to generate a decision tree for schwa deletion based on the McGregor dictionary's transliterations. Only issue really is porting the code, which is kind of a pain. Also been working on Bengali with a Bengali linguist, so maybe will be working on improving infrastructure in that in the near future. —AryamanA (मुझसे बात करेंयोगदान) 04:40, 13 August 2020 (UTC)[reply]
@AryamanA I'm a bit wary of using ML for schwa deletion because it introduces unpredictability. I'd be interested in knowing how well your model performed but in practice it won't be perfect, and as a result there may be no practical way for someone to know how the transliteration would turn out without previewing the page and checking what got produced. (OTOH I'd also be interested in hearing how the Chinese-language Wiktionary community fares with transliteration, since AFAIK they use a large lookup table to handle this.) The best of both worlds would be a smarter algorithm than what we have that still maintains a certain amount of predictability. Benwing2 (talk) 05:00, 13 August 2020 (UTC)[reply]
@Benwing2: Yeah, that is certainly a worry... the model gets 98% accuracy per-schwa (on test set, so overall it will perform better than that since it will encounter words it was already trained on) whereas Wiktionary had 94%, but there's certainly some Wiktionary got right that the model got wrong. I will try porting it some day. It's just such an intractable problem! No good linguistic approaches to it, and rule-based systems are also imperfect. And yet, native speakers have no trouble at all figuring it out without even knowing the process exists. Going off on a bit of a tangent, but I think the solution probably lies in syllable weight and stress, which so far are notoriously understudied in Hindi, and not so easy to implement computationally. —AryamanA (मुझसे बात करेंयोगदान) 05:10, 13 August 2020 (UTC)[reply]

॰ in translit[edit]

(moved from Talk:अवकलन)

@AryamanA, Atitarev The translit says avkalan but the phonetic respelling results in avakalan. Which is correct? Benwing2 (talk) 00:15, 13 August 2020 (UTC)[reply]

@Benwing2, Atitarev: Fixed, was an issue with the ॰ in translit. Now schwa dropping is done after converting it, so it will behave the same as "-" does. —AryamanA (मुझसे बात करेंयोगदान) 00:27, 13 August 2020 (UTC)[reply]

anusvara after long vowels[edit]

@Atitarev, AryamanA We have काँत (kā̃t) with chandrabindu indicated as a nasal vowel, but कींत (kī̃t) with anusvara indicated as n. Is this correct? I read somewhere that anusvara is used after vowels that rise above the line and chandrabindu after vowels that don't rise above the line, but otherwise they are equivalent. Should we fix this module to indicate both of them as nasal vowels after long vowels? Benwing2 (talk) 23:14, 15 August 2020 (UTC)[reply]

@Atitarev, AryamanA And what about after e, o, ai, au, which are long but not transliterated with a macron? Benwing2 (talk) 23:15, 15 August 2020 (UTC)[reply]
@Benwing2, AryamanA: No difference in pronunciation. See also Talk:बांह. --Anatoli T. (обсудить/вклад) 23:27, 15 August 2020 (UTC)[reply]
@Atitarev, AryamanA How are these words pronounced? The example of बांह (bā̃h) is pronounced with a nasal vowel before h, but does this apply also before stop sounds? Benwing2 (talk) 00:05, 16 August 2020 (UTC)[reply]
@Atitarev, Benwing2: They cause nasalization + assimilation of a nasal consonant to the next sound. So काँत is /kɑ̃ːn̪t̪/. —AryamanA (मुझसे बात करेंयोगदान) 00:07, 16 August 2020 (UTC)[reply]
@Atitarev, AryamanA In that case they should all be transliterated with n/ṅ/ṇ/ñ/m not a tilde, right? Benwing2 (talk) 00:08, 16 August 2020 (UTC)[reply]
Can you clarify all the cases where anusvara/chandrabindu are pronounced as nasal vowels with or without a nasal consonant? I can then fix the module appropriately. Benwing2 (talk) 00:10, 16 August 2020 (UTC)[reply]
@Benwing2, Atitarev: I fixed it myself, and updated the nasal_assim table in hi-translit as well for ह. —AryamanA (मुझसे बात करेंयोगदान) 00:15, 16 August 2020 (UTC)[reply]

@Atitarev, Benwing2: So I read a paper about this "Nasal Epenthesis in Hindi" by Manjari Ohala and John Ohala (who just recently passed away unfortunately). They observed insertion of nasal consonants after chandrabindu only before voiced stops, and now that I think about it I agree with that, e.g. साँप (sā̃p) is not sāmp, it's a pure nasal for me. I've updated the module to reflect that now. —AryamanA (मुझसे बात करेंयोगदान) 17:06, 23 August 2020 (UTC)[reply]

a after non-native consonant clusters[edit]

(moved from Talk:चश्मदीद) @Atitarev, AryamanA MacGregor's dictionary seems to imply this should be caśmdīd. Benwing2 (talk) 21:41, 16 August 2020 (UTC)[reply]

@Atitarev, Benwing2: Sorry, I made an edit to hi-IPA that broke the weakened schwa. The śm has a weak schwa after it, as it isn't a normal Hindi consonant cluster. —AryamanA (मुझसे बात करेंयोगदान) 23:48, 16 August 2020 (UTC)[reply]
@AryamanA, Atitarev So it should be transliterated with an a after śm? Benwing2 (talk) 00:03, 17 August 2020 (UTC)[reply]
@Atitarev, Benwing2: Yes, in my opinion. McGregor is sometimes too generous with consonant clusters, e.g. he transliterates अमन (amn) by hypercorrection to the Arabic. —AryamanA (मुझसे बात करेंयोगदान) 00:14, 17 August 2020 (UTC)[reply]
@AryamanA, Atitarev In that case presumably this also applies to क़िस्म and lots of other Arabic-derived words we are currently transliterating without a? Benwing2 (talk) 00:20, 17 August 2020 (UTC)[reply]
Just to clarify, such words all have manual translit without the final a. Benwing2 (talk) 00:21, 17 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Yes, it seems a little inconsistent about how we treat the weakened schwa. I have already provided a number of respellings with virama to words like क़िस्म (qisma) - "qism". --Anatoli T. (обсудить/вклад) 00:34, 17 August 2020 (UTC)[reply]
@AryamanA, Atitarev I have done the same. Benwing2 (talk) 01:08, 17 August 2020 (UTC)[reply]
(edit conflict) @AryamanA, Benwing2: If I am not mistaken, the previous approach was to suppress "a" in transliterations (maybe that should have been automated?) but display a /ᵊ/ in IPA automatically. I'm not sure myself. If virama is added to hi-IPA, then obviously, no vowel is displayed in pronunciations. Now I think no virama should be added to hi-IPA in this case (क़िस्म). --Anatoli T. (обсудить/вклад) 01:10, 17 August 2020 (UTC)[reply]
@AryamanA, Atitarev IMO it depends on what the underlying phonemic identity of reduced schwa is. If it's phonemic we should display it as ă; if it's an allophone of regular schwa, display as a, if it's predictably inserted, display as nothing. Benwing2 (talk) 01:16, 17 August 2020 (UTC)[reply]
@Benwing2, AryamanA: Well the phonemic identity of all consonants, which don't have a virama (virama may be invisible, if they are part of the consonant conjunct) is "consonant + an inherent 'a'", so you will see that some online dictionaries, like https://www.shabdkosh.com/ don't bother and display all inherent "a". We are more phonetic. --Anatoli T. (обсудить/вклад) 01:27, 17 August 2020 (UTC)[reply]
@AryamanA, Atitarev Not sure I believe that there's an underlying phonemic schwa in all clusters without written virama, as schwa deletion seems far too unpredictable. Maybe you're thinking of the writing system, not the phonology? Benwing2 (talk) 01:30, 17 August 2020 (UTC)[reply]
@Benwing2, Atitarev: Agree with Benwing here, but I don't think Atitarev was suggesting adding the schwa every time it is written. क़िस्म I feel does not have a (weakened) schwa following it. I think it has to do with which clusters are "easy" for Hindi speakers (which may vary by dialect and education). I believe Kachru's Hindi grammar looks at acceptable clusters in some part of it, but there hasn't really been much work on Hindi phonotactics to help us out. —AryamanA (मुझसे बात करेंयोगदान) 02:24, 17 August 2020 (UTC)[reply]
@Benwing2, AryamanA: There is a high level of predictability in shwa removal but the underlying phonemic value of each consonant letter is "Ca" (consonant + shwa). "a" (shwa) is removed and is not phonemic in
  1. Consonant clusters (conjuncts) (with an invisible virama) between consonants (there's an invisible virama between s and m in क़िस्म)
  2. When consonants are followed by explicit vowel sign.
  3. Final positions with a visible virama.
All other cases of shwa dropping is not phonemic and follows an algorythm, perfect or imperfect, occasionally with optional or weakened shwa in various positions. भारत (bhārat) is written bh-ā-ra-ta - "a" in bha has no value, deleted by the ā sign, but "a" in ta is removed by the algorithm. भारतीय (bhārtīya) is spelled "bh-ā-ra-tī-ya". This time "a" is (optionally) dropped after "r" because it's followed by the next syllable. "a" in "ya" is also weakened/optional. It's romanised as "bhārătīyă" by McGregor. It's a matter of what approach and algorithm is taken. Even this simple example needs to be decided on. It can be transliterated as "bhārtīy", "bhārătīyă" or ""bhārtīya", depending who you ask. --Anatoli T. (обсудить/вклад) 23:39, 17 August 2020 (UTC)[reply]
@AryamanA, Atitarev OK, if we follow this strictly it means we have no reliable references for transliteration (except maybe Kachru's Hindi grammar which I don't have access to), which means it's going to be very difficult for anyone but you (or another native speaker maybe) to know how to transliterate such terms. For example, I just encountered जशन (jaśn) transliterated jaśn when the default would be jaśan. What is correct here? Are there any rules indicating which clusters take weakened schwas and which ones don't? If there are no rules and no references, there's no realistic way for Anatoli or I to help out fixing up entries. Also we need to agree how to transliterate weakened schwas. What is their phonemic status? Sorry to bother you with questions like this; I just feel we need to iron out the principles rather than go with our intuition on a case-by-case basis. Benwing2 (talk) 04:43, 17 August 2020 (UTC)[reply]
@Benwing2, AryamanA: Good point, @Benwing2. We need to iron out the rules but I can imagine it may be even difficult for a native speaker. We could use a scenario-by-scenario approach, agree on it and stick to it. Post agreed rules in a central place. We also need to make sure that the three parts agree - headword transliteration (if required, or automated), declension + generated IPA.
I find Rupert Snell's approach is, at least consistent, matches what we have agreed on so far, but my dictionary and textbooks are far from comprehensive and you guys probably don't have access to it. (He is the author of Teach Yourself Hindi series, including a dictionary).
One of the latest Oxford Hindi-English dictionaries came with transliterations, covering weakened shwa. It's in my possession (I have just got it back). I can provide with how certain words are transliterated there. I am struggling when searching Devanagari still, though.
Oxford gives जशन transliterated as "jaśan" with a note P. = jaśn. It looks like an irregular pronunciation here. --Anatoli T. (обсудить/вклад) 05:05, 17 August 2020 (UTC)[reply]
@Benwing2, AryamanA: Pardon me. The pronunciation of जशन is default here, so should be "jaśan". P. stands for "Persian" and "jaśn" is just a note how the word is pronounced in Persian. IMO, if anyone says "jaśn", imitates the original pronunciation. Pls confirm. --Anatoli T. (обсудить/вклад) 05:12, 17 August 2020 (UTC)[reply]
क़िस्म (qisma) is given as "qism", not "qismă" by Oxford. A weakened shwa is represented as "ă" in the dictionary. --Anatoli T. (обсудить/вклад) 05:30, 17 August 2020 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── @Benwing2, AryamanA: Another cluster: अक़्ल is transliteratted as "aql" by Oxford. Pls let me know your thoughts. --Anatoli T. (обсудить/вклад) 07:10, 17 August 2020 (UTC)[reply]

@Benwing2, Atitarev: I say aqal... So, I noticed some patterns. Some of these clusters are resolved by schwa-insertion. So क़िस्म is either qism (if that cluster exists in your dialect, which it does in mine) or qisam (if it doesn't), but never qismă. Here's an idea: add whatever examples you come across with strange clusters to User:AryamanA/Hindi_clusters and I will transcribe how I say them. Maybe we can find some sort of system for it. —AryamanA (मुझसे बात करेंयोगदान) 07:28, 17 August 2020 (UTC)[reply]
@Benwing2, AryamanA: OK. You can in turn ask me for a small number of words you're particularly interested in, I will also search through recent pings and discussions. क़िस्म couldn't be qisam since it has a virama between s and m, no? Does it seem that Oxford's translit is closer to your dialect, so far? I recommend to invest in it, if you haven't already. It's a quality dictionary, especially compared with any other Hindi dictionaries available, IMO. Not digitised, I am afraid. --Anatoli T. (обсудить/вклад) 07:40, 17 August 2020 (UTC)[reply]
@Atitarev, Benwing2: I will be adding hopefully as large a range of clusters as possible (it's 4 am here so probably not now!) I do in fact own a copy of the McGregor dictionary :) I love it, it is a great resource. (Assuming you are talking about that one, it's digitised at https://dsalsrv04.uchicago.edu/dictionaries/mcgregor/. They haven't publicized the link on their main site for some reason.) —AryamanA (मुझसे बात करेंयोगदान) 07:44, 17 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Cool! I did use it in the past but wasn't sure it's the same version and quality. Since everyone can access it, Why don't we base the transliteration (not necessarily IPA) rules entirely or almost entirely on McGregor's OHED and focus on other important things? --Anatoli T. (обсудить/вклад) 07:57, 17 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Re: अक़्ल, funny that in Urdu, pronunciation "aqal" for عقل is considered illiterate per Fallon (same site). --Anatoli T. (обсудить/вклад) 08:06, 17 August 2020 (UTC)[reply]
@Atitarev, Benwing2: Guess I can't read or write now 🤷🏽‍♂️ and yes, standardizing on McGregor is probably the best solution lol. (On a serious note, being able to say some of these weird clusters does indicate higher education/prestige/etc.) —AryamanA (मुझसे बात करेंयोगदान) 08:12, 17 August 2020 (UTC)[reply]
@AryamanA, Benwing2: McGregor uses weakened shwa to romanise व्यवसाय (vyavasāy): "vyavăsāy". Rupert Snell uses "vyavasāy", which one should we use for transliteration? --Anatoli T. (обсудить/вклад) 11:16, 17 August 2020 (UTC)[reply]
@Atitarev, Benwing2: I say vyavsāy, and a cursory look at Youtube confirms that that is the most common one. —AryamanA (मुझसे बात करेंयोगदान) 15:16, 17 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Thanks. Now we have a discrepancy with the dictionary or we need to decide on weakened shwa treatment.
Another cluster, which is normally transliterated as "vowel + tr": पात्र (pātra), मित्र (mitra). Should we transliterated them as "pātr"/"mitr" or "pātra"/"mitra"? McGregor uses "pātr"/"mitr". R Snell uses "pātra"/"mitra". This is a very common ending, let's decide.
BTW, pls acknowledge my post at 23:39, 17 Aug above. --Anatoli T. (обсудить/вклад) 10:42, 18 August 2020 (UTC)[reply]

@AryamanA, Benwing2: Hi. The module doesn't drop the shwa after "y" for some reason. कोयला is transliterated as "koylā" in McGregor or Snell. --Anatoli T. (обсудить/вклад) 08:36, 21 August 2020 (UTC)[reply]

@Atitarev, Benwing2: Fixed it! Let me know if it broke anything. I removed the special exceptions for how य is treated, it should be the same as any other consonant now. —AryamanA (मुझसे बात करेंयोगदान) 18:41, 21 August 2020 (UTC)[reply]
@AryamanA, Atitarev This is definitely going to break some things. In general, when making changes like this, we need to set up tracking to find the places where the pronun changes and decide for each one what to do. This is how I handled such changes to e.g. Module:ru-pron in the past. Benwing2 (talk) 20:12, 22 August 2020 (UTC)[reply]
@AryamanA, Benwing2: This was a positive and correct change. "y" should be treated like other consonants for shwa dropping but it wasn't always. Please add tracking, if it's required and doable. Also, it would be good to show the "respelled", either as a new column or as in Module:ru-pron/testcases, e.g. "Зимба́бве (respelled Зимба́бвэ)". --Anatoli T. (обсудить/вклад) 23:11, 22 August 2020 (UTC)[reply]
@AryamanA, Atitarev I wrote a script to find the places where the translit changed as a result of the above change. Following is the list of all such cases. Aryaman, if you could, please add a notation e.g. 'old' or '*' by all the ones below where the old translit is the correct one; these ones need manual translit and pronun added. Thanks!

Benwing2 (talk) 02:23, 27 August 2020 (UTC)[reply]

@Benwing2: All fixed! Thanks for going through it. I just went ahead and fixed the ones that needed it manually. —AryamanA (मुझसे बात करेंयोगदान) 04:37, 27 August 2020 (UTC)[reply]
@AryamanA Thank you! I notice that for words like गोपनीयता (gopnīytā), the etym is given as गोपनीय (gopnīya) + ता (), where the full word appears as gopnīytā but the constituent part as gopnīya with extra final a. Is that correct? Benwing2 (talk) 04:50, 27 August 2020 (UTC)[reply]
@Benwing2, AryamanA It was a correct fix to treat "y" like other consonants for shwa dropping. @AryamanA has fixed इंजीनियरी (iñjīniyarī) where shwa is required (also in McGregor). I have changed डायरी (ḍāyarī) makings assumptions (pls confirm). The rest seems good now. --Anatoli T. (обсудить/вклад) 10:27, 27 August 2020 (UTC)[reply]
@Benwing2, Atitarev: Final īy should cause a weakened schwa, but that schwa is dropped when the suffix is added, so this seems like proper behaviour to me. I fixed डायरी (ḍāyrī) since it can be either with or without that schwa. —AryamanA (मुझसे बात करेंयोगदान) 15:01, 27 August 2020 (UTC)[reply]

Again on the weakened shwa[edit]

@Benwing2, AryamanA: McGregor romanises कंपनी as "kampănī", Snell as "kampanī". The module currently as "kampnī".

What is our preference in transliterations? A new symbol would solve this problem, IMO, in both transliterations and pronunciations.

Would transliteration "kampănī" (using a new symbol "ă" for the weakened shwa) and pronunciation like /kəmpᵊniː/ be acceptable? --Anatoli T. (обсудить/вклад) 07:53, 22 August 2020 (UTC)[reply]

@Atitarev, Benwing2: I think perhaps we need a symbol for the weakened schwa then. It's a little frustrating since Hindi pholonologies don't really mention how the weakened schwa occurs. I suspect it has something to do with unstressed syllables. —AryamanA (मुझसे बात करेंयोगदान) 14:31, 22 August 2020 (UTC)[reply]
@Benwing2, AryamanA: OK, thanks, @AryamanA. @Benwing2, what do you think is best? Alternatively, * could be added to the respelling and hi-IPA, so that we get transliteration "kampanī" and alt IPA like this: /kəm.pniː/, [kə̃m.pn̪iː], /kəm.pə.niː/, [kə̃m.pə.n̪iː]. --Anatoli T. (обсудить/вклад) 23:18, 22 August 2020 (UTC)[reply]
@Atitarev: I actually do prefer having the two pronunciations. Word-final weakened schwa is enough complications as is. —AryamanA (मुझसे बात करेंयोगदान) 23:54, 22 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Done. --Anatoli T. (обсудить/вклад) 01:37, 23 August 2020 (UTC)[reply]
@AryamanA, Benwing2: I have given the same treatment to उपग्रह, transliterated by McGregor as "upă-grah". In this case, the module did the reverse by default ("upagrah"), it added a shwa. I have given both readings/pronunciations - both "upgrah" and "upagrah". One manual translit was just to maintain the order and in case we change the module. Change, if you think it's not right. --Anatoli T. (обсудить/вклад) 01:57, 23 August 2020 (UTC)[reply]
@Atitarev: This I think is more reasonably weakened schwa-worthy, because it's a morpheme boundary between उप- and ग्रह. Maybe this can be a rule actually: only have weakened schwas marked definitely at the ends of morphemes/words. —AryamanA (मुझसे बात करेंयोगदान) 02:11, 23 August 2020 (UTC)[reply]
@AryamanA: Could you please clarify what your actual preference is here? We don't have a symbol for the weakened schwa but we do have a symbol in IPA, which can't be forced(?), so what transliteration and IPA do you want to see in this case? --Anatoli T. (обсудить/вклад) 02:19, 23 August 2020 (UTC)[reply]
@AryamanA, Benwing2: Also, I thought providing both readings would be a compromise for lacking a weakened shwa handling as per above case with "kampnī"/"kampanī", correct me if I am wrong. --Anatoli T. (обсудить/вклад) 02:22, 23 August 2020 (UTC)[reply]
@Atitarev: Okay actually I misunderstood, sorry! As is, this is perfect, since the weakened schwa here is not easily predictable. —AryamanA (मुझसे बात करेंयोगदान) 02:51, 23 August 2020 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── @AryamanA, Benwing2: I have added a similar handling (as "kampnī"/"kampanī") to रेस्तराँ (restarā̃) romanised as "restărāṁ" by McGregor and "restrā̃" by Snell. रेस्तरां (restarā̃) is a redirect page made by @AryamanA. Alt forms - रेस्टरंट (resṭaraṇṭ), रेस्टरेंट (resṭarẽṭ) are also romanised with a weakened shwa symbol by McGregor - "resṭăraṇṭ" and "resṭăreṁṭ" --Anatoli T. (обсудить/вклад) 01:17, 1 September 2020 (UTC)[reply]

inconsistent nasal translit[edit]

@AryamanA, Atitarev The inconsistent nasal translit has returned. ऊंचाई is transliterated uñcāī but alternative ऊँचाई is transliterated ū̃cāī. Benwing2 (talk) 02:54, 24 August 2020 (UTC)[reply]

@Benwing2: Fixed: ऊंचाई (ūñcāī), ऊँचाई (ū̃cāī). The entry actually has a short vowel in the altform which is wrong haha, that confused me so much. I've added relevant testcases as well at MOD:hi-translit/testcases. —AryamanA (मुझसे बात करेंयोगदान) 03:31, 24 August 2020 (UTC)[reply]

ँ and ं[edit]

@Benwing2, Bhagadatta: The candrabindu and the anusvāra are different from each other. Why can't it be simply ँ = (curly line) and ं = what it is already. In words like आँगन, आँत, the transliteration is wrong. We can't really say they (anusvāra and anunāsika) are same. See these 2 words- मांगलिक and आंतरिक. Both of these are borrowings from Sanskrit. मांगलिक = म् + आ + ङ् + ग् + अ + ल् + इ + क् (+ अ) and आँगन = आँ + ग् + अ + न. It isn't like आंग and आँग are always the same. आंतरिक = आ + न् + त् + अ + र् +इ + क् (+ अ) and आँत = आँ + त. Also, the hrasva a doesn't mean that there is always न् when it is succeeded by त. For example, अँतड़ी. It isn't pronounced as अ + न् + त् + अ + ड़् + ई. So I think it is best if ँ is always a curly line. Alternative spellings of candrabindu words with the anusvāra can have manual transliteration as in उसांस. Thanks and regards - द्विशकारःवार्त्तायोगदानानिसंरक्षितावलयःविद्युत्पत्त्रम् 06:02, 29 December 2020 (UTC)[reply]

Also @AryamanA, Atitarev. - द्विशकारःवार्त्तायोगदानानिसंरक्षितावलयःविद्युत्पत्त्रम् 09:41, 29 December 2020 (UTC)[reply]
@शब्दशोधक: This was discussed with @Benwing2, AryamanA in Talk:उसांस but I'm not sure what the outcome is. The transliteration of anusvāra depends on the position but it may be manually transliterated (and respelled) as candrabindu when it's an alternative form of the spelling with candrabindu, so there's no easy solution. We might copy that discussion to a visible place, like here. --Anatoli T. (обсудить/вклад) 09:52, 29 December 2020 (UTC)[reply]
@Atitarev: The easiest solution is this only - ँ = a curly line and ं depends on its place (which is there currently). Is it possible that आंगन = āṅgan and आँगन = ā̃gan? If it is, then that's the best. As I explained above, मांगलिक (with anusvāra) = māṅgalik and आँगन = ā̃gan. In both of these words, the ं and ँ are preceded by ā(आ) and succeeded by ga(ग), but their pronunciations are different. In such words, anusvāra is ङ्, while ँ is as it is (@Bhagadatta, do you agree?). If it isn't possible, then there's no problem, I'll manually transliterate them, but if it is possible, then why shouldn't if be rectified? Pinging @Benwing2, AryamanA in the hope that they'll do something. Thanks and regards - द्विशकारःवार्त्तायोगदानानिसंरक्षितावलयःविद्युत्पत्त्रम् 12:54, 29 December 2020 (UTC)[reply]
@शब्दशोधक, Atitarev, Benwing2, Bhagadatta: In standard Hindi, अँतड़ी is indeed pronounced अन्तड़ी, and आँग and आंग are both pronounced like आङ्ग. Before voiced consonants, ं and ँ both sound like ं (you can see the paper Nasal Epenthesis in Hindi, Ohala and Ohala, 1991 for more info on this). Currently, the module is doing more of a transcription, and it is my opinion that reflecting the phonetics is more useful for Wiktionary users as no other dictionary currently provides these distinctions. —AryamanA (मुझसे बात करेंयोगदान) 17:10, 29 December 2020 (UTC)[reply]
It would be interesting if you could upload a recording of yourself pronouncing आंगन and आँगन. For me, they would both be आङगन. If they're different for you, then perhaps the nasal rules that the module is applying are too specific to Delhi Hindi and we should not be doing them. —AryamanA (मुझसे बात करेंयोगदान) 17:11, 29 December 2020 (UTC)[reply]
I agree, they're both आङ्गन to me. @शब्दशोधक How are they different? Do you nasalize the आ instead of actually pronouncing the ङ? -- Bhagadatta(talk) 02:48, 30 December 2020 (UTC)[reply]
@Bhagadatta, AryamanA: Maybe, it is different here. Yes, I and everyone else here nasalises आ instead of आङ्. Same with अँतड़ी. I don't say अन्तड़ी but I say अँतड़ी, with nasalisation and no nakāra. So, if its only different for me, then the transliteration system seems fine. - द्विशकारःवार्त्तायोगदानानिसंरक्षितावलयःविद्युत्पत्त्रम् 02:54, 30 December 2020 (UTC)[reply]
@AryamanA I'm wondering why ए ऐ ओ औ are in the short_vowel category? Because at सौंफ I noticed some weird behaviour: सौँफ is transcribed /sɔːmpʰ/, but सौँफ़ is /sɔ̃ːf/, /sɔ̃ːpʰ/. And when I do some more tests I notice that साँफ is transcribed /sɑ̃ːpʰ/. Is there a reason to not put ए ओ etc. in the long vowel category or is this an error? Exarchus (talk) 19:24, 7 January 2024 (UTC)[reply]
I think I figured out now why ए ऐ ओ औ are 'special': they are often written with anusvara because the candrabindu would get in the way of the vowel diacritic. After doing a lot of manual edits on native words with ẽ/õ, I figured out a way to reduce the manual work to just the loanwords such as एजेंट, in my proposal they would have to be respelled with homorganic nasal + virama to force the nasal consonant. The alternative is a lot of work on native words that are much more numerous. Exarchus (talk) 22:54, 11 January 2024 (UTC)[reply]

Substituting "ṇṛ" with "nz"[edit]

Why the instances of "ṇṛ" are substituted with "nz"? It causes problems with Hindi words like मांड़ी (mā̃ṛī). Sbb1413 (he) (talkcontribs) 11:41, 10 June 2023 (UTC)[reply]

Since no one has responded to it, I have boldly commented the line. Sbb1413 (he) (talkcontribs) 14:00, 15 November 2023 (UTC)[reply]