Wiktionary:Beer parlour/2023/September: difference between revisions

Content deleted Content added

Inline

Revision as of 02:36, 16 September 2023

Nuqtaless forms in Hindi

Nuqtaless terms like अर्ज are treated only as alternative spelling here.Those words without nuqta are not just existing only because of poor typset, but they are also pronounced without nuqta sounds. 'arz' is also pronounced as 'arj'. So in those entries native pronunciation should be given preference, and in declension sections transliteration reflecting non-nuqta variant be used or perhaps both the variations. कालमैत्री (talk) 02:38, 1 September 2023 (UTC)[reply]

No. The transliterations should be distinguished by the spelling. Where it may make sense to automatically include and prioritise the nuktaless forms is the pronunciation sections. --RichardW57m (talk) 10:44, 1 September 2023 (UTC)[reply]

this is what i said अर्ज should be transliterated as arj, which it isnt't just like in other nuqtaless entries. कालमैत्री (talk) 11:35, 1 September 2023 (UTC)[reply]

@RichardW57m कालमैत्री (talk) 11:35, 1 September 2023 (UTC)[reply]

@कालमैत्री I am inclined to agree with Richard here if I understand what you say correctly. I think the way it's currently done is correct; dictionaries should show the forms with nuqta except in the pronunciation sections (where the pronunciation as 'arj' is already given as an alternative). The only case I think it makes sense not to have the nuqtaless form be a soft redirect is if it's taken on meanings other than the nuqta-full form. Benwing2 (talk) 20:19, 2 September 2023 (UTC)[reply]

@Benwing2 I agree with you but there are already many entries without nuqta.So should they not show transliteration of non-nuqta form.Perhaps misunderstanding; i am saying about those non-nuqta forms to include it and and not the nuqta forms, the former shows the transliterations of nuqta form. कालमैत्री (talk) 02:29, 3 September 2023 (UTC)[reply]

@कालमैत्री Since it appears from the pronunciation that the nuqtaless forms are mere spelling variants of the forms with nuqta, I don't agree that the translit should be based on the nuqtaless form. Unless the pronunciation is consistently different between nuqta-full and nuqtaless forms, the translits should be the same. This is analogous to how we handle Russian written forms with е in place of ё. Benwing2 (talk) 02:43, 3 September 2023 (UTC)[reply]

@Benwing2 They are not mere spelling variants. But pronunciation one too, as regional hindi speakers use the pronunciation of nuqtaless variant.So both transliteration can be used in non-nuqta entry.Or is this unnecessary? कालमैत्री (talk) 02:52, 3 September 2023 (UTC)[reply]

@कालमैत्री I don't think it's necessary to include both, as the nuqtaless pronunciation is optional. Benwing2 (talk) 02:56, 3 September 2023 (UTC)[reply]

@Benwing2 Well the nuqta pronunciation is similarly optional in those entries.अंग्रेज entry uses pronunciation audio without nuqta sounds कालमैत्री (talk) 03:11, 3 September 2023 (UTC)[reply]

@कालमैत्री, @Benwing2, @RichardW57m: The situation with nuqta and nuqta-less forms are indeed very similar to Russian ё (jo) / е (je) words. Regardless of the pronunciation, the spelling with е (je) is more common in regular running Russian texts for native speakers.

ё is standard, е is non-standard or just a relaxed spelling of ё: свёкла (svjókla, “beetroot”) and свекла́ (sveklá)
е is standard, афе́ра (aféra, “shady deal”) and афёра (afjóra)

What can be done for Hindi, is provide alternative entry lines where both both transliterations and spelling are nuqtaless. Please take a look at this revision with my new changes of अर्ज (arj) with nuqtaless and alt. form handling. Also drawing attention of @AryamanA. Anatoli T. ^{(обсудить}/^вклад) 06:28, 3 September 2023 (UTC)[reply]

@Atitarev, Benwing2, कालमैत्री: IMO this format is a bit cluttered. I would prefer just giving nuqtaless form as the definition, both pronunciations in IPA, but only the nuqtaless transliteration in the headword. This is just a special case of alt form so having both alt form and nuqtaless form as defns is redundant. —Aryaman^A ^{(मुझसे बात करें • योगदान)} 21:03, 3 September 2023 (UTC)[reply]

@AryamanA, @Benwing2, @कालमैत्री: Thanks for your response, Aryaman. I can revert my edit later but I've got some obvious questions:

In case of अर्ज़ (arz) vs अर्ज (arj), the nuqtaless form is not only alternative spelling but a spelling, which matches the pronunciation. Is it always the case? And are the words or specific nuqta letters where this is not true? For example, is फिल्म (philm) ever pronounced as /pʰɪlm/, not /fɪlm/ as opposed to फ़िल्म (film)?

Since I don't know enough Hindi to judge, I'll use another analogy in Russian "ё" vs "е" spellings.

Unlike свёкла/свекла, афера/афёра where one pronunciation is proscribed but is acceptable, in case of самолёт (samoljót), it's ALWAYS pronounced as if it's spelled so [səmɐˈlʲɵt], even if it's spelled самолет (samolet) (not to confuse with spellings and pronunciations in other languages, such as Bulgarian).

So, in the case of свёкла/свекла, афера/афёра - two definition lines with two distinct pronunciations are appropriate.

In case of самолёт/самолет, only a soft-redirect is used.

Hope it's not confusing, please advise your thoughts. Anatoli T. ^{(обсудить}/^вклад) 01:52, 4 September 2023 (UTC)[reply]

@Atitarev Yes film is pronounced as philm in villages and also by those who speak different dialect(however adding seperate entry for other might be worthless).And as of whether it should nuqta or nuqtaless transliteration, i don't know.कालमैत्री (talk) 04:20, 4 September 2023 (UTC)[reply]

@कालमैत्री @Atitarev @AryamanA Correct me if I'm wrong but I don't think फिल्म vs. फ़िल्म ever really represent distinct pronunciations. As the last comment says, the word film (spelled either way) can be pronounced philm in villages and some dialects. So it is correct to indicate one as an alt form of the other. Benwing2 (talk) 04:50, 4 September 2023 (UTC)[reply]

@Benwing2. The question was, should nuqtaless फिल्म (philm) be {{hi-noun|g=f|tr=film}} or just {{hi-noun|g=f}} (automatically transliterated as "philm") or should have two definition lines, to which AryamanA opposes. AryamanA simpler suggestion to have it both ways in the pronunciations section and no manual translit in the headword will work for me as well. I've made अर्ज (arj) simpler in this revision.

Anatoli T. ^{(обсудить}/^вклад) 05:04, 4 September 2023 (UTC)[reply]

@Atitarev I see, yes I agree with not having two POS headers or definition lines. I would probably rather include the manual translit since the pronunciation is not determined by whether there's a nuqta or not, and having a difference in translit could wrongly lead someone to believe this. Benwing2 (talk) 05:22, 4 September 2023 (UTC)[reply]

@Benwing2: But having the same transliteration for two different spellings could lead people to believe there was a speck of dust on the screen. The punctilious would use the different spellings to indicate whether /f/ was permitted or not. --RichardW57m (talk) 10:39, 4 September 2023 (UTC)[reply]

Automatic transliteration of katakana and hiragana

(Notifying Eirikr, TAKASUGI Shinji, Atitarev, Fish bowl, Poketalker, Cnilep, Marlin Setia1, Huhu9001, 荒巻モロゾフ, 片割れ靴下, Onionbar, Shen233, Alves9, Cpt.Guapo, Sartma, Lugria, LittleWhole, Chuterix, Mcph2): Is there any reason we don't have automatic transliteration of katakana and hiragana? It seems silly that we have to add manual transliterations to things like {{l|ja|アメリカ}} and {{l|ja|すし}}. —Mahāgaja · talk 07:44, 1 September 2023 (UTC)[reply]

Probably because of potential word boundaries / spacing. —Fish bowl (talk) 07:45, 1 September 2023 (UTC)[reply]

Is that more of an issue for katakana/hiragana than it is for hangeul, which does have automatic transliteration? —Mahāgaja · talk 07:48, 1 September 2023 (UTC)[reply]

Hangeul has spacing and word boundaries are clearer. AG202 (talk) 05:29, 3 September 2023 (UTC)[reply]

Instead of {{l|ja|アメリカ}}, I just use {{ja-r|アメリカ}}, which gives アメリカ (amerika). Mcph2 (talk) 07:48, 1 September 2023 (UTC)[reply]

OK, but in translation tables we (have to?) use {{t}}, which also doesn't support automatic transliteration. —Mahāgaja · talk 07:50, 1 September 2023 (UTC)[reply]

Oh, @Theknightwho has been working on automatic Japanese transliteration these days, he might have a solution. Mcph2 (talk) 07:59, 1 September 2023 (UTC)[reply]

Spacing can be manually added, e.g. {{ja-r|あいうえお}}: あいうえお (ai ueo). Mcph2 (talk) 08:03, 1 September 2023 (UTC)[reply]

Well, exactly. Are there any issues to having automatic transliteration in {{t}} (for example) that {{ja-r}} hasn't already solved? And it could work for the hiragana transliteration of kanji terms in {{t}} as well. At the moment, we have to write {{t+|ja|子猫|tr=こねこ, koneko}} at kitten#Translations, but surely it should be doable to just write {{t+|ja|子猫|tr=こねこ}} and have a module generate the romaji koneko automatically. —Mahāgaja · talk 08:13, 1 September 2023 (UTC)[reply]

@Mahāgaja -- Please don't use both kana and romaji in translation tables. The layout is already very tight, kana is unusable and potentially confusing to much of our readership, and kana text adds nothing useful anyway that we can't get from romanization.

If we don't have automatic kana → romaji conversion, please use {{t+|ja|子猫|tr=koneko}} instead.

If we do have automatic kana → romaji conversion, @User:Theknightwho, for translation tables especially, please don't use ruby -- again, the layout of translation tables is very tight and ruby text above kanji pushes things around in unhappy ways, much of our readership cannot read kana and would find this confusing, there are other usability problems (such as cut-and-paste issues discussed elsewhere), and ruby doesn't add any useful information anyway that cannot be gleaned from the romanization. ‑‑ Eiríkr Útlendi │^{Tala við mig} 17:52, 1 September 2023 (UTC)[reply]

@Eirikr There's no ruby there at the moment, and your concerns are exactly why I think we need to discuss things first before implementing big changes like that. I don't think it's insurmountable, but I haven't had the time to look into it yet. That being said, I'm not sure I completely agree with you re the value of rubytext, but that's a separate issue that we've already talked about before. Theknightwho (talk) 17:59, 1 September 2023 (UTC)[reply]

@Eirikr: I almost never add Japanese translations myself, but the facts on the ground are that almost all Japanese lines in translation boxes that involve kanji have both hiragana and romaji in the transliteration field. —Mahāgaja · talk 20:36, 1 September 2023 (UTC)[reply]

I haven't made any comprehensive effort to check EN entries for JA translations. Those that I've encountered have been scattershot, with kana present more frequently in what appeared to be older edits.

At any rate, I am strongly opposed to including kana in the parens in translation tables -- these are not useful for most readers, and the romanization suffices. I am baffled that people add the kana; it seems editors get lost in the "cool" factor of another script, and don't consider usability / usefulness. By way of counterexample, we don't include bopomofo for Chinese, for instance. ‑‑ Eiríkr Útlendi │^{Tala við mig} 21:31, 1 September 2023 (UTC)[reply]

They're pretty useful to me... AG202 (talk) 05:31, 3 September 2023 (UTC)[reply]

@Mahagaja I would wait till User:Theknightwho comes back on line, he is in the middle of implementing this. I think it works already if you explicitly specify the script as Hrkt. Benwing2 (talk) 08:49, 1 September 2023 (UTC)[reply]

@Fish bowl @Mahagaja @Mcph2 @Benwing2 It is actually already enabled if you manually specify the script as Hira, Kana or Hrkt: {{l|ja|^アメリカ|sc=Kana}} gives アメリカ (Amerika). However, this is a stopgap measure and I would prefer if we don't use it generally in entries, as adding script codes to everything would clutter up entries; it's only there so that non-Lua templates can use {{xlit}}. For now, it's best to stick with {{ja-r}} - not least because spaces aren't supported yet.

The reason for this is because I recently split Module:ja-translit in two: the old kana_to_romaji function has been replaced by Module:Hrkt-translit (Hrkt being the ISO code for all kana combined). I then moved the old module to Module:Jpan-translit, which works by scraping pages for readings in a similar fashion to the way Chinese transliteration works. The reading it generates is then given to Module:Hrkt-translit. Jpan-translit is not currently enabled, because it's pending further discussion about how we handle terms with multiple readings.

The reason for this new system is because the two modules work in very different ways, and it means we can avoid wasting resources if we know for certain that a given term is going to be in kana. There's also the fact that some languages (e.g. Ainu) don't use kanji at all, and so it makes sense to have kana transliteration be handled in a standalone way.

Just as a word of caution: don't confuse Kana (the code) with Kana (the script name). Unfortunately, the ISO picked Kana as the script code for katakana, and Hrkt for what they call "Japanese syllabaries" (i.e. hiragana + katakana, with hentaigana grouped under hiragana). I've given Hrkt the name "Kana" because it's the most accurate name for what it actually refers to, and I don't . It won't make any difference 99% of the time, but it's good to be aware just in case. Theknightwho (talk) 09:01, 1 September 2023 (UTC)[reply]

@Theknightwho Thanks for the summary. Can you answer the question of when we can expect {{l|ja|^アメリカ}} to work right without explicitly specifying the script code, and what needs to be done and what issues resolved in order for this to happen? Can't you just either rely on the autodetection of the script or make the translit module check the contents of the text being transliterated, so that if it sees it's all Kana (Hiragana or Katakana), it goes ahead and transliterates, and otherwise fails? Benwing2 (talk) 09:08, 1 September 2023 (UTC)[reply]

And is it possible to add romaji automatically to the hiragana transliteration in cases like {{t+|ja|子猫|tr=こねこ}} that I mentioned above? BTW, I had never heard the term hentaigana before, and I have to say it doesn't mean what I was expecting it to mean! —Mahāgaja · talk 09:11, 1 September 2023 (UTC)[reply]

@Benwing2 My intention was that it'd be as soon as Module:Jpan-translit is enabled. At the moment, anything entered as hiragana, katakana (or a mix of the two) will always be detected as Jpan. We could use Module:Hrkt-translit for Jpan as a stopgap, and any incomplete transliterations should return nothing. Alternatively, we could make a specific code override a general code if there's a tiebreak, which would have the same result. That may be preferable, as it means script codes will be more accurate in general.

@Mahāgaja Not yet - the transliteration module can't override manual transliterations. We should be able to integrate the features of {{ja-r}} into the general link modules pretty soon, though, and at that point we should be able to update everything via bot (like we did with Mandarin). There'll need to be a few minor changes to make the syntax compatible, though, which is the main barrier at the moment. That will need to wait until I've finished my major rewrite of Module:languages and Module:links, and I don't want to add any new features to the current versions because they're already too complicated/messy as it is. That should hopefully be done by the end of the month, if not sooner, and at that point I can start working on this. No promises, though. Theknightwho (talk) 09:27, 1 September 2023 (UTC)[reply]

@Theknightwho I thought about this a bit. Changing the script detection to return Hrkt or something else other than Jpan is likely to break people's .CSS files that customize based on the Jpan script code. I would use Module:Hrkt-translit as Module:Jpan-translit and have it fail for now if it encounters Kanji. That puts a placeholder for when you resolve the issue of how to handle cases with more than one pronunciation. Benwing2 (talk) 20:15, 2 September 2023 (UTC)[reply]

Transliterating Japanese kana (both hiragana and katakana) is long overdue. It is in fact, even simpler than Korean hangeul but the following considerations should always be made:

Spacing, capitalisations and irregular reading for particles は (wa) (spelled as "ha") and へ (e) (spelled as "he") 東京(とうきょう)は日本(にほん)の首都(しゅと)です。 (Tōkyō wa Nihon no shuto desu.), どこへ行(い)くの？ (doko e iku no?). Notice spacing in kana spellings, ^ and separation of particles.
Morpheme boundaries and diphthong readings: 昨日(きのう) (kinō) vs 争(あらそ)う (arasou) and 新潟(にいがた) (Nīgata) vs 新(あたら)しい (atarashii). Notice the use of "." in kana. Please compare "ō" vs "ou" and "ī" vs "ii", the difference in pronunciations/transliteration mostly depends on morpheme boundaries.

Anatoli T. ^{(обсудить}/^вклад) 06:40, 3 September 2023 (UTC)[reply]

There are many languages (Yiddish is an notable example) where automatic transliteration has to be overridden for some words, so that shouldn't be a problem. We can use |tr=wa with templates like {{l}}, {{m}} and {{t}}, and use |subst=は//わ with {{ux}} and the "cite-" and "quote-" families of templates. —Mahāgaja · talk 19:25, 3 September 2023 (UTC)[reply]

@Mahagaja: Sure, that's all doable. Both {{ja-r}} and {{ja-x}} can handle irregular particle readings, as you can see in the examples or it can be done with substitutions as you suggested. Anatoli T. ^{(обсудить}/^вклад) 23:02, 4 September 2023 (UTC)[reply]

Splitting Quechua

Honestly, handling an entire family of mutually unintelligible languages which have their own ISO codes for a while now as four languages (based on the country and historical period they are/were spoken in) doesn't seem like a good idea in general. We'll need most of the codes mentioned here, but probably with slightly different names. If nobody has any fundamental issues with the split itself, I could start drawing up a list of codes and (proposed) names.

Related to this, I also believe we should prohibit the creation of lemmata of Standard Kichwa, as this case is almost identical to Standard Moroccan Amazigh: There are no speakers, it is an artificially created mix of used Ecuadorian Quechua varieties that only accomplishes to make speakers unconfident in their own language use. Thadh (talk) 10:38, 1 September 2023 (UTC)[reply]

But are there not readers and writers? --RichardW57m (talk) 10:50, 1 September 2023 (UTC)[reply]

But so are there of Klingon and Na'vi. That doesn't make it a language worthy of inclusion in the mainspace. Thadh (talk) 13:08, 1 September 2023 (UTC)[reply]

@Thadh Are there any native speakers of Standard Kichwa per se, or are they all native speakers of one of the languages it aims to standardise? Theknightwho (talk) 16:31, 2 September 2023 (UTC)[reply]

@Theknightwho: They are all speakers of the distinct dialects, and according to the literature I've read, the speakers suffer quite a lot from the prescriptive nature of the standard (i.e. think their language 'isn't correct'). Thadh (talk) 01:15, 3 September 2023 (UTC)[reply]

Bokmål and MSA have no native speakers either; the Klingon and Na'vi comparison is fatuous. Do you have evidence of the stated effects of Standard(ized) Kichwa? In the meantime, I stand with @AG202's stance. ~ Blansheflur ｡･:*:･ﾟ❀,｡ 21:24, 3 September 2023 (UTC)[reply]

See the introduction of Aschmann's A reference grammar of Ecuadorian Quichua. I'll cite a couple of passages:

"“Unified Quichua” is a special form of Quichua which has been devised in recent decades, a certain amount of literature has been produced in it (including a Bible translation called Pachacamacpac Quillcashca Shimi), and educational programs have been carried out in it. […] Unified Quichua was in its origin an artificial language, a mixture of features from various Quichua languages, with all the Spanish borrowings replaced with old (obsolete) Quichua words which the people do not know or whose meanings have changed. (Many of these obsolete words are still contemporary in other regions, some being used in other Ecuadorian Quichua languages, others being Peruvian Quechua words, and others being coined based on existing Quichua forms.) […] One unfortunate effect of Unified Quichua has been to make those who speak Quichua as their native language feel like they do not speak it well, because they don’t speak it like the academicians say they should! In reality, the native Quichua speakers represent the continuous, native tradition of the language. Another negative result has been that the Quichua young people, who are in some cases being taught the Unified Quichua in school, feel like their grandparents speak the language incorrectly, whereas in reality their grandparents are the ones who speak the language best!"

Aschmann also references the paper by Grzech et al., which write the following in their conclusion:

" […] At the same time, the linguistic features of Unified Kichwa fail to adequately represent the language which these speakers – acutely aware of linguistic micro-variation and reliant on it for constructing social belonging – perceive as their own. […] The standard currently in place is divisive and remains largely unused, mostly due to the purist ideology from which it is derived."

There are more comments of these sorts but I believe this is more than enough to conclude that Unified Kichwa is pretty similar to Standard Moroccan Amazight and also not something we'd want in our mainspace. Thadh (talk) 22:18, 3 September 2023 (UTC)[reply]

If there's been an agreement not to include SAM lemmas (as it sounds), then I stand with you, as those two cases are most comparable. So yes, I support everything you've put forth. ~ Blansheflur ｡･:*:･ﾟ❀,｡ 22:41, 3 September 2023 (UTC)[reply]

Support splitting Quechua. Vininn126 (talk) 18:00, 1 September 2023 (UTC)[reply]

Support - no reason why Quechua should be handled as a single language. Theknightwho (talk) 16:31, 2 September 2023 (UTC)[reply]

For context for others, see: the prior discussion on Kichwa. I support splitting Quechua, but I don't think I'd support prohibiting the creation of Standard Kichwa. Even if it's not necessarily spoken, it's still written and read, and it seems like a similar situation to Modern Standard Arabic or any other created standard variety created specifically to try and "unite" other lects, for better or for worse. The fact that it makes speakers unconfident in their own usage is unfortunate, but that shouldn't stop us from including the entries if they are cited in usage (similar to how it's been made clear that we include derogatory terms). At best, we could add some kind of label or usage note to disambiguate "standard" terms. AG202 (talk) 05:43, 3 September 2023 (UTC)[reply]

I am convinced by User:AG202's argument that we should not prohibit adding Unified Quichua/Kichwa lemmas. Instead I think we should have a label "Unified Kichwa" or similar to identify them. It reminds me a bit of Rumantsch Grischun and Standard Basque, each of which is somewhat controversial and for which similar complaints have been made to the complaints being made here about Unified Kichwa, yet we don't prohibit them. In general we are a descriptive dictionary, and prohibiting a language because some people don't like it seems very prescriptivist. Benwing2 (talk) 04:38, 4 September 2023 (UTC)[reply]

I guess that is fine. I also found Category:Moroccan Amazigh language which makes me think I have misunderstood something of previous discussions? The naming made it difficult to find, and it links to an empty Wikipedia page. I am still not sure if the language can be considered a natural language or even in the same way that MSA or modern Hebrew is, but I guess it's fine to keep it, provided we lable it "Unified Kichwa" and keep an eye on new editors adding terms in the other Kichwa varieties. Thadh (talk) 13:02, 4 September 2023 (UTC)[reply]

@Thadh The link to Wikipedia is misspelled; the article is at Standard Moroccan Amazigh. Benwing2 (talk) 20:39, 4 September 2023 (UTC)[reply]

Fixed the link. Benwing2 (talk) 20:49, 4 September 2023 (UTC)[reply]

Support ~ Blansheflur ｡･:*:･ﾟ❀,｡ 22:42, 3 September 2023 (UTC)[reply]

Wiktionary:About Icelandic finally up

I've just got around to publishing the draft for Wiktionary:About Icelandic, which had been in the request pile. It'd be good to get the feedback of any regular contributors to Icelandic, so if there's anything missing or misrepresented feel free to add it or let me know - I haven't as yet contributed much to the language on here and don't have much familiarity with the language-specific editing norms or templates. In particular, I couldn't find anything at all in the discussion pages about the cut off date we use for when Old Norse ends and modern Icelandic begins or whether in practice it's not much of an issue. Helrasincke (talk) 17:28, 1 September 2023 (UTC)[reply]

@Helrasincke Hi, I'm not an Icelandic editor, but about the content of that page: there's this here sentence, "Following is a simplified entry for the German word orðabók (“dictionary”). It shows the fundamental elements of an Icelandic entry:", but should it say "Icelandic" instead of "German"? Did you perhaps adapt this from About German? Anyway, nice work! Kiril kovachev (talk・contribs) 00:38, 2 September 2023 (UTC)[reply]

@Kiril kovachev Whoops, well spotted! Helrasincke (talk) 15:54, 3 September 2023 (UTC)[reply]

@Helrasincke No problem! And also, I apologize to split hairs, but you may also want to check the "Spelling" section:

there looks to be a sentence starting, "Letters such as"... that goes unfinished. I guess there was meant to be a passage about symbols that changed in their usage in some way after that. Otherwise, looks good :) Kiril kovachev (talk・contribs) 18:43, 3 September 2023 (UTC)[reply]

Dingal language add request

should the language be added? कालमैत्री (talk) 20:18, 1 September 2023 (UTC)[reply]

or should it be treated under rajasthani language already on Wiktionary कालमैत्री (talk) 20:27, 1 September 2023 (UTC)[reply]

I know nothing about Dingal, but Wikipedia suggests it's ancestral to both Rajasthani and Gujarati, which as far as I'm concerned is reason enough for it to be a separate language with a code of its own. —Mahāgaja · talk 20:45, 1 September 2023 (UTC)[reply]

how to add the language in Wiktionary कालमैत्री (talk) 13:06, 2 September 2023 (UTC)[reply]

The Wikipedia article shows signs of having been puffed up by editors who may or may not know what they're talking about, an issue many articles on Indian topics suffer from, so I would feel better if I could find information about the language in other sources. So far I haven't been able to find much. Glottolog doesn't seem to have it. Searching Google Books turns up little. There is a mention in an essay in Language Versus Dialect: Linguistic and Literary Essays on Hindi, Tamil, and Sarnami (ed. by Mariola Offredi, 1990), page 68, which says "The Caran were numerous above all in Marvar, whose regional language (Marvari) was later known as Dingal.6 The Dingal language entered the court thanks to the Caran and became the standard literary language in the vast Marvar region, ..." where the footnote 6 (on page 88) is "There is much discussion about the meaning of the word 'dingal'. This has been used since the nineteenth century, with reference to the literature in western Rājasthānī, known also as Marubhāṣā and Mārvārī. For the various interpretations, see MOTILAL MENARIYA 1949, 15-24. It is, however, futile to go into this discussion, since scholars have not yet come to any definite conclusions." Rajendra Kumar Dave, Society and Culture of Marwar (1992), page 103, says "Dingal—The literary form of Marwari was called Dingal. The word 'Dingal' was used for the first time by Kushallabh in his work Pingal Siromani composed in V.S. 1607-18. The word has been defined in various ways by scholars. Tessitori calls it a language of rustics and a language without grammar." (That seems harsh, considering other works call it the language of poetry.) I can't find anything on how intelligible or not it is with modern Marwari or other forms of Rajasthani. - -sche (discuss) 18:34, 3 September 2023 (UTC)[reply]

effect of prakrit on dingal literature.; dingle literature both in hindi might be of importance.Other such work exist but all in hindi.There are also dingal words in a hindi dictioanry here. कालमैत्री (talk) 11:02, 4 September 2023 (UTC)[reply]

MCgregor says" an archaising form of early Mārvāṛī language, as used in Rājasthānī bardic poetry". कालमैत्री (talk) 11:06, 4 September 2023 (UTC)[reply]

Inclusion policy regarding given names?

I noticed we have no concrete policies regarding names (not individuals but given, middle and surnames), such as how many people must bear a name in order for it to qualify for inclusion and such. I am asking because I discovered we have very few Afrikaans-language names and so I wanted to add some. However, some of the names I had in mind—those I know from my family tree—appear to be fairly rare, several yielding less than 10,000 or in extreme cases less than 1,000 or even 100 results on FamilySearch's vital records (including duplicate records for the same person). I am still a fairly new editor here and so I obviously do not wish to accidentally create numerous unnecessary or non-notable entries as it could be annoying to clean up. I recently created Sarel which has 2,555 results on said genealogical website for South African (1600–present) records, but others like the surname Heystek / Heijstek give me less than 900 search results and I wonder if there should be a limit to what can be added. I notice we have hundreds of entries like Odajyan that just say "According to data collected by Forebears in 2014, Odajyan is the 1988048th most common surname in the United States, belonging to 1 individuals", but IMO this is not good practice. I would love to hear your opinions. Kindest regards, LunaEatsTuna (talk) 22:39, 1 September 2023 (UTC)[reply]

@LunaEatsTuna This is a good discussion we need to have. Really rare names shouldn't be present, e.g. Odajyan should maybe be given as an Armenian last name but that's all. Otherwise we'll get a Cambrian explosion of useless entries. Benwing2 (talk) 22:49, 1 September 2023 (UTC)[reply]

My thoughts exactly. Wiktionary is not a database of surnames, after all. LunaEatsTuna (talk) 23:15, 1 September 2023 (UTC)[reply]

I see it has the boilerplate statistic line with "Odajyan is the 1988048^th most common surname in the United States, belonging to 1 individuals" (sic). The ordinal seems practically meaningless if only one person has the surname. —Al-Muqanna المقنع (talk) 23:35, 1 September 2023 (UTC)[reply]

I also find that it puts an utterly disproportionate focus on a statistic that’s essentially random noise at that point, too. I really don’t know why we need it. Theknightwho (talk) 21:35, 2 September 2023 (UTC)[reply]

This has been discussed before- even the person who added them doesn't care for them much, but their reasoning can be summarized as "it's better than nothing". Chuck Entz (talk) 22:42, 2 September 2023 (UTC)[reply]

We've had a couple of RFV's for surnames lately, including Klingon (which surprised me by passing), Nazndah (which failed), and Mozela (which didn't really complete). The standard for both given names and surnames that we followed in those RFD's is to treat them like ordinary words, meaning that they need three cites. In this case, a document with a list of names can be a cite, but it must be in the language we're looking for.

If Odajyan fails RFV, could we continue to list it as a transcription of the Armenian, or do we only write Armenian words in the Armenian alphabet? Thanks, —Soap— 22:50, 1 September 2023 (UTC)[reply]

About this name specifically ... i see now that it's a spelling variant of Odadjian, which would be trivially easy to cite, as it is the surname of a musician. The -ian spelling of Armenian names in general is the traditional one, at least in the United States, having been overtaken in recent years by -yan perhaps because it's more true to the Armenian pronunciation. For this name, and perhaps others, we have the -yan spelling as the standard and -ian as a variant. If we can cite Odadjian, does that mean Odajyan also passes? Im guessing not, because even though the Armenian name is the same, we're in some sense creating a new name by Romanizing it. —Soap— 23:01, 1 September 2023 (UTC)[reply]

I think surnames being treated as regular words is a good idea. Also, would I be correct in assuming that vital records like birth certificates would not count as ordinary citations towards the inclusion of names? I do not recall reading about whether or not such citations were even allowed on Wikt or not, and I would agree if they do not count for inclusion but I am just asking for clarification. (If they were allowed this would essentially make the de facto policy for inclusion that at minimum three people bear a name). LunaEatsTuna (talk) 23:06, 1 September 2023 (UTC)[reply]

As far as Armenian surnames are concerned, this is what is going on. According to Armenian law, all passports record the owner's first name and surname in Armenian and in an English transcription. There is a strict scheme for automatically replacing each Armenian letter with an English letter or digraph, without regard to the actual pronunciation of the Armenian surname or the resultant English transcription. Օդաջյան (Ōdaǰyan) becomes Odajyan, Քարտաշյան (Kʻartašyan) becomes Kartashyan, Սարգսյան (Sargsyan) becomes Sargsyan, Պետրոսյան (Petrosyan) becomes Petrosyan, no variants are possible. If a person from the current iteration of Armenia (AD 1991–) emigrates to England or an English colony or becomes famous in English-language media, he will be recorded under the legally transcribed English name. This is what Forebears did for the one recent US citizen Odajyan.

The situation with the older diaspora is different. They are not bound or influenced by the Republic's transcription rules. They usually adapt their Armenian name to the local language according to their taste and with more regard to pronunciation and euphony. I can sympathize as foreigners distort my Petrosyan to things like /petroʃan/, /petroʒan/, /petrozian/. It is better to adapt it as Petrossian in English and French lands to approximate the correct /petrosjan/. Odadjian and Odajian are adapted versions of Odajyan. Many adaptation variants are possible, look at the forms of Hakobyan. Sometimes adaptation goes so far that I can't even figure out the native form: compare Bilzerian.

Since the local passport transcription system is predictable and fixed, I had chosen its form as the main one (Odajyan) and listed the old diasporan adaptations as variant forms (Odadjian, Odajian). Admittedly, the old diasporan spellings are easier to attest in English because their bearers had better chance to be recorded in English.

Because Wiktionary's policy on names is undeveloped, I do not create foreign entries for Armenian surnames anymore. Instead, I list the passport transcription and all diasporan spellings I can find in the Descendants section of the Armenian entry as in Համամչյան (Hamamčʻyan). Vahag (talk) 10:33, 3 September 2023 (UTC)[reply]

Ban cross-family comparisons from EDAL

Self-explanatory. Any cross-family comparison sourcable only to EDAL or affiliated sources should be banned or at the very least worded in such a way that makes it clear that the "Altaic" family is a pseudolinguistic fringe theory. — SURJECTION ^{/ T / C / L /} 15:21, 2 September 2023 (UTC)[reply]

Support. We need to remove macro-level Altaic comparisons. I wouldn't be surprised if some lower-level connections are established but that's far outside the standard linguistic view of things and we don't need to be hosting fringe. Fringe is cringe bro. Vininn126 (talk) 15:25, 2 September 2023 (UTC)[reply]

I wouldn't be opposed to this, either. It often feels like many of our active Proto-Turkic editors are sneakily adding in Altaic comparisons and references (including lists of comparanda of supposed regular sound correspondences!) whenever they think they can get away with it. — SURJECTION ^{/ T / C / L /} 15:45, 2 September 2023 (UTC)[reply]

Let me clarify what I mean: perhaps in the future smaller connections between languages in this area will become more established within the linguistic mainstream, but until that time we shouldn't host it. Vininn126 (talk) 15:49, 2 September 2023 (UTC)[reply]

Support Over at the Proto-Turkic page we've already been slowly phasing out Altaic reconstructions and comparisons are made with Mongolic if they cannot be explained through conventional borrowing. Yorınçga573 (talk) 15:41, 2 September 2023 (UTC)[reply]

Support. AG202 (talk) 05:46, 3 September 2023 (UTC)[reply]

Support BurakD53 (talk) 13:01, 3 September 2023 (UTC)[reply]

Support ~ Blansheflur ｡･:*:･ﾟ❀,｡ 20:53, 4 September 2023 (UTC)[reply]

Provisional

Support -- @Surjection, could you expand on what "EDAL" is? From context and other threads, I think this is Starling, but I'm not sure. ‑‑ Eiríkr Útlendi │^{Tala við mig} 04:23, 5 September 2023 (UTC)[reply]

The Etymological Dictionary of the Altaic Languages which Starling proudly and prominently features. — SURJECTION ^{/ T / C / L /} 04:30, 5 September 2023 (UTC)[reply]

Gotcha, thank you! In that case, most definitely

Support. I did a brief survey of Dolgopolsky's work, on which Starling's Japonic entries appear to be based, and found an alarmingly bad failure rate. Discussed some in this old thread: Thread:User_talk:Rua/Unexplained_deletions:_continuing_what_appears_to_be_a_common_theme/reply_(5).

Yes, please rip out EDAL by the roots. If and when something more rigorous replaces it, perhaps we can use that future work as reference, but for now, please deep six anything relying on EDAL. ‑‑ Eiríkr Útlendi │^{Tala við mig} 05:12, 5 September 2023 (UTC)[reply]

Oppose 'Altaic' is apparently wrong rather than a pseudolinguistic fringe theory. --RichardW57m (talk) 14:15, 5 September 2023 (UTC)[reply]

How does being wrong vs being pseudolinguistic change anything about if we should include it? CitationsFreak (talk) 05:54, 6 September 2023 (UTC)[reply]

@CitationsFreak: The proposal is that any use of the EDAL for inter-family comparisons, or where it is the only traceable publication of an idea or suggestion, shall be accompanied by a denigration of Altaic as 'pseudolinguistic'. Not even a mere statement of disbelief in Altaic will suffice. --RichardW57 (talk) 13:35, 9 September 2023 (UTC)[reply]

Bulgarian name dictionary reference template

Hello to all Bulgarian editors, I don't know if this resource has been used before, but today I found a dictionary that documents Bulgarian personal names, which can be viewed online, just like the etymological dictionary. I've written a reference template at {{R:bg:LIFUB}}; the syntax is {{R:bg:LIFUB|page_number|entry_name}}, where the entry name can be omitted 9 times/10 if it's the same as the page title. @SimonWikt @Chernorizets @Bezimenen. This is a good help in adding accents to names with unclear stress, as well as expanding small entries: check out Апостолов, for example. Hope this helps! Kiril kovachev (talk・contribs) 18:17, 2 September 2023 (UTC)[reply]

In general, you don't need a BP thread for this. Perhaps alerting other editors on a talk page for the template or something similar will suffice. Vininn126 (talk) 18:20, 2 September 2023 (UTC)[reply]

@Kiril kovachev: This is great. I think we need pages documenting these resources, although Category:Bulgarian reference templates is a good start. As for where to post this, maybe WT:About Bulgarian and pinging the relevant users? (Although I wouldn't have seen this as you didn't ping me.) Benwing2 (talk) 20:11, 2 September 2023 (UTC)[reply]

@Vininn126 Yes, that's fair enough; my apologies. (I partly posted it here because I don't know for sure who else might be a Bulgarian editor, or be interested in the template anyway.) @Benwing2 Sorry for not @ing you, I wasn't sure whether you still edit Bulgarian these days and I didn't want to spam you with it in that case. So, fortunate that you check this place often enough to see. :)

I may well update WT:About Bulgarian with some information about our templates for this.

Thanks, Kiril kovachev (talk・contribs) 20:23, 2 September 2023 (UTC)[reply]

Don't apologize! I'm just informing. Vininn126 (talk) 20:24, 2 September 2023 (UTC)[reply]

Thanks for the heads-up! Kiril kovachev (talk・contribs) 22:29, 2 September 2023 (UTC)[reply]

@Kiril kovachev very cool! Chernorizets (talk) 21:41, 2 September 2023 (UTC)[reply]

redoing Template:rootsee and Template:PIE root see

User:Dragonoid76 requested equivalents of {{PIE root see}} for Proto-Indo-Iranian, Proto-Indo-Aryan and Sanskrit. I realize that there isn't a proper template currently for this. It should be {{rootsee}} but (a) that template doesn't quite do it, (b) it is a total mess. I am going to redo {{rootsee}} to work similarly to {{root}}:

|1=

Destination language of category Category:Destination terms derived from the Source root *root-. If left out or set to the value +, you get the umbrella category Category:Terms derived from the Source root *root-.

|2=

Source language of category Category:Destination terms derived from the Source root *root-. If left out or set to the value +, or equal to the destination language, you instead get Category:Destination terms belonging to the root *root-. However, if both source and destination language are left out or set to +, and the current page is in the Reconstruction namespace, the source language is inferred from the pagename and you get Category:Terms derived from the Source root *root- (otherwise you get an error). If the destination language is a family code and not a valid language code, the family code is converted to the corresponding proto-language. This means you can write ine for Proto-Indo-European, iir for Proto-Indo-Iranian, inc for Proto-Indo-Aryan, etc.

|3=

Root. If left out or set to the value +, it is taken from the subpage name (i.e. after a slash in the case of Reconstruction namespace items). If the source language is reconstruction-only, you can leave out the initial *. In addition, a hyphen may be added according to the following algorithm:

If there is a space or hyphen in the root already, no hyphen is added.
If the root is in a non-Latin script, no hyphen is added.
Otherwise if the source language is Navajo, a hyphen is added onto the beginning, otherwise onto the end.

|id=: Sense ID of the root; needed especially for Navajo.

This means, for example, that you can write {{rootsee}} by itself on a reconstructed root page and get Category:Terms derived from the Source root *root- automatically. This should make {{PIE root see}} totally unnecessary. Current uses of {{rootsee}} that default to PIE will have to be changed to add ine as the second argument, so that e.g. {{rootsee|en|*gʷem}}, which currently gets you Category:English terms derived from the Proto-Indo-European root *gʷem-, will change to {{rootsee|en|ine|*gʷem}}. Benwing2 (talk) 23:16, 2 September 2023 (UTC)[reply]

I have written the module underlying this, see Module:User:Benwing2/rootsee and User:Benwing2/test-rootsee, as well as the bot script to convert existing uses of {{rootsee}} and {{PIE root see}}. If no one objects, I will do the conversion in the next couple of days. Benwing2 (talk) 02:45, 3 September 2023 (UTC)[reply]

Thanks, this looks good. It was always a bit odd seeing a template with a generic name like rootsee being specifically bound to descendants of PIE. —Soap— 15:38, 4 September 2023 (UTC)[reply]

Looks good! Thanks! Dragonoid76 (talk) 21:56, 4 September 2023 (UTC)[reply]

Just for ease, could you make sure {{User:Benwing2/rootsee|pagename=भृ}} works like {{User:Benwing2/rootsee|+|sa|भृ}}. Right now, I'm getting "Unable to infer source from pagename 'भृ' as it isn't a Reconstruction or Appendix page", since it's not a reconstruction page. Dragonoid76 (talk) 22:13, 4 September 2023 (UTC)[reply]

We just got rid of the assumption that any root without a language code is Proto-Indo-European, now we're adding back special cases. How is the module supposed to know that the page in question is Sanskrit? What knowledge do we have to give it to be able to tell? It would have to be something that would hold true for the foreseeable future, no matter what happens to the type of page in question. Can we guarantee that no Sanskrit root will ever share a page with a root for any other language? I'm not trying to shoot down your idea- I just want to make sure someone thinks about this kind of thing. Chuck Entz (talk) 22:43, 4 September 2023 (UTC)[reply]

@Chuck Entz Good point. In the case of words written in the Devanagari script, I can't think of a case where the root wouldn't be Sanskrit—but it's probably just better design to use the template like {{rootsee|+|sa|भृ}}. Dragonoid76 (talk) 22:52, 4 September 2023 (UTC)[reply]

lemmas

hi how to find what languages in wiktionary have most lemmas 31.7.113.40 16:57, 3 September 2023 (UTC)[reply]

There is a sortable list at Wiktionary:Statistics. Einstein2 (talk) 23:22, 3 September 2023 (UTC)[reply]

Remember the phrase 'lies, damned lies and statistics'. In languages whose users command deep, morphologically marked derivations, synchronically derived terms may be marked as lemmas. --RichardW57m (talk) 08:26, 4 September 2023 (UTC)[reply]

Listing taxonomical names in Derived terms sections

I'd like to find out if there is a policy about how much detail should go into listing taxonomical names in Derived terms sections. See rigó. Currently the Hungarian name is followed by the English translation and the taxonomical name. I wonder if this is all considered useful by other editors. Should I just list the Hungarian names? Panda10 (talk) 17:38, 3 September 2023 (UTC)[reply]

I don't think there is a policy. I think they are useful to indicate which taxon is indicated by the vernacular name and whether multiple taxa are indicated. Taking English as an example, many really common one-word English vernacular names cover multiple species and multiple higher-level taxa, sometimes even kingdoms. It doesn't do a user much good to have to go on merry chase through other references to disambiguate the term. OTOH, it can be time-consuming for a contributor to do so. If we save two users from such a merry chase at the cost to one of us doing it once, there is a net social gain. DCDuring (talk) 19:03, 3 September 2023 (UTC)[reply]

I Agree. --RichardW57m (talk) 11:52, 5 September 2023 (UTC)[reply]

What DCDuring says. I think an overview like on جَوْز (jawz) is more useful than it without the derived terms and pages for the derived terms together; the derived terms are also necessary to show that it has a broader meaning with respect to derived terms, rather than just meaning “walnut” also corresponding to “nut” in English. It can become ridiculous of course if you have such lists at tree or اوت (ot) – at so general terms I believe nobody wants to read lists of taxonomic names. Fay Freak (talk) 11:58, 5 September 2023 (UTC)[reply]

Probably not. A few examples, as at tree now, might be OK. I don't know what is best at terms like common, vulgar, or bastard or the terms for colors and body parts: Complete or illustrative listings of derived terms? DCDuring (talk) 15:05, 5 September 2023 (UTC)[reply]

May I place a wikilink within a usex or quote when I believe it is helpful to the reader?

On the house entry, we have an extremely large collapsible listing all of the derived terms in alphabetical order. There are two phrases, on the house and the house always wins, that are specifically bound to sense 6, subsense 2, and make no sense in any other context. If there existed an inline version of the derived terms template, I would want to use that underneath s6:2 so that readers would know that it is specifically tied to this narrow definition. But so far as I know there is no such template, and if there were one, we would probably discourage widespread use, since it would take up space and merely duplicate terms that already appear below. We have collocations, but my understanding is that they are typically used for phrases which do not have their own entries and therefore cannot be links either. So perhaps the best way to help the reader is to use wikilinks within the use-examples we currently give. This is currently forbidden by our Manual of Style page, which specifically says that use-examples must

not contain wikilinks (the words should be easy enough to understand without additional lookup).

However, as I read it, this is intended to discourage introducing difficult, unrelated words into use-examples which would need wikilinks in order to be understood. That is, if my word were house, I would not do well to add a use-example such as

Next to the galamander was a small grey house.

Where the unrelated word galamander both distracts the reader and tells them nothing about houses. By contrast, linking to the expressions on the house and the house always wins underneath the one specific sense of house that they are bound to seems like the best solution for this rare situation.

Ideally, if we can agree this exception to the policy is valid, I would like to see a small change to the Manual of Style to reflect this, rather than just tolerating a few exceptions here and there, so that this won't be a source of conflict in the future.

Best regards, —Soap— 12:54, 4 September 2023 (UTC)[reply]

Good luck with the wording. DCDuring (talk) 16:47, 4 September 2023 (UTC)[reply]

Where does it say this? I can't find the word 'wikilink' in WT:STYLE.

Does it actually ban such links in quotations, or does it just ban them from usage examples? Banning them from usage examples makes sense. --RichardW57m (talk) 09:05, 5 September 2023 (UTC)[reply]

It isn't in WT:STYLE, but in Wiktionary:Example_sentences#Official_policy. I understand why wikilinks would be banned from examples of English usage, as the English-language Wiktionary addresses English speakers, who may be expected to have familiarity with all the words in the example other than the target/feature word. But I have been adding wikilinks quite liberally to non-English usage examples (in complete ignorance of the above prohibition, which I have only just learned of, and with nobody raising any objection), because it strikes me as helpful to language learners and not detrimentally distracting. The example's accompanying translation tells the reader what the sentence means as a whole, but may not be sufficient to clarify how the sentence means what it means, even when it is quite short and rudimentary in its composition. The reader may also want to go on a voyage of discovery in an unfamiliar language via wikilinks. This is where I believe it to be helpful to provide links to some of the constituent words, taking advantage of one of the key benefits of a wiki, namely the links. Voltaigne (talk) 11:35, 5 September 2023 (UTC)[reply]

@Soap: Ah, you're quoting WT:EL#Example sentences or its expansion page, which don't apply to quotations.

I think what you should do is to elaborate the headword line from {{en-PP}} to

{{en-PP|head=on the {{l|en|house|id=s6v2}}}}

on the house

where s6v2 is the {{senseid}} for the sense you want to link to. Obviously you should choose a better name for the senseid.

@Theknightwho: Can you please advise on how to remove the use of {{l}}? Or is the weak requirement in {{head}} not to use {{link}} to be ignored in this case? --RichardW57m (talk) 11:43, 5 September 2023 (UTC)[reply]

{{en-PP|head=on the [[house#English:_s6v2|house]]}}

on the house

Same result without {{l}}.

Voltaigne (talk) 12:46, 5 September 2023 (UTC)[reply]

Thank you both, but I dont think that links to subsenses are meant to be put in the header, as they would appear the same as normal links, and I see no reason a user would think to click the link in just this one particular case since they are presumably already familiar with the generic sense of the word house. We could add etymology sections and mention the derivation there instead. But that doesn't address what I came here asking about.

I want to put links on the house page, in the use-examples, where a link containing the entry word would stand out and alert the reader to the existence of the common set phrases on the house and the house always wins. If we cannot do this, the only indication of these phrases is in the derived terms section, which is (by prior consensus) sorted alphabetically rather than by sense, therefore mixing the syntactically bound terms in with hundreds of others.

Again we could consider this a proposal for an exception to the rule about not putting links in use-examples, but the way I see it, as above, the rule exists to prevent users from writing sentences with irrelevant and unhelpful words, distracting the reader, because a good use-example will focus on the entry word. I think a good way to do this would be to work the common expressions on the house and the house always wins into the use-examples under sense 6, subsense 2 of house, and that this is the most likely place the user will be looking for them. —Soap— 12:58, 5 September 2023 (UTC)[reply]

The value of a sense link in the inflection line is that a user thinking that the link will require a search of the entire English house L2 will be delighted to be taken to a specific, relevant sense. If we do this often enough users will become hopeful that they can be led to the correct sense and therefore may click through more often. DCDuring (talk) 13:38, 5 September 2023 (UTC)[reply]

I dont have a problem with modifying the inflection line to point to a particular sense of a constituent word, but I still say, as above, that few users are likely to click on it, as there is no visual cue suggesting that anything unusual is there, except perhaps the lack of links for the other words in the header. We rejected tooltips for the same reason.

Whether we link headers or not still has nothing to do with my original proposal. So far, nobody has agreed with me, so I'll just point out that this rule we have on the Manual of Style seems at least not to be enforced all too strictly, as we have a link to badass in a use-example on the marshmallow page, and on the gay page we have a linked collocation, gay marriage. I think the link to gay marriage is a good thing because it's specifically bound to this one sense. Perhaps we could reword the use-example on marshmallow to use a more familiar word than badass (again keeping in mind that at least some of our readers are just learning the language), but it at least has the benefit that it's not totally irrelevant, like my galamander example.

Again, the wording of the rule in the Manual of Style suggests to me that its purpose is to keep the use-examples close to the meaning of the word being defined. If we should decide to interpret that rule strictly but allow linking of collocations, as per the gay marriage example, I would be okay with that, but I'd also say that that effectively transforms collocations into an inline version of the derived terms template. If we go that route, why not make it official and create an actual inline derived terms template? Best regards, —Soap— 10:02, 7 September 2023 (UTC)[reply]

Do we need an inflection line when we have a conjugation box?

In a Beer Parlour discussion, it was pointed out that English doesn't use conjugation boxes that much, instead using the inflection line. However, we do use conjugation boxes for certain verbs, mostly those with archaic endings, like run. In these cases, do we really need the inflection line? The conjugation box provides so much more information in this case, and the same type as the inflection line. Why repeat ourselves? CitationsFreak (talk) 17:34, 4 September 2023 (UTC)[reply]

I think we should keep the inflection line both for consistency's sake (people will be looking there) and because it's much more convenient. The run page is a particularly long one, with the conjugation box only at the very end, which on smaller devices might be ten screens from the top of the page. But even if we were to move the conjugation box up top, I still think the inflections should stay in the header. because it's where people are more likely to look for them based on the patterns set by other entries. —Soap— 18:02, 4 September 2023 (UTC)[reply]

I was thinking of having the inflection line read "see conjugation box" with a link to it, for convenience's sake. CitationsFreak (talk) 18:09, 4 September 2023 (UTC)[reply]

@Soap Made a little mockup of how I think it should look at User:CitationsFreak/conjugate. Lemme know what you think. — This unsigned comment was added by CitationsFreak (talk • contribs) at 18:32, 4 September 2023.

It's standard in languages that have both principal parts (or equivalent) and extended conjugations still to list the principal parts in the headword line: compare the formats for Latin amo, Spanish amar, Korean 없다 (eopda) and so forth. I believe that's the most user-friendly procedure in general and for English as well, so I would oppose removing all inflections from the head line. —Al-Muqanna المقنع (talk) 19:15, 4 September 2023 (UTC)[reply]

@CitationsFreak I agree with User:Al-Muqanna; we should keep the headword information. In general I really think we don't need conjugation tables for most English verbs; they just aren't that complex. Also the Wikicode of {{en-conj}} is an absolute disaster. Benwing2 (talk) 20:52, 4 September 2023 (UTC)[reply]

I also agree with Al-Muqanna, in general. But I think conjugation tables can be useful, for showing all the forms which are too archaic or dialectal to list on the headword line. For example, if the headword line lists not just the one usual past participle, but several rare obsolete ones, those I would be inclined to move out of the prominent headword line (and a conjugation table, with appropriate qualifiers, is a logical place to put them). - -sche (discuss) 20:57, 4 September 2023 (UTC)[reply]
Yeah that makes sense. Benwing2 (talk) 00:10, 5 September 2023 (UTC)[reply]

I agree. As has been pointed out, it's common to have both. The inflection line should give the most useful conjugations and the conjugation box should give a complete conjugation. It's useful to have both, even if there's a certain level of redundancy, especially since this is already our standard practice in several languages (see Portuguese or Spanish amar, for instance). Andrew Sheedy (talk) 14:37, 5 September 2023 (UTC)[reply]

Etymology and descendants of letters/scripts?

Currently most letter pages dont have an etymology or descendants section, there are a few pages with them like most of Latin, Brahmi 𑀅, Aramaic 𐡀 etc. Shouldn't it be standardized and added to all letter pages? AleksiB 1945 (talk) 13:16, 5 September 2023 (UTC)[reply]

It appeals, but there are issues with the depths of some of these scripts, and it verges on the encyclopaedic. For example, for Tai Tham and Lao I want to reference the Fakkham script, which is currently unencoded. There are also issues with the development of the Thai script - how many stages are there between the Khmer script (if that truly be the ancestor) and the current script? There's nothing encoded yet, but we have concepts like the Sukhothai script and the King Lithai script. --RichardW57m (talk) 14:48, 5 September 2023 (UTC)[reply]

What should we do for the notion of inheritance? If a script is borrowed and ultimately transformed, I would say the characters thereby transferred were inherited, but we hit the technical problem that {{inh}} is set up for languages rather than sets of characters. On the other hand, for the Vietnamese letter 'a', we currently have

Borrowed from {{bor-lite|vi|fr|a}}

Borrowed from French a

which I suppose is only mostly wrong. (Portuguese, Italian and Latin would be the starting points for the system; I am assuming that 'French' just reflects ignorance.) --RichardW57m (talk) 08:18, 6 September 2023 (UTC)[reply]

Aramaic and Brahmic pages use photos of letters which arent encoded in their descendants section, thats too much but either way we could show only the major ancestral scripts instead of all of them; a major script which isnt encoded is Pallava Grantha though. AleksiB 1945 (talk) 09:40, 6 September 2023 (UTC)[reply]

US Census statistics as a template

Hello, partly in reference to the above discussion on the inclusion of names, would anyone support converting the surname Statistics (e.g. on Johnson) into a template? I would use a syntax like {{S:US Census|1=rank|2=number of bearers|race=|percentage=|race2=|percentage2=|...|alt=(alternative name other than the page title, unlikely to require use)}}. Whilst at it, we can also make a note of those that are highly underused, like Odajyan above, which has 1 holder in the US and no mentions on Google books. Indeed, it might just be better overall to remove statistics in cases where the name is literally in last place. What does everyone think? Kiril kovachev (talk・contribs) 20:18, 5 September 2023 (UTC)[reply]

@Kiril kovachev This sounds good to me. Benwing2 (talk) 20:41, 5 September 2023 (UTC)[reply]

+1. I agree the statistics serve little function for very small numbers, I would exclude them from names pertaining to single-digit numbers of people at least (and maybe just replace them with the (rare) label). —Al-Muqanna المقنع (talk) 11:09, 6 September 2023 (UTC)[reply]

Agreed. I use something similar for {{pl-freq 1990}} but for words, and I know Surjection has made a similar template for Finnish surnames. Furthermore, I think we might want to set "Statistics" as an official header at some point. Vininn126 (talk) 11:10, 6 September 2023 (UTC)[reply]

@Kiril kovachev Wait, I think we already have just such a template: {{surnames-us-census}}. It's unused, but I don't now if that's just because editors subst'ed it. — excarnateSojourner (talk · contrib) 18:42, 15 September 2023 (UTC)[reply]
@ExcarnateSojourner Oh that's interesting, it might be good to build off of that then. I think this is slightly different to the current text we have lying around, e.g. the template uses "US", whereas the text on Johnson uses "United States"; the template doesn't specify 2010 as the census year, but the entries currently do, etc., so I'd think this template may have been innovated later but not come round to being used, but Idk.

I think it could maybe use some refinements, such as converting the "rank" into an ordinal by default, rather than requiring the caller to input "3rd" every time, and also changes to bring it in line with the currently-widespread text (bridging the differences stated above basically). But nice find!

Fortunately I haven't done anything towards this conversion yet, so thanks for catching this before I did. Kiril kovachev (talk・contribs) 18:50, 15 September 2023 (UTC)[reply]

{{pa-Arab-translit}} does not comply with UR TR

~~I don't know what transliteration module Urdu is using now,~~ but it's not compliant with Wiktionary:Urdu transliteration nor the way Urdu entries have been transliterated up until now. This means that all the transliterations that have been manually entered into entry headers up till now are different than the automatic transliterations.

I have changed Module:ur-translit (which btw is not the module Urdu is using) to match the traditional transliteration of Urdu entries and I think we should switch Urdu to that module. Especially because there has never been a discussion on changing Urdu transliteration. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 05:32, 6 September 2023 (UTC)[reply]

@Sameerhameedy There was no Urdu translit module set until a week ago (Aug 29), when I changed it to use the Panjabi translit module based on a request from User:نعم البدل. I have no issues with switching it to use Module:ur-translit, but maybe we should wait for that user to comment as to why they think it should use the Panjabi module. Benwing2 (talk) 18:49, 6 September 2023 (UTC)[reply]

@Benwing2 If نعم البدل starts a discussion about changing Urdu's transliteration policy and adopting Panjabis transliteration policy, then I would have no issue. However currently, Urdu and Punjabi have very different transliteration policies. Urdu policy treats hamza as a zero-consonant, Punjabi doesn't. Urdu policy exclusively uses dots under letters for retroflex's, Panjabi has no such restriction. That is not to mention the different character mappings because of the fact that Urdus policy only transliterates letters that have (official) pronunciations, whereas Punjabi transliterates all of them. @نعم البدل I don't care if Urdu adopts Punjabis transliteration policy but please start a discussion with other editors about changing the policy first. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:05, 6 September 2023 (UTC)[reply]

@Sameerhameedy, Benwing2, Module:ur-translit is garbage and not based on any standard. Module:pa-Arab-translit is based on the ALA-LC Transliteration standard for Urdu, and subsequently for Punjabi Shahmukhi. And it's better to use one module for the multiple languages of Pakistan, since there's practically no difference in the transliteration standards for those languages. Module:pa-Arab-translit currently suffices for Urdu, Punjabi (Shahmukhi), Saraiki, Pothwari etc. I'm all ears for opinions, but there's not even enough users to comment on the matter, or at least not many people seem to have enough of an opinion on the matter. نعم البدل (talk) 19:08, 6 September 2023 (UTC)[reply]

Yes but every Urdu entry up till now has used the previous standard. I don't really care what transliteration policy Urdu uses but please start a discussion about changing it if you don't like the current one. Changing the transliteration policy would meant fixing thousands of Urdu entries, which is a lot of work to put on other Urdu editors without asking them. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:10, 6 September 2023 (UTC)[reply]

@Sameerhameedy My mistake, you've been updating Module:ur-translit. Perhaps that calls for a discussion, apologies. نعم البدل (talk) 19:11, 6 September 2023 (UTC)[reply]

@Sameerhameedy Also WT:UR TR is based on the ALA-LC standard anyways, no? نعم البدل (talk) 19:12, 6 September 2023 (UTC)[reply]

No it's not, I'm getting confused with the Transliteration module. نعم البدل (talk) 19:13, 6 September 2023 (UTC)[reply]

@نعم البدل I don't know why you are starting an argument, I changed the module match the Urdu transliteration policy as Urdu has not had automatic transliteration until this week. Again, I don't care what transliteration Urdu uses, just please start a discussion about changing the policy if you don't like the current one. that's all i'm asking. If you start a discussion you can use whatever transliteration the other Urdu editors agree to, and I will fully support whatever transliteration that is. Just discuss it first. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:16, 6 September 2023 (UTC)[reply]

@Sameerhameedy I'm not starting an argument. If I'm coming across as passive aggressive, then apologies, but that wasn't my intention. I can continue this at Module talk:ur-translit and explain why I requested Module:pa-Arab-translit be used over Module:ur-translit. نعم البدل (talk) 19:24, 6 September 2023 (UTC)[reply]

I support transliteration of Urdu in the main space but I think some main ground rules still need to be established - what can and more importantly, what CANNOT be transliterated. Otherwise, transliterations will look like a bunch of consonants. Since vocalisation, especially full vocalisation is not a common practice yet, https://rekhtadictionary.com/ dictionary is inconsistent and is full of errors, we need to work out the rules ourselves.

Common digraphs, which are represented by one letter in Hindi and the ways to spell them out. E.g. in بھائی (bhāī) بھ (bh) is an acceptable cluster, no vocalisation is required BUT it needs to be followed by a long or a short vowel, otherwise, the transliteration should fail. E.g. at گَھر (ghar) and چِڑْیا (ciṛyā) are good, گھر is also good because the module sees the ambiguity but what about چڑیا (cṛyā)? (incorrect vocalisation)
Use of sukun to kill off vowels to avoid ambiguity in the middle of words.

Anatoli T. ^{(обсудить}/^вклад) 02:10, 8 September 2023 (UTC)[reply]

Standard for Urdu romanization

In light of recent discussions, I invite the following users (as well as any other users who may have an opinion on this):

Urdu contributors @Rodrigo5260, AleksiB 1945, ‎SAA2002, Notevenkidding
And additionally @Sameerhameedy, Fenakhay, Benwing2

Hi all, apologies for pinging you all. The topic of the transliteration policy for Urdu as come up, and you're opinions on this are much appreciated.

Recently, I requested Benwing2 to set Module:pa-Arab-translit as the transliteration module for the Urdu language (previously only Punjabi, Saraiki, Pothwari). It's based on the ALA-LC Romanisation standard, the represents, specifically, Urdu letters. The Module, is not perfect, and needs to be fixed, but in my opinion serves the Urdu language (and other Pakistani languages) the best. User:Sameerhameedy recently fixed, or sorted out Module:ur-translit, which is based on the old Urdu/Hindi transliteration policy, which makes it easier to understand both Hindi and Urdu transliterations.

How should we go about this :) نعم البدل (talk) 19:50, 6 September 2023 (UTC)[reply]

just to add, if we change the transliteration policy I will change Module:ur-translit to the new policy. Since currently it is less buggy than the punjabi module (which I will also look into fixing in the future). سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:53, 6 September 2023 (UTC)[reply]

And are in accord with the opinion that it's better for one module for all of the languages, or would we want separate transliteration modules/policies? نعم البدل (talk) 19:58, 6 September 2023 (UTC)[reply]

@نعم البدل: I know I wasn't pung since I'm not active around here much, but IMO we should maintain consistency in transliteration firstly between Urdu and Hindi, and secondarily all of the Indo-Aryan languages. We have a lot of locations where both Hindi and Urdu equivalents get linked to and inconsistent transliteration will look ugly. I also don't particularly like the ALA-LC treatment of nasalisation and in general its overuse of underlines and digraphs when a single diacriticless letter works fine (e.g. kh instead of x). However, it seems like pa-Arab-translit doesn't exactly follow ALA-LC since I see x? Regardless, we should try to keep consistency with Hindi.

One point on which divergence from Hindi might make sense is the various Arabic script letters that get merged into one sound in speech (e.g. all the z's). But current practice is to not distinguish these in the transliteration. —Aryaman^A ^{(मुझसे बात करें • योगदान)} 20:03, 6 September 2023 (UTC)[reply]

@AryamanA To add, I personally like how Urdu policy reserves underdots for retroflexes (e.g. ḍ ṭ ṛ). Even if we change the transliteration, I think the dot should still be reserved for retroflexes only. The fact that ص and ض transliterate as "ṣ" and "ẓ" is very confusing IMO. Also I prefer that words like کَئی transliterate as "kaī" not "ka'ī" since the apostrophe is also used for a glottal stop, and Hamza here is not acting as a glottal stop but as a Zero consonant. And I think مَیں transliterating as ma͠i is more understandable than maiṉ, since the tilde is such a universal way of showing nasal vowels. But if Urdu editors prefer that... then whatever. Besides that stuff, I don't feel strongly about any other changes. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 20:27, 6 September 2023 (UTC)[reply]

I have no issues with reserving the underdots for retroflexes. How would we go about transliterating the Hamza, or does it need to be transliterated, and if so the difference between the Ain and Hamza? نعم البدل (talk) 20:21, 6 September 2023 (UTC)[reply]

Well, current practice is to not include the hamza since it's only written to ensure vowels are paired to a consonant. Presumably, an Urdu reader would know that kaī is کئی since the ī needs to be paired to a consonant according to the rules of the Urdu alphabet. But perhaps there are benefits to including the hamza in transliterations that i'm not aware of. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 20:31, 6 September 2023 (UTC)[reply]

The only benefit I can think of is the fact that sometimes the Hamza is left out between diphthongs, and would technically become a misspelling. The transliteration may remind the user that the Humza is necessary? نعم البدل (talk) 20:36, 6 September 2023 (UTC)[reply]

Hi @AryamanA: Thanks for being part of the discussion, feel free to invite others to this discussion as well. I don't agree with the ALA-LC fully either, I don't like the fact that it represents خ as kh either, neither with how it represents ش / ژ / غ, since the ALA-LC standard is technically a romanisation standard, not a transliteration and wouldn't be opposed to create a policy that diverges from the ALA-LC. Although, I'm not too opposed with how it represents the nasal vowel, among other things, and I do think it's important to set a transliteration policy which represents how words are perceived and written by Urdu speakers, because technically, for instance, nasalisation in Urdu isn't the same as how it's perceived in Hindi, despite the pronunciation being no different. نعم البدل (talk) 20:19, 6 September 2023 (UTC)[reply]

@Sameerhameedy @نعم البدل I think we should make a decision based on what makes the most sense for Urdu, not either (a) how difficult it is to convert existing transliterations (which can be converted by bot) or (b) what the current state is. I also think transliterating Urdu and Hindi similarly is more important than transliterating Urdu and Punjabi similarly, since Urdu and Hindi are essentially the same language. Benwing2 (talk) 03:45, 7 September 2023 (UTC)[reply]

@Benwing2: It depends on what you mean by similar. Module:ur-translit at the moment doesn't produce the same results as Module:hi-translit, when it comes to nasalisation, for instance. The transliteration policy that we pick for Urdu, would technically become the one for Shahmukhi Punjabi, and other Pakistani languages anyways, and it would be pointless to have a different TR policies for Urdu and Punjabi, when they are essentially the same, when it comes to the spelling, grammar, alphabet etc. It's why I think it's better to make a policy that can work for basically all of the Pakistani languages, rather than create separate transliteration policies, like how Module:hi-translit isn't merely limited to Hindi. نعم البدل (talk) 15:23, 7 September 2023 (UTC)[reply]

@نعم البدل I don't understand why you think the translit policy we pick for Urdu would have to be the one for Panjabi in either spelling. I understand that Urdu and Panjabi use similar spelling principles but their phonologies differ, e.g. Panjabi has tone whereas Urdu does not. OTOH Urdu and Hindi are essentially the same language so clearly the translits should be as similar as possible. Benwing2 (talk) 06:01, 8 September 2023 (UTC)[reply]

@Benwing2: Because the Punjabi and the Urdu alphabets are exactly the same, and just so we're clear, you do understand I'm talking about the Shahmukhi script for Punjabi, not Gurmukhi, right? Yes, Hindi and Urdu are essentially the same language, but we're talking about transliteration right? And transliteration is supposed represent the spelling, not necessarily the pronunciation; Hindi and Urdu spelling/characters can't be mapped one to one, like they can be with Punjabi and Urdu (because the alphabet is exactly the same), so it would make sense that Punjabi Shahmukhi script would adopt the same TR policy. If we continue to use different TR policies then we'd have the word افطاری, for instance, transliterated as اِفْطَاری (iftārī) in Urdu and اِفْطَاری (īft̤ārī) in Punjabi. Btw, Tones aren't marked in Punjabi. نعم البدل (talk) 08:37, 8 September 2023 (UTC)[reply]

@نعم البدل Yes, transliteration in the linguistic sense usually represents spelling but here at Wiktionary when we say "transliteration" we really refer to more like transcription, or a mix of traditional transliteration and transcription. "Romanization" would probably be a better term since the choice of how closely to hew to spelling or pronunciation depends on the language. Yes I'm aware that we're talking about Arabic-script writing of Panjabi (aka "Shahmukhi") not writing in the Gurmukhi script. Benwing2 (talk) 08:51, 8 September 2023 (UTC)[reply]

@نعم البدل As another data point, see the discussion just below on Modern Greek translit. Benwing2 (talk) 08:53, 8 September 2023 (UTC)[reply]

@Benwing2: or a mix of traditional transliteration and transcription – which is exactly what I'm hoping to achieve. The similarities between Hindi and Urdu should be maintained (which is why I think characters like ش should be transliterated as ش (ś) (Urdu) and not ش (š)) (Arabic/Persian), no quarrels with that, but to generalise the differences, or things which can't be translated directly like the various 'z' and 's' letters seems a bit inconsiderate, and it's not even to say that the Hindi and Urdu pronunciations will always be exactly the same – they can differ. نعم البدل (talk) 09:14, 8 September 2023 (UTC)[reply]

@Benwing2: To add to that, ष (ṣa) is transliterated as "ṣ", even though it becomes ش (ś) ś in Urdu. Same with ण (ṇa) and ن (n), not to mention the 'r' diacritics and other Devanagari letters. Surely these would also be just generalised to ś, n, r etc like we're choosing to do with Urdu letters, since native Urdu speakers would have no understanding of these letters/diacritics? نعم البدل (talk) 09:27, 8 September 2023 (UTC)[reply]

@نعم البدل I understand your concern about ś vs. ṣ and n vs. ṇ, which are pronounced the same in Urdu (and usually in Hindi as well). Urdu and Hindi don't have to be transcribed identically but should be as similar as possible considering the fact that they are a shared language, and Urdu-Hindi harmonization comes first in priority over Urdu-Panjabi harmonization IMO (ideally of course all three are harmonized). It isn't consistent cross-linguistically in Wiktionary whether to transliterate two letters with the same pronunciation the same or differently (see again the discussion below for Modern Greek, which for example has 5 or 6 distinct ways of writing the sound /i/, some of which are transliterated /i/ but others in other ways). As for the 'r' vs. 'ṛ', AFAIK both Hindi and Urdu speakers make this distinction; at least, this is what Hindustani phonology says. Even if Urdu spelling is not capable of making this distinction, the fact that the the distinction is made in speech means it should be in the translit. Urdu spelling has a large number of extra Arabic-derived letters that don't correspond to differences in Urdu pronunciation, and it could be argued both ways in terms of whether we should distinguish them in the translit. For reference, Persian translit does not distinguish them: س ص ث are all transliterated 's'. Hebrew translit is the same way for modern Hebrew: כּ and ק are both transliterated 'k' while ת and ט are both transliterated 't', א and ע are both transliterated as apostrophe or left out depending on position, etc. (Essentially, modern Hebrew lost the emphatic and pharyngeal sounds but preserves them in spelling.) Biblical Hebrew is sometimes transliterated differently in a way that represent Late Biblical (specifically Tiberian) pronunciation.‎ Benwing2 (talk) 10:02, 8 September 2023 (UTC)[reply]

@Benwing2:

As for the 'r' vs. 'ṛ' – Should clarify, I mean ऋ (ŕ) and ृ (ŕ), not ڑ (ṛ) / ड़ (ṛa).
Arabic-derived letters that don't correspond to differences in Urdu pronunciation – Generally, that is the case, but what about words like جَماعَت (jamā‘at) (जमात (jamāt)) and حَضْرَت (ḥaẓrat) (हज़रत (hazrat)), where Urdu speakers [and not Hindi speakers] (attempt to) retain the original pronunciation and as a result have two pronunciations – a 'standard' or 'formal' pronunciation and the common/informal pronunciation (and by no means the only examples)?
Urdu spelling has a large number of extra Arabic-derived letters that don't correspond to differences in Urdu pronunciation, and it could be argued both ways in terms of whether we should distinguish them in the translit. The argument at hand.

I do understand that Persian and Hebrew follow a similar policy (and especially for the later, I disagree there as well, but since I'm not a speaker of Hebrew, have never really considered myself qualified enough to voice an opinion on the Hebrew TR policy, not to change the discussion but is the Hebrew TR policy formalised – last I checked Module:he-translit isn't actually being put to use?).

Also, it's not just about the letters per se. Urdu, for instance, uses a nasal vowel – a letter, to represent nasalisation, and Urdu speakers, when it came to Roman Urdu, either used a variant of n, or m (where applicable) to illustrate it. As I've said that nasalisation works differently in Urdu than in Hindi. A tilde, a diacritic makes sense for a Hindi TR, but not really for Urdu. Words like بُلَن٘د (bulãd) should just be transliterated as "buland", not "bulãd".

However, since it's clear that there's no easy solution to this, I've another suggestion which is that the source formatting should be represented in the translit and the headword/vocalised Urdu already automatically generates a Roman TR and a Devanagari TR, while a separate TR policy for the pronunciation should be utilised, which is common to both Hindi and Urdu (under the pronunciation section). That way, a reader can understand both settings. I've been trying something similar with Punjabi tonal words, and of course a module would make it much easier to handle all this, گَھر (ghar vs kàră) is a decent example. نعم البدل (talk) 10:41, 8 September 2023 (UTC)[reply]

Something similar to the new Template:fa-IPA. نعم البدل (talk) 10:46, 8 September 2023 (UTC)[reply]

Hi, none of the sources on the page حضرت indicate the pronunciation "hadrat". Is that really a thing? Current policy simply ignores letters that in official pronunciation are synonymous with another character. If you can prove that ض has a distinguished official pronunciation it could be changed. However, on the page a dictionary managed by Pakistans ministry of Education indicates the official/standard pronunciation is "hazrat". So it seems that, at least officially, ض and ز have the same pronunciation. (of course this is about current policy, your proposed changes would be different). This is not unique to Urdu if you check the romanizations for ط most Arabic script languages ignore characters without official pronunciations.

About the tilde, I can change that. But shouldn't a sukoon be used here?? Noon is already a dental, I don't see why noon would need a ghunna diacritic before another dental since it's not assimilating. And بُلَنْد (buland) (which is how the corresponding page vocalizes it) does not create a tilde. noon + sukoon is just an ordinary consonant, only noon + ghunna causes special nasalization.

Lastly if you are aware of any consistent patterns for Punjabi tones I would gladly help you make an IPA module for Punjabi in the future :) (I'm working on a lot right now so I will not be able to get to it anytime soon). While including tones in Punjabi transliteration would be very cool, I can't implement something like that unless Punjabi changed their transliteration policy, as it currently ignores tones. I can't really change Punjabi transliteration without consensus but if punjabi editors wanted tones in translit I could try to do something like that in the future. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:33, 8 September 2023 (UTC)[reply]

@Sameerhameedy:

Hi, none of the sources on the page حضرت indicate the pronunciation "hadrat" – Yes it is a formal/religious thing, and I wouldn't be surprised that dictionaries or references don't mention it, since the rule of the thumb is that ض, is pronounced the same as ز but in certain words it is pronounced as /d/, which is why I oppose the 'z letters' all being transliterated as just 'z', since, as I said, it generalises all pronunciations, merely to match it with Hindi. I'd say it's akin to Hindi ष and ण, yet as I mentioned earlier, we transliterate Hindi/Devanagari letters definitely.
if you check the romanizations for ط most Arabic script languages ignore characters without official pronunciations – Apart from Persian and not including Pakistani languages, which language that uses the Arabic script had a detail discussion on how Arabic loan-letters should be transliterated?
About the tilde, I can change that. But shouldn't a sukoon be used here?? – It doesn't matter whether a sukoon should be used or not. My point was it does not make sense to transliterate the noon ghunna diacritic, which goes on top of the noon, and is implied, be represented with a tilde and not just as the letter 'n' unlike Hindi's anusvara.

What I don't get is that if we do want to keep Hindi and Urdu TR policies alike, well then previously, Urdu transliterations were generated by Module:hi-translit, and it produced the exact same Hindi transliteration, then what's the need to formalise a TR policy for Urdu, or why bother with Module:ur-translit, since Module:hi-translit did the job previously?

Lastly if you are aware of any consistent patterns for Punjabi tones – I've noted a couple down that we can hopefully discuss, it may be difficult to create a module that can correctly generate the IPA for Punjabi lemmas, but it's worth a try, and if not the transliteration could always serve as a backup. @عُثمان could likely also point out some patterns in another discussion.
Punjabi changed their transliteration policy, as it currently ignores tones. – because tones aren't marked in Punjabi, and currently there is no defined way of marking tones in Punjabi in either scripts, hence not needed to mention in the TR policy? نعم البدل (talk) 21:47, 8 September 2023 (UTC)[reply]

Also, just a note, tones were never included in Punjabi's TR policy as far as I know? نعم البدل (talk) 22:08, 8 September 2023 (UTC)[reply]

@نعم البدل "Alike" != "Similar". I want them to be as similar as possible, not necessarily exactly the same. E.g. things like different representations of nasalization seem gratuitously different when the phonology is identical. Benwing2 (talk) 21:51, 8 September 2023 (UTC)[reply]

@Benwing2:

I want them to be as similar as possible, not necessarily exactly the same. – Even though we're talking about completely different scripts and letters can't be directly mapped to each other...

You mentioned that Transliteration is more like romanisation and not strictly a transliteration, right? So let's talk about Roman Urdu. Roman Urdu is like a psuedo transliteration/Roman script for Urdu. I don't think I've ever seen nasalisation in Roman Urdu being represented with a tilde. In fact the only times I've seen a tilde being used in Roman Urdu was to romanise the nasal vowel in Urdu, and it was done with the letter ñ – similar to ALA-LC's standard, except a different diacritic was used – but the point being that the letter 'n' is always included.

A tilde, and that too alone, to represent nasalisation in Urdu is unconventional. نعم البدل (talk) 22:01, 8 September 2023 (UTC)[reply]

@نعم البدل @Sameerhameedy

Yes, I could share some pointers on IPA transcription for Punjabi, preferably in a new thread if there is interest since it can get quite complicated. I agree that transliterations of Punjabi do not need to be concerned with tone, because Punjabi phonotactics allow the tone to be inferred from the spelling, and are governed by rules as if the original aspirate/breathy voiced consonants are still there. (For example, Punjabi words cannot have two different aspirated stops in them. A word like Bengali “abhidhān” is still not possible even though both bh and dh would lose their aspiration in most Punjabi dialects.) It is actually relatively easy to determine the tonal value for a given word; the challenging factor is syllabification and identifying the stressed syllable. In the word ਭੰਡਾਰ بھنڈار the tone is on the second syllable but in the word ਭਾਰੀ بھاری it is on the first. Meanwhile the word ਭਰਾ بھرا is monosyllabic.

Re: Urdu transliteration, I recommend following the transcription conventions used in Perso-Arabic Loanwords in Hindustani exactly (there are PDFs of this on various sites online). It makes a well informed balance between representing the words in a way which is both true to the phonology of the language and makes clear how these words are spelled in the Perso-Arabic script. This dictionary does not treat Hindi and Urdu separately and relies on multiple sources which predate the Hindi-Urdu controversy. With this in mind, the widespread use of Devanagari is recent and many Devanagari spellings are unetymological. There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings.

The distinction between where to indicate a nasalized vowel as opposed to a cluster with a nasal consonant should be based on the value of the preceding vowel and whether or not it occurs in the final syllable. After a long vowel in the final syllable is the only position in which tilde should be used. That is, ہوند is hõd and likewise ہُوں is hū̃ but ہوندا should be hondā. ہِند and ہِندی would both use n as in hind and hindī respectively. There is one difference between Hindi/Urdu and Punjabi in this regard: in syllables with ā in Punjabi ending in a consonant other than “v,” the nasal is still realized as a consonant. Hence Hindi/Urdu ā̃kh vs. Punjabi bhāng. Turner’s comparative dictionary follows these conventions in transcriptions. This may seem pedantic, but this is what is actually occurring in pronunciation. عُثمان (talk) 23:15, 8 September 2023 (UTC)[reply]

@عُثمان Okay, look forward to discussing punjabi with you once i'm available. I do have a question about Urdu though from what you mentioned.

noon + ghunna + gaaf should always equal "ṅg" correct? And noon + sukoon + gaaf should be "ng", correct?? It seems those are the only two types on noon before gaaf. Or is noon always "ṅ" before gaaf, irregardless??

But it seems unlike gaaf , Urdu allows 3 (4 including meem) nasals before kaaf. a nasal vowel, a velar nasal, and/or a dental nasal!! And i'm super confused how to distinguish them from eachother in the urdu alphabet.

should it be:

n + ghunna + kaaf = ā̃k

n + sukoon + kaaf = ank

n + kaaf = aṅk

??? If that's the case I have to give the module an exception for a "n + k" combination as that's currently not allowed. without a diacritic between them. n+k is the only diacritic-less exception needed, correct??

Thank you! سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 23:52, 8 September 2023 (UTC)[reply]

@Sameerhameedy To be completely honest with you here--and others may feel free to disagree with this--I think sukoon/jazm and the gunna marker are misleading in Urdu and there is no reason to use them. Urdu writers often have a preference for a "zer-o-zabar" style of writing as Pakistanis might call it that uses diacritics solely for decorative purposes. Modern Punjabi and Sindhi dictionaries published in Pakistan never use sukoon or the gunna marker.

The distinction between a nasal vowel and consonant depends on the vowel it comes after. So آنک = ā̃kh while انک = ank (aṅk). However, if there is another syllable after as in آنکا then that may be written as ānkā (āṅkā). You are correct to say that before k and g the nasal consonant is specifically the velar ṅ rather than alveolar n, but because this is always true distinguishing this detail is optional. If we say the word "sink" in English, the "n" is not actually in the same place as in "bend." There is only one possible pronunciation of "sink" though, we cannot use a different "n" sound, so no different letter is needed. Sanskrit used to represent ṅ separately in writing but this practice was ended in the modern languages which use Devanagari because there was no need to represent nasal clusters differently from one another. Whether you think ṅ is useful to show is your choice.

vowel + meem + k/g almost never occurs in Urdu in the middle of a word, but meem is always simply m. m is never allowed in clusters at the end of words in Urdu, so a vowel must be inserted if the meem + k/g is the end of the word. عُثمان (talk) 00:30, 9 September 2023 (UTC)[reply]

@عُثمان Since we are not a monolingual dictionary intended for native speakers who already know the vowels, we will use sukuun and gunna. Also I don't understand your statement "There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings"; this is definitely not going to happen. Benwing2 (talk) 00:35, 9 September 2023 (UTC)[reply]

> Since we are not a monolingual dictionary intended for native speakers who already know the vowels, we will use sukuun and gunna.

The transliteration should suffice to aid non-native speakers, and the rules are consistent enough that sukuun and gunna are not necessary to produce accurate transliterations.

> "There is no need to transliterate Devanagari spellings at all and Hindi entries should simply use transliterations generated from their Urdu spellings"; this is definitely not going to happen.

This is like insisting Chinese entries should not use pinyin. Modern Hindi orthography is partly ideographical in a way that Urdu orthography just is not. The word ऋषि is pronounced the exact same as رِشِی while the word आदि is pronounced آد despite the fact that both of these words are written as if they end in the same vowel. Trying to transliterate these exactly is ignoring the fact that Hindi intentionally written in a way that does not follow consistent patterns. عُثمان (talk) 00:46, 9 September 2023 (UTC)[reply]

If short vowels were never nasalized and long vowels never occurred before a regular noon, then it would be possible to transliterate Urdu without those diacritics. However, since neither of those are true, the module cannot tell wether a noon is noon ghunna or a regular noon without diacritics. Also, to prevent transliterations of blank words. The Module will go blank if there's not enough vowels. So a sukoon is needed for consonant clusters, regardless. Based on hindi translit "aṅg" and "ang" are possible but "ãg" is not possible, and is removed. So noon ghunna will represent "ṅ" before gaaf and a nasal vowel elsewhere. Since Hindi-translit allows "aṅk" "ank" and "ãk" I will have to count n+k as one consonant, since there's no other way I can think of to show that three way distinction. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 00:46, 9 September 2023 (UTC)[reply]

think sukoon/jazm and the gunna marker are misleading in Urdu and there is no reason to use them. – The issue is that without the use of the sukoon or the gunna marker, things can get pretty ambiguous. It would be difficult to differentiate Urdu words like حَیرَانْگی (hairāngī) (alveolar n) and آن٘کھ (āṅkh) (velar n). نعم البدل (talk) 00:48, 9 September 2023 (UTC)[reply]
@نعم البدل I would put zabar instead on حیرانَگی since a vocalic release (IPA /ᵊ/ micro-schwa) is necessary to intervene in the cluster as /ng/ does not form true clusters. Even if that were not the case though, words ending in نگی are the only exception I am aware of and we can simply say all words with this ending use alveolar /n/. It is specifically inherited from Persian words following this exact pattern. عُثمان (talk) 00:58, 9 September 2023 (UTC)[reply]
I've never really felt a micro-schwa here, and I couldn't hear it in UDB's recording either. It may only be in Persian-derived words ending in نگی, but we would still need to be able to differentiate between words such as حَیرَانْگی (hairāngī) and جَان٘گی (jāṅgī), and I don't see how a program can differentiate them based on the lemma alone, and without the use of sukoon/ghunna markers. نعم البدل (talk) 01:08, 9 September 2023 (UTC)[reply]
Hmm when I compare those to نارنگی I see what you are saying. There is an affected mute pause between the sounds in the UDB recording which I would expect to hear a voiced vowel during, but this may be a genuine difference between Muhajir Urdu and Urdu as used by Punjabis. If we were to mark sukoon on حیرانگی and words like it though, I think it would be safe to assume نگ is velar in its absence. (Part of my concern here is simply that many Urdu fonts are no longer legible when these diacritics are added. The diacritics are both covered up by گ on my device in this case.) عُثمان (talk) 01:21, 9 September 2023 (UTC)[reply]
@عُثمان:
but this may be a genuine difference between Muhajir Urdu and Urdu as used by Punjabis – That was my first thought as well.

think it would be safe to assume نگ is velar in its absence. Yeah seems fair, but it we should also include the ghunna, just in case some user does add it in, and doesn't just produce a nil error. نعم البدل (talk) 01:28, 9 September 2023 (UTC)[reply]
@نعم البدل Hi just letting you know, while it's still a work in progress and could change. I am probably gonna follow the recommendation of Pakistans Ministry of Education (who regularly the dictionary "Urdu Lughat"), which solely uses ghunna for nasal vowels. However, since Nasal vowels cannot appear before gaaf, gaaf will probably be an exception.
But for the sequences /ŋk/ and /ŋɡ/ urdu lughat uses no diacritics. So the module will probably treat نک and نگ (no diacritics) as a single letter. Unless @Benwing2 can think of a better solution. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 01:16, 9 September 2023 (UTC)[reply]
@Sameerhameedy: I was so confused initially, because I could've sworn UDB did use ghunna markers, and they do use them, for both /ŋk/ and /ŋɡ/. It's when you actually click on the specific word that the ghunna marker comes up, not when retrieving results. See آن٘کھ‎ and جن٘گ‎. In any case, at the minute I'm leaving Module:ur-translit in your hands and only passing my feedback. نعم البدل (talk) 01:22, 9 September 2023 (UTC)[reply]
@Sameerhameedy Ideally there should always be a diacritic between consonant clusters so we can fail the translit if the diacritic is missing. It sounds like per User:نعم البدل the diacritic in /ŋk/ and /ŋɡ/ is ghunna, which seems a good solution. Benwing2 (talk) 01:28, 9 September 2023 (UTC)[reply]
@Benwing2 Well we can use ghunna mark before gaaf, it seems like the MOE does that as well. But my main concern was that, Unlike before gaaf where noon only has two pronunciations, for kaaf there is a three way distinction between ãk, aṅk, and ank. I know for sure that nasal vowel + gaaf is impossible in both hindi and urdu. However in Hindi, nasal vowel + ka is possible and is distinguished from velar + ka. And unlike other phonemes, the nasal vowel + kaaf distinction in Hindi was not borrowed from sanskrit, and is present in many Hindustani words. Additionally @عُثمان seemed to indicate that sequence existed in Urdu as well. Based on what @عُثمان told me though, nasal vowel + kaaf only happens with long vowels and short vowels are always(?) velar + kaaf. If that's the case I can probably use ghunna for both the nasal vowel and velar nasal consonant. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 02:10, 9 September 2023 (UTC)[reply]
@Sameerhameedy:
where noon only has two pronunciations, for kaaf there is a three way distinction between ãk, aṅk, and ank. – Once again putting forward my suggestion that noon + ghunna mark being transliterated as simply "ṉ", while alveolar n (noon + sukoon) as simply "n" could solve all this, while Template:ur-IPA should actually be used to clear up any ambiguity in the pronunciation. نعم البدل (talk) 02:26, 9 September 2023 (UTC)[reply]

That would be no different than only transcribing the sequence as "aṅk" or "ãk" سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 03:09, 9 September 2023 (UTC)[reply]
@Sameerhameedy Can you give some examples of the three-way contrast? Maybe that will help resolve whether we need another (ad-hoc?) symbol. Benwing2 (talk) 03:21, 9 September 2023 (UTC)[reply]

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘

That would be no different than only transcribing the sequence as "aṅk" or "ãk – Yeah but, instead of having 3 ways of transliterating the noon, you'd have two, 1. when it's nasal, 2. when it's not, and it would be closer in representing the actual script. نعم البدل (talk) 01:12, 10 September 2023 (UTC)[reply]

@Benwing2 (not indenting because it's getting to hard to read.) So i've been reading various papers and found nothing however I saw this on wikipedia:
The palatal and velar nasals [ɲ, ŋ] occur only in consonant clusters, where each nasal is followed by a homorganic stop, as an allophone of a nasal vowel followed by a stop, and in Sanskrit loanwords. However /n/ + velar clusters also occur, eg. /ʊn.kaː/ making /ŋ/ phonemic. Could not find any other information on this unfortunately. It seems as though the nasal consonant "n" is pretty stable in front of velar consonants, but the nasal vowel is not. (Which would explain why hi-translit does not allow ãg). Though in hindi entries आँख (ā̃kh) and बैंक (ba͠ik) are transliterated differently, I have no idea why that is. in lieu of any information confirming that stable nasal vowels exist before velars, We can go with the assumption that nasal vowels are always unstable before velars and assume that difference in Hindi is merely orthography and not distinctive. The only question is then, should n + ghunna + kaaf = aṅk or ãk? سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 03:43, 9 September 2023 (UTC)[reply]

> If short vowels were never nasalized

Short vowels are in fact never nasalized in Urdu/Hindi to my knowledge. Consider the following vocalizations, with velar ṅ just to demonstrate:

کَنک = kaṅk
کنَک = kanak (wheat)

Zabar should be used to indicate when نک does not form a cluster since that is the less common situation. Hindi/Urdu pronunciation does not allow consonant clusters word initially, so the first consonant should always be followed by a vowel if unmarked. عُثمان (talk) 00:51, 9 September 2023 (UTC)[reply]

@Sameerhameedy Er, I should say short vowels are never nasalized in Urdu/Hindi unless followed by /h/ as in مُنہ. I forgot about that as Punjabi always lengthens the vowel in such cases. (Hence مونہہ or مینہہ etc). عُثمان (talk) 01:00, 9 September 2023 (UTC)[reply]

I believe the current Urdu transliteration works quite well and think an overhaul would be such an extensive effort for checking every Urdu page that just isn't necessary as the current system doesn't have a significant number of ambiguities, if at all. SAA2002 (talk) 01:41, 8 September 2023 (UTC)[reply]

@SAA2002: - Hi, sorry. Could you clarify, do you mean the one that differentiates س with ص and ث etc (ie. Module:pa-Arab-translit or the one that's similar to Hindi, and just transliterates them all as 's' Module:ur-translit? نعم البدل (talk) 08:39, 8 September 2023 (UTC)[reply]

In their most recent edit (before this discussion) they transliterated اُطفاً as "lutfan", so presumably the current transliteration, as that's what they have been using. سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 19:44, 8 September 2023 (UTC)[reply]

@Sameerhameedy @SAA2002 By "current" do you mean the one documented in WT:UR TR and used in most manual entries, or the one since a week ago based on the Panjabi translit module? I'm pretty sure the former but we need to clarify. Benwing2 (talk) 20:49, 8 September 2023 (UTC)[reply]

The one that documented in UR TR سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 20:51, 8 September 2023 (UTC)[reply]

The one that translates them all as "s" as Urdu pronunciation is closer to Persian (particularly to Dari Persian) rather than Arabic so therefore it makes sense since both languages don't make a distinction between the sounds in speech SAA2002 (talk) 15:11, 12 September 2023 (UTC)[reply]

Thanks for clarifying.

so therefore it makes sense since both languages don't make a distinction between the sounds in speech – but we have a pronunciation section for that, to explain the phonology of the lemma. TR should be used a learning tool, you know for a user to understand that it's not a native س (s) being employed here. Like I say, we make the distinctions for ण (ṇa), ष (ṣa), ऋ (ŕ) already. نعم البدل (talk) 21:31, 12 September 2023 (UTC)[reply]

@Sameerhameedy You can use {{outdent}} followed by the appropriate number of colons to draw a line indicating the outdenting. I seem to remember going back and forth on whether to transliterate the chandrabindu differently from the anusvara; this has probably been changed at least once. Wikipedia's article on Anusvara has a detailed section on the pronunciation of these symbols and whether they represent a nasal vowel or a homorganic nasal consonant; it seems it depends on the phonological environment, with some lexical exceptions, but it's implied that chandrabindu = nasal vowel and anusvara = homorganic nasal consonant is more theoretical than reality, with a lot of spelling alternations. Note that the existence of lexical exceptions suggests that there is in fact a phonemic difference between nasal vowel and velar nasal consonant before /k/ and /kh/, but there may not be any minimal pairs. Benwing2 (talk) 04:39, 9 September 2023 (UTC)[reply]

@Benwing2 Okay so for right now I'll make ghunna always be a nasal vowel except before gaaf and qaaf. Urdu Lughat uses noon + ghunna for words that are transliterated as ā̃k and aṅk in hindi, So i'm under the assumption Urdu has no diacritics to distinguish those. Since ghunna + gaaf and ghunna + qaaf are (seemingly) always ŋg or ɴq, but ghunna + kaaf does not seem to always be ŋk, ghunna + kaaf will not collapse into a consonant (at least right now). سَمِیر | Sameer (^{مشارکت‌ها} • ^{کتی من گپ بزن}) 08:48, 9 September 2023 (UTC)[reply]

About the transliteration of Greek initial ντ-

Module:el-translit currently transliterates word-initial ντ-, pronounced /d/, as d-. This clashes with word-initial δ-, pronounced /ð/, also transliterated as d-. This leads to δε (de) and ντε (nte) being transliterated the same even though they're pronounced differently, which means the flaw of the transliteration system is not its irreversibility, but rather the fact that the transliteration is making a worse job at dictating pronunciation than the original orthography, while what should generally happen is the opposite.

The solution that comes to mind is simply to transliterate the two as nt- and d- respectively. Checking w:Romanization of Greek#Modern Greek I see most systems seem indeed take this approach. The two systems that don't either spell δ- as dh- or ντ- as ḏ-. I personally don't like either of these two last approaches, though it's important to note we're the only ones merging the two sounds under the same letter, together maybe with things like passport transliteration, which isn't exactly a good starting point for a dictionary

The analogous μπ- and γκ- don't seem to cause problems as they stand. γκ- is already transliterated as gk-, while μπ- being b- isn't an issue since β- is v-. I personally would transliterate μπ- as mp- for the sake of consistency with nt-, but the systems listed at Wikipedia don't seem to worry about the inconsistency, so neither must we.

I see this had surfaced on the module's talk already, but no solution was reached. @Erutuon, Saltmarsh. Catonif (talk) 20:26, 7 September 2023 (UTC)[reply]

@Catonif Also pinging @Sarri.greek. The issue seems to be whether we translit more like the spelling or more like the pronunciation. IMO we should choose one principle and use it consistently. Benwing2 (talk) Benwing2 (talk) 06:19, 8 September 2023 (UTC)[reply]

I like the BGN/PCGN system in this respect, more phonetic but it uses "dh" for most of "δ" readings. Perhaps it should be "ð" or "dh"? Anatoli T. ^{(обсудить}/^вклад) 06:32, 8 September 2023 (UTC)[reply]

@Atitarev I am inclined to agree with you, we should prefer a more phonetic rather than spelling-based representation. I think either ð or dh would work, maybe ð is better since it is a single character. Benwing2 (talk) 06:49, 8 September 2023 (UTC)[reply]

@Benwing2, @Sarri.greek: We do use "th" for "θ", which could use "θ"

δ (d) and θ (th) are a voiced/voiceless pair. They could be both either "ð" and "θ" or both "dh" and "th". Anatoli T. ^{(обсудить}/^вклад) 06:58, 8 September 2023 (UTC)[reply]

I am not a regular Greek editor, but my preference would be to shift the romanizations of modern Greek δ and γ to dh and gh respectively, of word-initial γκ to g, and of medial γκ to nk (the romanization as gk in words like άγκυρα is inconsistent with the romanization of γγ as ng in words like αγγλικός).--Urszag (talk) 08:29, 8 September 2023 (UTC)[reply]

@Urszag I am broadly in agreement with you. My only potential difference would be to use IPA symbols for the fricatives, hence ɣ ð in place of gh dh, but I won't insist on this. Benwing2 (talk) 08:32, 8 September 2023 (UTC)[reply]

Speaking of inconsistencies between letter-based and sound-based transliterations, ει, η, ι, οι, υ are all spelled differently and pronounced the same, but our transliteration is mixed bag, as η and ι are transliterated the same (i), while ει, οι, υ are each transliterated differently (ei, oi, y). αι, ε are also pronounced the same but transliterated differently (ai, e); but ο, ω are transliterated the same (o). If we go for a letter-by-letter transliteration, then for consistency η and ι should be transliterated differently, as should ο and ω. If we go for a sound-by-sound transliteration, then everything pronounced /i/ should be transliterated the same, as should everything pronounced /e/ and everything pronounced /o/. —Mahāgaja · talk 09:02, 8 September 2023 (UTC)[reply]
@Mahagaja Agreed. Benwing2 (talk) 09:01, 8 September 2023 (UTC)[reply]

[from wikt:el:User:Sarri.greek, notifying @Atitarev, Benwing2, Catonif] Transliterations-policy. From latin to arabic, cyrillic, greek, han>pinyin, ... scripts. From arabic to cyrillic, greek, han>pinyin, latin... From cyrillic to arabic, greek, han>pinyin, latin,... and so on. I wish there were ISO translit-lists and standalone modules at Commons for all wiktionaries to use, as it is difficult to find all ISOs for all from.to combinations, plus monitor possible official changes.
At the moment, for greek-to-latin:

1) compulsory for Modern Greek (el): according to ISO (the corresponding greek institution ELOT), as in passports, products etc. Not a matter of preference.
2) a second transliteration: the popular one, which usually is more phonetic (note, that in greek passports we are allowed to use both 1 +of.our.own.choice, e.g. I am E.Sarri (Ekaterini Sarri), K.Sarri (Katerina Sarri) and A.Sarri (Aikaterini Sarri). Cashiers believe I am a crook using 3 identities.)

If your policy is 1, at el.wiktionary it is wikt:el:Βικιλεξικό:Μεταγραφές/ελληνικά#Πίνακας column ΕΛΟΤ = w:en:ISO_843. For earlier Greek from ancient up to 1982, the columns 'grc'. At en.wikt, there are interventions to ISO as in the column 'άλλα' (other), but this is in fact the following:
If your policy allows also 2 it is the last column 'άλλα (other)' which gives the 'popular' ones, as initial Nt=D, initial Mp=B (b latin), Δ=D becomes Dh. (Tricky: also middle mp can be 'b', nt=d, etc if the word is non-greek. e.g. μπαμπάς = babás, not the ridiculus bampás or mpampás. These can be done only manually.)
For transliterations of greek surnames (example at el.wikt) we are allowed to give ISO+popular+earlier (manually anyway, not having the luxury of modules).
It seems, that at the moment en.wikt uses a combination of 1&2, which unfortunately may vary and change at any moment. Since all wiktionaries copypaste from en.wikt, it would be nice if you decide on fixed policies. Would be nice to allow a second manual translit. Thank you. ‑‑Sarri.greek ^♫ I 09:52, 8 September 2023 (UTC)[reply]

I think there is a larger discussion to be had on what to do in general in cases where the spelling makes extra distinctions not found in speech. See the parallel discussion above about Urdu. Current Persian translit transliterates multiple Arabic letters the same, reflecting Persian pronunciation. The same is done for Modern Hebrew translit, Ottoman Turkish translit and for Urdu translit, although User:نعم البدل is trying to change this for Urdu. This suggests we should do the same for Modern Greek, meaning all symbols pronounced as /i/ should be reflected as i, same as e and o. OTOH there is some benefit to maintaining the source distinctions in the translit, so maybe we need a system that reflects pronunciation whenever possible but without merging differently-spelled letters (to the extent possible). But maybe it's better just to require people who want to know the source spelling to just look at the source spelling directly. Benwing2 (talk) 10:10, 8 September 2023 (UTC)[reply]

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘Created before reading @Sarri.greek

Referring to Wiktionary:Greek transliteration we should probably note the point made there:

The transliteration of Greek letters into Roman characters is not intended to provide a phonetic representation of a word. The correct place for that is under the Pronunciation heading.

And it is impossible for transliteration of Modern Greek to work both ways — so we need to confirm that, although it does give a simple guide to pronunciation that is not what we are about. That being the case, and if ELOT 743's guidance is to be followed the letters ν & τ can be transliterated separately as n & t wherever they occur.

Mea culpa: the error crept in in March 2013 when Module:el-translit was being created. I was contacted, it was a few years since I had created the table. I referred to the Library of Congress guidance which indeed does recommend that initial ντ transliterates as d. I think we shoudl ignore this and follow ELOT 743, and the Greek govt's advice. and transliterate them separately wherever they occur.

And Benwing's comment re η & ι >> i is correct. — Salt marsh ^🢃 10:16, 8 September 2023 (UTC)[reply]

I agree that closeness to IPA is not the scope of a transliteration. Using IPA characters in Greek transliteration is at least for me completely out of the question. I also agree on fully endorsing the transliteration (not the transcription) of ELOT 743 = ISO 843, i.e. the third column of the table at en.wiki. Sarri makes a good point mentioning μπαμπάς, if the module can't recognise the second μπ as b, to keep things fair we should keep the first digraph as mp- as well, since having the same digraph pronounced the same but transliterated differently in the same word is indeed wild, making a transliteration like bampás outright misleading. The only divergence I'd istinctively suggest is γ as n before all velars, even in initial γκ-, for consistency with mp- and nt-, but given absolutely no transliteration system seems to do this I guess I'm the weird one, and overall respecting to the letter a strong standard from ELOT/ISO is definitely a step forward from following our own preferences. Catonif (talk) 12:11, 8 September 2023 (UTC)[reply]

I hadn't realized that many of the inconsistencies (as it appears to me) in Wiktionary's romanization system are found in existing standards. It does seem simplest to follow them, although I'm now somewhat questioning the value of including this kind of romanization in the first place--we don't have space limitations, but it's not like the Greek alphabet is difficult to learn, and if the romanization isn't meant to provide an accurate guide to either pronunciation or spelling, what use is it? We don't need to create passports with an English language field for our words, after all. (For comparison, I just checked our Ukrainian entries and see that they do not seem to use Ukrainian-passport-style romanization--e.g. Зеленський is romanized on Wiktionary as Zelénsʹkyj.) Can we compare what other Greek-English dictionaries do? It looks like WordReference doesn't bother with giving a romanization.--Urszag (talk) 12:48, 8 September 2023 (UTC)[reply]

You bring up a reasonable point, most linguistic resources don't seem to transliterate Greek. Many etymological dictionaries in Latin-script languages leave Greek as it is, while they give, e.g., Cyrillic languages only as transliterated. But it must be noted though that these are all linguistical contexts where such knowledge is rightly taken for granted, while we on the other hand stand in the middle of the Internet, and I'm sure there are many people that find the transliteration very useful. It doesn't hurt anyways, when done well, aside from taking some more space. Catonif (talk) 15:29, 8 September 2023 (UTC)[reply]

I agree that Wiktionary is not the appropriate place for leaving Greek untransliterated. I think using ISO 843 is a good idea. —Mahāgaja · talk 16:00, 8 September 2023 (UTC)[reply]

I think it's good for us to have a house style. We dont need to follow anyone else's standards. I think our current system is the best one except for the issue raised up above where there are two d's ... if this evolves into a formal vote the only change I would recommend is rewriting δ as dh. —Soap— 16:09, 8 September 2023 (UTC)[reply]

@Saltmarsh Most transliteration systems at Wiktionary are phonetic when possible, e.g. in Russian we transliterate г as v when it's pronounced as such, and we transliterate е as ɛ to indicate that the preceding consonant is unpalatalized; but at the same time we don't apply akanje, i.e. we don't merge unstressed vowels in translit even though they are merged in pronunciation. I think it's extremely misleading to transliterate ντ as nt when that is not at all how it's pronounced. I agree with User:Soap we don't need to follow someone else's standards; other languages don't insist on following some particular standard, for example. If we follow the Russian example, we should maintain all vowel distinctions but transliterate ντ as d word-initially and nd word-medially. If there are words where ντ word-medially is pronounced as d, those require manual translit. Benwing2 (talk) 21:15, 8 September 2023 (UTC)[reply]

@Benwing2 How is nt "misleading when that is not at all how it's pronounced"? It's an orthographical rule, all languages have orthographical rules. I find it misleading Irish baothchaifeach is spelled like that even though it's "not at all how it's pronounced", yet it's not like we add |tr= to Irish, we just leave that matter to the pronunciation section. Saying that something "is not written the way it's pronounced" makes no sense. I, as an Italian, could equally say that English is not written the way it's pronounced because Anglophones don't pronounce night as */ˈniɡt/, and they could say the same of my language because we don't say */pɪsˈtætʃiəʊ/, even though both spellings make perfect sense when in the context of their respective languages. Nothing is "not the way it's pronounced", that just means it's following a different system, nor better nor worse than the system you may be used to. Who tells us that nt should stand for /nt/? because IPA? because English? Greek transliteration is Greek transliteration, and if Greek transliteration says that nt- is /d-/, there is nothing that can contradict it.

It also seems like we need to clarify, transliteration is not transcription, as Saltmarsh noted. I would like people to embrace the distinction: transliteration is orthographic, transcription is phonetic. Adding manual "transliterations" to differentiate between medial /b/ and /mb/ while still calling them "transliterations" would be an outright lie, as it would imply the two are somehow differentiated by the orthography. As emerged from this talk's branch with Urszag and Mahagaja, transliteration is meant to aid people who don't know the script, not people who don't know Greek orthography rules. Again, that's a job for the pronunciation section, as with Irish, Italian and English, as with any other language.

I'm not knowledgeable about Russian, so I'm not sure how to best address the examples you provided, but I'll try, correct me for any of my inaccuracies. If ⟨г⟩ as v can be more or less equated to how Greek ⟨γ⟩ stands for n before velars, and may be transliterated as such, then I can see why it would make sense, but as you mentioned we don't transliterate akanje, nor do we transliterate voicing assimilation (e.g. votka) and with good reasons; I can only imagine how well the Russian editing community would take these suggestions. I can equate ⟨г⟩ as v to ⟨γ⟩ as n, but I believe keeping ⟨ντ⟩ as nt is no different from keeping ⟨д⟩ as d when it stands for /t/ or ⟨о⟩ as o when it stands for /ɐ/. (I'm addressing this only for completeness, I hope the discussion doesn't fully branch out into Russian. The important parts of this reply are the first two paragraphs.) Catonif (talk) 22:20, 11 September 2023 (UTC)[reply]

Oops, I'd missed that the standards you and Saltmarsh wanted to follow were the strict transliterations (and so, you want to change the current behavior of the model). I was confused because ELOT 743 refers to (at least?) two standards, and the transliteration standard is not the one used on passports (which had been mentioned by Sarri.greek as a context where ELOT 743 was "compulsory for Modern Greek"). I definitely find it more sensible to include the ELOT 743 transliteration rather than the ELOT 743 transcription. That would mean getting rid of b.--Urszag (talk) 22:52, 11 September 2023 (UTC)[reply]

@Catonif: I support a more phonetic but not entirely phonetic transliteration. We transliterate г (g) as "v", ч (č) as š when they are (unexpectedly) pronounced as [v] and [ʂ], even though there is a grammatical reason for that.

With letters e.g. д (d) or о (o), we don't transliterate them differently, since consonant devoicing/voicing, vowel reduction is standard and predictable and can be even applied to the transliterated reading when stress is provided (and it is).

I'd like to handle μπ and ντ, etc. the same way. I treat them as digraphs.

The Greek alphabet is not hard to learn, having it transliterated more phonetically to the user is more beneficial than trying to render each letter verbatim.

Our current handling of Korean, Arabic, Persian, Thai and Khmer is much closer to phonetic pronunciation than to spellings.

@Urszag: Regarding Зеле́нський (Zelénsʹkyj), passport offices in Ukraine, Russia, many ex-USSR countries (sorry for grouping them, it's not a political statement) nowadays Anglicise surname, so "Zelenskyy" is the spelling in Zelenskyy's passport, although it could also have been "Zelensky" (with one y). Anatoli T. ^{(обсудить}/^вклад) 06:04, 12 September 2023 (UTC)[reply]

@Catonif I agree very much with User:Atitarev here. I think transliterating the Greek alphabet letter-for-letter is more or less useless as Greek isn't hard to learn to read and just rendering the letters doesn't contribute so much information, esp. compared with a semi-phonetic non-lossy rendering (which is what I advocate). BTW if we can't agree on anything I would suggest as a first pass that we just fix the issue of having d mean two different things by rendering dhelta (δ) as 'dh'. Benwing2 (talk) 07:08, 12 September 2023 (UTC)[reply]

I used to like the idea of having Ancient Greek romanizations match the pronunciation, but from what I've read, a module can't determine the correct pronunciation of all nasal-stop or nasal-fricative letter clusters based solely on the spelling, so I'm reluctantly in favor of a more strict transliteration that doesn't try to match the pronunciation. I made a weird script that modifies Modern Greek transliterations to be even more phonemic, but it fails because of δε (de) and ντε (nte) having the same transcription, which is fixable, and because μπ and ντ and γκ each could have two (voiced stop or nasal and voiced stop) or maybe even three pronunciations (nasal and voiceless stop in a loanword??), which can't be determined automatically. Another ambiguity is that the second γ in γγ can be a stop or fricative: fricative in συγγραφέας (syngraféas), stop in συγγενής (syngenís). Not sure if the same ambiguity applies to μβ and νδ. To solve these ambiguities, someone would have to find the words that the module fails to transcribe phonemically and correct them. In theory that could be done, but it's work that doesn't really help readers of the dictionary that much. — Eru·tuon 01:13, 12 September 2023 (UTC)[reply]

I am sorry to have to write TL:DR, well not all of it anyway. I must leave it to others to decide. — Salt marsh ^🢃 11:19, 12 September 2023 (UTC)[reply]

(Building on Catonif's point) In the past, I've been sceptical of the idea of having two transliteration + transcription parameters (or at least outputting two values for tr=), but I've increasingly come around to the idea, because it's clear that both "transliterate the letters into Latin letters" and "provide an enPR-esque pronunciation as the transliteration" are things that people want. And I can see why each 'side' thinks the other idea is silly ("if you want to reproduce the distinctions present in the original letters, why not just learn the original script? why have all the distinct letters in two places, once in the original script and then again in the transliteration?" / "well, if you want to provide a pronunciation, why not just provide a pronunciation, in IPA or another system? why have the pronunciation in two places, once in IPA / in the pronunciation section, and once in whatever idiosyncratic respelling we come up with?"), but in fact maybe we should just have both...!
(Not strictly on topic, but I have occasionally had cause to mention a word's pronunciation in e.g. some other word's etymology section, when the precise pronunciation is relevant and isn't intelligibly reflected in the transliteration/transcription. So I do see the utility to providing IPA pronunciation information in places other than just the word's own pronunciation section!) - -sche (discuss) 03:27, 13 September 2023 (UTC)[reply]
@-sche Yes, this might successfully "split the difference". We are running into similar issues with Urdu (and Panjabi, etc.) in terms of whether to transliterate all the different Arabic letters differently even though several of them have the same pronunciations. This issue has also come up in Hebrew (Biblical Hebrew vs. Modern Hebrew, which use the same entries; current practice is to use the Modern Hebrew pronunciation but this is unsatisfactory for terms that are either primarily Biblical or equally Biblical and Modern), Persian (Iranian Persian vs. Dari/Classical), Japanese (do we present both hiragana and Latin transliteratinos?), etc. Given all the tweaking currently being done to core language modules it seems like we could definitely implement this. We'd need a syntax to allow for inputting multiple manual translits and specifying the system of each, but this should not be too hard to create. It could be argued that we should use the ts= field for transcriptions and the tr= field for transliterations but this doesn't seem a general solution esp. since some of the cases of multiple translit cannot reasonably be subdivided into transliteration vs. transcription (like the different Persian translits). Benwing2 (talk) 03:42, 13 September 2023 (UTC)[reply]
@-sche, @Benwing2: The idea to transliterate different letters with the same sounds differently keeps coming up but the support for that idea is never strong. E.g. Persian, Urdu letter ط (t) has the same reading as ت (t), even though they have different values in Arabic. There are too many such cases. It's more problematic when for both ντ (nt) and δ (d) we use "d" (different sounds) but using "nt" for ντ (nt) when it's pronounced /d/ seems silly.

Look, hiragana character は (ha) ("ha", not "wa") is transliterated as "wa" when it is a grammatical particle to match the pronunciation (this is standard in most Japanese transliterations) アテネはギリシャの首都(しゅと)である (Atene wa Girisha no shuto dearu, “Athens is the capital of Greece”)

(I use it not necessarily to convince you but others who still doubt that a phonetic transliteration is common and good.) Anatoli T. ^{(обсудить}/^вклад) 04:55, 13 September 2023 (UTC)[reply]

About the Tupi-Guarani family

The family tree used in the Wiktionary puts Nheengatu directly under Proto-Tupi-Guarani, thus making it a sister language of Old Tupi, which is not the case. Old Tupi evolved into the Amazonian General Language and this gave origin to modern Nheengatu. Neither the Amazonian nor the Southern General Languages have a ISO code, so we would have to made up some if we want to add them. These languages are also needed for the Etymology section of some Portuguese words and for the inh+ template to work in Nheengatu.

Also, recent studies have added a intermediary proto-language between Proto-Tupi > Proto-Tupi-Guarani, now being Proto-Tupi > Proto-Mawé-Guaraní > Proto-Tupi-Guarani; Sateré-Mawé and Awetí are considered cognates with Proto-Tupi-Guarani under this. If it would be added, I suggest its code to be mav-gua-pro to follow the standard.1 2

Trooper57 (talk) 06:50, 8 September 2023 (UTC)[reply]

@-sche Any ideas here? I don't have the background to comment. Benwing2 (talk) 21:51, 8 September 2023 (UTC)[reply]

Reclassifying Nheengatu as a descendant of Old Tupi is easy enough to do, independent of changing or adding any other codes (I have changed it now). Regarding the General Languages, how different are those from each other and from "tpw" Old Tupi And from "tpn"? This affects whether they need to be completely separate languages with their own ==Headers==, or whether they just need "etymology-only" codes for 'stages' of Old Tupi and/or tpn, so etymologies can specify that they were borrowed from that stage. With Proto-languages, are they so different that we need to be reconstructing them all; are there works reconstructing lots of words in each stage? Or can we get buy with saying various Tupi words derive from Proto-Tupi? Our codes are based on ISO codes, which "mav" doesn't seem to be, so the codes should probably start with "tup", the Tupian family code. - -sche (discuss) 23:02, 8 September 2023 (UTC)[reply]

"tpn" is labelled as "Tupinambá", so I guess it refers to the dialect of Old Tupi spoken by the Tupinambá people, rather than General Language. If so, "tpw" would be Old Tupi in general, like the diference between "pt" and "pt-BR".

The General Languages were originated from two different dialects of Old Tupi: Amazonian (or Northern) from Tupinambá and Southern from São Vicente captaincy's dialect (nowadays São Paulo). About Old Tupi dialects, the sintax was the same, the mainly diference was the pronounciation and some different nouns (like sea urchin, that was "pindá" in tpw and "pinda'yba" tpn), which gave origin to slight different lexicons in the two GL. Compared to Old Tupi, the sintax of the GL was reshapen and became closer to Portuguese, with verbs coming right after the pronoun rather than at the end of the sentence, and sounds were simplified, with loss of ɨ and ʔ and addition of vowels at consonant ending words. Not to mention the borrowings from Portuguese. Maybe we could put both under a "Língua Geral" or "Brazilian General Language" header and label words from the Southern dialect with {lb}, since it became extinct fairly quickly and is much less documented when compared to Northern even nowadays; it also didn't evolve into any new language, unlike the other.

As of now, I've only found Nikulin's work regarding Proto-Mawé-Guaraní reconstructions, with about 107 words [Appendix G, page 566]. It's not a big deal really, we could stay with PT and PTG or "ultimately from PT".

Summarizing:

Old Tupi (tpw): language spoken by Brazilian indians in 16th c. and before; had many dialects, with two known for certain; extinct;
Tupinambá (tpn): a dialect of Old Tupi spoken in most of Brazil; extinct > evolved into Northern GL in the 17th c.;
São Vicente's Tupi (no code): a dialetc of Old Tupi spoken in nowadays São Paulo; extinct > evolved into Southern GL around the 17th c.;
Amazonian/Northern General Language (no code): evolution of Tupinambá with Portuguese influence; extinct > evolved into Nheengatu in the 19th c.;
Paulista/Southern General Language (no code): evolution of São Vicente's Tupi with Portuguese influence; extinct in the 20th c. with no descendants.

[1][2][3][4] (in Portuguese, it's somewhat difficult to find references about this in other languages)

[5] (in English)

Also, there's a recent effort to expand Língua tupi and related pages in pt.wikipedia, namely by @Bageense. There are also others like @NoKiAthami in English Wikimedia interested on the topic.

Trooper57 (talk) 04:33, 9 September 2023 (UTC)[reply]

@Trooper57 @-sche I would prefer if we create a language code for Língua Geral that it be named using the Portuguese term rather than English "General Language", which seems not to be used except as a gloss of the Portuguese term. Benwing2 (talk) 04:47, 9 September 2023 (UTC)[reply]

Yeah I wasn't sure of the name in English. Then "Língua Geral Setentrional" for Northern and "Língua Geral Meridional" for Southern if we want to be more specific. Trooper57 (talk) 04:53, 9 September 2023 (UTC)[reply]

鸭绿 Template interaction between zh-see and place leads to unsightly result

鸭绿 Template interaction between zh-see and place leads to unsightly result- "(“[[Yalu”)." appears on 鸭绿 --Geographyinitiative (talk) 11:16, 8 September 2023 (UTC)[reply]

You posted about this already at Wiktionary:Grease_pit#Problem_in_zh-see. it'd be nice to see more response, yes, but it'd also be nice to keep the discussion in one place. I replied to you there to hopefully get more discussion going. —Soap— 14:20, 8 September 2023 (UTC)[reply]

"rare" and "uncommon" once again

These are not currently defined in a particularly helpful way ("rare" at the glossary just says not used commonly, and "uncommon" says not common but more common than "rare"). There doesn't seem to be much rhyme or reason to how they're applied in practice: on a few occasions I've even seen them apparently being used to smuggle in prescriptive opinions, i.e. slapping "rare" labels on senses someone doesn't like or isn't personally familiar with.

I doubt a specific quantitative threshold will be useful cross-linguistically, but it would be good to work out a definition that can be applied more consistently by editors than the current ones. My personal rule of thumb has been that a word is "rare" if it takes more than minimal effort to find adequate attestation, and "uncommon" if it's not that difficult to find but a reader (even a specialist) is very unlikely to encounter it organically. This issue has been discussed before, most recently I think here in Jan. 2022, but it hasn't come to a conclusion. —Al-Muqanna المقنع (talk) 18:46, 8 September 2023 (UTC)[reply]

@Al-Muqanna Thanks for bringing this up. Last time this came up I think I advocated for merging the two on the basis that they weren't and couldn't be distinguished consistently. I think your distinction makes sense although it definitely requires human judgment, and I have my doubts about whether people will actually follow it. Benwing2 (talk) 21:42, 8 September 2023 (UTC)[reply]

I thought some linguists have ordinal scales for how humans interpret such words and phrases. I'm pretty sure that we can help users more by maintaining some kind of distinction. "Rare" certainly means less common than "uncommon". Neither is normally applied to words that are not considered principally to be used in some particular register or usage context. I expect that most normal users would use our definition to decode whatever passage they found the term in and forget it whether the label read "rare" or "uncommon". Some users (writers, mostly, in my vision, like Thomas Pynchon) might differentiate and decide some uncommon words they happened to like were worth deploying. Or are these tags just useful for us? Maybe they are just our way of expressing displeasure at a word that required more effort on our part than seemed worthwhile? I certainly need some similar motivation to bother with that kind of label. DCDuring (talk) 00:11, 9 September 2023 (UTC)[reply]

@DCDuring For foreign languages, rare and uncommon are very useful in indicating senses that don't occur much; otherwise, the more common senses get overwhelmed by the multitude of uncommon ones. For English specifically, with the target being native English speakers, the target users already have some sense of whether a definition is rare/uncommon because in that case they won't know it; but even then I think the tag can help people (both native English speakers and L2 speakers) trying to write good English, to know whether they have to be careful with using a particular sense. Think of ludicrous machine-translated Chinglish where the machine used obsolete or rare senses of English words, or Kim Jong-un trying to insult Trump and coming up with "dotard"; in both cases the dictionaries clearly didn't do a good job of identifying the terms or senses as rare/uncommon. (For that matter, we don't label "dotard" at all; the only hint that it is dated or rare is that all cites, except the one from Kim Jong-un, are <= 1867.) Benwing2 (talk) 00:31, 9 September 2023 (UTC)[reply]

Yes, I am mostly interested in English and don't expect to impose what we should IMHO do for English on any other languages.

I'm certainly not objecting to having them both, but I doubt that there is any point in quantifying. We usually stop at three to five cites per definition. The tag "rare" sometimes results from "It took a lot of searches and corpora for me to get even three good cites". "Uncommon" often has an element of 'low frequency relative to synonyms'. If no learner dictionaries and some other 'unabridged' dictionaries don't have it, that's also support for some kind of tag. In the absence of something better, that can be 'uncommon' or, if, only or not even the OED has it, 'rare'. DCDuring (talk) 01:33, 9 September 2023 (UTC)[reply]

@DCDuring: I think your description here is basically comparable to my personal understanding of the terms, i.e. that "rare" means hard to find even when you set out to find it and "uncommon" just means notably infrequent but not necessarily hard to find. —Al-Muqanna المقنع (talk) 12:27, 9 September 2023 (UTC)[reply]

How about we decide on an arbitrary small number, like 7, and say that anything with less than 7 hits is classed as rare. It would be an awesome vote to decide what the threshold would be, the 7-brigaders battling against the 6-brigaders, I can't wait! Jewle V (talk) 00:19, 9 September 2023 (UTC)[reply]
Lol, this is what I mean about trying to set fixed thresholds probably not being very useful. —Al-Muqanna المقنع (talk) 11:22, 9 September 2023 (UTC)[reply]
These labels stack with others. If a term is slang and rare, it will be extra-hard to find and unlikely to be “organically” encountered, whatever that means – guess it means doomscrolling arguments on the internet again which you previously decided not to rebrowse. Then the rule of thumb is applied in a hypothetical, extrapolated fashion—typical Belizean English is what I would not find with a certain likelihood if living in Belize or consuming Belizean sources which aren’t necessarily preferred by search engines to open us the windows into other worlds. There is reason but no rhyme. Fay Freak (talk) 01:06, 9 September 2023 (UTC)[reply]
"Organically" means encountered in the ordinary course of one's linguistic experience rather than searching for the term so you can find attestations for Wiktionary. My qualification on "even a specialist" (perhaps better put as even a specialist or member of the relevant community) was intended to encompass slang and jargon, i.e. an Internet slang term will certainly be uncommon for people who don't use the Internet much but should only be marked "uncommon" if someone who does is notably unlikely to ever encounter it. —Al-Muqanna المقنع (talk)

I would urge studying the theory of how other dictionaries use "rare/uncommon" and then comparing words considered rare/uncommon elsewhere to those Wiktionary has or has not labeled rare/uncommon. --Geographyinitiative (talk) 11:45, 9 September 2023 (UTC)[reply]

English dictionaries don't generally use an "uncommon" label, afaik, they just have "rare" if they bother at all (Merriam-Webster doesn't as far as I can tell). The OED does have technical frequency bands based on occurrences per million words (see here) but they don't appear to apply the label "rare" based on these bands, e.g. they list gurhofite as an example of the least common band 1 but it is not marked rare in the entry, perhaps because it's mineralogy jargon and so expected to be low-frequency; grithbreach, another example they give, is not marked rare either, just obsolete or historical. abaxile, on the other hand, is indeed marked "now rare", despite the relevant sense being botany jargon. —Al-Muqanna المقنع (talk) 12:16, 9 September 2023 (UTC)[reply]

So what I take from this discussion is that we do it right, better than anyone else, and you just want to redefine the Appendix:Glossary definitions, to match the actual usage, and a definition at all … fine. Also fine if you weren’t sure at first what you want and everyone had to make a point on it. Fay Freak (talk) 12:30, 9 September 2023 (UTC)[reply]

@Fay Freak: The point is to settle on more helpful glossary definitions which can be applied consistently, yeah. I don't think we necessarily do it right since it's applied inconsistently in practice, but if there's already an understanding shared by most established editors then we should update the glossary and it'll be easier for readers to understand and other contributors to follow it consistently. —Al-Muqanna المقنع (talk) 12:35, 9 September 2023 (UTC)[reply]

I would perhaps suggest that the use of "rare/uncommon" may be a kind of defect if the other dictionaries do not use "rare/uncommon", and that some better method of quantifying relative usage may be in order, perhaps piggy backing on OED or Google NGrams. I come from the perspective of seeing Wiktionary as in a "primitive" stage of development. --Geographyinitiative (talk) 13:17, 9 September 2023 (UTC)[reply]

@Geographyinitiative I disagree, rather I honestly think Wiktionary often does a much better job in deploying specific judgements about the nature of the words it covers. If other dictionaries don't mark a word as "rare", when in fact it really is rare, then I think we are legitimately being more precise about it. I don't know how this translates into editors' practice, but principally distinguishing rare and uncommon entries from the general vocabulary is helpful in my opinion, and stops you from sounding weird if you try to use a word somewhere in the 100,000s place in the frequency list. As long as our internal definitions make sense, and the glossary helps readers to understand them, then surely all we need to do is realize that in practice. Kiril kovachev (talk・contribs) 13:32, 9 September 2023 (UTC)[reply]

I would like to know: what does Wiktionary's "rare/uncommon" terms match to in OED's technical frequency bands? If there's a correlation between Wiktionary's existing designations and OED's technical frequency bands, that could be something worthwhile to add to an updated definition of "rare/uncommon". Or, perhaps "rare/uncommon" is not connected to actual frequency and is merely a judgment compared to similar words. If the designation is subjective rather than "corpus frequency", that might be something to mention in a revised definition. The current definitions of "rare/uncommon" could mean literal "corpus frequency"; a revised definition might try to say "relative to related terms" or something like that. But there has got to be a reason why the other dictionaries may not delve into using these words- what is that reason? Wiktionary has the capability to be superior, but is it actually superior? -Geographyinitiative (talk) 13:39, 9 September 2023 (UTC) (modified)[reply]

We don't define them quantitatively so they don't correspond to anything. Similarly the OED's "rare" label also doesn't in practice correspond to any of its frequency bands. To be useful, they need to be somewhat subjective: any kind of jargon is going to be uncommon in absolute terms compared to non-jargon, but if you add a label like "(sociology, uncommon)" then you're implying it's uncommon even within sociology. Adding "uncommon" simply because its absolute frequency in the entire language is low is probably not helpful to readers, and we don't in general do that anyway. —Al-Muqanna المقنع (talk) 14:12, 9 September 2023 (UTC)[reply]

This is beginning to sound both coherent and realistic and approaching consensus! DCDuring (talk) 14:47, 9 September 2023 (UTC)[reply]

I don't create a lot of entries, but I also label words "rare" if it is difficult to find cites (e.g. clattawa, which includes all the citations I was able to find online, with a certain amount of effort, over a period of a several years). I haven't used "uncommon" much, if at all, but I have understood it in the way described by other editors here: unlikely to be encountered, but not hard to cite. FWIW, I have also used "very rare" for words or spellings that are just barely citeable. I'm not sure if that's something we want to be doing or not. Andrew Sheedy (talk) 16:27, 9 September 2023 (UTC)[reply]

That's more or less my take. I've only used "very rare" myself at abstringe, where it seems possible that the 3 citations (if accepted) are literally the only instances it's ever been used independently. —Al-Muqanna المقنع (talk) 17:27, 9 September 2023 (UTC)[reply]

I probably would label as rare a French word that may look like 'the' translation of a common English word, but isn't actually used nearly as much as its English counterpart. Don't have any example at hand right now though.

Another case would perhaps be a word that is formed according to regular derivational processes but isn't used as much as might be expected. For example, déconnade and déconnement both seem to me to be perfectly regular and "reasonable" derivations from déconner (a very common verb), but for some reason only the first one sounds normal to me. I've absolutely never heard or read déconnement before: I was looking in my head for an example of what I meant and I thought of this one. It happens to be attested, but it still feels like a weird/rare word.

In any case, I agree this is all very fuzzy and the label is almost meaningless at the moment. If we want it to be useful, we should get into the habit of providing meaningful comparanda (i.e. "rare compared to what?"). P U C – 16:38, 9 September 2023 (UTC)[reply]

Example at hand: predator, predatory, sexual predator and so on, which don’t exist as set terms in other languages. But the translations get specific glosses. Fay Freak (talk) 17:18, 9 September 2023 (UTC)[reply]

Granularity of reading types in `{{ja-kanjitab}}`

How precise do we want to indicate the readings used in {{ja-kanjitab}}? For example, in this here edit on 女体 (にょたい), I indicated the reading of the kanji as being go-on, which I would say is an improvement in this case, because the whole compound is made of the go-on readings. However, what about the other kanji tab on this page for じょたい? Do we want to specifically notate this as kan'on + go-on? At this time, it's being shown just as "on'yomi", which is indeed helpful, but I'd like to know whether we should do our best to convert these to be as precise as possible. In my opinion, this would be constructive for readers as it would show the specific types of kanji readings that are used throughout the Japanese vocabulary, but that's just my opinion. Kiril kovachev (talk・contribs) 13:24, 9 September 2023 (UTC)[reply]

In general, I try to be as specific as possible in {{ja-kanjitab}} reading types.

I confess that it has bothered me for some time that we have no way of clearly specifying when a kanji reading is both kan'on and goon; we are left with just on for generic on'yomi, which is also what gets used when an editor hasn't taken the time to look up reading types (more common for older entries), or for when the resources we have to hand don't specify. I dislike the ambiguity.

Back to your specifics, yes, for the じょたい (jotai) reading on the 女体 page, I would specify {{ja-kanjitab|じょ|たい|yomi=kanon,goon}}. ‑‑ Eiríkr Útlendi │^{Tala við mig} 17:07, 11 September 2023 (UTC)[reply]

@Eirikr I see, then our ideas are aligned I do think in this regard. I also would like to accurately reflect the reading types where possible, but like you say I don't know what to put when the kan/go on readings are the same. At the end of the day if it just says "on", then this could most likely only refer to kanon or goon anyway, since any other on reading type is very rare; but this isn't awfully verbose or clear to readers of the kanji tab at a glance, and of less educational value too. I wonder if there isn't a solution to this, such as e.g. a "compound" reading specify kanon/goon? But this would require changes to the kanjitab template.

Thanks for your feedback, anyway – I now updated that particular page, and I shall keep on doing this for any entries I create in the future. Kiril kovachev (talk・contribs) 19:17, 11 September 2023 (UTC)[reply]

Too much concealment of quotations and synonyms in short English L2s for multiword expressions

At an entry like where it's at, we have hardly any content visible to a user. (There was less before the See also was added.) I think this makes our entry look pathetic for no good reason, as there is content. Do others agree?

If so, can our talented technology mavens work their magic to make sure that collapsible quotations and synonyms don't make such an L2 shorter than, say, 20 or 25 lines when there is concealed content? DCDuring (talk) 23:27, 9 September 2023 (UTC)[reply]

really long single-kanji readings

@Kiril kovachev Your bot changes to remove redundant translits have resulted in thousands of new categories getting created, often with long readings for single kanji. For example, 顣 lists a kun reading まゆをひそめる = mayuwohisomeru, which duly results in Category:Japanese kanji with kun reading まゆをひそめる and its parent Category:Japanese kanji read as まゆをひそめる. Is this a real reading? The Hiragana entry for まゆをひそめる soft-redirects to 眉を顰める; this entire phrase is pronounced mayu o hisomeru, and the Kanji 顣 that is claimed to have the reading mayuwohisomeru is listed here with reading hiso, which seems to make more sense, although I don't know a lot about Japanese, so maybe this extremely long reading is kosher. Can someone who speaks and reads Japanese tell me whether this reading and the others listed in Special:WantedCategories from position 752 through 1829 or so (which include katakana readings towards the end) are real or spurious? Benwing2 (talk) 08:12, 10 September 2023 (UTC)[reply]

@Benwing2 Hello, thanks for bringing this up, and in truth I was going to post an extended post about this change and how to clean up kanji entries in general, but let's address this first. I apologize very much if this has lacked some degree of foresight, as personally I didn't know this would flood WantedCategories with all this drivel; but ultimately we already had these readings sitting there latent, and the only reason these categories haven't surfaced till now is the syntax previously suppressing the category generation. May I ask, is the creation of categories an automatic process? I was aware that many kanji had had a category generated as a result of this, but I didn't know category pages are also being created in such amounts because of that.

As for the readings, I am extremely dubious that most of these kun readings are real whatsoever, in fact they strike me as simply an explanation of the character's meaning as opposed to a "reading": no one is likely going to see this kanji 顣 and read it as an entire clause. There are loads of these instances in our coverage, which I would suggest are mostly mistaken. The general limit on common kanji readings' length is 5 kana, with the most notable readings of this kind being 承(うけたまわ)る (uketamawaru) and 志(こころざし) (kokorozashi). As a rule of thumb, therefore, there should be exceedingly few valid kun readings over 5 long. Those are probably just phrases.

Additionally about the kanji in that list:

I believe all the ones that end in "さま" are just meanings, not readings, so those are not legit. Same with most ending in "する".
The ones that contain a を are entire phrases, so also probably not valid. Alternatively, they're spellings in the old orthography, so still wrong, but maybe fixable.
This was a consideration for kanji reading cleanup as a whole: a great number of kanji readings in that list are for verbs (so they end in one of the kanas ending in "u", うくすむぬつふる), but they have no okurigana suffix to show which part of the verb the kanji corresponds to (the rest being written in kana), e.g. "盈" supposedly has reading 盈(みたす) (mitasu), but the usual spelling of this verb is 満たす, suggesting that 盈's reading should actually be み-たす. Similarly, many of the readings end in い, of which a lot are i-adjectives. For those that are, they should never contain the い as part of the whole reading, and the end of the reading should fall somewhere before the い, e.g. 脃(もろい) (moroi) should be もろ-い.
Because some dictionaries don't show the okurigana placement, this can make it difficult to tell exactly which parts the kanji are meant to represent. Sometimes, it's not the same portion of the underlying word as the primary spelling, either. This means to me that we may have to manually decide (a) whether some of these alternative readings (for the valid, short-ish words) are well-attested to begin with, and (b) what parts are used for the kanji reading and what parts are written in kana.
The katakana ones are dubious to me as well, but this ventures into the territory of archaic and esoteric, which I'm not familiar with: it's possible that some of these were used in the past, e.g. 銥 may possibly have been in use for イリジウム (irijiumu), but I've never seen it myself. However, 粁(キロメートル) (kiromētoru) might be okay, along with related terms, because 米(メートル) (mētoru) is occasionally used.
There are some old-orthography readings that have slipped through the cracks as well, e.g. 塴(はうむる) (haumuru), which should be ほうむる<はうむる (with a - somewhere, requiring some research); some patterns can be filtered, e.g. を, ゑ, ゐ, はう, かう, まう, etc., but again in need of investigation.

If this is too much to handle at once, perhaps I can undo the syntax removal? That would temporarily remove these categories and allow us time to fix the readings in general, if you would prefer that. Otherwise, I do feel this will still require lots of manual checking, even after we're able to prune the obviously dubious readings. I would of course be happy to take responsibility and begin to check those in the list in order.

Hope this was of some help, Kiril kovachev (talk・contribs) 11:09, 10 September 2023 (UTC)[reply]

Also, I suggest a good litmus test for whether something is really a (common) kun'yomi: check on https://kanji.jitenon.jp/. It doesn't always have all the readings we do, but it does specifically delineate between 意味 (meaning) and 訓読み (kun'yomi), so if some supposed kun'yomi is listed on that site as a "meaning", as opposed to a kun'yomi, then it's probably invalid. E.g. 人 is given the kun'yomi ひと, whereas 塴 simply 'means' ほうむる. Kiril kovachev (talk・contribs) 12:27, 10 September 2023 (UTC)[reply]

@Kiril kovachev The creation of categories in Special:WantedCategories is something I manually run a script to do, usually every 3 days, since that's the frequency at which Special:WantedCategories is refreshed. I usually check the categories when I run the script, which is how I caught these cases. If I catch the issue before the script runs, I can tell it to filter out the bad categories but in this case I didn't notice until the script had almost finished the Japanese categories, so I let it run. Once we've fixed everything up, the bad categories will end up in Category:Empty categories and we can delete them. I understand that the problem was already there before you ran your script; I wouldn't recommend undoing what you've done because the categories won't disappear (only become empty), and the problem will still be there. I don't know that much about Japanese so you'll have to help delete the bad readings, but I can help as much as possible, e.g. I can make a list of the potentially bad readings and you can manually filter out the good ones, leaving the bad ones to be removed by bot. Benwing2 (talk) 19:18, 10 September 2023 (UTC)[reply]

@Benwing2 Okay, that's fine. I'll engage myself with fixing the readings as much as possible until the categories can be cleared. The problem with some of the readings, such as the okurigana placement I mentioned above, are not so clear-cut, but we can get rid of all the obviously shoddy ones. Could I enlist your help to filter out some of those? Could you please create a listing of all the ones that:

Contain を, ゑ, ゐ, or any of [かがはばぱさざただなまやら] followed by う (first 3 indicate archaic characters, or inappropriate reading in the case of を final is one of a few common old spellings); or
Are longer than 5 characters. There are some 6-long out there, but the majority should be false readings; or
End in さま?

The katakana ones might well be valid so I'll check all of those manually, but I believe these three points make for unlikely readings. Also, if you're quite busy, please let me know and I'll generate them myself, so you don't need to waste your time. Thanks for your help, Kiril kovachev (talk・contribs) 20:38, 10 September 2023 (UTC)[reply]

@Kiril kovachev Sure, will do. Benwing2 (talk) 20:47, 10 September 2023 (UTC)[reply]

@Kiril kovachev: I notice a bunch of categories like Category:Japanese kanji with kun reading あお-ぐ that have a hyphen in them (this category has 5 chars in it). Should they exist or should we be putting these characters in Category:Japanese kanji with kun reading あお, chopping things off at the hyphen? Benwing2 (talk) 22:29, 10 September 2023 (UTC)[reply]

@Benwing2 I'm pretty sure these are correct. This is the standard practice as far as I can tell. If anything, the ones with no - where there should be one should be changed, which I'll be looking to do as part of my sweep. I think you should ignore the - as part of the character count anyway. Kiril kovachev (talk・contribs) 22:35, 10 September 2023 (UTC)[reply]

@Kiril kovachev: I *think* this is the full list meeting the above criteria (it's 382 categories): User:Benwing2/bad-japanese-reading-cats Benwing2 (talk) 00:48, 11 September 2023 (UTC)[reply]

@Benwing2 Thanks! I'm now going to start checking, I'll let you know my findings. Kiril kovachev (talk・contribs) 10:13, 11 September 2023 (UTC)[reply]

@Kiril kovachev I also ran it while checking for categories with 5+ chars. The result is here: User:Benwing2/bad-japanese-reading-cats-5-or-more. There are 833 lines here. Benwing2 (talk) 10:31, 11 September 2023 (UTC)[reply]

@Benwing2 Thanks for this again. Kiril kovachev (talk・contribs) 12:05, 11 September 2023 (UTC)[reply]

Thanks for generating this. Since we had issues last month(s) with the monthly subpages not showing up on the main WT:BP page because they were too large to transclude, might I suggest moving this list to a sandbox/userspace page and just linking to it, lest we (now at 229k bytes with two-thirds of the month left to go) get too large for WT:BP again?😅 - -sche (discuss) 03:18, 11 September 2023 (UTC)[reply]

@-sche Done. Benwing2 (talk) 03:53, 11 September 2023 (UTC)[reply]

Pinging the most-recently-active native Japanese speakers who also list English in their Babel boxes, User:MathXplore and Lugria. The question is whether (e.g.) 顣 by itself has a reading "まゆをひそめる", or whether it only has a reading of "hiso". (You might or might not also have an opinion about where to mentino vs not mention ruby, discussed at Wiktionary:Beer_parlour/2023/September#Automatic_transliteration_of_katakana_and_hiragana.) Additional ping to User:Eirikr. - -sche (discuss) 16:14, 10 September 2023 (UTC)[reply]

@-sche I very highly doubt that まゆをひそめる is even possible as a reading, because I've never before seen a reading that doesn't spell out を as its own particle. As a whole を is spelled in kana 100% of the time from what I've personally seen, so seeing it hidden behind a kanji doesn't seem right at all to me. But I'll let those who know better weigh in as well. Kiril kovachev (talk・contribs) 20:40, 10 September 2023 (UTC)[reply]

I also have the same doubt like this. MathXplore (talk) 07:45, 11 September 2023 (UTC)[reply]

According to this online kanji dictionary ([6]), the only given reading is "shuku". I also confirmed this at page 1564 of 新漢語林第二版 ([7]) from w:ja:大修館書店. No other readings were found. On the other hand, the online kanji dictionary that I used above explains the definition as "ひそめる", so this is likely related but cannot find "hiso" as its reading. I hope this can help you. MathXplore (talk) 07:54, 11 September 2023 (UTC)[reply]

@Benwing2, @MathXplore, @-sche:

In relation to this very partial investigation that I began just a few hours ago, something has already become clear to me, which is that many of our kanji readings pages were generated automatically a long time ago (2003) by User:NanshuBot, and then later reformatted to its current form. Anyway, the important thing to note is, they are all heaved over from KANJIDIC. What's more, these same reading sets that were present in the KANJIDIC data set back in 2003 have by now proliferated alll over the internet, making for countless sites that appear to corroborate the original, apocryphal reading. These characters are so obscure, they don't fit into the usual kanji dictionary that I search (学研漢和大字典), and finding a legitimate, in-the-wild usage is virtually impossible. The pathetic 7 or so readings I was able to scrawl through were the product of over an hour of trawling over data and deciding whether it's a KANJIDIC derivative, and then in turn trying to figure out where that original KANJIDIC reading even came from, and whether it should be kept in the first place. Indeed, KANJIDIC for these rare readings also doesn't bother to place the okurigana location, which for our purposes means that it doesn't help in whittling down which categories should be kept or not.

We could reject readings that appear to be only be present in KANJIDIC as a rule, but it could also be possible for some of them to be valid after all, so I don't know. I think I would like to email the KANJIDIC maintainers to ask about the sources of readings, if they are gathering them in some systematic way. Maybe I need to also consult Dai Kanwa Jiten for some of the most obscure readings as well. But checking through these all in this way will still be very cumbersome.

What do you all think we should do? Kiril kovachev (talk・contribs) 12:26, 11 September 2023 (UTC)[reply]

For starters, I think any kanji whose kun'yomi seems at all dubious and which is sourced to KANJIDIC should be added to some kind of maintenance category, flagged for further review, as it were. For instance, the 顣 character mentioned above appears in the Weblio aggregator site here, and it seems to indicate that this only appears in KANJIDIC -- no other resources or entries are listed.

I too have noticed this over time -- many obscure kanji characters are included in KANJIDIC and the Unihan database with Japanese readings, but digging further reveals that:

The kun'yomi given in both resources is often a gloss, and not a valid Japonic reading.
The character in question isn't even used in Japanese text at all, or if at all, only exceedingly rarely, and often in cases where the Japanese text is worded like "this is the character used in Chinese to spell the word iridium..."

I am not sure if KANJIDIC cribbed from Unihan, or the other way around. At any rate, both resources strike me as extremely dubious for rare Japanese kanji. Any of our entries sourced to these will need vetting. ‑‑ Eiríkr Útlendi │^{Tala við mig} 17:19, 11 September 2023 (UTC)[reply]

@Kiril kovachev Thanks for the investigation! I would agree with User:Eirikr and add that anything dubious like this should be placed in some sort of "check" template that essentially quarantines the dubious reading, similar to {{t-check}}. Such template should not add the reading to any categories. Probably this can be done in an automated or semi-automated fashion. Benwing2 (talk) 18:51, 11 September 2023 (UTC)[reply]

@Eirikr Thanks very much for this input, I'm glad to know these aren't really used in Japanese, and we can take a slightly less difficult approach and flag things directly that we doubt.

@Benwing2 About the check template, do we want it to continue to display the reading to users, or just effectively comment it out, whilst still allowing it to be tracked? If we want to display it, I just wonder how we can do this while preserving the format of the readings template, since any text we try to input appears to mess with the reading itself.

If we don't want to show the reading, I guess just an empty template with no output would do? Just wrapping the reading in e.g. {{ja-reading-check}} would make it vanish but still track it.

Additionally, about the automation, I believe a good bit of work can be saved by scraping the various reading aggregator sites, ideally independent ones such as kanji.jitenon.jp (that aren't based on kanjidic nor on Unihan) to check whether the relevant readings are provided or not. Out of the many entries in the pile to check, it's quite likely that many can be flagged like this. Kiril kovachev (talk・contribs) 19:09, 11 September 2023 (UTC)[reply]

@Kiril kovachev If you look at {{t-check}}, it continues to display the translation but with a displayed indication that it needs checking, and it also adds the translation to a request-for-check category. This would be ideal, if it's not too much work to implement. As for the automation, if you can give me more detail on how to do the scraping and how to check the relevant readings, I can probably implement it. Benwing2 (talk) 20:03, 11 September 2023 (UTC)[reply]

BTW maybe instead of wrapping it in a separate template, we can add some flag to the readings template (whichever one that is) to indicate that a reading needs checking. Benwing2 (talk) 20:05, 11 September 2023 (UTC)[reply]

This could be as simple as an asterisk or other symbol before the disputed reading or whatever. Benwing2 (talk) 20:05, 11 September 2023 (UTC)[reply]

@Benwing2 That's a good idea, because I'm not sure how we would do it inline: each reading generates a link to itself, right? Therefore if you just insert a template in there, it will interact with the link; e.g. if you try to input tags, this triggers the < syntax of the template, which specifies older forms of the reading. If you input ordinary text, that in turn is considered part of the reading, and gets linked to, polluting the desired link. Maybe there's a more advanced solution, as I'm not all that experienced with templates, but actually your suggestion is great and I'd rather just put a little indicator to suggest some of its readings need checking, if that's possible. I guess it'd just need a small check on behalf of the readings template: I can try add the asterisk option if you're agreed with that.

Re:scraping, I can try making that script myself, but I can cover the details if you'd like too. We need to get the data, probably from HTML since there aren't any good APIs*, from a few sites: (*that I know of, maybe there are?)

kanji.jitenon.jp: get the content page via the URL https://kanji.jitenon.jp/cat/search.php?getdata=4eba&search=match&how=%E6%BC%A2%E5%AD%97 where in this case 4eba is the hexadecimal value of the Unicode code point value (for 人). Then we need to access the kun'yomi (we should focus our attention on checking kun'yomi, as these are the vast majority of the dubious readings) from the readings table on that page, e.g. ひと. This looks slightly complicated, so I'm not sure on the exact procedure yet, since the table doesn't have distinct IDs or classes for its elements, so some amount of decoding the thing might be required.
weblio.jp: this site is one of the ones that is sometimes corrupted with the dubious readings. We can check whether these are present by scanning for text like ［訓］ひと (the ひと could be any hiragana or katakana) at https://www.weblio.jp/content/<kanji here>. If this kun reading matches what we have in our coverage, but the same content is not reflected on kanji.jitenon.jp, it is likely that we can flag the value as dubious. Its URLs are just https://www.weblio.jp/content/<kanji here>, so getting the content is quite easy. It's then a matter of searching with the desired pattern, which should hopefully work broadly over all possible kanji. But bear in mind that the readings are indicated in tags, which in turn are nested inside tags that contain the ［訓］ label in them.

We could check more sites than this, for instance the KANJIDIC database itself (via an API at http://nihongo.monash.edu/cgi-bin/wwwjdic?1B): if you do

r = requests.post("http://nihongo.monash.edu/cgi-bin/wwwjdic?1D", data={'kanjsel': 'X', 'ksrchkey': '<kanji goes here>', 'strcnt': ''})

, that gets the HTML contents of the lookup page, and then the kun readings in tags, immediately following the [訓] (in ) tags) would need to be read. But perhaps just one of this or Weblio would do.

The check at KANJIDIC would check that our obscure reading is probably from there, and the check at independent Jitenon would confirm whether that reading is actually widespread or not.

Maybe there's an easier way to access the content from Jitenon and Weblio, but idk.

As you can see it is quite a lengthy process, so if you want I can try to handle it. But anyway let me know if you want to do it yourself. Thanks for offering to help! Kiril kovachev (talk・contribs) 20:54, 11 September 2023 (UTC)[reply]

@Kiril kovachev Yes, maybe you should see if you can implement it since I'm not very experienced in scraping web sites and you probably have a better idea of what you're looking for. I can implement the changes needed for {{ja-readings}}. Benwing2 (talk) 21:36, 11 September 2023 (UTC)[reply]

@Benwing2 Alrighty, thanks for handling it. I'll figure out that business tomorrow if I can. Kiril kovachev (talk・contribs) 21:39, 11 September 2023 (UTC)[reply]

(Explanatory note for those who don't know Weblio) I would like to note that Weblio itself is not a dictionary but a search engine that find things from many dictionaries and encyclopedias, which includes (but not limited to) Japanese Wikipedia/Wiktionary etc. In other words, this can be understood as an online dictionary mirroring website. If there is something wrong on weblio, then we may need to check where such errors come from. MathXplore (talk) 06:46, 12 September 2023 (UTC)[reply]

I noticed that kanji.jitenon.jp is based on existing paper-based kanji dictionaries from major Japanese publishers (such as KADOKAWA etc.). Their references that they used are listed at [8]. I wasn't able to find the list of editors from the company's website, but I think kanji.jitenon.jp is reasonable to be used for our upcoming checks. MathXplore (talk) 06:57, 12 September 2023 (UTC)[reply]

@MathXplore Right, I see, thanks for the correction, I think I'll just stick to directly tapping into KANJIDIC2 for the comparison, then. Also good to know about kanji.jitenon.jp — I noticed they were clearly different from the other net-based aggregators, so thanks for checking out the source. Kiril kovachev (talk・contribs) 09:22, 12 September 2023 (UTC)[reply]

@Kiril kovachev @Eirikr On a somewhat related note, the Unicode Consortium are currently undertaking a massive (read: years long) review of the Japanese data in the Unihan database, which is probably worth keeping an eye on. They’re well-aware that there are major inaccuracies in a lot of the data, but they also do want to clear them up - and they have some pretty knowledgeable people contributing to it. Theknightwho (talk) 01:15, 12 September 2023 (UTC)[reply]

@Theknightwho Wow, thanks for letting us know. That's certainly interesting — when did they start, do you know? We could benefit a lot, if we aren't able to handle everything before then ourselves, if we check out what changes they make when everything's finished. Kiril kovachev (talk・contribs) 09:19, 12 September 2023 (UTC)[reply]

I don't know well about the KANJIDIC and the related bot mentioned as above, but I think it's a good idea to ask about their sources. Then we can easily declare if they are valid or not. MathXplore (talk) 06:33, 12 September 2023 (UTC)[reply]

Sounds good. I'll send them a mail later today ^^ Kiril kovachev (talk・contribs) 09:22, 12 September 2023 (UTC)[reply]

@MathXplore, @Eirikr As you were interested, I asked Mr. Breen about where the kanjidic readings originally come from, + highlighting what anomalies we've found, and he responded:

Hi Kiril,

Thanks for making contact on this issue. This is just a brief initial response; l don't have a lot of time to address the matters you raised, and I'm heading off on some travel until late October.

When the kanjidic data was first put together over 30 years ago it was a matter of scraping together whatever was available. No attempt was made to look at sources other than basic references such as Nelson. That applied particularly to the rare kanji in the JIS X 0212 "supplementary" standard. For most of these the details were simply copied from the Unihan data. I see that most of the kanji you mentioned in the email are from JIS X 0212 and that the doubtful readings are from Unihan.

When I get a chance I'll go through them and check against some kanwa sources. I suspect that many of those readings can be dropped. As they are not common kanji this sort of checking has never been a high priority.

I hope this helps, and I'll get back to you.

Cheers

What I believe we should draw from this is, we have good license to question the odd readings, and should use some established kanwa sources to verify or reject specific readings. Kiril kovachev (talk・contribs) 16:14, 13 September 2023 (UTC)[reply]

@Kiril kovachev Thank you for doing this! This answers the question of where the info came from (Unihan) and I agree that we should drop or quarantine the doubtful readings. Benwing2 (talk) 19:32, 13 September 2023 (UTC)[reply]

@Benwing2 Agreed! And status update on the automated processing, I've been ever so slightly busy these two days, and so I've yet to finish it, but one of these coming days I hope I can get it done. Either way, this is good news imo for our goals :) Kiril kovachev (talk・contribs) 21:47, 13 September 2023 (UTC)[reply]

Thank you for the contact. Since we learned KANJIDIC is "scraping together whatever was available", I'm afraid that it may have collected exceptional readings (such as readings used only for human names, readings used when converting ancient Chinese texts to Japanese etc.). I agree that doubtful readings must be checked, but I also thought that we may need to give labels to rarely used readings to distinguish them from frequently used readings. MathXplore (talk) 12:35, 14 September 2023 (UTC)[reply]

Labels in inflections of labeled terms

Hi. Currently I'm working on Fala entries, and I've just added WT:ACCEL support to the verb-conjugation template. This language has three varieties, one for each town where it is spoken. In order to not give preference to one variety over the other two, both the bibliography and the Wiktionary entries have labels (see enxagual).

My question is whether we should label non-lemmas if the parent entry is already labeled, either with {{lb}} or {{tlb}}. This doesn't only apply to Fala or verbs, but all Wiktionary languages, so I feel I should ask first for consensus. Cheers, sware • 🗣 • 🏲 16:30, 10 September 2023 (UTC)[reply]

No, that kind of information should be centralised in a single lemma, unless it relates specifically to the inflected form. Alternative forms, where they're technically lemmas, are more of a grey area though it's usually still better to centralise the information at one entry, again unless it's specific to the form itself. —Al-Muqanna المقنع (talk) 16:47, 10 September 2023 (UTC)[reply]

+1, exactly. Like newmade is not a particularly "obsolete" past tense of newmake, it's just the past tense of newmake, and newmake as a whole is obsolete. Whereas, low#Etymology_2 is a {{lb|en|obsolete|nocat=1}} {{inflection of|en|laugh||sim|past}} (...actually the "nocat" is questionable, although I understand why it was added...) because it really is a specifically obsolete past tense form of laugh, which has the non-obsolete past tense laughed. - -sche (discuss) 17:18, 10 September 2023 (UTC)[reply]

Understood, thanks. The low case does appear in Fala, take the example of enxagual: the second-person plural imperfect indicative form enxaguabis only is found in Mañegu. No (simple) way I could encode that into the acceleration script, right? Anyways, that'd be more of a GP question. Thank you both. sware • 🗣 • 🏲 17:34, 10 September 2023 (UTC)[reply]

It's been a while since I tinkered with any ACCEL stuff, but if the Mañegu-specific forms are predictable, e.g. if every verb generates two second-person plural imperfect indicative forms and the first one is always Mañegu-specific (or they both are, etc), then it should be possible to have the template/module code wrap that form's link in a different class so that ACCEL clicking on it can generate the label. It might make even more sense to just have a bot generate the forms. Presumably the table on enxagual should also have some indicator visible on that entry that that form is Mañegu-specific, e.g. a (qualifier) or a footnote! - -sche (discuss) 17:46, 10 September 2023 (UTC)[reply]

@Swaare User:-sche is right, you can encode the fact that the form enxaguabis is Mañegu-specific in the accelerator form (= class), and have the accelerator code generate the appropriate label. Benwing2 (talk) 19:22, 10 September 2023 (UTC)[reply]

Etymology-only Sardinian

Hi, everbody. A little while ago, I added the Sassarese entry abbaiddà, which is a borrowing from Sardinian abbaidare. There are some variants of the latter, but this specific form only appears in Logudorese Sardinian.
Now, my question is whether the Etymology subsection for abbaiddà should look like this

Borrowed from Sardinian abbaidare

or like this

Borrowed from Logudorese abbaidare

I mean, I personally can't see a reason to not use the more specific nomenclature, but I thought I'd ask. —— GianWiki (talk) 11:20, 11 September 2023 (UTC)[reply]

Logudorese is already a recognized etymology-only language, so it can be the second. Just write "Borrowed from {{bor|sdc|sc-src|abbaidare}}." —Mahāgaja · talk 11:29, 11 September 2023 (UTC)[reply]

Thank you very much for your answer. —— GianWiki (talk) 11:36, 11 September 2023 (UTC)[reply]

spelling of medi(a)eval in etymology sections

Should we stabilize the spelling of this word? e.g. on φασκιώνω we spell it Mediaeval and on παλαμάρι without the a. The shorter spelling is the one we use as the main entry (Medieval Greek), as does Wikipedia, but I didnt check for other languages such as Medieval French. This isnt a huge problem, but stabilizing it would make searches somewhat easier. Thanks, —Soap— 10:26, 12 September 2023 (UTC)[reply]

There aren't any languages whose canonical Wiktionary names include the word medi(a)eval, and only four etymology-only languages (CAT:Medieval Hebrew, CAT:Medieval Latin, CAT:Early Medieval Latin, and CAT:Medieval Sinhalese), all of which use the spelling medieval. As for Medieval Greek, etymology sections ought to be using the code gkm and call it Byzantine Greek. —Mahāgaja · talk 10:50, 12 September 2023 (UTC)[reply]

There is however a canonical label "mediaeval folklore" and "mediaeval" also appears as a qualifier in translations sections, derived terms etc. Search turns up 747 results. "Mediaeval" is a rather dated spelling even in the UK so I'd recommend getting rid of it and standardising to "medieval". It seems the etymology sections where it appears for Greek already are using gkm, they just tautologically specify "mediaeval Byzantine Greek". —Al-Muqanna المقنع (talk) 11:02, 12 September 2023 (UTC)[reply]

I just noticed that too. Certainly "Medi(a)eval Byzantine Greek" should be changed to simply "Byzantine Greek". —Mahāgaja · talk 11:11, 12 September 2023 (UTC)[reply]

Multiple quotations, same author, different works

I seem to remember once reading somewhere on Wiktionary about it being not recommended to provide multiple quotations by the same author for the same entry. Is there actually anything against this? —— GianWiki (talk) 10:27, 12 September 2023 (UTC)[reply]

It's not recommended because some authors have idiosyncrasies and we want to show full lexicalization, not just individualized lexicalization. If there are other quotes from other authors supporting a definition, it seems fine to include a quote from the same author if the quotes from said author particularly demonstrate usage well, however for a term to pass WT:CFI they cannot be from the same. Vininn126 (talk) 10:34, 12 September 2023 (UTC)[reply]

Sorry to bother you, @Geographyinitiative, @Vininn126, but I have another question.

If there is one work by an author, and another work which is an anthology of texts of various origins (i.e. not composed by the author of the first work), but edited by the author of the first work, do both works count for the purposes of attestation?

Thanks in advance —— GianWiki (talk) 08:44, 13 September 2023 (UTC)[reply]

@GianWiki What do you mean edited? Vininn126 (talk) 09:49, 13 September 2023 (UTC)[reply]

@Vininn126, I mainly meant to say collected. It's the case of a mid-19^th-century anthology of Sassarese popular songs, collected by Giovanni Spano (author of several Sassarese translations from the Bible), who I believe also provided the orthography in which these songs have been published (which is the same one used in his Bible-related works).

Also, even though it's a bit off-topic—but since I'm here, I thought I'd ask—how does one deal with an anthology where most of the texts don't have a title or a named author (sometimes neither), but some do?

I'll use examples taken from the entry a.

This text has no title or author:

19^th century, unknown author, [untitled song]; republished in chapter XV, in Giovanni Spano, editor, Canti popolari in dialetto sassarese‎^[9], volume 2, Cagliari, 1873, page 87:
Dunca lu megliu è
Tu pensa a la to’ pazi, ed eju a me.
So the best [thing] is: you think about your own peace, and I [think] about myself.

whereas this text has no title, but has a mentioned author:

19^th century, Gavino Serra, [untitled song]; republished in chapter XLII, in Giovanni Spano, editor, Canti popolari in dialetto sassarese‎^[10], volume 2, Cagliari, 1873, page 129:
Di tanti cantendi, e tanti
Mancuna incantesi a me,
Ma da ch’aggiu intesu a te
Tu sei l’unica ch’incanti.
Of so, so many singers, not one enchanted me; yet, since I've heard you, you're the only one who enchants.

Is there a more correct way of dealing with this, perhaps? ——— GianWiki (talk) 10:25, 13 September 2023 (UTC)[reply]

I'm not 100% sure but I'd be inclined to say that it might count. Vininn126 (talk) 10:28, 13 September 2023 (UTC)[reply]

Since Spano isn't the author of the works but merely the editor of the collection, I'd say these count as having different authors. Also, since Sassarese is an LDL, a single mention (let alone a use) is sufficient to pass CFI. —Mahāgaja · talk 10:37, 13 September 2023 (UTC)[reply]

@GianWiki: If these songs were not previously published in an edition then it'd be more correct, and probably simpler, to use |year_published= and the format in the edition. In this case I would simply write

{{quote-book|sdc|year=c. 19th century|author=anonymous|chapter=[untitled song]|editor=w:Giovanni Spano|title=Canti popolari in dialetto sassarese|worklang=it,sdc|year_published=1873|section=song 15|page=87|pageurl=https://books.google.it/books?id=TWlcAAAAcAAJ&pg=PA129|...}}

:

c. 19th century, anonymous author, “[untitled song]”, in Giovanni Spano, editor, Canti popolari in dialetto sassarese (overall work in Italian and Sassarese), published 1873, song 15, page 87:

And the same format for the Serra song. Note that anonymous is now explicitly accepted as a parameter for authors, editors, etc. In general though I don't think it's any different from works in any other edited collection. On the CFI point, we'd need to use common sense about whether the editor or the context of the collection would've influenced the actual composition of the works themselves, thus potentially making them non-independent. In this case I would say they count. —Al-Muqanna المقنع (talk) 10:41, 13 September 2023 (UTC)[reply]

Thank you very much for your answer. —— GianWiki (talk) 10:50, 13 September 2023 (UTC)[reply]

Things for you to consider GianWiki (some overlap with above): At this stage, we technically do not know who wrote the page 87 excerpt. In a pinch, I don't think we know that page 87 and page 129 have two different authors- (1) could Gavino Serra have written the page 87 exceprt without Giovanni Spano realizing it? I've never seen an RFV get that strict, but it may be warranted. Also, I would question: (2) is Gavino Serra the same person as Giovanni Spano (some kind of pseudonym) and Spano is writing both of these passages himself? Or perhaps, (3) in collecting/editing the page 87 and page 129 passages, did Giovanni Spano change the text or modify the wording such that he is really becoming the author? (4) A more remote possibility: could the page 87 and page 129 passages have been written within a year of each other? (See Wiktionary:Criteria for inclusion#Spanning at least a year) These are all things I would consider. Hence I personally would not rely on page 87 and page 129 as two of three cites to meet WT:ATTEST- I personally would find a fourth cite that was clearly independent, etc. That step is probably not necessary. But to me, adding cites is not about reaching a golden three cites for some dumb rule, it's about educating future readers about the actual usage of a word and incidentally hitting WT:ATTEST in the process. --Geographyinitiative (talk) 10:54, 13 September 2023 (UTC)[reply]

@Geographyinitiative, GianWiki: I think in this case, they're independent for the lemma and dependent for the spelling of the lemma. Does LDL-ness get grandfathered, or could entries be expunged simply because a language became well-documented? (This might become relevant for Lao-script Pali, where the spellings are 20th or 21st century but the composition often occurred over two millennia ago.) --RichardW57m (talk) 17:39, 13 September 2023 (UTC)[reply]

Here's an example of something that I guess may fail RFV for having only one author giving uses and therefore not passing Wiktionary:Criteria for inclusion#Independent: Citations:Tomosteng. --Geographyinitiative (talk) 10:47, 12 September 2023 (UTC) (Modified)[reply]

Thank you, @Vininn126 and @Geographyinitiative, for your answers. —— GianWiki (talk) 11:03, 12 September 2023 (UTC)[reply]

Could kennings be moved to the etymology section as a param under the compound template?

Just as we have Category:Ancient Greek bahuvrihi compounds auto-generated by the params of {{compound}} and its relatives, so too could we have Category:Old English kennings populated by a parameter of those same etymology templates. Since right now Category:Old English kennings has just two words, there isnt a lot of cleanup work that needs to be done to prepare for the new system. Likewise Category:Old Norse kennings has only eight. This seems like a good idea to me. Thoughts? —Soap— 10:48, 12 September 2023 (UTC)[reply]

Just a note that the required change to the template is language-indifferent, so it won't need to be manually hard-coded for OE, Old Norse, and whichever other languages use kennings. —Soap— 10:49, 12 September 2023 (UTC)[reply]

@Soap: I went ahead and made the change (|type=kenning or |type=ken), and also fixed the categories; feel free to clean up the entries. Benwing2 (talk) 05:23, 13 September 2023 (UTC)[reply]

Thank you so much. This works just like I'd hoped it would. There are a few words, such as hjǫrleikr, that suggest we might want to keep the old method open as a fallback for if only one of two senses of a word is considered a kenning, and two others that I haven't yet converted to the new system because the etymology sections are more wordy than the rest, but I changed all of the others right away and I think they present the information more neatly than they did before. —Soap— 09:50, 13 September 2023 (UTC)[reply]

Inherited derivation

If term X is derived from term Y in language B (mnemonic Before), and language A (mnemonic After) inherits, borrows or deduces all of X, Y and the derivation rule from B, may one still list 'Y' under 'Derived Terms' of X for language A? This arises in the context of listing terms derived from a root; some of the derivations we are recording may be very ancient. The wording of WT:EL is not crystal clear, and I have an unreliable recollection of it being argued that in such circumstances as this, X would not be derived from Y in language A. However, applying such an interpretation would be deleterious pedantry when applied to derivatives of roots. --RichardW57m (talk) 14:11, 12 September 2023 (UTC)[reply]

I'm having trouble following all the algebraic constants. Can you give a concrete example of what you're talking about? —Mahāgaja · talk 14:18, 12 September 2023 (UTC)[reply]

With the complexity that X and Y have changed, Sanskrit मति (máti,matí, noun) from Sanskrit मन् (man, root), although this example can be traced back via Indo-Iranian noun *matíš from root *man-, which in term comes from Proto-Indo-European *méntis from *men-. WT:EL prohibits English data as a derivative of English datum because both come from Latin, though as a mere plural of datum, data would not be added at datum because it is already in the inflection line, so it isn't clear as an example of over-ancient derivation. --RichardW57m (talk) 14:54, 12 September 2023 (UTC)[reply]

In this case, I would say that Sanskrit मति (mati) is inherited from Proto-Indo-Iranian *matíš, which is either derived or inherited from Proto-Indo-European *méntis (depending on whether you think it can be called inherited despite the accent shift and concomitant switch from e-grade to zero grade in the first syllable), but I would also add {{surf|sa|मन्|-ति}} to show that it hasn't lost its synchronic association with the verbal root. —Mahāgaja · talk 15:13, 12 September 2023 (UTC)[reply]

@Mahagaja: So should one remove मति (mati) from the 'Derived terms' section of मन् (man)? --RichardW57m (talk) 17:05, 13 September 2023 (UTC)[reply]

Not in my opinion, no; I believe the Derived terms section should include surface analyses even when the affixation first happened at an earlier stage. —Mahāgaja · talk 17:09, 13 September 2023 (UTC)[reply]

@Mahagaja: Good. I think that's more useful for the reader. --RichardW57m (talk) 17:23, 13 September 2023 (UTC)[reply]

should country-specific dialects count in Category:Languages of X?

I am converting {{dialectboiler}} invocations to use the new {{auto cat}} support. Some existing country-specific dialect pages are manually classified into Category:Languages of X. E.g. Category:Belizean English is manually categorized into Category:Languages of Belize and similarly for Category:Nigerian English and Category:Languages of Nigeria; but Category:Singapore English is not manually categorized into Category:Languages of Singapore. Meanwhile Category:English language itself is categorized into all of the above country-specific categories by virtue of the long list of countries specified in the call to {{auto cat}} in Category:English language. My instinct is to remove all the dialect categories from the "Languages of X" categories and only list languages; but it could be argued the other way. If we are to include such categories, IMO it should be done semi-automatically, e.g. the fact that the call to {{auto cat}} for Category:Nigerian English will have |1=Nigeria given and Category:Languages of Nigeria exists should be enough to auto-categorize Category:Nigerian English into Category:Languages of Nigeria. Benwing2 (talk) 07:04, 13 September 2023 (UTC)[reply]

Definitely. Completely sensible, especially since "Languages of [place]" include dead languages or very small minority ones. Including dialects that are widely spoken gives a much better impression of the actual languages spoken in that place. —Justin (koavf)❤T☮C☺M☯ 07:43, 13 September 2023 (UTC)[reply]

@Koavf Did you read the part where I mention that the languages themselves are already listed in these country-specific categories? Including them causes duplication between language and dialect. Benwing2 (talk) 08:04, 13 September 2023 (UTC)[reply]

I think both should be in the country category. I'm fine with both CAT:English language and CAT:American English being in CAT:Languages of the United States. It doesn't feel redundant, it feels thorough, especially since CAT:American English only lists words that are uniquely American (or North American) but not everyday words that are found in all dialects of English. —Mahāgaja · talk 08:14, 13 September 2023 (UTC)[reply]

Yes, I'm not sure how my response was itself unclear. I agree with the semi-automated inclusion via {{auto cat}} or {{dialectboiler}} or some other systematic way of inclusion. The solution to this double-counting problem could just as easily be 1.) it's not a problem, just leave them or 2.) only include the country of origin in the language itself and have dialects at the countries. Either one could be reasonably argued: I'm just trying to answer the question you asked without introducing noise. —Justin (koavf)❤T☮C☺M☯ 08:24, 13 September 2023 (UTC)[reply]

@Koavf: That's worrying. I couldn't work out which proposition 'definitely' was meant to agree with. --RichardW57m (talk) 17:21, 13 September 2023 (UTC)[reply]

The thread has a yes or no question in its title, so I answered it with a "(yes) definitely". When someone asks an explicit question, I try to actually answer the question asked. —Justin (koavf)❤T☮C☺M☯ 18:21, 13 September 2023 (UTC)[reply]

Seems reasonable. Vininn126 (talk) 17:15, 13 September 2023 (UTC)[reply]

Shavian-alphabet English entries

An IP has been making entries for English words in the Shavian alphabet, a constructed alphabet for English with negligible currency outside of some hobbyists, e.g. 𐑮𐑦𐑕𐑐𐑧𐑒𐑑 (respect). Do we want these? Do they need to be RFV'd with three cites? Or is a constructed script like a conlang, only deserving entries if it has stable native/professional use? We don't have any English entries in the Deseret alphabet. Shavian and Deseret have come up a few times before in various contexts (e.g. Template talk:deseret) but this might be the first time anyone's bothered systematically creating entries. —Al-Muqanna المقنع (talk) 00:37, 14 September 2023 (UTC)[reply]

If a word in the Shavian alphabet can pass CFI, then it stays. That's the way it should work, in my book. Helps us have the most-used Shavian-script words, without the flood of nonce Shavians/Deserets. CitationsFreak (talk) 01:23, 14 September 2023 (UTC)[reply]

I agree. Shavian and Deseret spellings need three cites showing usage (not merely mentions) from three independent sources over more than a year. Otherwise, they get deleted. —Mahāgaja · talk 06:03, 14 September 2023 (UTC)[reply]

If we are to have any entries in these scripts at all, we need to have a concept of "ConScript" (see also Constructed writing system on Wikipedia), have script codes for these scripts and CSS entries so they are displayed correctly, and ensure that they are correctly detected at the language level. This is significant and non-trivial technical work, and for this reason I think it's better to ban them entirely unless we're willing to bite the bullet and do the implementation work. Otherwise we just end up with bad-looking characters that are broken on many people's browsers. Benwing2 (talk) 06:08, 14 September 2023 (UTC)[reply]

I think most scripts whose characters get encoded into [non-PUA] Unicode (like Shavian and Deseret) also get ISO script codes (in this case, Shaw and Dsrt, which are in our module). I don't think we can or should have entries in any scripts that aren't encoded into [non-PUA] Unicode. - -sche (discuss) 06:53, 14 September 2023 (UTC)[reply]

This is an interesting question. It's tempting to think "include any script (e.g. Shavian) and any spelling in that script (e.g. 𐑮𐑦𐑕𐑐𐑧𐑒𐑑) if three authors have used it", but if three authors publish books (or webpages) written entirely λικε ←that or with леттерс ←like those, would we add Greek- and Cyrillic-script forms of all the English words the books have in common? If Greek bloggers tweet in Latin script, do we include that? Or, as you ask, should we require some evidence that there is a real community of native, natural users, like for e.g. historical Cyrillic Romanian? (Is the standard even higher than that? There are some naturally-used script-forms we—IMO startlingly and inconveniently—don't include, like Latin-script Yiddish.) Does the fact that this script was specifically invented for English make it more acceptable, or does the fact that it was specifically invented (within living memory) like a conlang make it less includable? If it weren't a separate script but just a new (Latin-script) orthography, and a similar-size group of hobbyists published writings all juzing thair nu speling, I feel like we'd include that with the relevant label(s). (So I'm not sure what my answer to the question is, yet.) - -sche (discuss) 07:06, 14 September 2023 (UTC)[reply]

Such Latin-script reform experiments have happened—"American spellings" were the result of a successful one—and it might be possible to find 3 distinct sources written in something like SR1. What does "independent" (per the CFI) mean in this context, though? Apparently a journal was published in the Shavian alphabet which went through a number of issues, would three different articles there count as independent even though they're all edited by the same group? @Mahagaja —Al-Muqanna المقنع (talk) 07:46, 14 September 2023 (UTC)[reply]

If we were talking about spellings (juzing for using in The Journal Of Spelling "Juzing" Like That) or words (say competitive eaters writing in Competitive Eating Quarterly all use gurgitize to mean "eat", but no-one else does), I'd say if they're written and edited by three different people, they're includable, and the people being in one small group is something to note in a label or usage note. (We have to be careful not to present rare [or even fringe, loaded] terms as widespread, and we have to figure out how to handle the fact that e.g. SR1 spellings are all better-attested, since before the advent of SR1, as simple eye dialect, but I don't see any reason or way to exclude them.) If they were written by different people but edited by the same person, I'd consider context: maybe an American science journal editor would enforce a house style and silently change both a London doctor's foetusise an Oxford scholar's foetusize to fetusize, so we might not consider those independent if they were edited by the same person, but if we're talking about The Journal Of SR1, or Shavian Monthly, I think we can assume that anyone who'd contribute work to there would've written in SR1 / Shavian to begin with. The only reason I see to treat Shavian-script forms different from SR1 or juzing is that it's an entirely separate script, and a "conscript" at that... but IDK, maybe we should just subject that to normal ATTEST requirements... - -sche (discuss) 15:43, 14 September 2023 (UTC)[reply]

Leaning exclude. The Shavian alphabet was designed to be phonetic, so IMO it's equivalent to me creating an entry at ɹɪˈspɛkt for respect. Even if it's citable, is it even English? — excarnateSojourner (talk · contrib) 18:25, 15 September 2023 (UTC)[reply]

Translation of a translation (of a translation etc.)

Hello everyone. How do you deal—in terms of quotes—with a translation of a translation?
I'm using the Sassarese translations of several books of the Bible as sources for quotes. These translations are based on an 18^th-century Italian translation of the Latin 1592 Clementine Vulgate—which is based on Vetus Latina texts (which, in turn, are based on the Ancient Greek Septuagint, translated from Hebrew texts). This is an example of how I'm currently handling it:

1863 [1770s], Antonio Martini, chapter I, in Giovanni Spano, transl., Lu càntiggu de li càntigghi di Salamoni [Solomon's canticle of canticles]‎^[11], London, translation of Il cantico de' cantici (in Italian), verse 14, page 6:
Tu sei veramenti bedda, o amigga meja, veramenti bedda: l’ occi toi sò di culombi.
[original: Bella veramente ſei tu, o mia diletta: bella veramente ſe’ tu, gli occhi tuoi ſon di colomba.]
[Bella veramente sei tu, o mia diletta: bella veramente se’ tu, gli occhi tuoi son di colomba.]
You are very beautiful, o lover of mine, very beautiful. Your eyes are those of doves.

My question is: should one be concerned with... I don't know, doing something in order to mention that a translated work is based on another translated work?
I realize this is probably an extreme example, but I'm curious to know if there's any point in even considering this.
Thanks in advance. —— GianWiki (talk) 10:39, 14 September 2023 (UTC)[reply]

For the Bible I definitely don't think you should bother putting original dates and languages and so forth. See {{RQ:King James Version}} for example. You've already indicated it's a translation from an Italian version, which is enough. —Al-Muqanna المقنع (talk) 10:57, 14 September 2023 (UTC)[reply]

I see. Thank you very much for your answer.
GianWiki (talk) 12:49, 14 September 2023 (UTC)[reply]

Yeah, in the case of the Bible, I think people know it wasn't originally written in Italian, or King James' English, etc (so they should know the Italian edition is itself translating another language). If there were some exceptional situation where it was necessary to go 'another step backwards' to show something relevant about the word being quoted — e.g. the Sassarese edition of something uses the word under discussion in one place but then uses a different word later in the quote, whereas the Italian version uses the same word twice, but the Latin/Greek/etc does use two different words, which the Sassarese edition is reflecting — I think we could resort to some exceptional solution like tagging each of the Italian words with {{transterm}}. - -sche (discuss) 15:03, 14 September 2023 (UTC)[reply]

Using |author= or |origXYZ= is improper since the 1770s Italian translation isn't an "original" of anything. Just ignore all of the intermediate steps.

1863, Giovanni Spano, transl., Lu càntiggu de li càntigghi di Salamoni [Solomon's Song of Songs], London: Impensis Ludovici Lugiani Bonaparte, chapter 1, verse 14, page 6:
Tu sei veramenti bedda, o amigga meja, veramenti bedda: l’ occi toi sò di culombi.
You are very beautiful, o lover of mine, very beautiful. Your eyes are those of doves.

Ioaxxere (talk) 17:00, 14 September 2023 (UTC)[reply]

When I cite the Bible in Gothic entries, I usually use the King James Version in the translation, since both the Gothic and the KJV are fairly literal translations of the Greek, which means they match up pretty well with each other (but I do deviate from KJV when it really doesn't match the Gothic). In this case, you could use Douay-Rheims, which is also a translation of the Vulgate and thus probably matches up pretty well with the Sassarese Bible. And it's out of copyright, so you don't have to worry about using too much of it. —Mahāgaja · talk 20:35, 14 September 2023 (UTC)[reply]

@GianWiki FWIW, in the documentation of {{quote-book}} there's an example of a translation of a translation. There's also the |origtext= param for indicating the original text in circumstances like these. Benwing2 (talk) 06:36, 15 September 2023 (UTC)[reply]

How to quote a book containing translations of various poems from various authors?

Hi. How do you quote a book where each chapter is a translation of a poem (with works from 5 different authors translated throughout the entire book)? I don't think using |deriv=translation and |original= cuts it, because that refers to the whole quoted book as a translation of a single work, and not single chapters. —— GianWiki (talk) 16:34, 14 September 2023 (UTC)[reply]

@GianWiki Are you somehow quoting the "whole book" or a single chapter? If the latter case, there are standard ways of doing this, e.g. |chapter_tlr= for a chapter translator and |author= for the chapter original author. Benwing2 (talk) 06:34, 15 September 2023 (UTC)[reply]

I also thought about doing something like this

xxxx [yyyy], Poem translator, transl., “Translated poem”, in Anthology, translation of Original poem by Original poem's author:
Quoted text
[original: Original quoted text]
English translation of quoted text

but it indicates Anthology—instead of “Translated poem”—as a translation of Original poem (maybe there is just no better way of doing it, since you can't specify when a single chapter is a translation on its own). —— GianWiki (talk) 07:07, 15 September 2023 (UTC)[reply]

Should Wiktionary be the "best free & open dictionary" or the "best dictionary"?

I ask this because the current header for the Grease Pit reads "the best free and open online dictionary". I believe that this should read "the best dictionary", since we should aspire to the best dictionary of all time, and not merely the best free one. (Yes, I know that removing that means that some schmuck may think that means shoving it behind a paywall and putting in ads is what we want, but that is not making it the best dictionary anyway.) CitationsFreak (talk) 01:13, 15 September 2023 (UTC)[reply]

I think "free and open" is absolutely central to the philosophy and ethics of this project. Equinox ◑ 01:22, 15 September 2023 (UTC)[reply]

Absolutely. I think it is an important thing what we remain free and open. However, it just sounds like there are dictionaries that are not free/open which are better than us, and we're fine with that. Which we shouldn't be. CitationsFreak (talk) 01:27, 15 September 2023 (UTC)[reply]

If it's just at the Grease Pit, I think it's kind of inside baseball. How many readers who are unfamiliar with the ethic of the project are ever going to arrive at that page? bd2412 T 02:40, 15 September 2023 (UTC)[reply]

"It [the Grease Pit] is also a place to think in non-technical ways about how to make the best free and open online dictionary of “all words in all languages”." seems wrong. I would have thought that BP is the place for such policy matters, just as GP, not a user page, is the right place to discuss larger technical matters that impinge on and implement policies. I thought the basic principle would be to have discussions in forums that are inclusive ones, with audiences as large as practical of those affected by the matters discussed. DCDuring (talk) 12:11, 15 September 2023 (UTC)[reply]

Essential to me to mention these aspects. --Geographyinitiative (talk) 12:33, 15 September 2023 (UTC)[reply]

Thai Mon vs. Mon

@-sche In Oct 2022, User:Octahedron80 created a new language called "Thai Mon". We also have the Mon language. There is no mention of a Thai Mon language in Wikipedia; the closest is a paragraph in the Mon language entry that says "Thai Mon has some differences from the Burmese dialects of Mon, but they are mutually intelligible. The Thai varieties of Mon are considered "severely endangered."". We also have a Category:Thai Mon category, which is supposed to represent the Thai dialect of Mon and is added by the 'Thailand' label. I'm extremely skeptical that we need a separate Thai Mon language. I'm not sure if there was any discussion that led to this split or if it was just a "be bold" moment. I suggest we delete Thai Mon and move the entries to Mon, with the 'Thailand' label. Benwing2 (talk) 06:32, 15 September 2023 (UTC)[reply]

It is hard to say. Firstly, I had the same thought to have Myanmar/Thailand tags to specify which dialect, adding to pronunciations and definitions. Until the (Myanmar) Mon person, Intubesa, joined the game. He did not know what we were doing and he was overconfident; he claimed that Thailandish Mon (Thai Mon) words spelled wrongly, and he said that Wiktionary must use only Myanmar forms, because he relied on the ancient texts he worked. I knew dialects could read/spell differently per location, that is the nature of language, so I had to get some Thailandish references and collected alternative forms for a term. But he did not accept and also argued that the references are wrong either! (even though they were written by true Mons or experts). Then the edit war happened: if I wrote a Thai form, it would be renamed to corresponding Myanmar form (and the Thai form was neglected). The arguement also spreaded to other innocent users. Nevertheless, before he got banned, he suggested to split Thai Mon out of Myanmar Mon so they would not mix up. I agreed with this idea because there are a lot of words that read/spelled different much enough to split. The two Mons also have different grammar e.g. Thai Mon uses noun+modifier whereas Myanmar Mon uses modifier+noun. The paper "The Mon language: recipient and donor between Burmese and Thai" will explain this situation. I also use the concept at Thai Wiktionary and he never get in the way again. --Octahedron80 (talk) 14:47, 15 September 2023 (UTC)[reply]

If Thai Mon cannot be the "particular language" so the category "Thai Mon" could make use of dialectboiler to make a local portal for the dialect, like English, Spanish, Protuguese, etc. Unfortunately, here don't accept the term "Thailandish"; just "Thai" then might be ambigous to the Thai language in automatic categorization. --Octahedron80 (talk) 14:57, 15 September 2023 (UTC)[reply]

@Octahedron80 Thanks for the response. We can create language-specific labels so there's no ambiguity with a label like 'Thailand' applied to Mon entries. If you don't mind I will merge them. In general, when you have issues with someone like this, the correct thing is to get them banned rather than splitting a language. Benwing2 (talk) 18:28, 15 September 2023 (UTC)[reply]

-less, -lessly, and -lessness

Could we have a bot or a script do the following things:

Add a derived terms section to any page ending in -less to the corresponding -lessly and -lessness pages, if they exist;
Tag the -lessly and -lessness pages as {{lb|en|rare}} if they do not have cites or at least a usex (e.g. embryolessness is not a common word, since we most commonly would say "lack of embryos" or the like).

I consider this a low-priority task and won't be upset if we decide it's not worth writing the code and running the script. Some of these words might not even pass RFV, but I don't want to see 1,200 RFV's or or even a mass-RFV/D that would delete all the hard work we've put in unless someone cracks the books looking for three cites for every single word. —Soap— 14:57, 15 September 2023 (UTC)[reply]

I don't think it's very safe to assume that such a form is rare if it doesn't have uxes or cites. For example, a very common word homelessness has neither. It's often just that they don't get a lot of attention because there is usually not much to talk about there as they just often say "the state of being X". lattermint (talk) 15:04, 15 September 2023 (UTC)[reply]

Would definitely oppose auto-tagging "rare" as well, though the first request seems reasonable enough. I can see a good number of attestations of embryolessness in technical literature as well so at most it would be "uncommon". —Al-Muqanna المقنع (talk) 15:07, 15 September 2023 (UTC)[reply]

Appendix Part of Speech Templates

Currently, this is what one must do on an appendix page to get it to function correctly: ((en-noun|((apdx-l|plural)))) (replace parentheses with curly brackets), which is more complicated than it needs to be. Templates should automatically link to an appendix page if within an appendix. Otherwise, they will function as normal in the main namespace. Netizen3102 (talk) 19:42, 15 September 2023 (UTC)[reply]

Does this really need to be a formal vote? Has this already been brought up in Grease Pit? AG202 (talk) 20:09, 15 September 2023 (UTC)[reply]

Yeah, it has been brought up, 13 years ago. Some appendix words may benefit from having their own pages. Netizen3102 (talk) 21:30, 15 September 2023 (UTC)[reply]

mismatch between Proto-Foo-Romance and Romance families

@Nicodene, -sche We have a serious mismatch between the Romance classification as found in the families in Module:families/data and the proto-languages found as subcategories of Category:Vulgar Latin. Trying to match them up we have:

Category:Proto-Balkan-Romance approximately lines up with Category:Eastern Romance languages but the naming should be harmonized.
Category:Proto-Italo-Western-Romance: No match. This includes part of Category:Italo-Dalmatian languages, plus Category:Gallo-Italic languages, Category:Occitano-Romance languages, Category:Oïl languages, Category:Rhaeto-Romance languages, Category:West Iberian languages and two languages hanging out by themselves: Category:Franco-Provençal language and Category:Venetian language.
Category:Proto-Italo-Romance: The closest corresponding family is Category:Italo-Dalmatian languages, but they don't line up.
Category:Proto-Western-Romance: No match. This includes Category:Occitano-Romance languages, Category:Oïl languages, Category:Rhaeto-Romance languages, Category:West Iberian languages, the unattached Category:Franco-Provençal language and possibly some or all of Category:Gallo-Italic languages.
Category:Proto-Gallo-Romance: No match. This presumably includes Category:Gallo-Italic languages and Category:Oïl languages.
Category:Proto-Ibero-Romance approximately lines up with Category:West Iberian languages. ("West Iberian" is confusing as it corresponds to all of Iberia except Catalan, and in no way, shape or form corresponds to western Iberia, which would approximately be Galician and Portuguese. I'm not sure if this naming is intended to exclude Catalan, or to disambiguate the family from the other Iberia in the country of Georgia, or to disambiguate the family from the ancient Iberian language.)

I would assume we also need a Proto-Rhaeto-Romance category. Benwing2 (talk) 01:30, 16 September 2023 (UTC)[reply]

Can hyponyms be included in Derived Terms?

"Postman" is both a derived term of "man" (it's post + man) and a hyponym of "man" (because a postman is a type of man, i.e. one who delivers the mail). Please look at this: [12]. @Whoop whoop pull up says that we mustn't include derived terms if they are also hyponyms. I believe this is wrong. Can someone show me policy? Thanks. (-sche will be delighted to see me arguing to keep trans woman as a hyponym of woman. I only care about the words... mostly.) Equinox ◑ 01:53, 16 September 2023 (UTC)[reply]

My comment was based on the bit in the derterms section of woman that explicitly says that that section's specifically for derterms that aren't also hyponyms. Whoop whoop pull up ^{Bitching Betty ⚧️ Averted crashes} 02:34, 16 September 2023 (UTC)[reply]

Policy, shmolicy. There is some lunacy here. Would occupational hyponyms appear once for the adult-male-only definition and again for the adult-human definition? Should any role, occupational or other taken on by an adult male appear under Hyponyms. Why not each given name, surname, nickname, ethnonym? Hyponyms should be useful, not complete, provided we are still trying to help humans. DCDuring (talk) 02:36, 16 September 2023 (UTC)[reply]

Revision as of 02:34, 16 September 2023 edit Whoop whoop pull up (talk \| contribs) Autopatrollers 17,031 edits →‎Can hyponyms be included in Derived Terms?: Reply Tag: Reply ← Go to previous edit		Revision as of 02:36, 16 September 2023 edit undo DCDuring (talk \| contribs) Administrators 435,139 edits →‎Can hyponyms be included in Derived Terms? Go to next edit →
Line 830:		Line 830:

	:My comment was based on the bit in the derterms section of ''woman'' that explicitly says that that section's specifically for derterms that aren't also hyponyms. [[User:Whoop whoop pull up\|Whoop whoop pull up]] <sup>[[User talk:Whoop whoop pull up\|Bitching Betty]] ⚧️ [[Special:Contributions/Whoop whoop pull up\|Averted crashes]]</sup> 02:34, 16 September 2023 (UTC)		:My comment was based on the bit in the derterms section of ''woman'' that explicitly says that that section's specifically for derterms that aren't also hyponyms. [[User:Whoop whoop pull up\|Whoop whoop pull up]] <sup>[[User talk:Whoop whoop pull up\|Bitching Betty]] ⚧️ [[Special:Contributions/Whoop whoop pull up\|Averted crashes]]</sup> 02:34, 16 September 2023 (UTC)
			:Policy, shmolicy. There is some lunacy here. Would occupational hyponyms appear once for the adult-male-only definition and again for the adult-human definition? Should any role, occupational or other taken on by an adult male appear under Hyponyms. Why not each given name, surname, nickname, ethnonym? Hyponyms should be useful, not complete, provided we are still trying to help humans. [[User:DCDuring\|DCDuring]] ([[User talk:DCDuring\|talk]]) 02:36, 16 September 2023 (UTC)