Wiktionary:Beer parlour/2021/July

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Regional Variations in Pali Inflection Tables

[edit]

We need to discuss how we are going to handle regional variations in Pali spelling as it affects inflection tables. This affects templates {{pi-decl-noun}} and {{pi-conj-special}}.

One problem is that transliteration cannot easily be tuned on a form-by-form basis. For variations in the stem, I have already proposed a 'subst' parameter. Please discuss that approach there. --RichardW57 (talk) 05:25, 2 July 2021 (UTC)[reply]

This problem is actually solved by the now-implemented |subst= in the main Pali inflection templates, provided that all forms of a word spelt the same are to be transliterated the same, and the transliterations corresponds to some Pali spelling. --RichardW57 (talk) 11:22, 4 July 2021 (UTC)[reply]

There are already some parameters to address the generation of the tables - use of implicit vowels and its concomitants (|impl=), the shape of the ā vowel (|round=), the choice of consonant for -y- (|y=) and the form of the Lao script instrumental/ablative plural in -bh- (|liap=). Parameters |impl= and |y= are passed down into a special transliteration interface (trwo() instead of tr()) in the transliteration module.

We now have some additional features to worry about. Apparently some Mon use SIGN II for -iṃ, and Shan Pali (assuming it meets CFI) works like a different script to the Burmese script. Shan may have two writing systems, but quite a few writing systems seem to have been proposed for Pali in the Shan script. Two Shan writing systems are displayed in the lists of alternative forms. There is also a Lao writing scheme whereby consonant clusters are (mostly) written using the rules for Lao. What parameters would users find usable for controlling these features in the inflections. What mixtures of writing systems in the inflection tables should users find acceptable?

I ask, especially of @Octahedron80, that implementation of these new variations be suspended until we have discussed the matter. --RichardW57 (talk) 05:25, 2 July 2021 (UTC)[reply]

@Octahedron80: I have pulled together my thoughts on the 10 scripts, and they are as follows: --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

General

[edit]

In principal homorganic nasals might be written using niggahita instead of the explicit nasal consonant, but I have not yet encountered systematic use. This would affect verbal inflection. Normative spelling generally rejects it, though it does turn up as the main spelling occasionally, but affecting stems rather than affixes. I think a general parameter used across scripts would be appropriate for this. I have ignored this parameter in the discussion below. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

I think the way forward should be to gather and share data, rather than to leap into coding inflections for ill-documented systems. A simple chart of consonants and vowels is not sufficient - consider the complexities of the Sinhalese script and the Burmese style of the Burmese script. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Latn

[edit]

The dominant writing system for the Roman script is IAST and its variants. The variations affecting the inflections are:

  1. ṃ v. ṁ v. ŋ v. n
  2. Mark for vowel length - macron, acute, circumflex or grave.

At present, so far as I am aware, we only use and macron. I had thought that there was a ukase to that effect, but I can't find it.

I would want to mix forms in a single inflection table, though listing them as a footnote would also be appropriate. Until people start creating entries using these alternative writing systems, I see no reason to add support for them in the inflection tables. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Thai

[edit]

There are two major writing systems for the Thai script - with and without implicit vowels. For the former, there are older variations involving the use of yamakkan and wanchakan, but these do not need to be addressed for now.

Although there are some inflection tables that mix the two systems in inflection tables for feminine forms, I now think that it is better to give separate tables for nominals. (There are currently technical problems with transliteration in mixing them for masculine and neuter forms). --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Deva

[edit]

There is no indication of any variation that affects the inflection. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Beng

[edit]

The only likely variation lies in the representation of <v> in inflection. We may have a choice between Bengal BA, RA WITH MIDDLE DIAGONAL and RA WITH LOWER DIAGONAL. If this happens, I would favour putting them in the same table for verbs. On balance, I also favour mixing them for nominals. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Sinh

[edit]

We currently have decent attestation for touching letters (with some conjunct exceptions) only. We might need at some point to support visible AL LAKUNA.

I would favour mixing them in the same inflection table. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Brah

[edit]

As we do not classify Ashokan Prakrit as Pali, there do not seem to be any writing system variations to be concerned with. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Khmr

[edit]

There is as yet no indication of any variation that affects the inflection. However, in the long term Khom texts should be investigated, as the script as a whole distinguishes an unencoded vowel like โ from the dependent vowel OO of the Khmer script. Paleographers currently ignore the distinction in Old Khmer, but it is significant in Tai vernaculars. If both be used in Pali, I would favour mixing them in tables. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Lana

[edit]

Affecting inflection, there is a distinction between round AA and long AA, which is currently handled by the parameter aa to the inflection templates. The rare occurrence of <SA, SIGN SA> is so far rare enough to only be included where attested. So far, it does not seem worth splitting inflection tables by writing system. There is potentially an issue with aorist 3rd plural in -iṃsu, but for now more examples are needed. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Mymr

[edit]

There are at least six writing systems. There is a possibility of conflict between round AA and long AA, but the mechanism for the Lana script will also handle that. The writing systems are Burmese, Mon, 'old' Shan, 'new' Shan, Khamti Shan (Unicode proposal L2/08-276 and Tai Laing (Unicode proposal L2/12-012). Thai Mon should be reviewed for differences - off the top of my head, there may be an issue with -ss-.

A preliminary character chart is here. While combining some of Shan, Khamti Shan and Tai Laing will sometimes be possible, it looks infrequent enough not to expend significant effort enabling it. They will rarely mix with other variants.

Title Burmese Mon Shan Khamti Shan Tai Laing
k က က က က
kh
g
gh
c
ch
j
jh
ñ or
ṭh
ḍh
t
th
d
dh
n
p
ph
b
bh
m
y
r
l
v
s or ?
h
Independent vowels
a
ā အာ အာ ဢႃ ဢႃ ဢႃ
ā (closed) N/A N/A ဢၢ ဢၢ N/A
i ဢိ ဢိ ဢိ
ī ဣဳ ဢီ ဢီ ဢီ
u ဢု ဢု ဢု
ū ဥု ဢူ ဢူ ဢူ
e ဢေ ဢေ ဢေ
o ဢေႃ ဢေႃ ဢေႃ
cjct stack stack stack / nostack nostack nostack
Dependent vowels
kaṃ ကံ ကံ ကံ ကံ ကံ
ka က က က က က
ကာ ကာ ကႃ ကႃ ကႃ
ဂါ ဂါ or ? N/A N/A N/A
kā (closed) N/A N/A ကၢ ကၢ N/A
ki ကိ ကိ ကိ ကိ ကိ
ကီ ကဳ ကီ ကီ ကီ
ku ကု ကု ကု ကု ကု
ကူ ကူ ကူ ကူ ကူ
ke ကေ ကေ ကေ ကေ ကေ
ko ကော ကော ကေႃ ကေႃ ကေႃ
go ဂေါ ဂေါ or ? N/A N/A N/A
Geminates
ññ or ည္ည ၺ္ၺ or ၺ်ၺ
ss or သ္သ unenc or သ်သ

Two vowels in inflections show variation - ā and -ī. For the former we can simply extend the use of |aa=, taking values 'round', 'tall', 'both' and 'shan'. For the latter, we can do something similar with |ii=. Thinking ahead, we can footnote the alternatives as Burmese and as Mon when we select |ii=both.

However, the consonant differences are different enough to overwhelm this system when we include the other systems. The letters r, s, n and h vary between the writing systems, and I propose that we using a writing system parameter (possibly |ws=, possibly even |sc=) to distinguish between the Shan sensu lato writing systems. I'm inclined to identify them by the codes for the vernaculars - shn, kht, and tjl. Shan has exhibited two writing systems for Pali, one stacked and another non-stacked. The wording of descriptions give the impression that Khamti and Tai Laing Pali do not stack consonants, but in association with Khamti I have seen the misspelt ꩬံပုက္တႃꩬး (for saṃbuddhassa - L2/20-162 p9 l3), so there may be similar variations in the others. I therefore propose to have a parameter such as |stack= to control this dimension.

I think we need to gather, record and share examples (with quotations!) before we leap into generating tables for the Shan, Khamti and Tai Laing writing systems. There's too much scope for making mistakes for my liking. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Laoo

[edit]

There seem to be many writing systems. The existence of the following is demonstrated or asserted:

a) Buddhist Institute alphabet with implicit vowels (an abugida), b) Buddhist Institute alphabet without implicit vowels (technically an alphabet). c) Lao repertoire with following options:

      Use ຣ or not for <r>?
      Use ຢ or not for <y>
      Use ດບ for clusters?
      Use cancellation mark for 'un-Lao' clusters? (May need resolution sv v.tv)

d) System using nuktas.

As with Thai, inflection tables for the abugidic spelling are best kept separate. At present I am merging the others for nominals where the stems are the same. For verbs, I am splitting the tables on the basis of the writing of <y> - NYO v. YO.

To accommodate the systems that use ດບ for clusters, I want parameters to control the spelling of 'ss' and 'sm' in the masculine and neuter nouns. It may be one parameter, or it may be two. Some at least of these systems write 'sm' with a cancellation mark; I am not sure if it is all of them. --RichardW57 (talk) 10:51, 30 August 2021 (UTC)[reply]

Slavey split

[edit]

@Mahagaja We have forgotten to conclude this discussion, so I'll summarise the proposals:

  1. Change Slavey (den) into a family (Slavey languages, den); put South Slavey (xsl) and North Slavey (see below) as members.
  2. Split North Slavey (scs) into three languages: Bearlake Slavey (den-blk?), Hare Slavey (den-hre?) and Mountain Slavey (den-mnt?)

The code names can be debated about. If nobody's opposing these, I would like them to be implemented. Thadh (talk) 12:39, 2 July 2021 (UTC)[reply]

Someone might want to redirect the entries listed there to Proto-West Germanic entries. (@Rua). ·~ dictátor·mundꟾ 20:04, 2 July 2021 (UTC)[reply]

@Benwing2, you might do a bot-operation for this. ·~ dictátor·mundꟾ 21:58, 4 July 2021 (UTC)[reply]

Admin help required

[edit]

I recently came across Wiktionary:Beer_parlour/2020/July#Update_CFI_to_reflect_decision_on_treatment_of_attributive_forms, which I had forgotten about. I believe that the wording should be uncontroversial, as it merely reflects the result of the vote at Wiktionary:Votes/2019-05/Excluding_self-evident_"attributive_form_of"_definitions_for_hyphenated_compounds, in which case I would be grateful if an administrator would now insert the wording as suggested. Alternatively, if there is something further that I need to do to make this happen, please let me know what it is. Thanks. Mihia (talk) 15:57, 3 July 2021 (UTC)[reply]

@Mihia: Done Done, sorry for the repeated delay. I'll see about deleting any affected pages. Ultimateria (talk) 01:12, 8 July 2021 (UTC)[reply]
@Ultimateria: Thanks very much for doing that. Would it also be possible to make the italicisation of terms in the CFI text the same as in Wiktionary:Beer_parlour/2020/July#Update_CFI_to_reflect_decision_on_treatment_of_attributive_forms? Or the part in quotes could instead be "attributive form of periodic table" if you prefer that. Mihia (talk) 17:58, 8 July 2021 (UTC)[reply]

The 2021 Board of Trustees election opens 4 August 2021. Candidates from the community were asked to submit their candidacy. After a three week long call for candidates, there are 20 candidates for the 2021 election.

The Wikimedia movement has the opportunity to vote for the selection of community-and-affiliate trustees. The Board is expected to select the four most voted candidates to serve as trustees. Voting closes 17 August 2021.

The Wikimedia Foundation Board of Trustees oversees the Wikimedia Foundation's operations. The Board wants to improve their competences and diversity as a team. They have shared the areas of expertise that they are currently missing and hope to cover with new trustees.

How can you get involved? Learn more about candidates. Organize campaign activities. Vote.

Read the full announcement.

Best,

The Elections Committee

Announcement posted by User:Xeno (WMF) at 23:26, 3 July 2021 (UTC)[reply]

Removing Vulgar Latin Pronunciations from Mainspace

[edit]

(Notifying Fay Freak, Brutal Russian, JohnC5, Benwing2, Lambiam, Mnemosientje): I notice that @The Nicodene has been systematically removing |vul= from {{la-IPA}} from mainspace Latin entries with the edit summary "Removed Proto-Romance because the word is not reconstructed". This strikes me as fundamentally wrong, for at least a few reasons.

  1. To start with, there's a difference between reconstructing a whole term and merely reconstructing a pronunciation. All historical pronunciations that aren't covered by detailed phonetic decriptions are reconstructed. Following the logic of the rationale given to its conclusion would mean removing pronunciations from all dead languages in mainspace.
  2. I'm not really sure exactly what Vulgar Latin is as we cover it on Wiktionary, but I don't see consensus for redefining it to be exactly synonymous with Proto-Romance. Sure, there's overlap, but the methodology and criteria seem different to me.
  3. Which brings up my main concern: a project to systematically change a basic feature of our Latin entries should not be undertaken without consensus of the Latin community. Yes, this has been touched on in the ad nauseam "debates" between The Nicodene and @Brutal Russian, but it got lost in the sea of ad hominems, nitpicking and arguments about who said what when. Besides which, the matter of removing Vulgar Latin pronunciations was never directly addressed. Chuck Entz (talk) 05:26, 4 July 2021 (UTC)[reply]
@Chuck Entz I am glad to see that you are interested in this. Pinging also @Ser be etre shi, who has also expressed an interest.
Vulgar Latin is, as we all know, an unfortunately nebulous term. Before either I or BR ever touched the 'vulgar' pronunciation module, it was being used haphazardly for not only reconstructed Proto-Romance entries, but also for random Latin words attested from Plautus (e.g. eccille) all the way to the fifth century C.E. (e.g. Justinianus) and even well beyond that (e.g. Iraquia, which is Neo-Latin for Iraq).
BR attempted one solution, namely narrowing 'Vulgar Latin' down to a sort of reconstructed Pompeiian pronunciation. I am not opposed to the general idea here, but no sources have yet been provided that support the assigned features. (To date I still do not understand where the idea of a word-initial /b/-/w/ merger came from, for instance.) A properly-sourced version of that would be appropriate for Latin words attested by the late first century, perhaps with a grace period of an additional 50-100 years after that.
As for reconstructed Proto-Romance words, the reconstructed Proto-Romance pronunciation seems most appropriate. Sources for the reconstructed features can be reviewed here and here.
But what do with attested Latin words from, say, circa 200–600?
I previously proposed using the Proto-Romance pronunciations as the basis for a pronunciation of Late Latin, and indeed most of the features assigned to it are supported by inscriptional evidence (which the scholars reconstructing Proto-Romance of course had in mind as they did their work). The members of the DÉRom project themselves see their reconstruction as a means of discerning what Late Latin was like via the comparative method, rather than some sort of separate and purely hypothetical language.
BR, however, made his contempt for that proposal so clear that I decided, in the end, to withdraw it in an attempt to avert further controversy on that matter. That is also why I began to remove the reconstructed pronunciation from attested Late Latin entries. (I would have removed it anyway from Plautine or Neo-Latin entries, of course.)
Unfortunately I do not think there are any sources out there that attempt to lay out a synchronic phonology of Late Latin at a particular point in time without working from comparative Romance data. (I would be delighted to learn otherwise.)
If anyone has thoughts on ways to fill the 200–600 C.E. gap, please share. The Nicodene (talk) 06:43, 4 July 2021 (UTC)[reply]

The few previous discussions about Vulgar Latin that I'm aware of are collected here. It's clear to me that people are interested in having a non-standard, more phonetically advanced pronunciation in addition to the conservative Classical. A consensus was emerging in these dicussions that the Latin of Pompeii and the wider Campania is the best way to give it to people - when Classical-age Vulgar Latin isn't used as synonymous with the speech of Plautus, or in the sense of a diglossic "language that the plebs spoke and the patricians didn't", it's used in this sense. Accordingly, I had replaced the former indeterminable and unrealistic Vulgar transcription with "Campanian" by introducing the most striking features attested in the region to what is otherwise Classical Latin. The name is after Adams (eg. 2007 & 2013): he sees it as having been characterised by at least one "long-standing feature of the speech of this Campanian region, distinguishing Campanian Latin still at this date from the speech of the city" (talking about the monophthongisation of /ae/). This variety basically anticipates many of the Late Latin developments and can be used as a generic, even urban post-Silver Latin pronunciation with minimal modifications. Most of its features have their ultimate origins in the Middle to Late Republican "rustic Latin" of Latium, part of the well-known Roman dichotomy with the other part being "urban"; but that is best represented as Praenestine and/or Faliscan.

I believe all the features I included can be found in any discussion of Pompeiian inscriptions, from Väänänen's Le latin vulgaire des inscriptions pompéiennes and Introduction au latin vulgaire to Wallace's Introduction to Wall Inscriptions from Pompeii and Herculaneum. I'm also referencing Rohlf's 66 Grammatica storica... The word-initial /b/-/w/ merger came from a combinations inscriptions (first confused in that position in Pompeii) and from the fact that this merger in favour of /v/ is characteristic of modern Southern Italian (Neapolitan), where /b/ only occurs after consonants or when double. A parallel thing happened to its /d-/ (now /ð/ and widely /r/, as in Neapolitan riece (ten), Sicilian reci), and with a lot of variation to /g-/. It might or might not have characterised Campanian Latin at that stage (2nd century AD), and I tend to think there was free variation, so I included it for variety. Apart from that, the transcription is highly conservative, and so I was intending for it to be given by default alongside Classical for all the lemmas because it simply represents a variant pronunciation of the same language that people are interested in seeing. The above user's complaints about no sources are disingenuous and simply serve to retroactively justify unilaterally removing my transcription without any prior discussion or request for sources.

This user above has unilaterally replaced this conservative and discussed transcription of Latin with the controversial transcription of the purely comparatively reconstructed proto-Romance whose status in relation to Latin and whose need to be reconstructed at all are still contentious issues. They have pointedly done so with no prior discussion and in violation of the principles of this website. Moreover, they're "requesting administrative help" to stop me from stopping them from appropriating the module in this manner. They have their fringe ideas and they're here to implement them, and they have an article with a pile of references to gaslight you in case you disagree. It's your ideas that are fringe, you see. In particular, postulating a pan-Romance, supra-regional phonology not situated either in any time or space on the basis of "why couldn't it have existed?" is indefensible. DÉRom uses no attested Latin evidence (what they dub 'le code écrit') whatsoever in its reconstructions as a basic premise (a basic fact that the user above repeatedly denies), and its goal of reconciling proto-Romance and Late Latin is nothing but a goal. Currently their reconciliation consists in the 'Le corrélat du latin écrit..' section, some footnotes and dictionary references at the bottom of most but not all articles.

What I think is should be done:

  1. restore the pronunciation of an unambiguously attested, known variety of Latin that I term Campanian that every work on Pompeiian Latin describes;
  2. stop user The Nicodene from unilaterally overwriting the pronunciation module and shamelessly telling people to re-add what they've overwritten in a separate sub-module. Stop them from trying to appropriate parts or the whole of the module by proposing "deals" where other people don't edit their appropriated part and vice versa.
  3. make the Campanian pronunciation default alongside Classical as has been done with Ecclesiastical, and figure out how to add at least a third one (Plautine) in order to complete the range of pronunciation that coexisted in the Late Republic, to reflect speech variation much better than currently, to illustrate the language's development and to give people an ability to choose. People are on the right track when they think that perhaps Plautus, Cicero and Zosimus from Pompeii didn't pronounce things the same. I requested assistance with this here, but so far nobody has responded. This would make the removals of vul=1 irrelevant.
  4. stop user The Nicodene from removing default pronunciations based on date of attestation. It's a basic insight of historical linguistics that date of attestation in the written record is not the date of appearance in speech, and neither is date of last appearance the date of disappearance from speech. This user makes no consideration of langauge as a complex sociolinguistic system that we have only a very rough outline of in written attestations. They demonstrate this in the fōrmāticus debacle, insisting we base everything on the date of attestation, from pronunciation to morphosyntax (ellipsis and gender) to even how to call the geographic region, freely disregarding the evidence for a much earlier date of the word's appearance.
  5. stop user The Nicodene from trying to conflate Late Latin with proto-Romance. If a seprate proto-Romance transcription is to be made, it should be phonemic like with all other reconstructed languages on the website, and in accordance with common sense. It should basically mirror what DÉRom gives. It seems reasonable that it be limited to the reconstructed namespace, as people will already have the "Vulgar" that they want. Ideally pagenames should look like that too, and not as current Latinisations, but this will probably entail making proto-Romance into a whole separate language, with the ensuing problems of having to derive Romance languages both from Latin and proto-Romance simultaneously. This is currently the only way I see to reconcile proto-Romance and Latin.
  6. if and when we introduce an actual Late Latin phonetic transcription for a well-documented time and place, such as 6th century Ravenna, it will necessarily borrow from the fruits of DÉRom's labour. The problem with this is that the language of this particular time and place differed from Classical Latin in numerous other ways and might deserve its own morphology as well, for instance. Which again leads into the above issue. Brutal Russian (talk) 12:22, 4 July 2021 (UTC)[reply]
I want to participate in this discussion but I can't if this is going to degenerate into another ad hominem quarrel between the two of you... I found your medicalization of The Nicodene particularly horrid. However, I do have some questions:
For @Brutal Russian: first, regarding point 4, in terms of showing this Campanian Latin by default, do you see nothing wrong with showing it for words significantly post-dating the 2nd century AD like Iraquia (mentioned above) or tēlephōnum? It would strike me as very odd. (It'd be great if people other than us three (Brutal Russian, The Nicodene, me) could also chime on this...)
Second (and secondarily), do you think you could manage to cite the exact page(s) in either Väänänen book or Wallace/Rohlf where initial v- is said to be confused with b-? I just want to confirm The Nicodene is wrong about them not stating such a thing (I can't remember this detail myself).
For @The Nicodene: while I'm aware a few linguists have been doing God's work trying to reconcile attested written Late Latin with pronunciations reconstructions off Romance data, via statistical methods involving misspellings (e.g. Politzer, Adams, Leppänen), plus interpretations of grammarians of the time, is Brutal Russian really wrong about DÉRom largely omitting this in favour of pure reconstructive work?
Besides, I have the feeling that as reasonable as the Proto-Romance reconstruction in your head might be (or not), there are philosophical issues in pushing this reconstruction via Wikipedia articles you have largely written yourself... Call it an unfortunate consequence of too few people seriously reading all that material in question and participating on Wikipedia/Wiktionary, if you will. It would be good to pay particular attention to the scholars' consensus (well, more like schelling point), or to simply stick to the exact representation the DÉRom people prefer (which seems to be solely phonemic as BR said, phonetic allophony being more like territory under current exploration).
I'm well aware that scholars have been working out PRom allophony (just yesterday I was reading Loporcaro's Vowel Length from Latin to Romance (2015), whose "PRom OSL" (the author's term, open-syllable lengthening) is obviously allophonic), and while I understand the desire to represent such things, phonetically in square brackets, I have to admit PRom allophony often leaves me wondering... E.g. I recall one time talking to a PhD student of Romance linguistics and receiving pushback over OSL, partly because it's only some more geographically central languages that show its effects (Old French, Friulian, Dalmatian, ?Tuscan...), being absent from Portuguese, Spanish, Romanian and Sardinian. I couldn't help but agree that it smelled like a later wave sound change... You could say all more peripheral Romance simply happened to drop OSL, but why not say the more central Romance developed it and spread it instead? (Yes, I'm aware Augustine in De Musica alludes to canō having a long ā... add his African Latin dialect to the list if you want.)
The allophony of /ll/ being more retroflex (which you alluded to elsewhere) seems to be on better ground, being more widespread, but here it's you and me making the call, not the scholars at large (in the WP article you cited a paper by Xavier Gouvert after all... it's this kind of thing why I call it territory under exploration).--Ser be être 是talk/stalk 14:17, 4 July 2021 (UTC)[reply]
Agree with Brutal Russian. The gist is point 4. From it all “The Nicodene’s” paralogisms hail. It could be easy: If man believes it was present in vulgar speech then it gets a Vulgar Latin pronunciation. This has nothing to do with whether a term is in the reconstructed or main namespace. There are no “reconstructed Proto-Romance entries” insofar as the entries are imagined Latin. Likely pronunciations get added. But they are for Latin and not some third “Proto-Romance”. You see, Proto-Romance didn’t exist, for our purposes: Whenever it is attested it is Latin, and whenever a reconstruction is situated in reality it is Latin. Fay Freak (talk) 14:27, 4 July 2021 (UTC)[reply]
@Ser be etre shi: Yes, it would strike me as odd, but people have added vul=1 to stranger things showing that there's an interest in learning how to pronounce any word the way "the Roman people" might have done. If for instance a person adopts Campanian as their default pronunciation (and what's to stop them? Look at Luke with his quantitative Ecclesiastical :-), who should we assume they won't try to pronounce modern vocabulary the same way? Reciting Praenestine inscriptions using, say, German Ecclesiastical would strike me as more peculiar, which is why I tend to remove Ecclesiastical from dialectal/variant forms, but in principle even that could be left in. I doubt there's a big audience for this and those who do that probably use that pronunciation for lack of worry about pronunciation and don't need pronunciation tips any way.
On DÉRom's methodogy, specifically as envisioned by Buchi, one can consult this and this in French (sacrebleu) and this in English. Admittedly I haven't read through any of the introductory articles of any of the three dictionary issues so far (I'm afraid I'd overload Google Translate), but I think I know when I see a foundational principle of doing Romance etymology differently, and I've used the dictionary itself and found that it follows the principle as stated, only adducing written Latin correlates at the end to "confirm" their findings. One admittedly questions that it's possible to completely unsee the Latin data even when professing to do so. Awareness of the bias doesn't automatically exclude it.
In addition, nothing of what I've seen tells me they're as blind to sociolinguistics as The Nicodene wants to portray them. In fact shedding better and modern light on Latin sociolinguistics seems to be a fundamental aim of their approach (Conception du projet 5.2.4, first link): "la variation est omniprésente", listing the /phonological/, semantic, morphosyntactic and lexical levels. The [phonetic] level that The Nicodene is trying to make pan-Romance is conspicuously absent, because quite clearly the dictionary doesn't even attempt to reconstruct that level.
On OSL, Spanish and Rumanian diphthongisation make postulating it necessary for these languages as well. Portuguese still has OSL, in Portugal with strong reduction of everything unlengthened. It's currently present in Sardinian, though definitely not Neapolitan/Molise/Lombard-level present (perhaps that's why no diphthongs in Sardinia). But the diphthongs alone will probably necessitate as many transcriptions as there are Romance sub-branches, with vaguely understood waves of developments staggered by vast time spans, close to two millenia in the case of Sardinian. That is even if any agreement at all is possible (I have ideas).
Allophones of /l/ is one case where not just geographic, but even sociolingustic (Greeks) differences in allophony are explicitly attested, and no sweeping pan-Romance generalisation is possible.
On b/v confusions, it's attested eg. in Pompeian baccvleivs, baliat, bervs, and Väänänen 1981: 61 mentions Romance continuations such as Romanian bătrân, Portuguese bodo, Tuscan boce, not to mention the situation in Sardinia where only a couple of villages maintain the distinction. Also Väänänen 1966: 50. It seems to me the South of Italy was the hotspot for this particular phenomenon, but I can easily drop it, and I did before getting reverted again. Brutal Russian (talk) 16:28, 4 July 2021 (UTC)[reply]
While it is true that the DÉRom’s lexical entries give only phonemic transcriptions–which is true of perhaps most dictionaries–that is simply because the focus of these entries is lexical, not phonological.
The DÉRom does in fact cover allophonic phenomena in great detail in volumes I and II, and that is what is cited on the Wiki page, with supporting citations from other sources like Zampaulo or Ferguson, who also do the same thing.
Moreover the reconstructed (allophonic) features are supported by Late Latin inscriptional evidence, as examined extensively by e.g. Grandgent.
The citation pointing to Xavier Gouvert, regarding the realization of /ll/, is not from a separate paper: that is simply the name of a section of the DÉRom, volume I. Incidentally I did not implement a retroflex realization for /ll/ on the module because the other sources that I consulted about this, e.g. Zampaulo, do not seem to agree with Xavier Gouvert's attribution of the retroflex realization to Proto-Romance, suggesting that it is not widely accepted. I have made a point of refraining from adding features that multiple reliable sources do not agree upon.
Variation certainly is present in any sufficiently widespread language, and of course the DÉRom acknowledges that, and many of its lexical entries or 'mini-articles' do, in fact, provide more than one form for this very reason. That does not prevent modern scholarship from reconstructing a general Proto-Romance phonology, complete with allophonic features, as they have done.
The matter of diphthongization in Romance is certainly tricky, and Ferguson (1976: §7), at least, posits a sort of metaphonically-conditioned proto-diphthongization of stressed (and lengthened) lax e and o, for Proto-Romance. He sees the resulting *[eɛ] and *[oɔ] as underlying all future Romance developments, even including Sardinian (thereby explaining metaphonic raising to [e] and [o] in the tonic vowels of words like tempus or oru).
––––––––––––––
Looking through Väänänen (1966), I do see that the source supports at least some of the features that you assigned. Still, I would like you to document citations (including page and section numbers) for all of the features that you have assigned. This is not too much to ask for a major module.
I have reservations about making such a 'Pompeiian' pronunciation apply to all Latin entries, since that would include words attested 1,000+ years after Pompeii was buried by ash. The fact that some modern speaker could, conceivably, use such a pronunciation privately does not mean it is well-established enough to make it default on Wiktionary.
P.S.: Re chronology.
Just as there is no a priori reason to assign, on Wiktionary, an Ecclesiastical pronunciation to unattested forms reconstructed from Romance data, there is also no reason to assign a Classical one either.
An exception can be made if at least one scholar in the field has stated that the form in question existed, or at least likely existed, in Classical times but was simply never written. The Nicodene (talk) 06:30, 5 July 2021 (UTC)[reply]
@Fay Freak I have never thought that Proto-Romance reflects a language distinct from Latin. I have rather argued–both on this page and elsewhere–that it does not. (The supposed distinction is upheld by the same person that you say you agree with.) When I say “reconstructed Proto-Romance entries”, understand that as shorthand for “entries for unattested Latin words that have been reconstructed from comparative Romance data”.
The problem here is that there is no single "Vulgar Latin pronunciation” that Wiktionary could possibly give: the term is wildly imprecise, referring to anything non-literary in approximately the period 200 B.C.E. to 600 C.E., with some applying the term to even later centuries than that. Needless to say, sound changes were operative in Latin–as in any living language–throughout the period. The Nicodene (talk) 06:33, 8 July 2021 (UTC)[reply]

@Brutal Russian: I have taken the time to work out a fair compromise. Please consider the following proposal:

There should be two sub-modules, one labelled ‘Proto-Romance’, the other labelled ‘Pompeiian’ or ‘Campanian, late first century C.E.’ or similar. Neither should be labelled as simply 'Vulgar Latin', an imprecise term with an infamously wide range of possible meanings.

  • The Proto-Romance one, complete with well-cited allophonic features, is to be used for unattested words reconstructed from Romance data.
    • I have already agreed, per your request, not to apply it to attested Late Latin words.
  • The Pompeiian one is to be used for Latin words that are either attested by 300 C.E. or claimed by at least one source to have probably existed, unwritten, by about the time of Pompeii.
    • That provides a ‘grace period’ of more than two hundred years after Pompeii. There needs to be some cut-off, and this seems reasonable. (I do not think anyone wants to see Wiktionary apply a Pompeiian pronunciation to a word attested from 1180 to 1246 in the southwestern corner of Poland.)

If we can agree on that, it is now simply a matter of asking someone knowledgeable about Lua to make two such sub-modules. Neither needs to even use the code ‘vul’: they can simply be ‘pr’ and ‘pom’ or similar.

A compromise will allow everyone to move on with their work of improving Wiktionary. The Nicodene (talk) 21:27, 5 July 2021 (UTC)[reply]

@The Nicodene: One obvious reason to give Classical for reconstructed entries is to illustrate phono evolution. Another is that Classical is the standard pronunciation for all lemmas. A third reason is that there's no way to disprove that this pronunciation was ever used for these items since date of appearance is unknown and dates of individual phono features disappearing entirely is too. Given the starting point and the ending point, a person can make an educated guess at the intermediate points, and the vast majority of people cannot triangulate it without explicit transcriptions.
DÉRom has some dicsussions of allophony, but their goal is to arrive at phonemic reconstruction. The question is: given the starting point (proto-Italic) and a range of Romance outcomes, is there any continuity and what would be the optimal phonemic representation? No claim is made as to what combinations of these allophones existed in what speech varieties. The closest thing to such a discussion that I've seen is found at the end of Gouvert 2014, p. 48. A belief in the necessity to settle on a universal proto-Romance narrow phonetic transcription to me is inexplicable even from the point of view of the philosophy of language and historical linguistics, and doubly so in that it repeats the mistakes of Vulgar Latin ("that's how all the plebs in the Empire spoke as opposed to all the patricians"). No valid reason has so far been presented in its favour.
Here's a word from 14th century Poland where a Classical pronunciation can be argued to be inappropriate: granicia. It can also be argued to be appropriate as a guide for people who want to incorporate the word in their speech any way - I've participated in more than one discussion on "how would the Latins adapt /t͡ʃ/ in words like chilēnsis or Czechia." The only possible Classical-age native adaptation seems to have been /s/, and later perhaps /t͡s/ judging by spellings with zo- for t(h)eo-, probably as a Greek-borrowed (bilingual) phoneme.
Campanian Latin hasn't disappeared with the destruction of Pompeii, it's continued by the modern dialects of Campania and southern Lazio. Given that it anticipates many Late Latin developments, it will be an appropriate transcription at least to the end of antiquity, certainly where Classical is appropriate. Same considerations to tracing phonetic evolution apply. I don't know of a language on this website where such fuss exists over pronunciations. I haven't observed anything of the sort with Ancient Greek. I don't see what's the practical point. We don't even have consistent attestation dates/authors, a much more pressing problem. If there already exists a similar practice with some other ancient language, we can see about borrowing from it. Otherwise it's an exercise in whose arbitrary position wins.
There are people willing to engage with my thoughts, know what I know, how I know it and how I interpret it, to offer criticisms for both to consider. For these people I'm ready to write pages of text and provide the links and references they need. References typically require interpretation, and I will happily discuss the interpretation of these references with those able to discuss it. What I'm not ready to do is to engage with those who, for being unable to engage with my thoughts, instead engage in reference warring; who, due to their own lack of information, project onto me their ignorance of what knowledge and interpretations I based an edit on, and assume that I based it on no knowledge, no references and/or on an inability to correctly interpret them; and that gives them justification to discard my edits as trash and edit war me. I will not enable the practice of taking a module hostage by paying ransom in references.
Those interested will consult the references provided and discuss what features they think deserve to be incorporated in the transcription whose aim is to portray sociolinguistic variation by stereotypising a local variety and contrasting it to the standard. In short, Classical and Campanian are both reconstructions; in addition to being individually valid, the pairing can be used to represent two poles of a continuum that likely was both diatopic (Campania-Rome) and diaphasic. It's true that Classical has a special status of also supposedly being taught, giving it a status beyond just "reconstructed", but what's being taught in reality is often widely different. I would argue that Campanian has the same validity as a classical-age reconstructed pronunciation as what we currently have and if it was taught precisely, it would have been miles closer to the target. In short I don't believe there's any significant epistemological difference between the two. Brutal Russian (talk) 21:16, 7 July 2021 (UTC)[reply]
@Brutal Russian: While it is true that one cannot prove a negative, there does need to be evidence (or at the very least one scholar's speculation) that shows that the form in question existed in Classical Latin for there to be a chronological or historical justification for assigning it such a pronunciation on Wiktionary. The starting point for such words is Proto-Romance, not Classical Latin, unless demonstrated otherwise. Moreover, while providing a dubious Classical pronunciation could be useful for illustrating diachronic sound changes, the same reasoning would justify assigning Proto-Romance pronunciations for any Classical or Late Latin word that survived into Romance, which you are clearly against. (Although, if you change your mind, that could actually be a fruitful avenue of discussion.)
Nowhere in the DÉRom does it say that it is necessary to limit oneself to the phonemic level. The very first volume of the work contains, in part two, an extensive section titled ‘Reconstruction Phonologique’, which lays out a reconstructed phonology of Proto-Romance, complete with allophony. (The last part is particularly interesting.) If you would like more information on why they undertook this project, you are free to read it in their own words. The fact remains, however, that they did, and other sources as well reconstruct allophonic Proto-Romance features.
We are not discussing whether a Classical pronunciation is appropriate, on Wiktionary, for a fourteenth-century Polish word, but rather whether a reconstructed Pompeiian one is. A revived Classical pronunciation is in widespread use among modern Latinists; a revived Pompeiian one is not.
It seems that you agree with the principle of a cut-off, but not where exactly it should be. That, however, can be worked out in time. Wiktionary does not, by the way, assign reconstructed ancient Greek pronunciations to non-ancient words, as far as I can tell.
Per Wikipedia:BURDEN it is your responsibility to provide citations for Pompeiian Latin, no matter who asks you to.
Should a revived Pompeiian pronunciation come into widespread use among Latinists one day, we can revisit the topic of making it a default pronunciation for all periods. The Nicodene (talk) 22:30, 7 July 2021 (UTC)[reply]

@Brutal Russian: I have now read through the sources that you mentioned.

It occurs to me that it would be a good idea to lay out here, in detail, the issues I have with the Pompeiian module. They can be summarized as a series of questions:

1) Why is there a complete merger of the phonemes /b/ and /w/ in every environment?

Väänänen’s own conclusion (1966: 128) is that, in Pompeii, “the cases of b-u confusion are very few and doubtful”. He does mention the ‘sandhi theory’ on page 52: “At the beginning of a word, b […] would have remained an occlusive when preceded by a consonant […] at the beginning of a word, the confusion of b and [w] was doubtlessly decisive when a vowel preceded”.
Parodi, cited elsewhere by Väänänen, mentions that as well on page 195: “the general tendency was to consistently continue the route that word-internal b had followed, between vowels, and to reduce it to v even when it was initial, if a vowel preceded it.”
Lloyd (1987: 239) mentions that as well in From Latin to Spanish: “If we turn back to the situation in Late Latin when variation was still a living process, we can see that the initial /b-/ would have had two realizations: [b-] after pause or consonants, and [β] after words ending in a vowel [...]"
Hualde (2011: 2228–2229), in The Blackwell Companion to Phonology, vol. 1, provides a detailed summary: “Again, the merger would be expected to affect word-initial B- and V- when intervocalic; that is, ILLA BUCCA, for instance, should have undergone lenition to [ilːaβukːa] […] At the last stage represented in (11), the phonemes /b/ and /v/ are in contrast in word-initial position only if not intervocalic (e.g. after pause or consonant) […] The frequent cases of confusion between initial b- and v- […] provide quite strong evidence for the hypothesis that the phonemic contrast was indeed analogically re-established in word-initial postvocalic position, after a period where the two phonemes were contextually neutralized, as proposed by Weinrich (1958).”
In other words, word-initial /b/ tended to become a fricative when preceded by a vowel, but this contextual (i.e. limited to a specific environment) merger was later reversed, in many regions, by substituting the occlusive [b], which would have survived all along in initial position when not preceded by a vowel.
I would, for the purposes of the Pompeiian module, limit word-initial [b] > [β] to cases where there is a preceding word ending in a vowel. Even so, note that there is not a single example of ⟨u⟩ for initial [b] in Pompeii, whatever the preceding sound.

2) Why is short /i/ rendered as [e] in all environments?

Adams (2013: 60) shows that the phenomenon only occurs with regularity in word-final atonic syllables, generally verb endings.
Väänänen (1966: 128) says that “e for i only appears to a notable degree in the endings -is, -it > -es, -et”. Out of the only three stressed examples where the readings are not doubtful, he immediately casts doubt on two of them as explainable via assimilation to the e of a following syllable (p. 21).
Wallace (2005: xxvii) says only that: “in graffiti the short vowel e was used with considerable frequency for original short i in word-final syllables”, which he follows by providing numerous examples of verbs. Notice that the section is titled Short i in Final Syllables.
Clarkson, on p. 7 of the Oxford Guide to the Romance Languages, cautions that “there are hardly any good examples" of the phenomenon in stressed syllables and, on the next page says: “The conclusion from these spellings would appear to be that, in the first century AD, the confusion between ē and i was not yet at a stage where it permeated into speakers' writing habits (so also Eska 1987; Adams 2013:58f) […] Vowels in final syllables of polysyllabic words (which were never under the stress accent in speech) do show alternations in writing between e and i, but here the merger affects short e and short i, and need not be related to the changes which will later affect Romance vowels.”
I would, for the purposes of the module, limit the phenomenon to final unstressed syllables. Note also Clarkson's comment that this is a contextual merger of short i and short e, which your module distinguished.

3) What source claims that intervocalic /d/ and /g/ were fricatives in Pompeii and not in Classical Latin?

4) Why is there not, for instance, simplification of geminates after long vowels, or numerous other features mentioned by Väänänen?

I would also like to ask why you claim that the formation of the glides [w] and [j] is “syncope”, but that did not affect the module, at least. The Nicodene (talk) 21:57, 8 July 2021 (UTC)[reply]


@Brutal Russian: Notice how Wiktionary deliberately avoids assigning a 5th century B.C. Attic pronunciation to these terms, which entered Greek in later eras:

δούξ, Κωνστᾰντῑνούπολῐς, Σκλάβος, βίρρος, φραγέλλιον, ἱεράρχης, πάσχα, πίτα
παροικία, Τοῦρκος, κυριακή, σάββατον, στήκω, Ἀλεξανδρέττα, σεβαστοκράτωρ

Periodization matters. The Nicodene (talk) 21:51, 9 July 2021 (UTC)[reply]

Concerning Kuching Hakka

[edit]

@RcAlex36, Justinrleung Kuching Hakka, Kuching (Hopoh) Hakka, or Kuching (Hepo) Hakka is descended from the Hakka variant as used in Hepo, Jiexi, Jieyang, Guangdong, China, and therefore is more similar to Cantonese than the Hakka variants as used in Taiwan, in terms of written characters. I believe that the Taiwanese Hakka standard should not be used to write the Kuching Hakka variant, as some words have different etymology. Wiikipedian (talk) 10:10, 4 July 2021 (UTC)[reply]

@Wiikipedian: The word is probably 到 even though it's in 上聲 instead of the expected 去聲. 《客家社會生活對話》中「到₃」、「到₅」功能的重疊及其與臺灣客語的比較 writes it as 到₃. I would like to hear Justinrleung's opinion on this. RcAlex36 (talk) 13:29, 4 July 2021 (UTC)[reply]
@Wiikipedian, RcAlex36: In general, if the Cantonese and Hakka words are etymologically the same, we should write them in the same way. There are certain cases where this is difficult, like for the word for tired, where most Hakka sources use 𤸁 but 攰/癐 is used in Cantonese. (This example is kind of irrelevant for Kuching (Hepo) Hakka since another word 𤺪 is used.) We usually follow the Taiwanese standard if the 本字 is not clear and various sources do not agree. We often also consult sources from Guangdong, Fujian and Jiangxi. The written tradition for Hepo Hakka may be similar to Cantonese, but Hepo Hakka should be generally more similar to other varieties of Hakka rather than Cantonese. Most varieties of Hakka spoken in Taiwan come from Guangdong as well, so I don't see why there would be a problem with following the Taiwanese standard. If we're talking about 著 (tó) in the Taiwanese standard specifically, I would say we should not follow it because it is not quite the etymological character and varieties that would have the /au/ vowel for this series use /au/ (like Meixian, for example). I would support writing it as 到. — justin(r)leung (t...) | c=› } 20:09, 6 July 2021 (UTC)[reply]

Hebrew transliteration - time to clear the mess.

[edit]

Hello everybody!

(@Malku H₂n̥rés, Metaknowledge, Fenakhay, Thadh, Lingo Bingo Dingo, Erutuon, Gnosandes, Pinnerup, Fay Freak, Profes.I.)

As I'm sure many of you did already, I've noticed the lack of standardisation when it comes to Hebrew transliterations. Put simply: It's a mess. All other Semitic languages seem to deal with it in a much neater way. I've generally noticed the following scenarios:

  1. Simplified/Modern Hebrew transcription only: אֵל, בַּיִת (see derived terms, too), קַרְקַע, אָחוֹת, etc.
  2. Accurate transliteration only:ראש (see transliteration of derived terms), *ḳarḳar-, etc.
  3. Both of the above in random order: עָפָר, שיבולת, *halak-, *śamš- (note how the order in which they appear is also irregular), etc.
  4. Mistakes of any sort: עֹשֶׁר (ע neither transliterated nor transcribed), *ʕaśar- (ע transliterated as /ʔ/ instead of /'/), פַּרְעוֹשׁ (missing accent), etc.

Some of us have been recently discussing about this on Discord, listing different opinions, eventual problems and possible solutions. One thing pretty much all of us seemed to agree with was that something needs to be done about it. I'm aware that there have been similar discussions in the past, and also that an attempt to automatise the process through a transliteration template was made (see Module:he-translit). I think it's time to resume the discussion and make some decisions. Just to kick off the discussion, here is what I would like to see on a Hebrew entry:

This seems to me the most Wiktionary-like solution, using the parameter |tr= for an actual transliteration and giving a Modern Hebrew transcription via |ts=. For the indication of the accent when non on the final syllable (that might be necessary to automatise the transliteration/transcription template, from what I understood), I would use the oleh ( ֫ ), as it is apparently already widely used in textbooks and so easily recognisable by many already. Let me know what you think, issues you might see, alternative solutions. Thank you! Sartma (talk) 14:55, 4 July 2021 (UTC)[reply]

@Sartma: Hello. I don't know why I was mentioned. I will only support that you really need to use the sign oleh ( ֫ ). For this sign is indeed found in publications on Hebrew grammar. That's all. Gnosandes (talk) 15:05, 4 July 2021 (UTC)[reply]
I only recognize the oleh from a Christian who was interested in Biblical Hebrew using it and me saying, "What is that?" and looking it up and finding out it was a cantillation mark from three books of the Tanakh that was otherwise unused, and that it was not relevant to the example of Biblical Hebrew that they were presenting, so I don't support it; I don't think it's as universally recognized as you're claiming it is, and in fact it has a very separate usage traditionally that is not compatible with its usage as a stress marker. פֿינצטערניש (Fintsternish), she/her (talk) 18:36, 5 July 2021 (UTC)[reply]
No Modern Hebrew transcription—which would also use |ts= in an unseen manner—is needed as the scholarly transcription already includes everything. That link to WT:HE TR beside each transcription at a header could just tell manaman in an additional column how the IPA values are (it seems that to write IPA there one was too inert like one was too inert to use full characters instead of this cripple-keyboard transcription); additionally I arread that automatic generation of the pronunciations as on עַזָּה (ʿazzā́) is also possible. In a few cases though I have seen Modern Israeli Hebrew being claimed to have retracted stress against the other pronunciations, as on עַזָּה (ʿazzā́). Fay Freak (talk) 15:23, 4 July 2021 (UTC)[reply]
Comment: In how far is Modern Hebrew transcription not deductable from the Biblical Hebrew transliteration? Because looking at the examples given above, it seems they are (with the exception of bb > b, but as I understand it, modern Hebrew doesn't have gemination): p̄ = f, ḵ = kh, ṣ = ts, ḇ = v, ṯ = t, š = sh, ʿ = ', long vowels are ignored in Modern Hebrew. If the modern variants are just respellings of the Biblical ones, why even bother giving them? Thadh (talk) 16:26, 4 July 2021 (UTC)[reply]
One could argue that distinguishing e.g. long vowels from short vowels or w and b in transcriptions for words only used in Modern Hebrew (or other stages of Hebrew in which certain mergers took place) is awkward. Overall I strongly prefer a shift to a more scholarly system, though. ←₰-→ Lingo Bingo Dingo (talk) 16:52, 4 July 2021 (UTC)[reply]
@Ruakh, Enoshd, Mnemosientje, פֿינצטערניש I hope we can get some input from people who are competent in Modern Hebrew. ←₰-→ Lingo Bingo Dingo (talk) 16:52, 4 July 2021 (UTC)[reply]
No time rn for a long writeup but for starters one thing that is often different is the stress; Fay Freak has already noted the example of עַזָּה, which is pronounced áza in Modern Hebrew with initial stress. It's a bit of an annoying situation; MH transliteration is inadequate for readers interested in Biblical Hebrew, but scholarly transliterations will be confusing as hell to readers interested in Modern Hebrew. For example, there is no difference in pronunciation between kuf and kaf in MH, idem between khet and khaf, vet and vav, tet and tav, etc. and indeed even alef and ayin (except for Mizrahi speakers), but scholarly transliterations show differences nonetheless (e.g. q for kuf and k for kaf, which looks weird to me as a speaker of MH). In transliterations seen in daily life in Israel, q and ḥ and so forth are basically not used as a result, not to mention stuff like ṣ. While the spelling is the same, so to speak, the pronunciations of BH and MH are about as different as those of ancient and modern Greek.
Bottom line is Biblical transliteration, with geminates and long vowels and all these specific characters, will throw off most modern Hebrew speakers and learners, but modern Hebrew transliteration leaves out important info for Biblical Hebrew learners/people interested in comparative linguistics. I think a good compromise would be to include a scholarly transliteration optionally, and have the modern transliteration as a default (i.e. to have the option of two transliterations), so as to not have Biblical transliterations on MH words which would look woefully out of place. So I guess I mostly agree with Sartma's proposal. (Unfortunately the Hebrew headword line is already cluttered AF due to status constructus forms etc. - adding a second translit would certainly not help declutter that situation..) — Mnemosientje (t · c) 18:37, 4 July 2021 (UTC)[reply]
Worth noting that the Modern Hebrew translit is the de facto default atm: Wiktionary:About Hebrew#Romanizations. — Mnemosientje (t · c) 18:54, 4 July 2021 (UTC)[reply]
@Mnemosientje: I understand that if we only choose Biblical transliteration we risk alienating Modern Hebrew speakers/learners, but it's also true that native speakers do write different letters for kuf and kaf, khet and khat, and so on, and even in Modern Hebrew if you want to vocalise a word you would use Biblical niqqūds, so part of me thinks that it's just a question of habit: when Romanising Modern Hebrew the norm is to use the "simplified" version, so a proper transliteration would look weird to a native speaker. On the other hand, the simplified version is completely unsatisfactory to whoever studies Biblical Hebrew. Another idea that was brought up on Discord was having independent entries for Modern Hebrew and Biblical Hebrew, the same way that Arabic is separated form it's national "spoken" varieties. Could that be an option? Sartma (talk) 22:07, 4 July 2021 (UTC)[reply]
Splitting Hebrews would complicate things, and doing this for transcription would be out of proportion.
I do assess though that in so far as one writes different or the same letters, thus one does not confuse Modern Hebrew learners if our transcriptions reflect them. On the contrary, maybe mingling qūp̄ and kāp̄ does not do them justice and promotes spelling mistakes. But as a middle ground you can use the Latin letter ⟨k⟩ with a dot below, ⟨ḳ⟩, which we use anyway for Ethio-Semitic, and Proto-Semitic: *ḳarib- – because @Rhemmiel was a fan of it and I didn’t care one way or the other.
So it is also commendable to transcribe כ as either ⟨k⟩ or ⟨ḵ⟩ because it reflects well that is the same letter. The same goes for all other begedkefet letters. Scholarly transcription is straightforward.
I don’t see any argument why we shouldn’t transcribe שׁ with ⟨š⟩. The háček is well-known. שׂ can be ⟨ś⟩ because it’s rare any way and if anyone doesn’t understand it he correctly reads it without the acute just like ס with ⟨s⟩, which is also like in modern Ethio-Semitic languages (the descendants of *šurš- all read /s/ for conservatively spelt (śä)). Rare ז׳ runs as ⟨ž⟩ well of course.
צ in turn should not be transcribed as ⟨ts⟩ because it does not count as two consonants but one root consonant so let’s stick to ⟨ṣ⟩, nothing fancy like ⟨c⟩.
And so I got rid of all the digraphs already, in addition to shewing why the macron letters are actually easier. I suspect this “chat romanization” is not actually there to make it easier for readers but for editors, who should know how to type in characters with diacritics – that can be seen by the avoidance of the half rings ʾ ʿ which Modern Hebrew learners can’ take issue with as they take only little notice. Daily life in Israel is hardly the yardstick of what is desired, otherwise we would use Arabic chat alphabet because it is daily life in Egypt – very ugly.
But we can tone it down with the vowels a bit. For Šwā there are less intrusive alternatives, such as U+1D4A MODIFIER LETTER SMALL SCHWA ⟨ᵊ⟩ and U+1D4A MODIFIER LETTER SMALL SCHWA ⟨ᵊ⟩. Macra over ⟨u⟩ and ⟨o⟩ and ⟨i⟩ seem to be of little significance for Hebrew, though theoretically-comparativistically they distinguish root patterns. I have to note that the transcriptions are not made for the purpose of being reverse-engineerable to the Hebrew script: Like in Ottoman Turkish we don’t transliterate the Arabic alphabet by circumflexes because we have the original spelling right next so we employ |tr= to approximate how one actually understood the sounds. Fay Freak (talk) 02:31, 5 July 2021 (UTC)[reply]
@Fay Freak: I guess that once we have a template, it will be easy to decide what characters to use, and even change them in case preferences change. To be honest, I would prefer using ⟨q⟩ for ק, mainly because pretty much all my Hebrew textbooks (even the Assimil one for Mother Hebrew) transliterate like that, but I'm not against ⟨ḳ⟩ either if that's what the majority wants. As long as we have a consistent, standard transcription, I'll be happy! Sartma (talk) 09:31, 5 July 2021 (UTC)[reply]
I agree 100% with @Sartma — scholarly and Modern should both be included, as they are separate things that a reader might be interested in when looking up Hebrew entries. Modern Hebrew transcription is very easy to read for a student of Modern Hebrew, but the scholarly transcription is necessary for people with an interest in Ancient Hebrew and/or the historical development of Hebrew reading traditions. It makes the most sense also to put these under tr and ts parameters, but the main hangup is that a random new editor coming along to edit an entry might not know the system. פֿינצטערניש (Fintsternish), she/her (talk) 10:22, 5 July 2021 (UTC)[reply]
@פֿינצטערניש: The idea is that this will all be automatised, so the only thing an editor should know is how to spell the Hebrew correctly with niqqūds, and the transliteration + transcription would appear automatically. Sartma (talk) 11:16, 5 July 2021 (UTC)[reply]
Even better. The only thing is that there are cases where spoken Hebrew diverges from what is expected based on the Nikkud, and it should be possible to manually input when that is the case. פֿינצטערניש (Fintsternish), she/her (talk) 13:22, 5 July 2021 (UTC)[reply]
Just because it bothers me, I will repeat what I said on Discord. Neither the scholarly Biblical Hebrew transcription nor the Modern Hebrew transcription that we currently use are strict transliterations, even ignoring stress. Both do not have a one-to-one relationship between the Hebrew graphemes and the letters of the romanization, and are not completely reversible. For instance qamats has two transcriptions, ā and o, and matres lectionis aren't transcribed. (For instance bēṯ could be בֵּית or בֵּת or בֵּאת.) Thus Module:he-translit was hard to write. (User:Wikitiki89 figured out lots of the edge cases that I had given up on.) The Biblical Hebrew transcription is closer to a transliteration given that it distinguishes more Hebrew graphemes (fricative ב and consonantal ו for instance), but it's not there. Thus strictly speaking it's a kludge to put the scholarly transcription in |tr= and the modern transcription in |ts=. I'm not opposed to the idea, because there isn't a better simple way to do it, just being picky about terminology because we were being picky on Discord. We could try to make an actual transliteration, but that would be unpleasant to read and not very useful for newbies.
There is only other option I could think of, but very difficult and it's perhaps impossible to agree on the details: adding a second transliteration parameter (called who knows what, |tr2=?) and putting the Modern Hebrew transcription in that. Perhaps this could be made to also allow a proper way to display Japanese kana and romaji (currently I think they're shoved in |tr= with a comma). But even if we could agree on doing this, it involves changes to Module:links, which is used by almost everything, so it would be painful. The |tr= and |ts= for Biblical and Modern is the easiest option that we have, even though it's not technically correct. — Eru·tuon 19:35, 5 July 2021 (UTC)[reply]
@Erutuon: We can already use |tr2= within {{head}}: עָפָר or July (ʿāp̄ā́r or 'afár). I wouldn't mind this layout either, it's honest, in a way. Sartma (talk) 00:02, 7 July 2021 (UTC)[reply]
@Sartma: Right, just ignore |tr2=; I'm not seriously proposing that as a name because it would conflict with existing parameters in commonly used templates, like {{affix}}. The real parameter name would have to be something else. — Eru·tuon 01:00, 7 July 2021 (UTC)[reply]
@Erutuon: I think בֵּית or בֵּת or בֵּאת should be transliterated as bêt̠, bēt̠ and bēʾt̠ respectively. That's how they appear in textbooks where a proper transliteration is used. Sartma (talk) 08:50, 7 July 2021 (UTC)[reply]
@Sartma: Well, the exact details of the transcription aren't relevant to my point, but perhaps see Module talk:he-translit § Vowel distinctions for why WT:HE TR and Module:he-translit don't indicate matres lectionis in this way. Wiktionary doesn't have to do exactly what textbooks do. For my part, the circumflexed vowels and silent letters tend to just confuse me about what the phonemes actually are. — Eru·tuon 21:47, 7 July 2021 (UTC)[reply]
@Sartma, Erutuon: I learnt with a system where circumflex indicates that the vowel doesn't get shortened because of shifts in accent. However, it would be good to indicate what the matres lectionis are in the transliteration, e.g. by superscript letters. --RichardW57 (talk) 22:46, 7 July 2021 (UTC)[reply]
@RichardW57: I think it's the same system. In general, vowels written with matres lectionis don't change when the accent shifts. If we are going for a "transliteration" then it must have a sign to letter correspondence (trans-literation literally means "turning one sign/letter into another sign/letter). This means that somehow matres lectionis need to show in the transliteration. Sartma (talk) 13:37, 28 July 2021 (UTC)[reply]
@RichardW57, Sartma: Superscript letters sound a bit clearer to me than circumflexes. There are superscript characters for w, y, h, but I don't know of a superscript ʾ (and it might be hard to distinguish from a regular one). We could use HTML tags, but then the transliteration wouldn't copy and paste cleanly as plain text. I think my book at least sometimes uses parentheses around silent matres lectionis, which has its own problems stylistically because transliterations are usually inside parentheses to start with. — Eru·tuon 21:00, 13 July 2021 (UTC)[reply]
@Erutuon: I would still want to go for one of the many already existing transliteration formats, unless there is a clear advantage in using something different. Sartma (talk) 13:37, 28 July 2021 (UTC)[reply]
@Erutuon, Sartma: There's a whole range of letters suitable for transliterating aleph:
  • ʔ U+0294 LATIN LETTER GLOTTAL STOP (or a casing pair if you really prefer)
  • ʾ U+02BE;MODIFIER LETTER RIGHT HALF RING
  • ˀ U+02C0 MODIFIER LETTER GLOTTAL STOP
  • ʼ U+02BC MODIFIER LETTER APOSTROPHE
One might even be able to bring oneself to use U+02BE for anciently sounded and U+02C0 for quiescent. --RichardW57 (talk) 18:39, 14 July 2021 (UTC)[reply]
I think ʔ U+0294 LATIN LETTER GLOTTAL STOP for non-mater-lectionis aleph and ˀ U+02C0 MODIFIER LETTER GLOTTAL STOP for mater lectionis aleph would be the clearest, but it's probably a big departure from tradition to not write aleph with the right half ring. — Eru·tuon 19:29, 14 July 2021 (UTC)[reply]
@Erutuon: See comment above. Unless there is a clear advantage in using something different from what already exists in terms of BH transliteration systems, I wouldn't want to come up with another one. Sartma (talk) 13:37, 28 July 2021 (UTC)[reply]
@Sartma: I'm not seriously proposing it. I don't mind the current transcription scheme described at WT:HE TR and generated by Module:he-translit. — Eru·tuon 17:56, 28 July 2021 (UTC)[reply]
@Erutuon: The only issue I see with the romanisation scheme describe at WT:HE TR is that a "romanisation" is not a "transliteration". The model described there is a mix of transliteration and pronunciation (=romanization). I'd like to have a proper transliteration for BH, since no-one really knows how it was exactly pronounced, and the only thing we have are the signs. The current romanization in Module:he-translit doesn't show matres lectiones, wich is a huge fault for me. Besides, deleting matres lectiones from the transliteration is the main reason why programming the automatic transliteration became hell. If we implemented the circumflex (^) for long vowels written with a mater lectionis, everything would become easier and straightforward. It's a no-brainer, really. Sartma (talk) 09:01, 29 July 2021 (UTC)[reply]
@Sartma: As I've said several times, including in a previous message in this thread, if the WT:HE TR transliteration is not a true transliteration, your proposed scheme isn't either, if by transliteration you mean something that is reversible, with a one-to-one relationship between Hebrew graphemes and Latin graphemes (letters and diacritics). Your scheme isn't reversible any more than the WT:HE TR one is. It would merely represent at least one more distinction, the presence and absence of a mater lectionis (not even which mater lectionis it is). — Eru·tuon 19:53, 29 July 2021 (UTC)[reply]
@Erutuon: Then shouldn't we try and find a way to transliterate BH properly instead of producing a mixture of transliteration and reconstructed arbitrary pronunciation? How many cases would the circumflex system leave out? They should be extremely few, maybe we can concentrate on finding a way to fix those? Sartma (talk) 10:35, 30 July 2021 (UTC)[reply]
@Sartma: What's the purpose of having a reversible transliteration for Biblical Hebrew? I don't remember seeing that before and I suspect it would be confusing and make it harder to read because Hebrew spelling is complicated and I guess even ambiguous. There are multiple pronunciations of individual graphemes that in many cases (not sure if all) you can figure out based on context. Looking over Module:he-translit/testcases: the vowel אָ (ā or o) or the letter and diacritic וּ (ū or ww), or the vowel and diacritic אִי (ī or i; see קִידּוּשׁ near the bottom of the Biblical Hebrew table), or of course sheva אְ (ə or nothing). Module:he-translit has to have a lot of rules to figure out these cases, but if the reversible transliteration is one-to-one Hebrew-to-Latin, the readers have to have these all in their heads. Granted a reversible transliteration could be one-to-many Hebrew-to-Latin, in which case it could disambiguate many of these cases (for instance, write וּ as ū or ww depending on context, rather than choosing a single symbol, whatever that would be, to represent it everywhere whether it was consonantal or vocalic), but there would still have to be an unambiguous representation of matres lectionis, which the reader would have to learn to ignore. I don't see the point. And we don't usually go for reversible transliterations for complicated scripts in Wiktionary. — Eru·tuon 18:55, 30 July 2021 (UTC)[reply]
Actually if וּ is considered as either a vav with a dagesh (ww) or a shuruq (ū) it isn't an exception to a one-to-one transliteration. It's only an exception Unicode-wise, in that they are both U+05D5 HEBREW LETTER VAV followed by U+05BC HEBREW POINT DAGESH OR MAPIQ, and have to be distinguished by context. — Eru·tuon 01:49, 31 July 2021 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── @Erutuon: A trans-literation is literally a sign-by-sign transcription. It's purpose is helping the reader to understand the value of the signs of a foreign alphabet. I find it strange that you have never seen it before, since it's quite widely used at the beginning of many BH books. It's used throughout Larry Mitchel's A Student's Vocabulary for Biblical Hebrew and Aramaic. I don't think any of the examples you gave here are problematic in any way, since there is always a way to derive their value. You can always know when a אָ is ā or /ɔ/, as you can always know when וּ is ww or ū and אְ is /ə/ or nothing. I understand that a scriptio plena like קִידּוּשׁ might be an issue, but is that even BH? How many times does it appear like that in the Bible? If it's spelled קִדּוּשׁ the vast majority of the time, then it's a fake problem. Sartma (talk) 20:25, 3 August 2021 (UTC)[reply]

@Sartma: The etymology of transliteration is beside the point. Our transliterations (|tr=) vary from one-to-one correspondence to things that depart very far from native spelling and basically represent phonemes. I recall reading that Module:th-translit is one of the most unfaithful transliteration modules because Thai has a crazy writing system that is so distant from the phonemes in some cases, that transcribing graphemes is completely useless. (Compare the "orthographic" and "romanization" columns of เปาะเปี๊ยะ (bpɔ̀-bpía), a random example; maybe there are crazier ones, but I'm not very familiar with Thai.) But even Module:ar-translit doesn't generate a reversible transliteration: كَاتِبَ (kātiba, writer (m. acc.s.)) and كَاتِبَة (kātiba, writer (f. no case ending)) are both transliterated katiba. We are free to make our transliterations as unfaithful to the native alphabet as we must to make them useful to readers.
I'm familiar with the transcription in the Student's Vocabulary. It's roughly the one my Hebrew textbook uses (though that doesn't have the slashes, whatever those are for) and I suppose the one with circumflexes that you've been talking about. It's not a one-to-one transliteration in the sense I was talking about earlier because it has lots of contextual rules. (I see it represents a silent alef as a half ring, rather than with a circumflex as I had the impression it did.) I don't know if it's reversible or not, even if it's not one-to-one; maybe it actually is except for edge cases
I think there may be a few cases where אָ and אְ cannot be determined by context, but I am fuzzy on the details. Like see Shva#Shva Meraḥef. I wish User:Wikitiki89 were around. He seemed to have a better handle on edge cases.
I don't know if קִידּוּשׁ is Biblical Hebrew spelling; you'll have to ask somebody who's more learned in Hebrew than me. I merely assumed it was a case that had to be handled because it was in the module testcases. — Eru·tuon 23:12, 3 August 2021 (UTC)[reply]
@Erutuon: Ok, I understand that Wiktionary transliterations vary depending on the language and can be one-to-one correspondence or something else on the line of a phonetic transcription. This means that a one-to-one transliteration for BH is not excluded a priori. My argument here would be that no-one really knows what the actual BH pronunciation was, we might have better theories for phonemes, but they are theories nonetheless, so why transliterating something that's just theoretical/reconstructed instead of sticking to the only evidence we have, wich is the Tiberian niqqud? People not interested in BH won't care anyway and will have their MH normalisation. I have the feeling that we are trying to make BH closer to MH for I'm not sure what purpose. You correctly say that our transliterations should be useful to readers. If most BH students learn in books that mostly use the circumflex for matres lectiones and always transliterate alephs, why choosing something different? As for the Arabic examples you give, they are false issues: كَاتِبَ (kātiba, writer (m. acc.s.)) would never appear like that anywhere without the article. It has to be الكَاتِبَ so it's an unreal example. I don't know all the details, but from my experience of the Arabic transliterations on Wiktionary, they're pretty much always reversible.
@Erutuon: The slashes are used as syllable boundary. Alephs should always be transliterated, since we have no certainty of how they were pronounced in BH and, again, alephs are not matres lectionis so they shouldn't be transliterated with a circumflex. I don't think that system is 100% inconsistency free (compare הָיָה (hā/yấ) vs מִקְנֶה (miq/néh)), but as far as I know it's pretty much always reversible. Sartma (talk) 08:48, 4 August 2021 (UTC)[reply]
@Sartma: Sorry, كَاتِبَ (kātiba) is ungrammatical Arabic except in a phrase. Take الْكَاتِبَ (al-kātiba) and الْكَاتِبَة (al-kātiba) instead. ة is an exceedingly common letter because it is the ending of most feminine nouns and feminine forms of adjectives, and it is not distinguished from a final a vowel, which is common in past-tense verbs in most places, and in noun case endings in quotations, and when it has a vowel on it (in quotations since case endings are omitted in headwords), it is not distinguished from the letter ت. And then there is no distinction between final ا and ى, both transliterated ā, which are somewhat less common. And there is no distinction between nunated vowels (which occur in quotations, not on lemma lines, for the most part) and an un-nunated vowel and ن. There are enough non-reversibilities that I imagine you'd hit one in a non-trivial quotation and would fairly often hit one in a randomly selected headword transliteration. This is beside the point, but I feel it necessary to defend my basic Arabic ability...
@Erutuon: Now you're mixing apples and oranges, though. If you use ḥarakāt, then you have to use them consistently for both words: الْكَاتِبَ (al-kātiba) and الْكَاتِبَةَ (al-kātibata). If you want to use informal Arabic without ḥarakāt, then you have الْكَاتِب (al-kātib) and الْكَاتِبَة (al-kātiba). In either cases, those two words would not be transliterated the same. Sartma (talk) 10:30, 5 August 2021 (UTC)[reply]
I'm led to understand by Module:he-translit that it's clear from the Tiberian pointing which alefs were pronounced in Tiberian Hebrew. I'm a little unclear on all the details but I think it's that they don't have a vowel directly after them and that a vowel diacritic immediately precedes them. Maybe not too different from the conditions under which the other potentially silent letters are recognized as not pronounced. Module:he-translit is mostly able to distinguish them if you look at the testcases. The circumflex-transcription distinguishes some of the silent letters from the non-silent versions of themselves (by not writing them or writing them as a circumflex) but not all of them since it writes the final dagesh mappiq-less he and writes all alefs. That seems worse to me than the Module:he-translit practice, which is to omit them entirely. At least total omission is consistent!
@Erutuon: I see two issues: 1) Module:he-translit is trying to give a "pronunciation" more than a transliteration, and it's not even consistent in doing so, becasue Tiberian pronunciation for אָ doesn't distinguish between /ā/ and /ɔ/. So, the system is already inconsistent. 2) Alephs are phonemic in BH. In Tiberian Hebrew they were not pronounced after a long vowel, but in all other cases, they were. We should give a transliteration that differentiates verbs like ברא (bārāʾ), that have inflected forms like בָֽרְאוּ (bārəʾû) with pronounced aleph and verbs like בָּנָה (bānâ), who inflect differently (בָּנוּ (bānû)). In Module:he-translit they would both end in /ā/, making it impossible to distinguish their different conjugations. In the end, in Tiberian Hebrew all alephs following a long vowel are by rule quiescent, so knowing the rule would be enough. The correct pronunciation of the word is anyway always given in the Pronunciation section, so there is no need for a transliteration to mirror the actual pronunciation. It's more important that all relevant phonemes are clearly indicated, so the reader has all key information about that word. Sartma (talk) 10:30, 5 August 2021 (UTC)[reply]
I'll admit the circumflex-transliteration probably has familiarity in its favor. — Eru·tuon 19:22, 4 August 2021 (UTC)[reply]
Oh, I think what puzzled me about the slashes was that shvas don't count as syllable nuclei so I was seeing some two-vowel syllables... — Eru·tuon 19:28, 4 August 2021 (UTC)[reply]
Oops, just realized Module:he-translit doesn't consistently omit the silent letters in the vowel pairs אִ i and אִי ī, אֻ u and אוּ ū, where the macron basically marks the silent letter, except when the macron is omitted because the vowel is in a closed syllable (קִידּוּשׁ). My appreciation for the strange-looking superscript-using transliteration grows. — Eru·tuon 03:26, 5 August 2021 (UTC)[reply]

@Wikitiki89, Qehath, who might be interested in this. PUC11:10, 5 July 2021 (UTC)[reply]

Let's make a synthesis of everything that has been said on Discord and here. I spent my day doing it so don't ignore it. Here is not my personal opinion unless there's "I", but the arguments of everyone. It's a big message, though I tried to make it as synthetic as possible, but complete, comprehensive, exhaustive.

Introduction

Hebrew romanization is messy currently because there are several systems simultaneously used across Wiktionary, each user deciding which one to use. A unified romanization throughout Wiktionary's coverage of Hebrew, ie. a standardization, is necessary. The best way do to so is to use a module, which allows to ensure the standard will be respected and therefore makes things easier for the contributor who won't need anymore to write it because automated, and for the reader who knows there's no mistake and that the given romanization is consistent because standardized. Module:he-translit already exist thanks to Erutuon, but it's not ended and we need to agree about a solution in order to end it. The question isn't "should we?" but "how will we?".

I-Unification: Modern and Biblical Hebrew

By unified, it also means that the given romanization should be convenient for any stage (chronolect) of Hebrew: Biblical Hebrew (BH), Modern (Israeli) Hebrew (MH) as well as other medieval dialects. In a nutshell, MH has plenty of loanwords and neologisms and can freely pick up vocabulary from BH and later stages, which inherit part of their lexicon from BH, whose lemmata are limited. Therefore, any BH lemma is automatically a MH one too, and the best solution is to label (with {{lb}}, {{tlb}}, {{defdate}} or another) the period since which a term is attested, insofar as it can be used in any later stage, leaving MH loanwords and neologism without label. Splitting Hebrew would mean massive duplication, which would be useless in this context. On Discord there was consensus about keeping a single Hebrew header (actually the discussion started with this), which means the romanization will be common to MH and BH, and the automated standard should end the use of two kinds of romanizations, one chiefly for MH, somewhat tentative (use of digraphs, several romanization for one consonant and several consonants organized the same way), and a scholarly one initially for BH but which can also work for MH thanks to its precision. Working here means that it's a coherent system, not that MH speakers will be totally familiar with it. Indeed, MH transcription is fully deductible from the BH one, just ignoring such as gemination and vowel length. On the other hand, it would be stupid to use BH romanization for MH loanwords and neologism... MH being the living (thus growing) Hebrew language, MH romanization has to be a default one.

II-Romanization: transliteration and transcription

Romanization means the use of Latin script for a transliteration |tr= (stricto sensu, conversion of the graphemes and diacritics, should be reversible) and/or a transcription |ts= (phonological, therefore proper to each dialect, appears between "//", not necessarily reversible). The module shall serve for romanization, on the head for the entry and when a Hebrew term is mentioned, as well as for a future Hebrew pronunciation module Module:he-IPA, whose starting point to generate pronunciation for several dialects will be the romanization. Remember that the aim of the romanization is to help readers: if we were all Hebrew speakers we wouldn't need this; likewise, it's not meant to be a pronunciation since there's a section for this.

  1. In a pure transliteration, no data is lost, ie. alef (ʾ<') is distinguished from a ayin (ʿ<'), as well as bet (ḇ<v) and vav (v, or w<v), chet (ḥ<kh) and khaf (ḵ<kh), khaf (k) and qof (ḳ(or q)<k), tet (ṭ<t) and tav (ṯ<t), samekh (s) and sin (ś<s). It's precise, it's the scholarly romanization. Stress is not marked since graphemes do not say where it falls, unless using oleh which would resolve everything as I understand it. Glottal stops are always written, and the value of one letter is independent from the others, ie. a vav has the same transliteration no matter it's /v/, /w/ or /o/. Vowel length can be marked. Lastly, a transliteration isn't IPA-like so it's better to use <ḇ, ḵ, p̄> than <v, kh, f> since <ḡ, ḏ, ṯ> are also used, and not *<ɣ, ð, θ>. This is due to MH phonology, whereas BH is more parsimonious for a proper transliteration. <ṣ, š> are preferable to digraphs <ts, sh>.
  2. Using a pure transcription, it is relative to MH or BH (Sartma wants it for MH) and therefore it merges the consonants above and lose vowel length according to MH phonology. However stress is necessarily marked, and some adaptations can be made, for instance due to matres lectionis (such as vav, which can be transcribed as /o/ when it stands for a vowel), for which "it would be customary to use the ^" according to Sartma. Apparently, qamats can be unpredictably /ā/ or /o/; a pure transcription would require the distinction, which is problematic if unpredictable (unless it's /ā/ in open syllable and /o/ in a closed one, with more or less exceptions). Though it appears between slashes //, IPA symbols are not mandatory.

The question is to have either both, or only one having the advantages of both, with flexibility towards the definitions but still uniform (because automated). Note that both are used simultaneously used for languages using cuneiform, or pure abjads, in which what's written (given in transliteration) is quite different from the phonemic transcription (given in transcription). This doesn't make lot of sense IMO to use the former for vowel length and scholarly romanization and in the same time the latter for stress and phonemic transcription. I don't see the point to use both, peculiarly when a proper transcription would be specific to MH or BH, rather than one romanization, using |tr= and displaying distinction between each letter, stress, vowel length, all the consonants even not pronounced, and matres lectionis ie. romanizing vav as <o> when it's <o>. As a result, everybody would happy, since there are all the information on this merged romanization, however it may be overladen and therefore disturb people interested in MH.

III-Stress: position and implication

The main problem is the stress position:

  1. It is unpredictable. We don't know if it's absolutely unpredictable for BH, or just hardly, and if it would require to know the part of speech (POS) of the term. But even if it were predictable, there are the MH neologisms and loanwords whose stress position is truly unpredictable.
  2. The best, because by far the easiest for the user and the shortest for the module, which make it very elegant, is to indicate the stress position by means of a diacritic called oleh (Alt+6 on Windows) which is consistently and widely used in dictionaries and scholar works, see Sartma's message above, or with another diacritic (e.g. meteg). Perhaps there should be a default stressed syllable, when unmarked, say the final one, which would allow not to put it all the time.
  3. It should not be indicated on a transliteration, but only in a transcription, since the graphemes do not say where falls stress. Unless using (or rather showing) oleh. It was suggested to write oleh for this purpose but to remove it when displayed. I think this would be overcomplicated, all the more to remove it.
  4. Given that diacritics are already added, it won't bother to add one more. For a transliteration, either it's exclusively for consonants, using the pagename, but nobody would be satisfied with that, or it encompasses the added diacritics, including the one for stress. The romanization will anyway be generated from what's put in the template (with diacritics), not from the pagename (without any), so there's no technical problem to add stress with oleh.
  5. It may change through time, but seldom, I don't know how frequent it is.
  6. Apparently we can't do without the stress position, as stress would change things for vowels like length or position which also modifies transcription, not only pronunciation. Stress is a phonemic feature, compare /'ál/ and /el/. According to Metaknowledge, "the sticking issue has always been stress. If we mark that in the Hebrew, everything else can be overcome or specified." Thus there are implications of stress for transcription that are totally relevant for the pronunciation section, separating the dialects, but ignored for a transliteration. Those have not been explicitly said, I know not why the stress is so important beyond itself.

Conclusion

To conclude, there are a constatation (Hebrew romanization is messy), a necessity (agreeing on a standard) and a solution (a module), which remains to clarify (its exact working). I kept coherence where there is, so that oppositions in opinion appear clearly to you all. I'll let you discuss below what still need to be discussed. Then we will create a vote for the main proposal. Malku H₂n̥rés (talk) 17:32, 5 July 2021 (UTC)[reply]

Glad to see some initiative here! Firstly, we should be clear that this is not fundamentally an issue of transliteration, but rather an issue of how a language at two distinct stages in its history might be treated as a single unified object. Suffice it to say, the handling of Hebrew is unlike that of other languages on Wiktionary. I agree with @Fay Freak that, as long as Biblical and Modern Hebrew are treated here as a single language, a Biblical transcription scheme should be standardized, as these are the forms from which those of Modern Hebrew and the various reading traditions ultimately derive. Romanization, of course, will be of little interest to Modern Hebrew speakers anyways. Now, we must also be clear that when we speak of “Biblical Hebrew transcription,” we are not referring to a strict transliteration of Tiberian Biblical Hebrew (ie. the Hebrew text and vowel diacritics used in Wiktionary headers). Tiberian ◌ָ ambiguous indicates /ɔ/ or /ɔː/, since vowel length is left unwritten. A transcription which differentiates between o and ā for Tiberian ◌ָ is incorporating information from other reading traditions in which these vowels do not merge in quality. I’ll be able to expand on such issues and their implications later, but for now I’d suggest that anyone interested in familiarizing themselves with the Hebrew vowel system read Benjamin Suchard’s The Development of the Hebrew Vowels. I believe it contains all the information necessary to solve the problem of BH transcription, at least. Rhemmiel (talk) 11:06, 7 July 2021 (UTC)[reply]
@Rhemmiel: I didn't read all the 300 pages of that essay, but I think we might want to stick to something more mainstream and in use already in textbooks/dictionaries. I would want Wiktionary to remain "friendly", while still being exact. People who are interested in essays like that won't certainly be using Wiktionary. Sartma (talk) 19:41, 7 July 2021 (UTC)[reply]
@Sartma: I'm not sure if you'll be among those working on the module, but those who will be might find the material in this paper helpful when they run into problems determining the right output for a given vowel sign. I suggested it because it contains information relevant to module design, not for anything on the user-end of things 👍 Rhemmiel (talk) 22:55, 7 July 2021 (UTC)[reply]
Support final syllable as default stressed syllable and indicating the accent on the Hebrew (with oleh) in all other cases (segolates and unpredictable MH words). Sartma (talk) 08:43, 7 July 2021 (UTC)[reply]

Update - New proposal

[edit]

It looks like the majority of people who commented here agrees on the need to have both transliterations (Modern and Biblical) and on the desirability of an automated template. At this point, we just need to decide a format and start working on the template. My preference would be for something like this:

I would use a full scholarly transliteration for Masoretic BH as found in most textbooks (with ^ for matres lectionis, ā/o for qamets/qamets hatuph, etc.), since it's the most widespread and the only one that allows us not to care about pronunciation issues too much (but I won't oppose alternative proposals if they make sense). What do you think? Sartma (talk) 13:11, 13 July 2021 (UTC)[reply]

I think you have essentially two proposals here: showing two Hebrew transliterations in link templates and changing the Biblical Hebrew transcription system to show matres lectionis with circumflexes. Circumflexes currently aren't recommended in WT:HE TR, and Module:he-translit has a complete or near complete implementation without circumflexes.
Showing two transliterations might have the most support, but it requires coming up with a plan and modifying Module:links, which is a big job because the module is so widely used. And it might also require changes to Module:he-translit somehow, though perhaps fewer changes than adding circumflexes would require. Based on my reading of Module talk:he-translit, maybe the module has stalled because we haven't chosen a stress symbol, and because we don't have separate representations in the Hebrew script for the two pronunciations of qamets (אָ) and the pronounced and unpronounced shva (אְ), which are not completely predictable from context? User:Wikitiki89 probably can correct me if he's around. Not sure how easy those problems are to fix, and even if they are fixed, I'm not sure exactly how to proceed in Module:links to enable a second transliteration.
About circumflexes and matres lectionis, I haven't seen clear agreement. I think User:Wikitiki89 is against circumflexes because they're non-phonemic (he did most of the work to figure out the odd cases in Module:he-translit), I haven't been convinced that they're useful because they confuse me, and User:RichardW57 proposed representing matres lectionis with superscripts rather than circumflex (which seems clearer to me if it's possible). — Eru·tuon 19:22, 14 July 2021 (UTC)[reply]
@Erutuon: I keep talking about circumflexes mainly because they are pretty much in all my BH books, so it sort of makes sense to me to have on Wiktionary something that people might be familiar with, but as I said above, I'm not against a different approach, as long as it makes sense. For example, for me it would be a big no not to transliterate alephs, for examples (like it seems to be the case at the moment). Another system would be the one that uses single vowels like å for ā, ɛ for e, e for ē, etc. If we go for this, I guess we can even ignore matres lectionis, since it would be more of a normalization than a transliteration. Sartma (talk) 20:56, 31 July 2021 (UTC)[reply]
@Sartma: Module:he-translit doesn't transliterate alefs when they are matres lectionis, unpronounced in Tiberian (I guess everywhere except before a vowel). Why is this more unacceptable than with other two matres lectionis? I guess alef might be always etymological, unlike he and yodh, which are sometimes written in words that never had consonantal /h/ and /j/. I'm not sure if that ever matters in etymologies (i.e. if any languages have borrowed an unpronounced-in-Tiberian alef as an actual consonant). Even if we always represented alef even when unpronounced in Tiberian because it is etymological, we can't distinguish between etymological and unetymological he or yodh, so there would be an inconsistency there.
I've seen a transliteration with å a few times online. Using å it looks like it's transcribing Tiberian phonemes. Module:he-translit distinguishes ā and o for אָ, which were not distinguished in Tiberian (both being pronounced /ɔ/), but are distinguished as a and o in Sephardi pronunciation and Modern Hebrew; this one perhaps would write both as å. The advantage of å graphically is it has both an a and an o in it, so it suggests both of the modern pronunciations. — Eru·tuon 17:04, 3 August 2021 (UTC)[reply]
@Erutuon: Is aleph really a mater lectionis in BH? Weingreen (A Practical Grammar for Classical Hebrew), Seow (A grammar for Biblical Hebrew) and even Kahn (The Routledge Introductory Course in Biblical Hebrew) only give ה י ו as matres lectionis, not א. I never learned anywhere that aleph is a mater lectionis. Maybe it is in MH, but from what I know it's not in BH, so I would expect to see it transliterated in a BH transliteration. I still haven't formed an opinion about which system of vocalisation would be better, but I'm not sure why there would be anything wrong with representing the Tiberian pronunciation, considering that all niqquds are there to represent a Tiberian pronunciation, while they've never been there to show a Sephardic or even less modern one. Sartma (talk) 19:40, 3 August 2021 (UTC)[reply]
@Sartma: I think I have confused mater lectionis with "not pronounced". w:Mater lectionis#Usage in Hebrew says that alef is occasionally a mater lectionis, but apparently the cases where it stopped being pronounced by Tiberian Hebrew times, which I was thinking of, as in רֹאשׁ or בָּרָא, don't count. Only where it is written but would never have been pronounced even in the earliest Hebrew.
A Simple, Practical System for Transcribing Tiberian Hebrew Vowels looks like it may be the system that uses å that you're talking about. It is reversible and does clearly indicate Tiberian phonemes, using one vowel letter per Tiberian vowel (i, e, ɛ or æ, a, ɔ or å, o, u) and indicating silent letters (including matres lectionis) with superscripts as suggested by User:RichardW57. It would be easier to implement than the current transcription generated by Module:he-translit because it doesn't have to do acrobatics to try to infer the pronunciations in other varieties of Hebrew than the one that Tiberian vocalization represents. The example given on the last page looks crazy to me, being familiar with Wiktionary's transliteration and something like yours with the circumflexes! But it is a very nice system theory-wise. — Eru·tuon 23:38, 3 August 2021 (UTC)[reply]
How do you propose to handle Modern borrowings, coinages, etc.? Especially, how do you propose to handle the huge numbers of them that don't conform to the inherited phonology? Would we really try to romanize ג׳ורג׳ (jórj, George) or מיקרוסקופ (mikroskóp, microscope) or קיבוצניק (kibútsnik, kibbutznik) using some sort of romanization scheme intended for the Masoretic Text?
Similarly, how do you propose to handle the case where the usual Modern spelling introduces a mater lectionis that wasn't there in Biblical Hebrew, e.g. because in Biblical Hebrew there was a short vowel there? (This is extremely common; for example, Modern Hebrew adds a yúd in pi'él verbs such as בישל \ בִּשֵּׁל (bishél, to cook).) Would we transliterate them using a Masoretic Text romanization scheme but as if there had been a long vowel there?
RuakhTALK 02:25, 31 July 2021 (UTC)[reply]
@Ruakh: If the word is not attested in BH, it doesn't need to appear, only the modern romanization would be given, like this: ג׳ורג׳ (jórj, George), מיקרוסקופ (mikroskóp, microscope), etc. When the HB is present, then its transliteration will be given too: בישל \ בִּשֵּׁל (Ⓜ bishél Ⓑbiššḗl, to cook). I don't see any issue with your examples, they can be dealt with quite straightforwardly. Sartma (talk) 20:43, 31 July 2021 (UTC)[reply]
I thought there should be no transliteration but transcription, so they would use i instead of ī, but with biblical transcription – I don’t see why we have the digraphs here again. Why would we transliterate if the Hebrew spelling is right next. Same thing with Ottoman: One often gives circumflexes in transcriptions, in scholarly literature, so one can reconstruct the Arabic-script spelling but this is not needed as we give the spelling, plus Ottoman was also written in Armenian script, so we better give the transcription of how we think the norm phonological shape was, with the modern Turkish spelling rules, anything else would be confusing in its signification.
@Fay Freak: Apart from the fact that I still have no clear idea of what a "transcription" would or should be (it's one of the biggest mysteries I'm yet to solve since I started editing here on Wiktionary), my understanding is that we transliterate because this is English Wiktionary and it should be friendly to people who don't know the script. It's the same reason why we transliterate Chinese character and Japanese sentences. "The spelling is right next" cannot be an argument in a dictionary for English native speakers. Sartma (talk) 19:54, 3 August 2021 (UTC)[reply]
@Sartma: The mystery is that only four years ago there was only |tr= and not |ts=. Then it was split for languages where two parameters make sense. I did not focus on the name of the parameters thus: For most languages (e.g. Syriac)|tr= contains transcriptions in spite of its being named (or rather reinterpreted) “transliteration” now.
You understand the reason why we give transcriptions. I did not say we should omit it by reason of the script being right next, but rather it does not need to slavishly mimick the native script as a “romanization”, a transcription depending on the native script so to say. That is a transliteration. In general this is given in literature as a replacement of native script (which may be hard to print). We rather have to adapt the transcription to the circumstance that the script is right next: It aids in reading it but it does not need to give every peculiarity of the script. Therefore you are not wrong to conclude that we use private language and at Wiktionary transcription and transliteration mean something peculiar in opposition to the rest of the world, or the core strain of meaning the word has outside. Unavoidable since we have a coverage that is unique to the outside world.
Of course it means that one cannot just take take Wiktionary’s transcriptions and paste them into a scholarly work without the native script and expect that it is a faithful replacement of the native script. That is not what the transcription is for. Fay Freak (talk) 02:02, 4 August 2021 (UTC)[reply]
The thing @334a did at various places recently, adding/restoring the cirumflex in Hebrew |tr= for something about matres lectionis, I could not understand. Why explicitly signify there is a mater lectionis if a transcription showing either long or short pronunciation would already make it clear?
Concerning George, this problem only arises from trying to transliterate instead of transcribing. Also it shows that it is better to use ǧ like for the Ethiopic script. Using j for /dʒ/ is an English Sonderlesart that isolates you from Europe. Fay Freak (talk) 19:34, 3 August 2021 (UTC)[reply]
Hi @Fay Freak, I haven't read this entire thread (yet) but regarding the Hebrew transliteration which you pinged me for: I'm using the transliteration scheme as outlined in the grammar published by Zondervan and essentially the same (apart from the schwa) as the one found in the "Academic" column of w:Romanization_of_Hebrew#Table. Basically, any long vowel is shown with a macron, any vowel with a mater lectionis is shown with a circumflex, and any short vowel by itself (i.e. without a mater lectionis) is just shown as is, without a macron or circumflex. The circumflex is not there to tell you the length of the vowel; not all vowels with a circumflex/mater lectionis are long (e.g. שָׂדֶה (śāḏê)). Your assertion that matres lectionis need not be shown because you can just show the length of the vowel would leave out important information. Etymologically, it's very necessary to point these matres lectionis out and show the differences between defective vs. full spellings as different time periods have different writing practices (e.g. בּוֹקֶר (bôqer) vs. בֹּקֶר (bōqer)), not to mention that showing two different native spellings with the same transliteration just seems confusing and a little silly.
Of course, this transliteration scheme is not perfect and I have my gripes with it (for example, there's no way to tell if the vowel in ê is a sere or a segol or if the mater lectionis is a he or a yodh), but I guess that's the point of this discussion. :) --334a (talk) 16:29, 6 August 2021 (UTC)[reply]
No, I don’t reckon it confusing nor silly to have two native spellings with the same transcription scheme, nor necessary to point out matres lectionis by mirroring them in transliterations. I reckon it confusing to have different transliteration schemes if they signify the same phonological shape. Sere has always been put ē by me and segol e or perhaps ä (so I can put a macron on it to signify long segol, or is a ה after a segol just a mater lectionis?), no circumflex ever on either. And, @334a, I don’t see at all what you are trying to tell me with the distinction בּוֹקֶר (bôqer) vs. בֹּקֶר (bōqer)! You were thinking you show something but weren’t showing anything! That is, for most readers. Is it short or is it long, or is it different between the two? Fay Freak (talk) 04:02, 7 August 2021 (UTC)[reply]
@Fay Freak It's important to indicate matres lectionis because their presence might mean that a vowel will not be reduced if given a chance. Compare אָסִיר ʾɑsīr (prisoner) with its plural ʾăsīrīm (prisoners) vs כּוֹכָב kōḵɑḇ (star) with its plural kōḵɑḇīm (stars). Unless it is clear from the transliteration whether there is a mater lectionis or not, there is no way to know whether a vowel will be reduced to a shwa or not. Sartma (talk) 19:43, 10 August 2021 (UTC)[reply]

Hybrid Vowel transliteration proposal for BH

[edit]
After ruminating on the best way to transliterate BH vowels, I came up with this hybrid system (a combination of the 7 vowel system and the system using the circumflex for matres lectionis). I think that if we use the 7 vowel system + circumflex to indicate mater lectionis ה and macron to indicate any other mater lectionis, we should be able to cover all cases in an elegant and simple way (= the least amount of diacritic signs). Here is my proposal:
  • Basic system: 7 vowel system (a, ɑ (ɔ, å), i, e, ɛ (æ), o, u)
  • Vowel + mater lectionis ה = circumflex
  • Vawel + mater lectionis except ה = macron
בַ = ba
בָ = bɑ | בָי = bɑ̄ | בָה bɑ̂
בִ = bi | בִי = bī
בֵ = be | בֵי = bē | בֵה = bê
בֶ = bɛ | בֶי = bɛ̄ | בֶה = bɛ̂
בֹ = bo | בוֹ = bō | בֹה = bô
בֻ = bu | בוּ = bū
The "shwas" would be: ə, ă, ɛ̆, ɑ̆.
What do you think? Sartma (talk) 12:40, 10 August 2021 (UTC)[reply]

Usage examples

[edit]
I wouldn't use "a" and "ɑ" contrastively, since {{m}} puts the transliteration in italics, and the italicized versions of those two characters look identical or at least very similar in many fonts: בַּ (ba) vs. בָּ (). As for how to represent the matres, I was taught to write the letter of any silent consonant as a superscript, e.g. בִּי (biy), בֶּי (y), בֶּה (h), בּוֹ (bow), בֹּה (boh), etc. This would then even work for cases like יִשָּׂשכָר (yiśśɔśḵɔr), where there is a silent consonant letter that is not a usual mater lectionis. —Mahāgaja · talk 17:14, 10 August 2021 (UTC)[reply]
@Mahagaja: That's a good point and I did think about it. Would there be a way to stop {{m}} from italicising a transliteration if it is BH? I wouldn't want to give it up because of formatting issues. The ɑ (alpha) works great here precisely because it is similar to an a, and its IPA value is also spot on. I personally really dislike superscript letters in a transcription, especially if they're not necessary (I don't think they are for BH). If there regularly were a lot of words like יִשָּׂשכָר (yiśśɔśḵɔr) I would agree, but I don't think it's helpful to consider exceptions when trying to develop a regular system. Sartma (talk) 17:55, 10 August 2021 (UTC)[reply]
There might be a way to stop {{m}} from italicizing the transliteration when the language is hbo, but why not just use ɔ? It's parallel to ɛ for segol, and explains better why the short patakh is /o/ in Ashkenazi and Sephardic and Modern Israeli. And superscripts make it easier to understand which letter is being used as the mater. Your proposal above includes בָי = bɑ̄, but is בָּי ever /bɔː/ as opposed to /bɔj/? Or did you mean to write בָא = bɑ̄? If yod is ever silent after patakh, superscripts would allow us to distinguish between y and ʔ. —Mahāgaja · talk 18:09, 10 August 2021 (UTC)[reply]
@Mahagaja: Yes, of course we can use ɔ instead of ɑ. I see ɑ as a friendlier version, since the vast majority of people who study BH are used to see an <a> there, and might be confused by the ɔ. In the end there's no certainty on the actual pronunciation of that sound, so it's just a matter of choosing a sign for that phoneme. If ɑ makes the transliteration look less alien to people who already know BH, why not? Your argument that ɔ would "better explaint why the short patakh is /o/ in Ashkenazi, Sephardic and Modern Israeli is not pertinent to what we are trying to do here, which is finding a transliteration system that works for BH.
I took the above table of vowel combinations from Lambdin's Introduction to Biblical Hebrew. בָי = bɑ̄ is given as rare there, but I kept it anyway (I've never seen it before, but I guess it's a possibility). In the system I'm proposing, א (alephs) are always transliterated as <ʾ>. Alephs are not matres lectionis (they can be quiescent, but that's a different story), so the need to distinguish between y and ʔ won't arise. In the system I propose it's always possible to know what the mater lectionis is, since the correspondence is one-to-one. There are no inconsistencies. Sartma (talk) 18:58, 10 August 2021 (UTC)[reply]
@Mahagaja What do you think about the use of macrons and circumflexes? I think it solves a lot of the consistency issues present in existing transliteration systems and it's somehow etymological too: many final ה were originally <t>'s that stopped being pronounced, so the use of the circumflex would be in line with the function this sign has in other languages (like French, for example, where it indicates "fallen" consonants). The macron, on the other hand, just indicates the full presence/lengthening of a vowel through a mater lectionis. Plus, this system gives really neat, clean and elegant transliterations with just the right amount of diacritics. Compare אֱלֹהִים ʾĕlōhî́m vs ʾɛ̆lohīm, אָנֹכִי ʾānōk̠î́ vs ʾɑnok̠ī, מֹשֶׁה Mōšéh vs Mošɛ̂, or the inconsistency in מָה\מֶה\מַה mâ, meh, mah, vs the throughout consistent mɑ̂/mê/mâ. Sartma (talk) 18:41, 10 August 2021 (UTC)[reply]
I definitely prefer the seven-vowel system with fewer diacritics to the five-vowel system with more, but I'd still prefer ʔɛ̆lohiym, ʔɔnoḵiy, Mošɛh, and mah/h/h, which is also consistent. —Mahāgaja · talk 18:53, 10 August 2021 (UTC)[reply]
@Mahagaja I understand that superscript characters are your preference. But you agree that there is no need to use them, right? I would argue that a more economic transliteration system is superior to one that makes use of a greater number of characters than it is needed. Sartma (talk) 19:11, 10 August 2021 (UTC)[reply]
I would not agree that a system that distinguishes ɛ̂ and ɛ̄ is more economical than one that distinguishes ɛh and ɛy, as the diacritic mark contains just as much information for the reader to process as the superscript letter does. The superscript system allows a consistent treatment of all quiescent consonants, regardless of whether they're considered matres lectionis or not. —Mahāgaja · talk 19:34, 10 August 2021 (UTC)[reply]
Two letters are more economical than four. And how would the superscript system allow a consistent treatment of all quiescent consonants? Can you give me some examples of this? Also: alephs are not matres lectionis. There is no doubt about that, it's not a matter of opinions. No BH book lists alephs as matres lectionis. I don't understand why so many people are convinced they are. Sartma (talk) 20:09, 10 August 2021 (UTC)[reply]
Two letters with diacritics aren't more economical than four without; another way of looking at it is to say that superscript letters are diacritic marks that are next to the letter they modify instead of on to of (or underneath) it. The consistency is that all silent consonants are written as superscripts, whether they're matres or not. Thus quiescent aleph and the quiescent sin of Issachar mentioned above are treated in the exact same way as matres heh, yod, and waw. —Mahāgaja · talk 20:20, 10 August 2021 (UTC)[reply]
I think we can at least agree that they're definitely more concise. And to look at superscript letters as diacritic, you'd have to change the meaning of "diacritic". Superscript letters as you proposed them don't "modify" the letter next to them. They are just there to show Hebrew orthography, having zero influence on adjacent letters. Moreover, superscript letters are often used in phonetic transcriptions to indicate glides, another reason for me to avoid them. You also conflate quiescent (silent) letters with matres lectionis, I'm still not sure why. Matres lectionis are arguably the most vocal of all Hebrew letters. They are never silent. On the other hand, I understand you might not want to write a quiescent aleph in full (since they can in fact be either voiced or silent depending on their position in a word or inflected form), and if we want to use superscript for quiescent letters, that's ok with me, but we should leave matres lectionis in peace. Always writing uw instead of <ū> or <û> (or even just "u"), or iy instead of <ī>, <î> (or even just "i"), is neither economical nor reader friendly. It's a lot of visual noise and not intuitive. Sartma (talk) 08:38, 11 August 2021 (UTC)[reply]
@Mahagaja: See my experiment with a superscript-based system at Module:User:Erutuon/he-translit-superscript/testcases. It's not completely satisfactory at the moment because it just leaves out all shva, as proposed in the paper that I based it on, and I had to add an apostrophe to separate two identical consonants separated by shva. I would like to improve it so that it can see some use eventually, perhaps in a Hebrew pronunciation template, even if it doesn't end up as an official transcription. Would you happen to be able to point me to any books that use a more fleshed out superscript-based system? (Actually I think I had a book like that, but I'm not sure where it's gotten to.)
I've come to be more comfortable with the superscript-based system as well. It can clearly mark all the silent (or at least non-consonantal) letters in the Tiberian pronunciation, including those that are considered matres lectionis. It feels random to distinguish vowel points accompanied by particular letters with particular diacritics, when it's my impression (from a paper I ran into, A History of Hebrew Plene Spelling, from Antiquity to Haskalah) that through the history of Hebrew, plene vowel spelling in particular words or sets of words has varied a lot. And when I started trying to learn Biblical Hebrew years back, the use of circumflexes and macrons confused me because the two diacritics didn't indicate different pronunciations. However, Sartma's system is an improvement over the system in my textbook, in that the seven vowel qualities of the Tiberian pronunciation each have their own letter, so the macron and circumflex can be ignored when considering what vowel quality to pronounce.
And the circumflex systems typically only distinguish particular silent letters with diacritics, yod or waw or he after particular vowel points, so the reader probably has to look back at the Hebrew spelling to determine whether other letters in the transliteration are silent, whereas the superscript system can transcribe any silent alef, he, waw, yod with superscripts, and can transcribe other silent letters the same way if they are in Unicode, or if some other way of forming superscripts is available, as with יִשָּׂשכָר (yiśśɔśḵɔr). Module:he-translit/testcases has several odder examples that I copied into my testcases, like חַטֹּאות (ḥaṭṭoˀʷṯ) and יְראוּ (yərˀuʷ) and הֱוֵוה (hĕweʷʰ). I suppose they would be represented in Lambdin's transcription as ḥaṭṭō(ʾ)wṯ, yərʾû, hĕwewh (he sometimes marks unpronounced alefs with parentheses) or in Sartma's proposed transcription as ḥaṭṭoʾwṯ, yərʾû, hɛ̆wɛwh. These transcriptions don't inform readers without special knowledge that any of the letters are silent, whereas assuming I've got the rules right, it's clear from the Hebrew script.
The best argument for circumflexes that I have seen is that î, ê, ô, û include most of the cases of the unchangeable vowels unaffected by vowel reduction in inflected forms. But not all words or in all Hebrew writings apparently, because for instance the pa'al active participle כּוֹתֵב (koʷṯeḇ) has an unchangeable first syllable, but is written as כֹּתֵב (koṯeḇ) in Lambdin's book, yet its plural is כֹּתְבִים (koṯəḇiʸm), not *כְתֵבִים (*kəṯeḇiʸm). Also the circumflex doesn't only indicate unchangeable vowels, because â is used in both Lambdin's system and the Student's Vocabulary system, but represents a different combination in each, qamats-yod and qamats-he respectively, neither of which apparently is unchangeable. And given the variability of plene spelling of vowels, if Wiktionary ever includes quotations from works with a lower percentage of plene spelling, which must be transliterated, there might be quite a few more unchangeable vowels written without circumflexes than there might be in more standard Hebrew spelling. So the circumflex-is-unchangeable rule might end up being full of exceptions in quotations. — Eru·tuon 22:55, 10 August 2021 (UTC)[reply]

Hybrid transliteration proposal 2.0

[edit]

Hi all, here I am again. After further ruminating, I realised that if all BH words ending in a simple vowel are always followed by the mater lectionis ה, then in transliterating a word, a final simple vowel is more than enough to predict its correct spelling. This means that there is no need for the circumflexes of my previous proposal. This would simplify things even more. I would therefore update my proposal to the following:

  • Basic system: 7 vowel system (a, ɔ, i, e, ɛ, o, u)
  • Vowel + mater lectionis (except ה) = macron
בַ = ba
בָ = bɔ | בָי = bɔ̄ | בָה bɔ
בִ = bi | בִי = bī
בֵ = be | בֵי = bē | בֵה = be
בֶ = bɛ | בֶי = bɛ̄ | בֶה = bɛ
בֹ = bo | בוֹ = bō | בֹה = bo
בֻ = bu | בוּ = bū
The "shwas" would be: ə, ă, ɛ̆, ɔ̆.

Usage examples

[edit]
What do you think? Sartma (talk) 00:24, 23 August 2021 (UTC)[reply]

Sardinian varieties

[edit]

Hello. Some time ago, I added a few entries and templates (mainly related to verb conjugations) related to the Sardinian language. In the interest of representing the major varieties, these entries were about Logudorese, Campidanese, Nuorese, Sassarese and Gallurese. Looking at the page for Regional Sardinian, it seems that only the first two (Logudorese and Campidanese) are actually taken into consideration here on Wiktionary. Since there isn't a Wiktionary:About Sardinian page, I was wondering if this was official Wiktionary policy. I'm not sure if there is someone in particular I should be pinging, but the three users related to Sardinian seem to not have been active for quite some time. — GianWiki (talk) 17:50, 4 July 2021 (UTC)[reply]

@GianWiki: I don't know much about Sardinian, but Gallurese is a Corsican dialect treated as a separate language on Wiktionary (sdn). Sassarese is equally treated separately (sdc). Thadh (talk) 18:14, 4 July 2021 (UTC)[reply]
@Thadh: Thank you very much for your input. I was not aware of that. – GianWiki (talk) 18:18, 4 July 2021 (UTC)[reply]
Wiktionary:Language treatment documents what's currently treated as a language vs a dialect, with links to the discussions which led to the treatment — which in this case were both rather short. Gallurese and Sassarese are dialects of Corsican, aren't they?, and so treated separately from Sardinian. Whereas, as best I could tell, Campidanese and Logudorese.differed only slightly (and overlapped with 'standard' Sardinian, all in a manner that seemed similar to the dialects of Irish) and so I agreed with the 2014 proposal to merge them under ==Sardinian==. You may know more about it than anyone involved in the earlier discussions did; are you looking to keep Campidanese and Logudorese merged under Sardinian (with labels where necessary) or to split them into separate languages? - -sche (discuss) 18:32, 4 July 2021 (UTC)[reply]
As an aside, I wonder if we should link to WT:LT from the Module:languages/data3/a etc pages, to make it more findable. Like "Please check Wiktionary:Language treatment before adding a code to see if it has been intentionally subsumed into something else" or something. - -sche (discuss) 18:36, 4 July 2021 (UTC)[reply]
@-sche: Only admins can add new codes, so I'm not sure that wording is appropriate/needed, maybe a simple {{also}} template at Module:languages will do. Thadh (talk) 18:40, 4 July 2021 (UTC)[reply]
FYI Nuorese is normally considered a subdialect of Logudorese, notable for its exceptional conservatism. Benwing2 (talk) 20:49, 4 July 2021 (UTC)[reply]

Translations by language

[edit]

Hello, I think it might a common question but couldn't find any answer.

In the same way the "t-needed" template adds the English headword in "Request for translation into XXX" categories, are there categories that contain all English terms for which translations into a target language were already provided (via templates "t", "t+",...), something like "English terms with translations into XXX"?

Sitaron (talk) 19:50, 4 July 2021 (UTC)[reply]

@Sitaron: No, else one could see them at the bottom of the page, which I don’t even with “hidden categories” on; it would also use too much Lua memory, so I know a priori it can’t exist. What may exist is that some users create lists for entries containing translations in certain languages in their userspaces by bot. You may be interested in categories of the format Category:Requests for review of Khmer translations as containing translations needing attention. Fay Freak (talk) 22:38, 4 July 2021 (UTC)[reply]

Oriya vs. Odia

[edit]

For reasons unclear to me, it's become hip in India to pass laws renaming cities and states. In this vein, "Oriya" got renamed to "Odia" by law in 2011. It's not obvious to me that "Odia" does any better at representing the native pronunciation [oˈɽia] than "Oriya", but given that this change was made at Wikipedia and that even Wiktionary's definition of Oriya labels it as "historical", should we rename the categories here? The placename categories e.g. Category:Odisha already use the spelling "Odisha" in place of "Orissa". Benwing2 (talk) 20:45, 4 July 2021 (UTC)[reply]

  • I prefer the spellings Oriya and Orissa, as they are well-established English spellings. The government may use ‘d’ for ṛ, ‘aa’ etc. for long vowels, and ‘sh’, ‘ch’ etc. for ś, c, but on Wiktionary we follow the good transliteration albeit sans diacritics: therefor r, a, s, c. (For older languages such as the Prakrit varieties, I would prefer using diacritics, though.) The government’s passing laws to change toponyms should ill encourage us to follow suit, in our categories. ·~ dictátor·mundꟾ 21:43, 4 July 2021 (UTC)[reply]
  • “Odia” is ominous for it seems odious, literally the plural of odium (hate). The insight didn’t reach the political will of India but here in the West we must ward them from losing their faces. More, I am skeptical of any language name that is shorter than five characters—there often pop up homonyms, so it is better to avoid them. Similar are those programming languages and environments which you cannot well search about because of their names hailing from common tools or animals. So here users are better served with “Oriya” when they search since “Odia” has known homonyms even if they aren’t language names. Good reason? Fay Freak (talk) 02:43, 5 July 2021 (UTC)[reply]
Wow, here in the West we must protect the Indians from naming their cities as they like? Just, wow.
If the largest English speaking nation in the world uses a name for something in their purview, it's probably the best English name for that thing.--Prosfilaes (talk) 19:55, 5 July 2021 (UTC)[reply]
‘largest English speaking nation’: What? It’s not an Anglophone country, but it’s just that English is an official language. Other than the tiny Anglo-Indian community, no one speaks English as a native language: so whatever steps the policymakers of such a country take, well-established spellings must not be changed by us, because we should use spellings that are already in use by native English speakers. ·~ dictátor·mundꟾ 21:09, 5 July 2021 (UTC)[reply]
It's the largest nation where English is an official language, and it has more English speakers than any nation besides the US. At a certain point, there are second-language communities that are deserving of note, that create works among themselves and for themselves, like the last millennium of Latin usage, and India has more than hit that point.--Prosfilaes (talk) 23:00, 5 July 2021 (UTC)[reply]
Yes indeed! Students from India and other South Asian countries, despite these being formerly an integral part of the British Empire, have to sit English language tests to shew their English qualification before being allowed admission at a college abroad. You are seeing the sheer strength of the population being somehow taught the language at school, but in effect very few have a good knowledge of English, and hardly any with good accent. (And in this particular case of the language name, the policymakers favoured a bad transliteration‡ by rejecting the English spelling, this is not a renaming— there are other instances of renaming, though.) [‡ Our own transliteration: ଓଡ଼ିଆ (oṛiā)] ·~ dictátor·mundꟾ 23:08, 6 July 2021 (UTC)[reply]
While not super well-versed in the matter personally, I agree that for consistency's sake, it should probably be renamed, especially with the cited references already making the switch. Additionally, Glottolog, SIL, Oxford Dictionaries Online, Collins Dictionary, Wikidata, Wikimedia, and the first instance of "Odia" or "Odiya" on the language's own Wiktionary seen at the top here all have made the change to and use "Odia" as the (primary) spelling as well. Also, see this discussion on the English Wikipedia about the matter, with the eventual decision being to move to "Odia" based on the rapidly increasing usage at the time 6 years ago (which is bound to have increased by now). Having the English Wiktionary be the odd one out, especially when the region itself has made the official change to "Odia" seems a bit strange to me, when even linguistic regulators like SIL & Glottolog that are often cited here have made the change themselves. I'd Support a renaming. AG202 (talk) 22:30, 7 July 2021 (UTC)[reply]
More sources: Cambridge Dictionary, the Oxford English-Odia Dictionary (ISBN: 9780199474554), Microsoft, Google (another Google source), the Concise Oxford Dictionary of Linguistics, and multiple Wikimedia blog posts made by Odia natives which label Odia wiki projects using "Odia". AG202 (talk) 05:57, 11 July 2021 (UTC)[reply]
Support renaming per AG202.--Tibidibi (talk) 23:26, 7 July 2021 (UTC) Changed my mind.--Tibidibi (talk) 00:20, 9 July 2021 (UTC)[reply]
No, never, per my earlier comments. I oppose a renaming as someone who works in the language. ·~ dictátor·mundꟾ 17:10, 8 July 2021 (UTC)[reply]
I'd take a never comment more seriously if you oppoſed it as ſomeone who works in þe language, instead of opposing one random change no matter how frequently or universally it is used in English. Tilt against windmills, not hand fans.--Prosfilaes (talk) 22:11, 10 July 2021 (UTC)[reply]
I don’t see SIL & Glottolog often cited here. And they often create false senses or impressions of what things are viewed as. They contain a lot of ghost languages – so that Wikipedia even contains entries for “languages” whose names even are hardly attestable, recently under WT:RFM. That is if you took a closer look, which for specialized topics are few people, who are then not found on Wikipedia, and could not effect anything there anyway because Wikipedians prefer trashy generalist references by the slant that they are accessible and similar to them in scope, the same way people try to add shoddy Indo-European reconstructions on Wiktionary because they can reference it with the American Heritage dictionary.
Those other dictionaries must be suspect. Like Collins now putting it as the lemma form of “British English” though the laws of India do not apply to Britain. We see that those dictionaries take political decisions before being usage-based. Gestures of obedience to vague overlords.
It always was to astonish how a whole country could obsequiously follow the confusing German orthography reform of 1996 and its later versions, which was nowhere required by law except for government workers themselves. Very similar to when 1933 everyone switched from guten Tag to Heil Hitler, one went with the bellwethers, instead of minding the maxim of Bakunin: “To revolt is a natural tendency of life. Even a worm turns against the foot that crushes it. In general, the vitality and relative dignity of an animal can be measured by the intensity of its instinct to revolt.” (Yes, the conservative rises.)
So the interesting question is: Was the spelling change in India grass roots or grass tops like latest summer’s fashion? (For the references do not indicate without doubt that they are on the former side and not the latter.) Fay Freak (talk) 23:16, 10 July 2021 (UTC)[reply]
To be completely honest, the language in your reply was a bit confusing to read and there's a lot that I feel isn't really relevant, but I'll still reply to the relevant points. SIL sets the standard for the ISO 639-3 language codes that are used here, with some changes made here and there, but overall it is cited for new language proposals, deletions, and renamings. Glottolog is also a well-respected linguistic resource and was actually recently cited for the Koreanic family name change that will be happening soon. With Collins, I don't know why marked it as British English, but if you look at other words like "test," there are specifically British English entries there as well. Oxford is one of the most respected dictionaries and is quoted and cited a ton here throughout many entries. Even the official email for the language for its wikis (see here at the bottom), the titles of its wikis in English, and multiple Wikimedia blog posts made by Odia natives use "Odia". And then to add on, Microsoft & Google (another Google source), Cambridge Dictionary, the Oxford English-Odia Dictionary (ISBN: 9780199474554), and the Concise Oxford Dictionary of Linguistics all use the "Odia" spelling. Thus, if it's an argument of which one is used and established more in English, the answer should be clear. Additionally, there are more sources in the Wikipedia talk page that I sent that show the comparisons of "Odia" vs "Oriya" up to that time, illustrating the trend of "Odia" having more usage in English overtime, along with opinions from native speakers. AG202 (talk) 05:40, 11 July 2021 (UTC)[reply]
I know exactly what SIL and Glottolog are. The ISO 639-3 language codes we of course reuse here so one doesn’t have to learn twice. I lay emphasis though on Wiktionary as a secondary source. Hence the acceptance of the moves of those dictionaries and encyclopediae was not automatic.
The “major English newspapers in India”, counted in usage at the Wikipedia move discussion, though just with Google, may be more of an argument. But against this one must caution because journalists are a very special kind of people, in the habit of copying from each other. It is a fact that their languages contain elements not found or not found in the same frequency in other types of texts. So we see the The Guardian using the striking spelling bandoe for bando, as if they have just learned the word—but if one of these guys starts to spell it that way the others follow suit, no matter how striking it is (because they don’t know or don’t care, the important thing is that they are the spearheads), and there is nothing organical about that.
A table of uses in social media could be more convincing, weren’t it that these were full of bots and ad-men and other paid shills close to government.
I have yet to see expansion on that this a naturally used spelling and not an orchestrated one. If someone wants a spelling-change and diffuses it top-down it should perhaps have the opposite effect. Fay Freak (talk) 12:06, 11 July 2021 (UTC)[reply]
Maybe they are revolting, against some people from the West who think they can dictate names of Indian languages based on Latin. You've yet to see any evidence that this is a naturally used spelling because you've excluded all sources of recent quotes. I don't honestly know what sources would interest you; I suspect none that didn't support you. Journalists are distinctive, partially in that they publish more text than just about anyone else; the Washington Post alone offers 300,000 words a day, not including wire articles. There is nothing "organical" about using works like "organical" in modern English texts, but languages exist, evolve and grow because humans adapt their way of speaking and writing to better communicate with those around them.
As a note of interest, Amazon has a bunch of books labeling themselves as in Odia that Amazon labels as being Oryia.--Prosfilaes (talk) 22:43, 11 July 2021 (UTC)[reply]

Once more on hyphenation for Korean suffixes and particles

[edit]

It's been four months since the latest discussion on this, and I've become once more convinced of the advantages of hyphenation of 1) all suffixes and particles in the verbal paradigm, 2) all postpositions/nominal particles, and 3) native affixes without unbound counterpart. There are the following benefits:

  • This 2011 discussion led to the abolition of hyphens in Korean, but Korean was not in fact discussed; it was all about Japanese. Japanese is kind of a red herring in this discussion because of a fundamental difference in orthography: Japanese does not use orthographic spaces, whereas modern Korean does. Hyphenation makes sense in such an orthographic context.
  • Better readability. More-or-less "complete" entries like (i) end up being very cluttered due to the mix of bound morphemes and full-fledged words, with nearly two dozen independent etymologies. Such long pages would be difficult for learners to navigate.
  • Consistency with other languages. With the exception of Japanese and languages written in the Arabic script, all languages with significant morphology use hyphens to mark bound morphemes, e.g. Sanskrit (Category:Sanskrit suffixes).
  • Consistency with actual practice. Monolingual dictionaries and linguistic works lemmatize verbal suffixes with hyphens, e.g. 더라 in 표준국어대사전. While case markers and the like are not hyphenated in dictionaries, it is not difficult to find various academic sources that do so, e.g. 주격조사 ‘가’의 발달.
  • When the functionality of hyphenation in Korean romanization is restored (this is one of the stated parameters of Template:ko-usex but has apparently been defunct for several years), hyphenization of particles might be useful in usex formatting. For example:
    {{ux|ko|-이 오길래 -을 감고 얼굴-을 -에 묻었다.}}
If possible, we could tell Module:ko to keep the link to the hyphenated forms but remove the hyphens in the actual display, while the hyphens are retained in the romanization. This would have two benefits. First, it would produce
Nun-eul kkwak gamgo eolgur-eul du son-e mudeotda.
Instead of the current suboptimal
Nuneul kkwak gamgo eolgureul du sone mudeotda.
Second, the hyphenation would allow linking directly to the particles, e.g. (-i) instead of (i) in general. Currently, you need to use {{anchor}} or {{senseid}} to do this, and it is annoying to 1) type out the IDs and 2) remember/look up the ID the particle is assigned.
I might be way off on the technical feasibility of this, though. @Suzukaze-c

On the other hand, there are problems with the proposal:

  • The treatment of certain affixes:
    • In my opinion, we should keep all Sino-Korean affixes where they are, at non-hyphenated forms.
    • Native affixes with unbound etymons like (su, male), (gae, fucking (vulgar)), etc.: I am ambivalent about these but I think they can be kept where they are, at non-hyphenated forms.
    • 하다 (hada) and 되다 (doeda) should obviously not be hyphenated because these are not actual suffixes, even though the 표준국어대사전 categorizes them as such.

So only native morphemes that exist only in bound form would be affected.

  • The technical feasibility of a move:
    • There are currently only 214 entries in Category:Korean suffixes, of which about two-thirds on a cursory glance would qualify for hyphenation. There are 74 entries in Category:Korean particles. These are few enough to be moved in a single day.
    • A much bigger issue is links in usexes and quotations, which would have to be manually retargeted. If the proposal passes, I would consider a hard redirect for words like (reul) or 습니까 (seumnikka) just to preserve the functionality of usexes and quotations. There are 342 words with quotations (Category:Korean terms with quotations), all of which would be affected, and 2,486 in Category:Korean terms with usage examples, only some of which would be affected since many of the usexes are poorly formatted and lacking links. The retargeting can probably be done gradually, like the phasing out of {{etyl}}.
    • {{af}} has not been used by editors of Korean until last year, so Category:Korean words by suffix is rather sparse. There seem to be less than 200 pages which would currently be affected by hyphenation of verbal suffixes.

I would note that the technical issues will only grow as the quality of Korean entries improve, so if hyphenization is ever to be done, it should be as soon as possible.--Tibidibi (talk) 08:23, 5 July 2021 (UTC)[reply]

Pinging participants in the previous discussion: @Solarkoid, Eirikr, Omgtw15, Atitarev.--Tibidibi (talk) 08:24, 5 July 2021 (UTC)[reply]
Support. — Omgtw15 (talk) 10:02, 5 July 2021 (UTC)[reply]
I support the hyphenation on entries, but I would never use them in usex, so I oppose that. Hyphens don't belong in a Korean text. Sartma (talk) 10:35, 5 July 2021 (UTC)[reply]
@Sartma I think you misunderstand; the hyphens would be stripped by code, leaving their trace only in the links and the Romanization (but not visible in the display of Korean text). I'm not sure if this is technically feasible, however. This used to be possible in {{ko-usex}} until around 2017, but then the relevant module was rewritten and it became defunct.--Tibidibi (talk) 10:44, 5 July 2021 (UTC)[reply]
@Tibidibi: Argh! Yes, I did misunderstand! I'm terribly sorry for that, I didn't read your text properly. I just had a moment of freak-out/panic when I saw the hyphens in the Korean. (^m^;;;; Sartma (talk) 11:00, 5 July 2021 (UTC)[reply]
@Tibidibi I'm not following everything above but I'm pretty sure the technical issues are solvable and I can help with them. I reimplemented Module:compound a couple of years ago (which handles {{affix}} among other things) and cleaned up the support for hyphens; changing the handling of hyphens on a language-specific basis is pretty easy. I'm sure Module:usex can be fixed to support whatever hyphen-related functionality you want (for that matter, anything in {{ko-usex}} should be foldable into Module:usex so we don't need a special Korean-specific version). Benwing2 (talk) 19:45, 5 July 2021 (UTC)[reply]
@Benwing2 What would need to be done is that {{usex|ko|[[사람]][[-이]]}} should produce:
사람
Saram-i
That is, the hyphens are preserved in the wikilink and romanization but stripped in the display of the text. Would this be possible?
If getting them to show up in the romanization is too hard (I believe hyphens are currently automatically suppressed by the Korean transliteration module), that can be delayed to some later time.--Tibidibi (talk) 01:00, 6 July 2021 (UTC)[reply]
@Tibidibi This can definitely be done, although it would be easier and cleaner to implement if hyphens weren't suppressed by the translit. Is there a reason they are stripped? If they are stripped in most places in the translit but not in usexes, it effectively means we have two different transliteration schemes. Benwing2 (talk) 01:17, 6 July 2021 (UTC)[reply]
@Benwing2 Until a few years ago, hyphens were not suppressed by the translit and were in fact used to separate particles from the noun, as I'm suggesting we return to (you can still see hyphenated usexes in some of the oldest usage examples). This was removed after a redesign of Module:ko-translit, apparently because the hyphens interfered with the transliteration. {{m|ko|얼굴-에}} is supposed to be transliterated as eolgur-e, because /l/ is realized as [ɾ] in intervocalic position within a word, but the hyphen made it think these were two separate words and transliterated eolgul-e. The hyphens were then suppressed in the transliteration to prevent this.
Do you think you could fix this?--Tibidibi (talk) 01:36, 6 July 2021 (UTC)[reply]
@Benwing2 OTOH this account might not be totally correct since I wasn't around for it, and there isn't any official statement I could find about why the hyphen functionality was removed from the translit. I think the user who changed it was Wyang, who is now gone.--Tibidibi (talk) 02:05, 6 July 2021 (UTC)[reply]
@Tibidibi This would be easy to fix as long as there aren't genuine cases where hyphens are used to separate two words and the written l needs to be transliterated as l in those cases. Benwing2 (talk) 02:10, 6 July 2021 (UTC)[reply]
@Benwing2 There are no such cases.--Tibidibi (talk) 02:14, 6 July 2021 (UTC)[reply]
@Tibidibi OK, I should be able to get to this within a day or so. I see where in the code it's removing the hyphens and it's just a case of figuring out where it converts l to r before a vowel and make it ignore an in-between hyphen. In the meantime can you construct some test examples for me with hyphens in them that might be tricky for the module to get right, along with what you expect to be generated? 얼굴에 (eolgur-e) is one such example. Benwing2 (talk) 02:23, 6 July 2021 (UTC)[reply]
@Benwing2 Supporting bold formatting would also be nice. Module:ko-pron/testcases. —Suzukaze-c (talk) 06:43, 7 July 2021 (UTC)[reply]
@Benwing2 Sorry to bother you, but would it be possible for you to get to this by the end of this weekend?--Tibidibi (talk) 00:21, 9 July 2021 (UTC)[reply]
@Tibidibi Yes. My apologies, I will try to get to this tomorrow (Sunday). Benwing2 (talk) 05:59, 11 July 2021 (UTC)[reply]
@Benwing2 Sorry for the late notification, but is it possible that this could be expanded to Jeju (ISO: jje) as well? Just for consistency's sake. AG202 (talk) 15:27, 17 August 2021 (UTC)[reply]
Support. The page for is getting very unwieldy, and I agree that the hyphenated entries would lead to less confusion and would be easier to find for the everyday user. I would also suggest that Jeju and its entries be added to this proposal as well, just for consistency's sake. AG202 (talk) 22:02, 5 July 2021 (UTC)[reply]
Support. The various points made above all make sense to me, and I have no concerns with this proposal moving forward. ‑‑ Eiríkr Útlendi │Tala við mig 00:15, 7 July 2021 (UTC)[reply]
Support. kwami (talk) 23:16, 10 July 2021 (UTC)[reply]

FYI, this is what McCune–Reischauer says with regard to this very issue.

"The nouns, likewise, should be written together with their postpositions, including those called case endings, not separately as in Japanese, because phonetically the two are so merged that it would often be difficult and misleading to attempt to divide them."

For example, should 낮에 be romanized as "naj-e" or "nat-e" or "na-je"? All of these are unsatisfactory and misleading. This is why McCune–Reischauer says that it should be simply "naje". --2607:FB90:5AEA:E837:B427:4261:38B5:2C21 04:00, 12 July 2021 (UTC)[reply]

낮에 is /nat͡ɕe/ phonemically and 낮에 morphologically, so naj-e should be the preferred hyphenation and transliteration. Transliterating 낮에 as naje does not respect the morphophonemic nature of contemporary Hangul orthography, so if anything the lack of hyphenation is what is unsatisfactory. There is a clear orthographic distinction between 낮에 and 나제; how else would you mark this in the transliteration?
The only issue (which is honestly more an issue with Revised Romanization than with the principle of hyphenation in itself) is the clusters involving /h/ that surface as single aspirate consonants, but I don't see it as that big of a problem.--Tibidibi (talk) 04:18, 12 July 2021 (UTC)[reply]
Well, this is what McCune–Reischauer (and I) mean.
"naj-e" is misleading because it suggests that 낮 is pronounced /nat͡ɕ/ in isolation.
"nat-e" is also misleading because 낮 is not pronounced /nat/ when followed by a particle (postposition).
"na-je" is also misleading because it doesn't match morpheme boundary.
So it should be simply "naje".
낮에 and 나제 don't have to be distinguished in romanization. Rather, due to hangul's syllabary-like feature (모아쓰기), those two were "made" to be distinguished in Korean spelling. If Korean didn't use a writing system with syllabary-like feature, then the orthographic distinction between those two would not exist from the beginning (i.e., there would only be ㄴㅏㅈㅔ from the beginning). --2607:FB90:5AEA:E837:B427:4261:38B5:2C21 05:18, 12 July 2021 (UTC)[reply]

Splitting WT:RFVN

[edit]

I split off the CJK-related stuff in WT:RFVN into WT:RFVCJK. I did this by checking for either the language codes 'zh', 'ja' or 'ko' in the entry title or an East Asian character in the entry title. As a result, a couple of entries not in Chinese/Japanese/Korean got moved: I noticed one in Vietnamese and one in Ainu. Not sure this is desired; if not, I or someone else can move those two entries back to WT:RFVN. I fixed {{rfv}} and {{rfv-sense}} so they will automatically add to WT:RFVCJK instead of WT:RFVN if the language is Chinese, Japanese or Korean. The result of this is that WT:RFVN is now about 2/3 its previous size, which should help somewhat, although IMO it's still unwieldy. I'm thinking of also splitting off the Romance languages, as discussed prior; this will require a bit more work as there are more Romance languages than CJK languages and you can't identify pages to split off by character set. Benwing2 (talk) 21:08, 5 July 2021 (UTC)[reply]

I think Ainu is OK, ain should be in Template:rfv. For Vietnamese, four contributors: Ekirahardian, PhanAnh123, Bula Hailan, ColePeltier93 can all write some Chinese Characters. EdwardAlexanderCrowley (talk) 04:04, 7 July 2021 (UTC)[reply]

Deleting Italian reflexive participles

[edit]

(Notifying GianWiki, Metaknowledge, SemperBlotto, Ultimateria, Jberkel, Imetsia, Sartma): User:SemperBlotto bot-created thousands of Italian reflexive participles some time back. Examples:

AFAIK, all of these forms are obsolete and the vast majority are both unattested and unattestable because they are not part of the modern language and no one is composing texts in obsolete Italian any more (unlike e.g. Latin). I can't actually find a single one of these terms that has any non-bot edits.

Yes, "all words in all languages" but Wiktionary also has an attestation criteria (WT:CFI) and the vast majority of these terms would fail that. I propose to delete all of them and require at least one actual attestation before adding any of them back (which means no adding by bot). Benwing2 (talk) 22:11, 5 July 2021 (UTC)[reply]

@Benwing2: I thought that the attestation criteria was about a word or phrase per se, not all its inflected forms (Italian is not a dead language). If "abbottonarsi" exists (as it does), that should be enough to prove the existence of all it's regularly inflected forms, as any native speaker could use them, even if you can't find them on Google.Sartma (talk) 08:22, 29 July 2021 (UTC)[reply]
Formally, I think the attestation criteria is about an entry, not including any inflected forms. There's generally more leniency shown to inflected forms, since it's rarely worth the time to cite them, but there's also frustration when large number of uncitable entries are created by bot. They also can be problematic, creating words that are non-idiomatic, that aren't citable because that's not how a native speaker would say that, or that are wrong because some orthographic rule is ignored by the bot. Especially in a well-documented language, if a lot of them are unattestable, they really need to be citable instead of automatically created.--Prosfilaes (talk) 12:18, 29 July 2021 (UTC)[reply]
@Prosfilaes: I understand not wanting bot-created wrong forms, but there's nothing wrong, strange or unusual with the ones listed above, with exception of the present participles, maybe. The present participle is not really productive in Italian anymore. All the others are just common standard Italian. Sartma (talk) 10:17, 30 July 2021 (UTC)[reply]
I searched a few randomly and didn't even give Google hits. I support deleting them and putting the burden of proof on anyone who wants to add them by hand to try to cite them. Ultimateria (talk) 01:56, 9 July 2021 (UTC)[reply]
Those are all perfectly formed Italian words. They are not "obsolete". Some might be "rare", but only because the scenario they describe doesn't happen often, but that doesn't make them either wrong or obsolete. Some verbs are less used than others, but I find it strange that you couldn't find "regalatami", "speditagli", "abbattutasi" etc. They all are very natural and normal Italian words. They are inflected forms, so I don't know to what degree you want all inflected forms to be entries. But, like, they can all be created by knowing the infinitive. If you know "abbattersi", you can easily form "abbattutosi". Sartma (talk) 08:13, 29 July 2021 (UTC)[reply]

Anagrams

[edit]

Some languages (e.g. English, Italian) have anagram sections at the bottom. These haven't been updated in over 10 years; User:Conrad.Bot was doing it but is no longer active. Should we (a) leave them alone, (b) delete them, (c) update them (somehow)? Benwing2 (talk) 05:25, 7 July 2021 (UTC)[reply]

Isn't User:NadandoBot doing this already? I would support updating them in any case though. Thadh (talk) 09:19, 7 July 2021 (UTC)[reply]
Yes, in the recent past I updated them (for English, Finnish and Danish only). I mainly find it impractical due to the large number of pages that may have to be edited (of course the program does it, but still). Maybe there is a solution that doesn't require editing each individual page, and crucially does not have a large memory footprint. DTLHS (talk) 02:00, 8 July 2021 (UTC)[reply]
@DTLHS Given the current way of handling anagrams, the only way I see of updating them is to go through the dump file, construct a map from alphagrams to pages containing those alphagrams, check each page in the dump file to see if its anagrams need updating, and edit those pages needing updating. I assume your code does essentially this. This will take a good amount of memory esp. for English. If you don't have enough memory to hold the whole map, it should be possible to hack up a map-reduce type of solution to process the dump in chunks, using the disk for intermediate storage. I have done that in the past to do things like sort the Wikipedia dump file on a 16G memory Macbook Pro.
BTW I assume you used the {{anagrams}} template and inserted an alphagram at the beginning? I found an example from late 2018 where your bot formatted things this way (angriest). I can't find any Danish examples though. The Italian anagrams use a random combination of formatting with {{anagrams}}, with multiple calls to {{l}} on a single line, and with each anagram on its own line formatted with {{l}}. Benwing2 (talk) 05:00, 11 July 2021 (UTC)[reply]
I should have said Swedish. And that was my method. The memory wasn't a problem, it was more of a problem keeping up with English which has lots of new pages added every month. DTLHS (talk) 00:34, 12 July 2021 (UTC)[reply]
Update them if you can. Otherwise leave them alone. It's not your business to delete good content. Jesus I've had enough of you people hating on anagrams. Equinox 09:29, 7 July 2021 (UTC)[reply]
@Equinox Really, "you people"? Have you seen the usage note on that term? Benwing2 (talk) 01:02, 8 July 2021 (UTC)[reply]
>Telling me to learn English
I had no idea about that and I speak British English. It's very rude of you to expect that your American rules apply across the world (but typical). Equinox 05:52, 10 July 2021 (UTC)[reply]
You must have a stick up your ass or something; I really don't know what your issue is but you seem to relish insulting people and picking fights. Benwing2 (talk) 19:05, 10 July 2021 (UTC)[reply]
But leave them alone or update them. I see no reason for discarding them. Like nobody would come about and censure us as a sketchy website by reason that we haven’t even updated anagrams. Fay Freak (talk) 01:37, 8 July 2021 (UTC)[reply]
We could do anagrams the way we historically did rhymes, i.e. have a page like "Anagram:aet" that lists ate, eat, ETA, etc, and then all of those pages just link to Anagram:aet. Then, whenever a new anagram is added, only the page for the new anagram and the central Anagram: page have to be updated, not every single other word spelled with those letters. Or we could do this in the way that it's more recently been proposed to do rhymes, i.e. with categories, which could be automatic and perhaps even easier to maintain. - -sche (discuss) 18:02, 8 July 2021 (UTC)[reply]
@-sche Can you explain the more recent proposal for rhymes, and how it would apply to anagrams? Benwing2 (talk) 04:38, 11 July 2021 (UTC)[reply]
The way I imagine anagram categories working would be that {{head}} would add a category with the sorted characters as an alphagram. This seems dubious to me as there would be a very large number of alphagram categories most of which would have only a single entry. DTLHS (talk) 00:26, 12 July 2021 (UTC)[reply]
We could do it the hard way and require an entry to provide an anagram (not itself) for it to be included in the category. However, what do we do about head's, which I presume is an anagram of shade, but is not eligible for an entry. What languages do we have definitions of anagrams for? We currently record English face and café as anagrams of one another, but we can't treat a and ä as the same for Danish. For Danish, do we get unusual rules such as aa and å matching? For Swedish, I presume ö is a different letter to o. What about Welsh anagrams? @Octahedron80 pushed the rule that two words are anagrams of one another if they have the equal bags of Unicode letters (I'm guessing in form NFD), but that gives what seem to be some odd results in Thai, such as spacing vowel symbols counting but not non-spacing vowel symbols. --RichardW57 (talk) 17:15, 12 July 2021 (UTC)[reply]
Ignoring non-spacing symbols is the same way thinking of Thai Scrabble "คำคม" or general crossword that put ะ า ำ เ แ โ ใ ไ into separated cells. Not ignoring the symbols will lead to VERY less results. The rules of anagram are different per language. (I made {{th-anagram}} for Thai.)--Octahedron80 (talk) 23:30, 12 July 2021 (UTC)[reply]
Whereas in Thai crosswords some cells have non-square shapes to accommodate marks above and below. I would have expected the elements for natural anagrams to correspond to primary sort keys for collation; you're saying technological limitations have prevailed. (The elements you list are also the elements for intra-word spacing.) Do Thais see any sort of significance in words or phrases being anagrams of one another? I know spoonerisms are significant. --RichardW57m (talk) 08:35, 13 July 2021 (UTC)[reply]
I don't know well what you mean. To not ignore non-spacing marks, it was possible, but I chose not to apply. Thai has too many letters to match; not like Latin just A-Z. Compare Module:th-anagram/processed data resulting 1000+ sets and User:Octahedron80/sandbox resulting only 200+ sets. (Intra-space is ignored as same as punctuation marks.) Do you really want the latter? Please confirm and I will replace. About spoonerisms, most of them are SOP's or meaningless; you won't find so many words in Wiktionary. --Octahedron80 (talk) 08:53, 13 July 2021 (UTC)[reply]
Do Thais lack a naturalised concept of 'anagram'? If they have one, that is what we would want. If the only Thai concept is one of rearrangements of letters on tiles for a board game, then I am not sure we want the Thai concept of anagrams. The 1000+ sets strike me as wrong, but that reflects my (English) culture and attempt to understand others. It's possible that we might want an English-speaker's notion of anagrams of Thai words, though Los Angles and Delhi might not have a common concept. What do others think? --RichardW57m (talk) 12:34, 13 July 2021 (UTC)[reply]
Anagram is really invented to play puzzles and board games, anyway. No others' comment for 7 days. I will change to match all letters then. --Octahedron80 (talk) 00:05, 21 July 2021 (UTC)[reply]
At least option a, if somebody is up for implementing option c or -sche's proposal, that's fine. ←₰-→ Lingo Bingo Dingo (talk) 09:21, 10 July 2021 (UTC)[reply]
Per - -sche, the way it's heading, we absolutely should just go ahead and create an Anagrams: namespace comparable to what we do with the Rhymes: namespace. I would prefer namespace pages to categories, as we still have no system (so far as I am aware) of watchlisting category contents to be alerted when a new categorization is made. bd2412 T 05:12, 11 July 2021 (UTC)[reply]

Some tips here if somebody wants to make anagram module. (I hope you can understand.)

How to find anagrams

  1. Get a list of every word in a language. Lemmas + non lemmas are expected.
  2. Make index for every entry of the list by binary-sorting its letters, with or without diacritics depending on each language. (See alphagram)
  3. Collect entries which their indices matches each other, except blank and 1-character indices. The blank index will occur for sure.
  4. Drop unmatched isolate entries. Anagrams remain.

--Octahedron80 (talk) 02:14, 13 July 2021 (UTC)[reply]

PS - Japanese anagrams would be catastrophe since they must extract all kana from kanji.

@Benwing2 Sorry I missed your question earlier, but yes, like DTLHS says, the way to do anagrams via categories would be to have categories for the "alphagrams" ("Category:English anagrams of aet", for example, listing "ate", "eat", etc). These could either be added automatically by {{head}} iff that wouldn't use too much memory (but it would result in many categories with only one entry, as DTLHS warns), or added manually or by bot (using less memory, and allowing addition to be restricted to cases where the category would have multiple entries). Even manual or bot categorization would be somewhat less tedious that the current setup, because when an entry is deleted, no other pages would have to be updated, unlike at present. (With an "Anagrams:" or "Anagram:" namespace, only the entry and the one central anagram page would have to be updated whenever an entry was created or deleted, also reducing how many pages have to be updated.) - -sche (discuss) 01:13, 15 July 2021 (UTC)[reply]

Taxonomic names

[edit]

'The question of inclusion of taxonomic names is a matter for, first, the Beer Parlor, then a vote' (DCDuring). My opinion is to include both the generic epithet and the full binomial name (if I got my terminology right) as entries: really I just want their etymologies in Wiktionary. I'm not so convinced with including specific epithets by themselves, especially I don't think having a specific epithet entry would be enough to not include the binomial entry, since their could be more than one way in which the attribute described by the specific epithet is used in binomial names. Kritixilithos (talk) 14:15, 7 July 2021 (UTC)[reply]

We should not be including binomials (Pan troglodytes, Peromyscus maniculatus). As analogies, consider the rules for full names of people and chemical formulae. But more important to me, the internet is already overloaded with automatically created lists of species names. Most of these are harmful because they hide the few legitimate sources of information. ceb.wikipedia is a notorious offender and should be nuked from orbit. Wikispecies is the place for these names. We should also not be including specific epithets (nevadensis, slossonae) that exist only in taxonomy but I feel less strongly about this. Hit it with a bunker buster instead of a nuke. Vox Sciurorum (talk) 18:47, 18 July 2021 (UTC)[reply]

References vs. Further reading

[edit]

Currently, the Italian entries randomly put links to the same term in additional dictionaries either under "References" or "Further reading". I would like to clean this up. I assumed that these links ought to go under "Further reading" on the assumption that References is reserved for footnotes, but WT:ELE says this under "Further reading":

The “Further reading” section contains simple recommendations of further places to look.
This section may be used to link to external dictionaries and encyclopedias, (for example, Wikipedia, or 1911 Encyclopædia Britannica) which may be available online or in print.
This section is not meant to prove the validity of what is being stated on the Wiktionary entries (the “References” section serves that purpose).

After reading this, especially the final sentence, I'm thoroughly confused. These links do consist of pointers to external dictionaries (hence the second-to-last sentence makes sense), but the links are not provided just for random edification but usually to help verify the correctness of the Wiktionary entry. For example, when I insert a link to {{R:it:DiPI}} (a dictionary of Italian pronunciation), it's often because there's something in the pronunciation that isn't obvious (e.g. the presence of secondary stress or a hiatus), and the link is what I used to source this pronunciation. Many existing entries contain links to {{R:it:Treccani}}, which was clearly the source for the definition(s) in the entry, since this is the most authoritative Italian dictionary. So ... do these links go under "Further reading" or "References"? It should not come down to the purpose of the links (whether they are for verification of "simple recommendations of further places to look"); that is entirely subjective and an impossible standard to make use of. Benwing2 (talk) 03:24, 8 July 2021 (UTC)[reply]

One could make it more objective by adding a comment (perhaps an HTML comment) to the references to say why the reference is being cited as a reference. I'm not happy with the rule that we don't use in-line references, but I appreciate that they can look ugly and sometimes be confusing. An HTML comment should protect against an editor excusably moving the reference to 'further reading'. --RichardW57 (talk) 05:05, 8 July 2021 (UTC)[reply]
I feel like the distinction between these which was intended when they were separated is clearly so commonly not maintained in practice that it (together with the questions above about whether a distinction is even maintainable) calls into question whether we should just combine the headers (again?). Even I often just put all the references I add under References, or under whichever header is already present. - -sche (discuss) 18:09, 8 July 2021 (UTC)[reply]
I have recently added "References" sections to entries without specifying what I was referencing- as a kind of general "reference". I did so beacuse there was very little to read on those dictionary entries beyond "this is a varint form of another word x". If I need to change over to "Further Reading", let me know. --Geographyinitiative (talk) 18:26, 8 July 2021 (UTC)[reply]
My thinking is that "Further reading" = "works consulted" and "References" = "works cited". So at corretja, I got a date on the term from one source, but the definitions are based on a combination of these sources, so I treated them differently. Definitions themselves are technically only sourced through citations—except in LDLs, so I'm not sure how that affects things. (Presumably a reference on an LDL page sources all the info on it.) I'd be fine with merging the headers mainly to avoid inconsistency. Ultimateria (talk) 21:59, 8 July 2021 (UTC)[reply]
For me, ===References=== holds <references/> and nothing else, since inline citations are what connects a specific fact to the reference that confirms it. ====Further reading==== holds anything else relevant to a specific term, that is not tied to any facts stated in the entry. In other words, the former verifies information, the latter does not. —Rua (mew) 08:08, 12 July 2021 (UTC)[reply]
When I was new here, I was reproved for using in-line references - 'We don't do in-line references here.' --RichardW57 (talk) 17:38, 12 July 2021 (UTC)[reply]
"anything else relevant to a specific term, that is not tied to any facts stated in the entry." What about the fact of the existence of the word? --Geographyinitiative (talk) 21:37, 12 July 2021 (UTC)[reply]
All of this has arisen from the banning of the heading "External links" and the placement of such links under the heading "See also".
I find "Further reading" to be an unsatisfactory heading in many of its applications at Wiktionary. For a normal user, interested in definitions, usage examples, and translations, "further reading" seems time-consuming and irrelevant. "References" is more suggestive of terse information and confirmation of the information Wiktionary presents, or alternatives to the information Wiktionary presents or the way it is presented. Images linked under such headings are clearly not "reading" in the most common sense. Links to databases that contain semantic information not presented as sentences do not reward the reader with material to be "read" in the normal sense. "Further reading" makes me think of multi-page articles and scholarly notes about the etymology, usage, semantics, and syntax of words, not a verbal definition or translation, database entry, or picture. "Reference" seems to be much more inclusive.
Appropriating the term "Reference" solely for footnoted references would be fine if we had a single suitable term to characterize non-footnote type references that was not misleading as "Further reading" is. As it is, we once more are acting as if we are preparing a dictionary for only scholarly users of the linguistics variety, rather than a general population of users, whether scholars of any discipline or more ordinary users. DCDuring (talk) 20:13, 12 July 2021 (UTC)[reply]
I question the need for references to not be inline/footnotes. What use is a reference if you have no idea what information is referenced from it? —Rua (mew) 18:07, 13 July 2021 (UTC)[reply]
It just refers the reader to something, without further specification. The idea what is referenced the reader has himself, is prejudiced towards, i.e. he gets referenced what he wants referenced. This is really somewhat individual, depending on the term—the disputability of its definitions, forms and etymologies—, and the expected readers, and the working habits of editors, what gets referenced and how precise the pointers are, and whether it is sufficient. A usual way is just adding everything one knows and mentioning the resources that one had open.
I agree with DCDuring’s reasonings about “References” being more inclusive. I mostly use “Further Reading” to show that there is relatively a lot that one can read, if one wants to read more.
The distinction is basically not needed. Sometimes useful and abused for layout reasons: That direct references with <references/> go under the references header and the synthesized rest under the further reading header, because it is ugly when <references/> directly follows unspecific references, or perhaps the other way round. Else, I see it only as remnants from a time when there were much more headers. Fay Freak (talk) 19:22, 13 July 2021 (UTC)[reply]
Yeah, my instinct is simply to merge footnotes and non-footnote references under the ==References== header for Italian, as currently there's no consistency at all as to what goes in ==References== vs. ==Further reading==. The only way this loses info is if a consistent distinction is made in non-footnote references between those that go in ==References== vs. ==Further reading==, and from everything I have seen, there's no such consistent distinction. If we later decide to follow User:Rua's suggestion of putting footnotes under ==References== and everything else under ==Further reading==, this can be done by bot. Benwing2 (talk) 06:50, 14 July 2021 (UTC)[reply]
I think, going farther back, it might be the fault of MediaWiki choosing <references /> as the name of the footnotes tag. It should have been <footnotes />. Because of the tag, the header for footnotes is References. I recalled a vote saying References should only include footnotes, and this is probably it: Wiktionary:Votes/2016-12/"References" and "External sources". — Eru·tuon 03:48, 15 July 2021 (UTC)[reply]
Exactly, this is what I've been going by ever since. —Rua (mew) 19:16, 15 July 2021 (UTC)[reply]
I have just read Wiktionary:Entry_layout#References. To my formerly untrained eye, I recently seem to have been doing something right on the line between "verify[ing] the information available on our entries" and "simple recommendations of further places to look". However, I'm now leaning toward viewing my external dictionary links as more akin 'Further reading'. Please forgive my confusion- I have never added dictionary links until recently and hadn't seen this part of entry layout. --Geographyinitiative (talk) 19:31, 15 July 2021 (UTC)[reply]
I usually use references. I add a separate further reading section some references use <ref> and others do not. This is to make the page look nicer. I use further reading if I am linking to something like a Language Log post. This may not match anybody else's style or the officially prescribed style. Vox Sciurorum (talk) 18:57, 18 July 2021 (UTC)[reply]
FYI some Wikipedia pages have subsections called "Citations" and "Sources" under ==References==; see Newcastle upon Tyne for an example. Benwing2 (talk) 03:32, 21 July 2021 (UTC)[reply]

Earning the right to trial by RfV

[edit]

@Inqilābī has proposed entries with neither attestation nor Google hits may be subjected to speedy deletion even though there is reason to believe that they exist. As googling for a word in a heavily inflected language may easily fail if one just searches for the citation form, it seems that for words that 'obviously exist', one may need to provide google hits in the entry as evidence of its right to trial by RfV. In the use case I have in mind, these would exhibit inflected forms. How should one present these google hits? Would it be appropriate to exhibit them as 'usage examples', albeit perhaps lacking translations? --RichardW57m (talk) 13:18, 9 July 2021 (UTC)[reply]

(Actually the wording implied that both attestation and Google hits were required, but I'm assuming that this was just sloppy wording.)

Synonym Chart for Danish, Norwegian Bokmal, Norwegian Nynorsk, and Swedish

[edit]

Hi! Owing to the generally acknowledged fact that the three North Germanic languages (Danish, Norwegian, and Swedish) have high mutual intelligibility, I was wondering if we can make a synonyms chart similar to what we have for Persian (check دوچرخه where it shows the word for bicycle in Iranian Persian, Dari Persian, and Tajik) and similar to what we have for Chinese (check 自行車 where it shows the word for bicycle in the various Chinese languages). A synonyms chart like this, which would theoretically show four entries (since we show one for both Bokmal and Nynorsk), would be helpful for language learners and language enthusiasts in comparing the word usage for these North Germanic languages. For example, a Scandinavian languages synonym chart for "frog" would have frø for Danish, frosk for both Bokmal and Nynorsk, and groda in Swedish, and a Scandinavian languages synonym chart for "breakfast" would have morgenmad for Danish, frokost for Norwegian Bokmal, frukost for both Norwegian Nynorsk and Swedish. Theoretically, it could also be expanded to the various Norwegian dialects if applicable. What do you guys think? --Mar vin kaiser (talk) 13:29, 11 July 2021 (UTC)[reply]

Tagging some Scandinavian language editors I've found, @Enkyklios, @Donnanz, and @Gamren. --Mar vin kaiser (talk) 13:38, 11 July 2021 (UTC)[reply]
Suzukaze made {{dialect synonyms}} which can be used. If we are going ahead with this, I'd be in favour of including all their dialects too and not just the 'standard' versions. Kritixilithos (talk) 14:50, 11 July 2021 (UTC)[reply]
@Mar vin kaiser I'm in favor of this in general. The Kurdish languages have a similar template {{ku-regional}} that I expanded; see moz for an example. BTW I also think we should consider figuring out a way to merge Norwegian Bokmål and Nynorsk. It seems silly to have so many entries like Abkhasia and abkhasisk that have both Bokmål and Nynorsk entries. Benwing2 (talk) 18:15, 11 July 2021 (UTC)[reply]
@Benwing2: I also think we should merge them, but there was a vote about it, and it failed, unfortunately. PUC22:12, 11 July 2021 (UTC)[reply]
@Benwing2 @PUC Bokmål and Nynorsk are different languages with entirely different origins. Bokmål originates in Danish (regardless of Wiktionary saying it is descended from Middle Norwegian), which has gradually had spelling reforms made to reflect the local (mis)pronunciation of the Norwegian urbanites. Nynorsk on the other hand originates in Ivar Aasens studies on Norwegian dialects. The two have borrowed a lot of words from eachother, but they are still separate languages and should not be merged. Mårtensås (talk) 12:29, 13 July 2021 (UTC)[reply]
@Benwing2: Can you help me make a template draft for this? At least for the four I identified, to get started. I'm not familiar with the code, you see. --Mar vin kaiser (talk) 11:23, 12 July 2021 (UTC)[reply]
@Kritixilithos, Suzukaze-c The problem with {{dialect synonyms}} is it requires that separate data pages be created for each lemma. This makes sense when you want to cover a zillion dialects but if there are just a few, it can be painful to have to create all those pages. In addition there's the issue of what to name the data pages. The solution of having separate data pages was tried for descendants (the former {{etymtree}}) and didn't work well; the solution actually adopted in {{desctree}} was to fetch the data directly from one of the pages, which could be done here as well. Benwing2 (talk) 18:22, 11 July 2021 (UTC)[reply]
@Benwing2 It is a design innovated by {{zh-dial-syn}} ({{dialect synonyms}} is a language-agnostic remake) and has worked so far. —Suzukaze-c (talk) 19:27, 13 July 2021 (UTC)[reply]
You're describing a Swadesh list. We already have lists for Danish, Swedish, Faroese, Icelandic and both Norwegians that you can merge. WP has another one just with the Scandi languages.__Gamren (talk) 19:05, 11 July 2021 (UTC)[reply]
@Gamren: But Swadesh lists only cover basic words, and WP only covers a sample of words in the language. This proposal would cover all words of the language in all aspects. For example, what would happen is when you open the entry for frø, for example, you would see a table there for the terms in the other relatively mutually intelligible Scandinavian languages. --Mar vin kaiser (talk) 21:55, 11 July 2021 (UTC)[reply]
What? I don't want that. Cognates go in etymology section. Non-cognates go nowhere.__Gamren (talk) 23:02, 11 July 2021 (UTC)[reply]
@Gamren: The same thing is being done in Wiktionary across other languages and dialect continuums, like Kurdish and Persian. For example, check دوچرخه where you can see Iranian, Dari, and Tajik use different terms for "bicycle", and moz where you can see the word for "hornet" across Kurmanji, Sorani, and Southern Kurdish. This format is very useful for language learners and users to find out word differences and spelling differences across super-related languages, especially those that have high mutual intelligbility, which the Scandinavian languages have. So it would be convenient for learners to know in the Danish entry frø, that it's a totally different word in Norwegian and Swedish, frosk for both Bokmal and Nynorsk, and groda in Swedish. --Mar vin kaiser (talk) 07:25, 12 July 2021 (UTC)[reply]
Okay. I think it's a stupid idea in and of itself. Besides, how do you deal with synonymy? Do you link ei with ej or ikke?__Gamren (talk) 08:03, 12 July 2021 (UTC)[reply]
@Gamren: Well, based on my experience with other languages, more than one entry can appear within one language or dialect. If you look at the dialectal synonyms chart in 自行車 in Chinese, one dialect can have several words/synonyms for "bicycle", and sometimes if one word is already obsolete in that dialect, or very literary in that dialect, there's an option to label that word for that dialect as obsolete, literary, or any other label. --Mar vin kaiser (talk) 10:26, 12 July 2021 (UTC)[reply]

Merging categories for 'informal' and 'colloquial' terms

[edit]

See WT:RFM. I have redirected the 'colloquial' label to point to 'LANG informal terms' e.g. Category:English informal terms (instead of e.g. Category:English colloquialisms). For the moment the two labels still display differently. BTW there are a few weird edge cases, e.g. the colloquial-um and colloquial-un labels, used only for Persian (Category:Persian colloquialisms containing sequence um and Category:Persian colloquialisms containing sequence un). I don't speak Persian so I have no idea what's going on here but having special-purpose labels like this sitting in the general-purpose code strikes me as wrong. Benwing2 (talk) 04:24, 12 July 2021 (UTC)[reply]

 @Benwing2: Well you see, in transcriptions as of بادام (bādām) and words like سلام (salām) you fast find pronounced, the Classical Persian vowel ā is pronounced in Modern Iran more rounded and raised than the transcription suggests. In substandard language of some cities this goes even, in words auslauting ān and ām, so far as [uː], and thence is written, as “eye-dialect”, with و. But apparently not all words and so editors find it necessary to collect these forms. That’s all that’s behind it. It’s also mentioned at w:Persian_phonology#Colloquial Iranian Persian. Fay Freak (talk) 05:14, 12 July 2021 (UTC)[reply]

@Benwing2 I've noticed some other problems with Category:English terms by usage, but the only one I'll address now is the sparsely populated Category:English familiar terms. Should this be folded into Category:English informal terms? The distinction is very subtle, and it's certainly not being used the way that the category description intends in the rest of Category:Familiar terms by language. I can tell you that in the Romance languages at least, "familiar" is just the term many dictionaries used instead of "colloquial" or "informal". Ultimateria (talk) 17:02, 16 July 2021 (UTC)[reply]

@Ultimateria I'd be fine with combining these categories with 'informal terms'. We already have e.g. Category:English endearing terms, and it looks like the 6 terms in Category:English familiar terms can go either in Category:English endearing terms or Category:English informal terms. Benwing2 (talk) 03:10, 17 July 2021 (UTC)[reply]
Anyone object to merging 'familiar' with 'informal'? Most languages that have any such terms categorized as 'familiar' have only 1 or 2. I checked some of the languages with more such terms; many are Romance languages and it's clear in this case that it's merely used for informal terms just as User:Ultimateria suggested, maybe sometimes with relatively strong informality, but that's it. The use in Japanese and English appears to represent mostly informal terms referring to family or friends, which is probably the original intention of the category, but we also have the better-populated and more well-defined Category:Endearing terms by language and Category:Derogatory terms by language. Both endearing and derogatory terms are almost always informal as well, and I'm almost positive these two categories along with Category:Informal terms by language will suffice for all use cases. Benwing2 (talk) 16:52, 18 July 2021 (UTC)[reply]
Isn’t “endearing” the antonym of “pejorative” or “derogatory”, or what is it? (May also be the antonym of “pejorative” and comprise the idea of familiarity at the same time.) I know “meliorative”, but this is rarely used, and more often in German (→ de:meliorativ, w:de:Pejorativum about “Melioration”). Fay Freak (talk) 23:29, 23 July 2021 (UTC)[reply]
@Fay Freak Yes. My point is that these terms are informal, and are generally either endearing or pejorative, so the combination of these should be enough. Benwing2 (talk) 01:20, 27 July 2021 (UTC)[reply]

This category would be used for terms that are named, or perceived to be named, after something that they aren't actually related to. For example, turkey isn't actually from Turkey, Dutch filet americain isn't from America, Russisch ei isn't from Russia, and so on. —Rua (mew) 08:20, 12 July 2021 (UTC)[reply]

Does Dutch baby fit in there? And what about bastard operator from hell? There are credible rumours these miscreants aren't actually from hell. Regardless of the potential merits of such a category, I have a problems with the proposed name, which IMO is too generic. The label “misnomer” can be applied to anything not appropriately named; one can call Oktoberfest a misnomer for a festival that most of the time takes place for most of its duration in the September month, and centipede a misnomer for a critter that never has exactly 100 legs.  --Lambiam 23:02, 12 July 2021 (UTC)[reply]
Same. I reckon it useful to have terms with misapplied origin qualifiers collected, but what else would fall under this name? We might start with a category restricted to demonymic terms – there are a lot in the plant lexicon, Armenian cucumber, granoturco, مصر بوغدایی (mısır buğdayı) (but also mısır?) – which later can be a subcategory of a more general concept. Fay Freak (talk) 23:38, 12 July 2021 (UTC)[reply]
Yes, good idea. Another example: cochon d’Inde.
Also, what about montagnes russes vs. америка́нские го́рки (amerikánskije górki)? Or take French leave vs. filer à l’anglaise? PUC23:55, 12 July 2021 (UTC)[reply]
Most (almost all?) noun phrase idioms are well-attested misnomers, aren't they? Aren't metaphors misnomers sensu stricto. DCDuring (talk) 01:39, 13 July 2021 (UTC)[reply]
Alternative name proposals are welcome of course. —Rua (mew) 11:33, 13 July 2021 (UTC)[reply]
Maybe the category could be introduced as "Category for terms named after an incorrect country of origin". This would rule out words like "centipede" and metaphors and leave only what I believe was meant by the proposal. Thadh (talk) 11:51, 13 July 2021 (UTC)[reply]
I think that that which is (mis)named is a concept, not a term (in this context synonymous with name).  --Lambiam 15:57, 13 July 2021 (UTC)[reply]
Also, it doesn't have to be countries. —Rua (mew) 17:42, 13 July 2021 (UTC)[reply]
In that case, what are the requirements, if I may ask? Because I would definitely support the country of origin-misnomeres, since that's very specific, but I'm afraid a more abstract definition would quickly become unmanageable, like the others said above: any metaphor could be called a misnomer, if an aspecific enough definition is handled. Thadh (talk) 22:13, 13 July 2021 (UTC)[reply]
Anything else that's a place name? —Rua (mew) 08:15, 14 July 2021 (UTC)[reply]
Oh, okay, makes sense. In that case "Category for concepts named after an incorrect location of origin"? Thadh (talk) 08:36, 14 July 2021 (UTC)[reply]
I agree this would be useful to categorize if we can think of what to call it. What is the general term for "thing from place" words which aren't misnomers, i.e. what are terms like Italian sausage, Belgian chocolate which, pace our entry, seems to almost always mean chocolate from Belgium or American cheese called? (And do we want to group those?) I'm wondering because then turkey and cochon d'Inde would just be "misnamed [whatever the term for the general thing is]s". - -sche (discuss) 01:31, 15 July 2021 (UTC)[reply]
I don't know what such terms are called, but I agree with this idea. —Rua (mew) 19:14, 15 July 2021 (UTC)[reply]
@-sche: The second sense at toponym reads as follows: "(less common) A word derived from the name of a place." But I don't know if it quite fits, and even if it does, using it might lead to confusion, given it's not the most common meaning of the word... PUC09:47, 16 July 2021 (UTC)[reply]
What about Category:X terms misleading as to the place of origin of the things they designate? Super wordy though... PUC09:52, 16 July 2021 (UTC)[reply]
To shorten it: "the things they designate" are referents. Equinox 10:05, 16 July 2021 (UTC)[reply]
Ah yes, thanks, that's already a bit better. While we wait for someone to come up with a good category name, I've created User:PUC/Terms misleading as to the place of origin of their referents, so that we can start gathering such terms now. PUC10:12, 16 July 2021 (UTC)[reply]
I was thinking about this, because I don't want it to go undone just for lack of a name. I pondered "terms indicating place of origin" or "terms suggesting a place of origin" (for correct ones like American cheese) and "terms incorrectly indicating place of origin" or "terms suggesting an incorrect place of origin" (for misnomers). On their face(s), those might lead people to also add "American" and "Pennsylvania Dutch" (respectively) to the categories, though; would we be OK with that? If we decide to only categorize the incorrect ones, the inclusion of a small number of things like "Pennsylvania Dutch" (for people originally from Germany and not the Netherlands, and possibly more immediately from somewhere near but not in Pennsylvania) would probably be fine, even desirable... it's the "correct names" category getting swamped with bare demonyms ("Argentinian", "Idahoan", "Berliner", etc) which would be undesirable. - -sche (discuss) 17:22, 4 August 2021 (UTC)[reply]

Appropriation of Module:la-pronunc

[edit]

Reducing the size of Module:languages/data* by moving varieties/aliases/other names (and possibly Wikidata links) elsewhere

[edit]

It's known that the size of the language modules is a significant cause of memory errors. Some months ago (see Wiktionary:Grease pit/2020/November#splitting language data) I proposed splitting the data modules to be indexed by the first two letters instead of the first letter. Some people had concerns that this approach would be unwieldy to work with, and so it was proposed to have the modules still in the current structure but have a postprocessing step after saving a module that would copy them to the more split-up structure. This requires some JavaScript hacking, which I'm not very familiar with; User:Erutuon made some good suggestions and did some investigations, which weren't totally promising. I'm now suggesting a less radical but probably effective solution of moving the varieties, aliases, other names and maybe Wikidata link fields to different files. I think User:Rua also suggested this at one point. I'm almost positive the varieties, aliases and other names aren't used on most pages, and for some languages they can take up a lot of space (cf. French, which has one alias and 27 varieties listed). Moving the Wikidata links may be less effective; they are used at least when generating Wikipedia links in etymology templates. (On the other hand, they occur in pretty much *every* language, so the following solution might be very effective: (1) move the Wikidata links into a separate file; (2) create a list of high-memory pages where Wikipedia links are not to be generated by etymology templates, similar to what's in Module:links/data, but more centralized, more up-to-date and more effective; (3) implement this in Module:etymology.) If we do this, I would argue we should create a parallel set of modules under something like Module:languages/extradata* that has the same structure as Module:languages/data* but holds the secondary data not typically needed. Benwing2 (talk) 19:36, 18 July 2021 (UTC)[reply]

Sounds all good. I wasn’t even aware that varieties, aliases and other names could use memory while not being used. So the whole pages are loaded and therefore indexing by the first two letters instead of the first one helps, I understand. (I thought the other names are for easing the identification of languages, as e.g. it is annoying if you want to cite a cognate but do not find the language code and name which the language has on Wiktionary (thou probably hast not done such remote etymologies and reconstructions. It is bizarre of course that for that to work Lua memory is used …) Fay Freak (talk) 20:27, 19 July 2021 (UTC)[reply]
Thinking out loud about the effect of removing the Wikidata items from the main language data modules. The Wikidata items are under the key 2 in the tables and the family codes are under key 3. Key 1 (the canonical name) is always filled, and keys 2 and 3 are almost always filled (7270 languages have Wikidata item and family, 516 only Wikidata item, 356 only family, 24 neither). Because these three keys use the Lua array syntax (for instance, { "English", "Q1860", "gmw", ... }) they are placed in the array part of the Lua table when all three are present, and because Lua array sizes are in powers of 2, there will have 4 array slots and the fourth slot is unused. To reduce the size of the array to 2 slots, we would have to get rid of either the Wikidata items or the family, and move whichever one was left to key 2. Apparently array slots themselves take 16 bytes, so that would be at least (7,270 + 516) array slots × (16 × 2) bytes per array slot = 249,152 bytes out of a limit of 52,428,800 bytes saved on pages that do not load Wikidata items. The total size of the Wikidata item strings themselves seems to be 59,000 bytes, and much of that might be saved as well. Based on that napkin math, removing Wikidata items would have relatively insignificant savings, but because napkin math is not completely accurate, maybe it's worth trying if I make a script to do it automatically, roughly as I did for splitting language data into more modules.
Removing otherNames, aliases, and varieties would probably have a greater impact, but it's a bit harder to calculate the likely savings because nested tables are involved. As you say, they're almost never used in entries — probably never. I can't think of a way to verify this without editing Module:languages to transclude a tracking template inside getOtherNames, getAliases, and getVarieties and then wait for the server to update all the entries and count the transclusions. — Eru·tuon 21:33, 20 July 2021 (UTC)[reply]
If the slots themselves take up so much space an option could be to cram more data into one slot? e.g. comma or colon separated. "Aranadan", "Q3507928:dra" – Jberkel 22:02, 20 July 2021 (UTC)[reply]
@Erutuon Thanks for your analysis. I suspect the burden/savings may be greater because of the weird stuff that MediaWiki does when mw.loadData() is called. I'm not quite sure how that works but I think it adds an extra memory burden. Also, when you calculated the 59,000 bytes of the object strings, it sounds like you didn't take into account the memory used for the actual string object. Per the link to wowpedia.fandom.com, it says the average memory consumption is around 24 + string_length. The average length of the Wikidata items looks to be around 8 bytes, meaning (7270 + 516) * 32 = around 249,152 bytes for the actual strings. I suspect MediaWiki wraps each object loaded via mw.loadData() in a table, which means at least another (7270 + 516) * 40 = 311,440 bytes for the Wikidata strings, probably significantly more. I suspect we're talking at least 1 MB all told for the Wikidata strings given that the total size of the language data appears to be > 12 MB (based on a statement from User:Rua awhile ago). Benwing2 (talk) 03:29, 21 July 2021 (UTC)[reply]
@Benwing2: Yeah, I was just counting the byte length of the Wikidata items and had neglected the 24 bytes, so your calculations are about right. The padding takes up more than the actual string data!
I think mw.loadData doesn't affect the memory usage of string fields. It wraps tables (not strings or booleans or numbers or nil) and only when you access each of them for the first time. Access happens by calling mw.loadData (for the top-level table in the module) and by indexing the table or any of its subtables, directly or indirectly by iterating over them with pairs or ipairs, when indexing yields a not-yet-accessed table.
For example, local data2 = mw.loadData("Module:languages/data2") wraps one table, indexing local english_data = data2["en"] wraps another, indexing local aliases = data2["en"]["aliases"] wraps a third. The previously wrapped data["en"] was cached and returned the second time it was indexed. local wikidata_item = english_data[3] doesn't wrap anything because the yielded value is a string, not a table. for k, v in pairs(english_data) do end will then wrap the 5 remaining subtables in english_data. When local data2 = mw.loadData("Module:languages/data2") is called again, the process repeats, because the return value of mw.loadData is not cached. (The internal table that is accessed through the wrapper is cached for an entire page, however.)
It looks like the number of fields affects how long it takes for mw.loadData to iterate over the whole table recursively when it validates the data module (once per each page that loads a data module with mw.loadData), but I can't see a way for it to affect the memory used in wrapping it. I hadn't considered the validation part, and it could contribute to memory usage unpredictably, because it uses a temporary seen table to track the tables that have already been validated, with as many hash fields as there are tables in the return value of the data module, and that table will be garbage-collected at some time in the future. — Eru·tuon 06:03, 21 July 2021 (UTC)[reply]
@Erutuon Thanks again for the analysis. I think we should assume that no garbage collection happens when generating a page. Benwing2 (talk) 02:47, 22 July 2021 (UTC)[reply]
@Benwing2: It looks like that's true, because inserting an object with a garbage collection metamethod (setmetatable({}, { __gc = function() mw.log("Garbage is collected!") end })) into Module:links or Module:languages and previewing water doesn't cause anything to be printed to the Lua log. I'd never tried this before, but just assumed garbage collection would have to run sometime. Making a loop that increases memory usage until garbage collection happens doesn't manage to force garbage collection: the first template invocation to use the module just runs out of memory or time. So garbage collection seems to be disabled entirely, or maybe there's no way to trigger it under normal conditions. Then the unpredictability of memory usage is because of something other than garbage collection. — Eru·tuon 03:07, 22 July 2021 (UTC)[reply]
Okay, so my code didn't force garbage collection because Lua 5.1 only lets userdata have garbage collection metamethods! So garbage collection probably ran, but __gc function was just ignored. (The code works in Lua 5.2 and later versions where tables do get garbage collection metamethods.) I noticed this when running similar code in a modded Lua 5.1 that has a function that gets the total allocated memory, minus deallocated memory. Memory usage went down but the garbage collection metamethod hadn't run. I should have tried this long ago! (If garbage collection metamethods on tables were possible, Scribunto might want to disable them anyway, because they let you run code at an unknown time in the future, which can kind of break the independence of module invocations.) — Eru·tuon 09:24, 10 August 2021 (UTC)[reply]
"The padding takes up more than the actual string data!" Encoding wikidata ids numerically would be another option to save space. They seem to use 64-bit floating point in the version of Lua used. – Jberkel 12:16, 22 July 2021 (UTC)[reply]
Wikidata IDs are integers. With 64-bit floating point, that means we can represent up to 2^53 (9007199254740992) with no issues. I'm not sure if Wikidata IDs have any technical upper limit that is guaranteed, but I'd imagine that number wouldn't be reached any time soon. — surjection??12:24, 25 July 2021 (UTC)[reply]
Why not just regular integers then (unless somehow they are 32 bit in Lua)? However, I imagine that non-small integers as well as all floating-point numbers are implemented as objects in Lua, as they are in Python, so it's not clear we'd get much savings from doing this. Benwing2 (talk) 01:18, 27 July 2021 (UTC)[reply]
From what I understand, all numbers in Lua are internally stored as 64bit FP, at least in that version of Lua. So the saving would be mainly the 24 byte per string overhead, and they would all have a constant 8byte size. – Jberkel 01:33, 27 July 2021 (UTC)[reply]
It sounds like a good idea to me, and not likely to cause problems, so I've edited Module:languages and Module:data consistency check to accept either a string or a positive integer and converted Module:languages/data2. — Eru·tuon 03:13, 27 July 2021 (UTC)[reply]
Converted all language and language family data modules to render Wikidata item IDs as numbers. Strings are still supported at the moment but no longer used. I haven't tested the actual effects on memory usage in big pages, but that could be done by creating another set of language data modules and using Special:TemplateSandbox, as I did to test the effect of splitting up language data modules. — Eru·tuon 04:08, 27 July 2021 (UTC)[reply]
It seems to have made a difference. Yesterday there were 18 out-of-memory errors in CAT:E. As of now, there are 14. One of them, papa, I cleared with a null edit- which indicates a very recent improvement. Chuck Entz (talk) 06:29, 27 July 2021 (UTC)[reply]
I proposed this somewhere else, but if making something a numbered rather than a named parameter (like canonicalName is no longer spelled out as such but is just the first thing after the language code) saves memory, script info could be made a numbered thing. I've added it to a majority of languages; a concerted effort could add it to more. (Would it cause issues that script is often more than one value?) - -sche (discuss) 19:33, 27 July 2021 (UTC)[reply]
@-sche: Yeah, that is a good idea because according to User:Erutuon/language stuff § Data item census, scripts is the next most common field after the family code. Many language data tables that have scripts also have a Wikidata item ID and a family code, so there will already be a slot in Lua memory for field 4 that is not filled with a value and memory usage may just go down. Not sure how best to make this mesh with the proposal to move Wikidata item IDs out though; if that were done, perhaps field 3 could be used for scripts and field 2 for the family code. — Eru·tuon 19:54, 27 July 2021 (UTC)[reply]
IMO varieties, other names, etc. could easily be moved to an extradata module for a potentially significant chunk of memory saved. They aren't needed that often from what I can tell. — surjection??00:30, 8 August 2021 (UTC)[reply]
I've started looking into this. Module:User:Surjection/languages would be merged into Module:languages, but first these manual uses have to be dealt with. I've already changed Module:language-like to support a future _extraData (diff). — surjection??13:17, 8 August 2021 (UTC)[reply]
Extradata templates have been createdsurjection??14:59, 8 August 2021 (UTC)[reply]
All of the templates listed except for Module:data consistency check and Module:list_of_languages use Module:languages/alldata (or are false positives or unused modules), so not that much work is needed to keep those working. — surjection??15:06, 8 August 2021 (UTC)[reply]
OK, those and Module:languages have been updated. The old varieties/aliases/otherNames in the main data modules should no longer be used for anything, so unless something breaks with the existing changes, that's the next step to take. — surjection??15:37, 8 August 2021 (UTC)[reply]

Change to WT:WDL?

[edit]

The criteria for inclusion (CFI) require three usages for well documented languages (WDLs).

hapax legomenon (entry, appendix, category) are terms which are only attested once.

So there can be no sub-category of Category:Hapax legomena by language for WDLs.

Arabic and Chinese hapax legomena can be there as WT:WDL only lists "Modern Standard Arabic" and "Chinese (Standard Written Chinese)" (= written vernacular Chinese). This doesn't include Classical Arabic, Old Chinese, Classical Chinese.

However, in the list it's only "Hebrew" and not Ivrit or Neo-Hebraic. Thus all terms in Category:Hebrew hapax legomena, often hapax legomena only found in the Bible and thus Biblical Hebrew and one term being Paleo-Hebrew, would have to be deleted.

To keep Palaeo-Hebrew and Biblical Hebrew hapax legomena I propose the following change to WT:WDL:

old: 2. Armenian, Azerbaijani, Georgian, Hebrew and Turkish;
new: 2. Armenian, Azerbaijani, Georgian, Hebrew (Ivrit or Neo-Hebraic) and Turkish;

Furthermore to align the wording and style, I propose to change:

old: 3. Modern Standard Arabic
new: 3. Arabic (Modern Standard Arabic)

Then it's first the language name as it's used in entries (Arabic, Chinese, Hewbrew) followed by a qualification which variety is considered a WDL. -Macopre (talk) 14:17, 19 July 2021 (UTC)[reply]

I always imagined that given the wording "languages well documented on the Internet", this always referred specifically to the forms that are actually represented online, unlike e.g. Biblical Hebrew. To insist on the three-use criteria for all forms of Korean would be quite damaging to Wiktionary's coverage of the dialects.--Tibidibi (talk) 14:49, 19 July 2021 (UTC)[reply]
Well, on the hand, Biblical Hebrew is represented online (e.g. digitalised BHS, and there surely are versions on Google Books). And on the other, the list doesn't cover what is well documented, but what wiktionary treats as well documented. Scots for example isn't well documented (BP, AS).
As for Korean, there are for example: Old Korean, Middle Korean, (New) Korean, Jeju. That is, Jeju, Old and Middle Korean are other languages than (New) Korean, and only the latter is a WDL. For Hebrew it's different as Hebrew contains Paleo-Hebrew, Biblical Hebrew, Medieval Hebrew and Ivrit. --Macopre (talk) 15:28, 19 July 2021 (UTC)[reply]
@Macopre The ISO 639 code "ko" or "kor" for Korean (and hence Wiktionary Korean entries) include everything between 1600 and 1900, which was still a very different language from Contemporary Korean. Many of the most conservative, and hence most linguistically interesting, dialectal words might also fail because of the rapidity of ongoing dialect leveling. I don't think it makes sense to insist that an interesting Korean hapax is grandfathered in because the text is from 1599, only to be deleted when a new analysis discovers that the text was actually from 1601; or to accept an interesting word collected by Ogura Shinpei on Jeju Island but to reject a similar word when collected in rural North Korea (whose dialects are worse attested than that of Jeju).
The same general ideas would apply to many other languages on the list. Follow the spirit of the law rather than the letter, so to speak.--Tibidibi (talk) 15:51, 19 July 2021 (UTC)[reply]
Well, that's a general and different topic regarding WT:CFI and all WDLs. Here some cases were English terms didn't have 3 usages and hence were removed:
--Macopre (talk) 19:30, 19 July 2021 (UTC)[reply]
Firstly it would not be a “change to WT:WDL” because as already said by systematic interpretation we get the desired results (“follow the spirit of the law rather than letter”, that is the intent of the statute maker); there is only the danger that you botch the text change and we only have a new statute maker intent to interpret, because obviously you do not think before editing and assume that everyone is just a machine that executes statutes instead of finding purpose.
So I don’t even see what the intended change with “Modern Standard Arabic” → “Arabic (Modern Standard Arabic)” is.
Then your wording is poor. “Neo-Hebraic” is ugly, it should be Neo-Hebrew, and Ivrit can just be understood as any Hebrew, because it is just the Hebrew form of “Hebrew”, irrespectively of whether English speakers pretend it to mean just Modern Hebrew. It is as poor as the claim that Persian or only Iranian Persian should be called Farsi.
Then the strict criteria for Armenian, Azerbaijani, and Georgian should start from a similar time as Modern Hebrew. Probably with their Sovietization, which is when Azerbaijani switched from the Arabic alphabet (before that no 10% could read and write in Azerbaijan) and about when the dissolution of Armenian communities in Turkey happened: Surely we should have Western Armenian dialect terms which are mentioned for specific places. The corpora are also bad for these three languages. Can’t attest normal words for Azerbaijani even though it is in Latin script (or before, various Cyrillic scripts) that should scan well.
As the examples of dialect words show, we are in need of implementing the idea that a word needs only as much to be attested as it is specifically claimed to be used: If a word was allegedly used in regional dialects of Western Armenian or Northern England at one time then as we expect it to lack in the written sources then so be it. The important thing is that it should not look sham. Other attestation-based dictionaries also include that and it is all fine as they are explicit about how it is attested. And if a word is used on the internet than it may be attested from the internet. Extreme example are emojis, man, the CFI even precede them. It was seen that if a word appears often non-durably then it does not matter any more that it is not durable because non-durable occurrences get replaced by others: I formulated that a word should be ”consistently appearing on the internet”. And a dialect word should be attested as its kind of word uses to be attested. It is different for literary inventions again as we want to exclude protologisms. Fay Freak (talk) 20:18, 19 July 2021 (UTC)[reply]

Oirat language code

[edit]

@Metaknowledge I would like to revive the discussion at Wiktionary:Beer_parlour/2021/April#Oirat_language_code.

@Victar, what do you mean by "Kalmyk should probably be a etymology-only code Oirat xal"?

@LibCae, in your opinion, should the code xal be used for Oirat instead of Kalmyk? Can we create a separate code for Kalmyk (e.g. something like xal-kal), which is a variety of Oirat? RcAlex36 (talk) 16:52, 19 July 2021 (UTC)[reply]

Xinjiang Oirat has developed pretty differently than Kalmyk, as either phonetically or lexically. I’d like to keep xal for Russian Kalmyk. An independent subcode for Modern Xinjiang Oirat may be suggested, placed under xwo or xal after discussion. LibCae (talk) 08:59, 21 July 2021 (UTC)[reply]

Vietnamese Han character entries without headword-line templates

[edit]

Wiktionary:missing headword-line templates has a lot of instances of Vietnamese Han character entries. They can be found by clicking the "language section" header once (or clicking until the entries sort alphabetically) and then scrolling down near the bottom. It would be nice to clear these out because they make up about 170 out of the 600 something missing-headword pages.

I posted about this on Discord, but it looks like the most knowledgeable Vietnamese editors might not be active on there. Pinging KevinUp as suggested by Suzukaze-c.

Would it be a good idea to clean these up by just putting all the readings into {{vi-readings}}, or do they need more like information like radical-stroke sortkey, categories of reading, and references? The top few search results here seem to all have the sortkey: Special:Search/insource:"vietnamese han character". We have Module:zh-sortkey if that would usually give the right answer. — Eru·tuon 20:21, 19 July 2021 (UTC)[reply]

(and @KevinupSuzukaze-c (talk) 20:25, 19 July 2021 (UTC))[reply]
Started on this before the weekend, think I'm making pretty good progress so far. MSG17 (talk) 19:46, 9 August 2021 (UTC)[reply]

Proto-Norse romanisations

[edit]

Good day fellow Wiktionarians! I have recently been adding romanisations of Proto-Norse entries in runes, in order to make them easier to find and search for; this is consistent with another ancient Germanic language only attested in an extinct alphabet, Gothic. In order to complete this process I want to enable the instant linking of transliterations, like in Gothic. To demonstrate:

As we here can see, the Gothic transliteration becomes a blue link, while the Proto-Norse stays as unclickable text. In order to change this, the line `link_tr = true` should be added to Proto-Norse (gmq-pro) in Module:languages/datax. This requires an admin to do, so I'm posting here to make sure there is some level of agereement on doing this. ᛙᛆᚱᛐᛁᚿᛌᛆᛌWiktionary's most active Proto-Norse editorAsk me anything 22:26, 20 July 2021 (UTC)[reply]

Actually, I would rather we delete the link to Gothic romanizations, because I don't see the point. The romanization entries don't say anything but "Romanization of" the native script; why bother linking to that? As far as I can tell, of all the Indo-European languages we have romanization entries for – Gothic, Hittite, Konkani, Mozarabic, Oscan, Pictish, Primitive Irish, Proto-Norse, Sauraseni Prakrit, South Picene, Umbrian – Gothic is the only one where the romanization is linked to automatically. I'd rather bring Gothic into line with all the rest, rather than bring Proto-Norse into line with Gothic. —Mahāgaja · talk 15:45, 22 July 2021 (UTC)[reply]
I agree with Mahagaja. IIRC the romanization entries are to assist in getting users to the Gothic-script entries; if they contain no content other than a pointer to the Gothic-script entry, linking to them from the Gothic-script entry seems non-helpful. (Then again, we do link to e.g. plurals from plural even though it is nothing but a pointer to the lemma form, so, shrug. Yes, I know, plurals is supposed to contain a pronunciation section, but how often does that happen?) - -sche (discuss) 06:48, 23 July 2021 (UTC)[reply]
@-sche: It happens whenever I notice a Pronunciation section is missing. —Mahāgaja · talk 21:21, 23 July 2021 (UTC)[reply]
You know what, I actually agree with you. ᛙᛆᚱᛐᛁᚿᛌᛆᛌWiktionary's most active Proto-Norse editorAsk me anything 21:08, 23 July 2021 (UTC)[reply]
@Mnemosientje, any opposition to removing the link for Gothic? —Μετάknowledgediscuss/deeds 21:59, 23 July 2021 (UTC)[reply]

@Metaknowledge: How will we quickly check that the Romanisation entry exists for Gothic and hasn't been deleted in error? Do we rely on some bot-runner retaining an eternal interest in Wiktionary? --RichardW57m (talk) 10:52, 30 July 2021 (UTC)[reply]

@RichardW57m: As a general principle, reader-facing content should never be for editors' benefit alone. In this particular case, every word in the Gothic Bible already has its romanisation present on the site. —Μετάknowledgediscuss/deeds 17:40, 30 July 2021 (UTC)[reply]
@RichardW57m: We aren't talking about deleting the romanization entries themselves; all we're doing is removing the link to the romanization, so that instead of seeing "𐍃𐍄𐌰𐌹𐌽𐍃 (stains)" we see "𐍃𐍄𐌰𐌹𐌽𐍃 (stains)". —Mahāgaja · talk 18:30, 30 July 2021 (UTC)[reply]
@Metaknowledge: Every individual word-form yes. But that doesn't mean all the lemma forms are. When someone creates a lemma it can be helpful to see if it's missing a romanization. —Rua (mew) 15:23, 31 July 2021 (UTC)[reply]
@Metaknowledge: Am of two minds - on the one hand I agree with Mahagaja, the inconsistency should be resolved and the link to the romanization should probably be removed. On the other hand, it would be very useful if they could just stay around until Category:Gothic romanizations without a main entry is empty, which should happen sometime in the coming 6-12 months, to help with identifying missing romanizations (as editors often fail to create a corresponding romanization entry when they create a lemma entry for a word of which only an inflected form is attested + imported into said category). So while I don't strongly oppose removing them right now, I wouldn't really mind if that happened a year from now, either, and I don't see the harm of keeping them around just a bit longer. — Mnemosientje (t · c) 19:03, 31 July 2021 (UTC)[reply]

Universal Code of Conduct News – Issue 2

[edit]

Universal Code of Conduct News
Issue 2, July 2021Read the full newsletter


Welcome to the second issue of Universal Code of Conduct News! This newsletter will help Wikimedians stay involved with the development of the new code and will distribute relevant news, research, and upcoming events related to the UCoC.

If you haven’t already, please remember to subscribe here if you would like to be notified about future editions of the newsletter, and also leave your username here if you’d like to be contacted to help with translations in the future.

  • Enforcement Draft Guidelines Review - Initial meetings of the drafting committee have helped to connect and align key topics on enforcement, while highlighting prior research around existing processes and gaps within our movement. (continue reading)
  • Targets of Harassment Research - To support the drafting committee, the Wikimedia Foundation has conducted a research project focused on experiences of harassment on Wikimedia projects. (continue reading)
  • Functionaries’ Consultation - Since June, Functionaries from across the various wikis have been meeting to discuss what the future will look like in a global context with the UCoC. (continue reading)
  • Roundtable Discussions - The UCoC facilitation team once again, hosted another roundtable discussion, this time for Korean-speaking community members and participants of other ESEAP projects to discuss the enforcement of the UCoC. (continue reading)
  • Early Adoption of UCoC by Communities - Since its ratification by the Board in February 2021, situations whereby UCoC is being adopted and applied within the Wikimedia community have grown. (continue reading)
  • New Timeline for the Interim Trust & Safety Case Review Committee - The CRC was originally expected to conclude by July 1. However, with the UCoC now expected to be in development until December, the timeline for the CRC has also changed. (continue reading)
  • Wikimania - The UCoC team is planning to hold a moderated discussion featuring representatives across the movement during Wikimania 2021. It also plans to have a presence at the conference’s Community Village. (continue reading)
  • Diff blogs - Check out the most recent publications about the UCoC on Wikimedia Diff blog. (continue reading)

Thanks for reading - we welcome feedback about this newsletter. Xeno (WMF) (talk) 02:52, 21 July 2021 (UTC)[reply]

Anyone forking yet? Equinox 03:12, 21 July 2021 (UTC)[reply]

Vocalisation and diacritics: contrast in Persian and Arabic entries

[edit]

Hi Wiktionary folks, I have a question to ask out of curiosity: I noticed that Arabic entries in Wiktionary provide a vocalised spelling of the headword (as well as an IPA transcription), but Persian entries contain no such version with vocalisation or other diacritics like tašdid (although they do contain an IPA transcription). Why is this so? I am just starting out learning Persian so I'm not familiar with dictionary formats and spelling conventions. Thanks! — This unsigned comment was added by DDR44818 (talkcontribs) at 09:22, 21 July 2021 (UTC).[reply]

@DDR44818 This is I think due to a combination of conventions of the languages in question and choices of the editors involved in working on those languages. Vocalization of Arabic script was invented specifically for Arabic and I think (not sure though) that it's much more common to see it used for Arabic than for Persian. The choice the Persian editors seem to have made is to use the transliteration to indicate the vowels (not always consistently, I must say). Benwing2 (talk) 02:52, 22 July 2021 (UTC)[reply]

A new vote to do exactly what it says on the tin: delete an outdated, unneeded namespace from the wiki. Input is welcome. —Μετάknowledgediscuss/deeds 23:47, 21 July 2021 (UTC)[reply]

Nuke nuke nuke. Possibly some of the Chinese input method stuff is savable. Benwing2 (talk) 05:00, 22 July 2021 (UTC)[reply]
Support in principle, but I think it would be good to keep the red links somewhere. PUC11:45, 22 July 2021 (UTC)[reply]
I didn't know there was an index. Support then, I guess. But interesting to see the data presented that way. – Jberkel 12:04, 22 July 2021 (UTC)[reply]
Support. Even if this fails, at the very least the index links on the front page should be changed to point to our lemma categories, which I suspect many of our casual users never manage to find. — Vorziblix (talk · contribs) 18:56, 23 July 2021 (UTC)[reply]

Continued abuse of Module:la-pronunc

[edit]

Edit warring on fōrmāticus

[edit]

Just another example that the user in question thinks that edit summaries are battlefields, and after ending my genuine attempt to discuss by abusing and publically humiliating me they are again trying to taunt me into another squabble. The website cannot functions like this. It doesn't even bear mentioning how absurdly trifling their objections are - because nothing is not worth fighting over, and nothing cannot be use to portray me as an ignorant beginner who is wrong. Literally the whole purpose of this user's presence here is to edit war you by reverting within 10 seconds. Please, if you're inclined to ascribe it to some personal squabble that doesn't concern you, I urge you to consider what the website will turn into if this continues unabated; and if you're inclined to blame me as much as that person, I welcome you to have a quick look through my previous BP and TR (and other community) discussions to see what attitude to people and editing I've had all along. This didn't start with me - my beliefs and practices are entirely the opposite of this; it will not end until the person responsible for starting it is dealt with. Brutal Russian (talk) 09:53, 22 July 2021 (UTC)[reply]

You typed all-caps edit summaries in which you screamed at me instead of actually addressing the points that I calmly brought up. It is surprising to me that you insist on challenging the use of an asterisk to mark unattested words or phrases, since that is standard usage in Historical Linguistics. The Nicodene (talk) 10:00, 22 July 2021 (UTC)[reply]
@Brutal Russian: Yes, I remember you as a pleasant - and very knowledgeable - editor. But diagnosing The Nicodene was a very bad move on your part; I fear you've turned off many people and damaged your credibility by doing that. I've read your piece a few days ago. I've got no idea whether you're right - you might be - but it certainly didn't leave a good impression on me.
Also the volume of the exchanges has made it very difficult to assess what is what. And it has induced reader's fatigue in me; I am at a point where whenever I see something related to your disagreement, I'm thinking "oh no, not again; let's get away from here".
I wonder if it wouldn't be best for the both of you to give the issue a complete rest for a few weeks. It would give you the time to cool down, and others the time to recharge their batteries. They might be more willing to listen to you and reread some of the stuff then. PUC11:24, 22 July 2021 (UTC)[reply]
My sentiments broadly match PUC's; I've been more familiar with Brutal Russian (as a good and knowledgeable editor) than with The Nicodene, and I think The Nicodene is wrong on at least some things (I disagree with the removal of the pronunciation from formaticus and would like to restore that unless there is actually consensus to remove pronunciation information from such entries), but by now the repetitive diatribes between the two have made discussions tiring to wade through and TL;DR. I am aware that in a situation where one person is acting in bad-faith, and provokes a reaction, it'd be bad to act like both are being bad, but my impression is that in this situation both editors hold their positions in good faith / sincerity, and it's just that the positions are irreconcilable. This leads me to think it might be helpful if editors who edit Latin, other than these two editors, could decide what to do about formaticus and about the pronunciation module. - -sche (discuss) 08:03, 23 July 2021 (UTC)[reply]
Hi @-sche.
Note that Wiktionary avoids assigning a 5th century B.C. Attic pronunciation to the following words, which entered Greek in later eras:
δούξ, Κωνστᾰντῑνούπολῐς, Σκλάβος, βίρρος, φραγέλλιον, ἱεράρχης, πάσχα, πίτα
παροικία, Τοῦρκος, κυριακή, σάββατον, στήκω, Ἀλεξανδρέττα, σεβαστοκράτωρ
Formaticus post-dates the Classical pronunciation we have on Wiktionary (which is from about the first century B.C.) by the better part of a millennium.
The term is not used in Modern Latin, nor was it in Renaissance Latin. It occurs a few times around the ninth century in a limited region, clearly a local 'vulgarism'. It seems to me that it would be most appropriate to assign the word a pronunciation from that time and place, or perhaps a couple of centuries earlier. For that purpose one could use Mario Pei's analysis of the pronunciation of Latin in Northern France in the eighth century. The Nicodene (talk) 18:17, 23 July 2021 (UTC)[reply]
A pronunciation is not just how the word was spoken back in the day; it's also how the word is spoken now. A Shakespearean word may have dropped out of the language, but actors still have to pronounce it in the same accent as the rest of their speech. If you're reading a text out loud, or talking in Latin, you shouldn't drop from Classical pronunciation to Eighth Century Northern French pronunciation for one word. --Prosfilaes (talk) 04:12, 25 July 2021 (UTC)[reply]
@Prosfilaes Shakespearian words are still 'alive' in the sense that they are found in some of the greatest works of literature of the English language, ones that still carry a great deal of prestige and are regularly read and quoted from. Formaticus is used a handful of times, as a mistake for caseus, in obscure texts from France circa the ninth century and (as far as I am aware) never again after that, except in modern etymological dictionaries. The Nicodene (talk) 04:38, 25 July 2021 (UTC)[reply]
Which is not the type of fine distinctions it is worth cutting, much less edit warring over. Words from Shakespeare are still alive; what about Dryden or Chaucer? Otway? Chapman? Gosson?--Prosfilaes (talk) 11:07, 25 July 2021 (UTC)[reply]
It doesn't sound like a "fine distinction" to me. I tend to agree with The Nicodene here: I don't think it makes sense to display an anachronistic pronunciation for a word that was essentially a flash in the pan (though it lives on in a few daughter languages) and is unlikely to be used in living Latin circles. PUC12:30, 25 July 2021 (UTC)[reply]
Which runs right into the sorites paradox; when is a word a flash in the pan? How unlikely is too unlikely? Fine distinctions are inherent in the problem.
Not to mention that, why do most dictionaries have pronunciation sections? To show how a word was spoken in the a thousand years ago? No, it's for their readers, should they need to pronounce the word. If you're reading one of those obscure texts from France, do we provide pronunciations for all those words in the way they would have been spoken? No. If you're speaking it out loud to someone else, they would expect it in the normal pronunciation for Latin (whatever is normal for the speakers and listeners, American, classical, Ecclesiastical), not in an ancient French pronunciation, and definitely not just one word in the ancient French pronunciation.--Prosfilaes (talk) 07:22, 26 July 2021 (UTC)[reply]
@Prosfilaes: I'm convinced that edit-warring and eventual self-affirmation through winning is this user's end, and creating more possible points of contention enables them to reach that end. Brutal Russian (talk) 13:09, 26 July 2021 (UTC)[reply]
Of course it suits you to think that I have no possible reason to actually disagree with you on anything. The Nicodene (talk) 18:32, 26 July 2021 (UTC)[reply]
@Prosfilaes It can be solved by simply applying a condition like the one mentioned by Hazarasp. If the word is used by modern writers (say, after 1500) then it counts as modern and is treated accordingly with regard to pronunciation. As it happens, 1500 is roughly the point at which Neo-/Modern Latin comes into being, per Wikipedia at least.
As mentioned, Wiktionary avoids giving anachronistic pronunciations for Byzantine Greek terms; it does not give modern English pronunciations for early medieval English terms either. If one is reading a work from those periods, one is free to assign a modern pronunciation, but I do not think that Wiktionary should 'enshrine' such pronunciations by making them official. The Nicodene (talk) 18:40, 26 July 2021 (UTC)[reply]
The condition mentioned by Hazarasp is that words listed as being part of the language "Middle English" should not have pronunciations as if they were part of the language "English". I understand; I shouldn't have included Chaucer in that list. On the flip side, every word used by Chaucer and similar authors should be listed under "Middle English" and have an appropriate pronunciation. They're specifically listed as part of a different language. There are, by definition, no English terms that weren't used past 1500, but we do provide modern English pronunciations of all English words.
Wiktionary does not make anything official or enshrine anything. Part of our goal as a descriptive and useful dictionary is to describe things like pronunciation and hyphenation as they would be generally accepted in modern use.
If you were to provide pronunciations for all the words in the texts that use this word in the same dialect, I'm not sure I'd agree, but it'd be less of an issue for me. We give Byzantine Greek pronunciations for words used in Byzantine Greek, so it's not like we're providing one pronunciation for one word and pronunciations from a completely different dialect for the surrounding words in Byzantine Greek texts.--Prosfilaes (talk) 16:25, 28 July 2021 (UTC)[reply]
I agree that only having the pronunciation for one word would be anomalous. We could have a sub-module that automatically generates such pronunciations, assuming that Mario Pei provides enough detail in his aforementioned work on the subject. Presumably it would be put in a drop-down menu, as has been done with various Greek pronunciations. The Nicodene (talk) 18:04, 28 July 2021 (UTC)[reply]
As a aside, words "from Chaucer" (it's a bad habit to think of words as being "from" a specific writer in the ME era) shouldn't get Modern English pronunciations unless they appear in post-1500 writers. Hazarasp (parlement · werkis) 12:49, 25 July 2021 (UTC)[reply]
This seems like a reasonable approach to me. If a word is actually used in the modern period, with some reasonable cut-off, then it gets a modern pronunciation. The Nicodene (talk) 17:51, 25 July 2021 (UTC)[reply]

@Thadh, PUC, -sche: Sorry for persisting, but I'll explain why shortly. It appears that some fellow editors view my situation as some sort of back-and-forth shoot-out for which both parties are to blame. In particular, my descriptions of that user's personality have been portrayed as synchronic with and contributing to the that user's abusive behaviour. This is emphatically not true - their aggression and abusive intent were evident from the start, even before any discussion started, and what I say in their screenshots is not part of the events that I call attention to. I have been forced to synthesise that user's abusive tendencies into a story with a psychological explanation precisely to counteract the understandable tendency to dismiss individual instances of abuse as ordinary Internet squabbles. Please trust me when I say there is nothing ordinary or isolated about this for me. I'm a person who highly values civility and is endowed with massive patience to handle any disagreent as long as the other party maintans good faith, to which end I constantly employ various conversational cues. I have no trouble discussing things with reasonable people, and I've been doing it successfully here and elsewhere. I'm dismayed that of all places, this is happening to me on Wiktionary.

My psychological explanations are not gratuitous personal attacks - I wrote them specifically for the admins to read, I believe everything I write, and I supply it with evidence at every turn. I also object to the notion that these are dehumanising - I believe that finding explanation for antisocial behaviour is the reverse of dehumanising and requires understand the abuser as a human. You may disagree with my reasoning or conclusions, but please do not dismiss what I write as retaliatory abuse. Where I repeatedly call them a narcissist can be an instance of name-calling to the same extent as shouting "wolf" when being assault by a wolf. I was seeking help and refuge from the wolf from other villagers, and instead of getting help the wolf continued harrassing me with blatant intellectual dishnonesty and outright lies (denying direct native-speaker testimony by distorting a source). I wasn't even taking their bate, but replying to another user. I was outraged - how do I escape this?! - and reacted by bluntly and repeatedly stating what I believe is the reality.

Giving it a rest and cooling off is precisely what I cannot afford to do because I am a victim of sustained one-sided aggression with a history of several years, but that had never reached this level before because I was at liberty to act prudently and avoid it. This website has unfortunately offered my abuser an opportunity to corner me and realise their long-term desire. I'm not imagining it, and I emphasise this: on at least one occasion when I protested their appalling attitude against me, they retorted with the same (07:58) "I'm giving you the taste of your own medicine" - their abuse is conscious and purposeful. I made the mistake of walking away from that as well, which has spectacularly backfired. I would need to be pathologically incapable of learning, and would be actively inviting further abuse if I let this slide now. This is why I'm being forced to write these treatises, which I'm not enjoying in the least. But if I don't handle this once and for all, I'm under constant threat of further abuse.

In the edit history of capus is proof that there was unprovoked aggression and animosity from the start: "it is mildly absurd to claim", "beyond ridiculous", "Keep reverting if you like, you will not change anything". I protested this by calling attention to this from fellow editors; instead I immediately received slanderous personal attacks, saying that I deserved the aggression and projecting their own toxicity on me, making their abusive intent undeniable. I ignored them and attempted to extend good faith regardless; in the end I was rewarded with outright personal abuse and direct accusations of ignorance. The goal was to humiliate me and hopefully get a violent reaction in return (which is what that user had been attempting for years, always to no avail), thus bringing the discussion down to the level which that user is capable of handling; or at least to demoralise me and and win the argument=war as the last man standing. This personal attack was all the more offensive because it was itself rooted in ignorance (p.460-1). None of this I had provoked or deserved in any way. I invite you to read the entire May BP discussion and see for yourself - I only reply there 4 times and it shouldn't be difficult to follow.

Then happened the pronunciation module appropriation, where I also gave up discussion because I had already experienced what would happen if I continued. In doing this, and in refusing to participate in an edit war, and in not raising hell with other editors about this I ended up enabling them to get their way. Now this continues to go on fōrmāticus.

To summarise, I have been continually punished by that person for holding to the principles of Wiktionary. They professed an utter disregard for the words of another user, and I want to specifically call everyone's attention to the fact that they've demonstrated plainly a belief that edit warring is a valid means of participation on Wiktionary, and that discussion ought to be conducted via edit warring - they accuse me of refusing to do it right here! Their principle is that the last edit standing is the right edit - and in this they have unfortunately been validated. They have abused my good faith and sabotaged civil discourse at every step. It is my contention that this cannot be allowed to perpetuate. This is an exceptional case, a combination of warlike arrogance, personal hatred, and ignorance of both Wiktionary's mode of operation and the subject matter, compounded by an inability to reason correctly. I'm firmly convinced that the only way to prevent this behaviour from proliferating is to stop it at its root.

The person who edit-wars in the most ruthless way, is the least receptive of other people's words in the rudest possible way, who believes in settling disagreements through edit wars and who sees disagreement as an attack on their person, which they retaliate with personal abuse. You're damned if you engage with them and damned if you ignore them. Is this the type of editor we want to see on Wiktionary? Is this what we want to see as the website's principle of operation? Whatever reservations you may have about my own behaviour, that has been restricted to BP and user pages. I firmly believe that my attitude on the website proper has been consistently accommodating and alaways productive. I believe I've demonstrated with this post that the behaviour of my abuser has been destructive and injurious in the abstract, that I have not provoked it and only encouraged it by participating in a discussion. I continue asking that measures be taken to stop them from either further harrassing me or wreaking havok on the webiste. I'm perplexed to see that my grievances aren't even beying acknowledged, and I don't understand what else I can do to be heard. That user has not been deterred from continuing in any way, I haven't been explained how to defend myself in the future, and all I seem to have recourse to is writing these essays in the hope that someone takes me seriously and doesn't dismiss my words. Brutal Russian (talk) 12:44, 26 July 2021 (UTC)[reply]

Considering that you were rude to me in the very beginning–without provocation–and that you have since called me a dog, not to mention numerous other transgressions and personal insults, for which you have even been blocked, I find it difficult to believe your claims of innocent victimhood.
I have already explained why I was not lying about Adams. I will save space by not recapitulating the points here.
Speaking of "intellectual dishonesty", please explain why you claimed that Madeline 2016 describes and exemplifies "the modern English scholarly usage of these terms [Gaul and France]" and presumably supports your claims that Gaul is preferred, when the source does no such thing.
Your own actions constitute edit warring as well. I would much rather have a discussion on talk pages, yet you refused to do so and even screamed at me. Perhaps here, at least, you could finally explain why you are so against using an asterisk to denote unattested words or phrases–which is standard practice–that you removed it eight times.
Ledgeway in no way supports your belief regarding the spoken Latin of the sixth century C.E., which is that "the current thinking in Romanistics is that only animate nouns had [a subject case]." What he points to is a tendency, not absolute, of using the accusative for all nouns, animate ("ipsos lios sedeant") or inanimate, as an unmarked case in later Latin, contrary to your assertion that the phenomenon was complete or only affected inanimates. He also states that in "modern Romance" most traces of the nominative are found in animate nouns, which is true, and in no way incompatible with the observation, confirmable by reading any grammar on the subject, that Old French and Old Occitan still had a nominative case, including for inanimate masculine nouns. This can be seen by perusing entries like vendenge or fer or simply glancing through an early text like the Chanson de Roland or La Vie de Saint Alexis (the latters of which contains an example in its very first line: "bons fut li siecles"). You have misread Ledgeway and come to conclusions contrary to his, as also with Sen (2015). The Nicodene (talk) 18:24, 26 July 2021 (UTC)[reply]
@Metaknowledge @Erutuon (sorry for the ping)
The user has written another ~7500 character wall of text in which they continue with their personal attacks and diagnoses. Specific examples include: "antisocial behaviour", “abuser" [x3], “narcissist", "blatant intellectual dishnonesty and outright lies", "rooted in ignorance", and “warlike arrogance, personal hatred, and ignorance...compounded by an inability to reason correctly”. The Nicodene (talk) 19:09, 26 July 2021 (UTC)[reply]

Category for "descriptive" terms

[edit]

In Finnish etymology and grammar, there is a category of so-called "descriptive" words. The category consists mostly of expressive words which are not necessarily directly onomatopoeic, but which have onomatopoeic (ideophonic?) elements. Some examples:

  • hölmö (silly, foolish), hömelö (silly person), höperö (senile)
  • löyhkä (stink). lots of dialectal variants, such as löhkä, läyhkä. possibly also comparable to leyhyä, löyhyä (to waft), likewise seen as "descriptive" by most etymological dictionaries
  • surrata, surista (to buzz (relatively low buzz)), surahtaa (to let out a brief (relatively low) buzz) (these have a common onomatopoeic root sur-)

At times it seems the term is almost used as a bit of a lazy catch-all for words with (as of yet) no other etymology, but I think at least some of the uses are justified. Even the arguably authoritative Finnish grammar has an article for these words: [1].

I think there should be a category for these terms, but I can't think of a good name.

  • "descriptive term/word" already appears to have some other meaning in English, being used of adjectives (in materials for English learners?)
  • There is CAT:Ideophones by language, but that is considered its own part-of-speech (that is most akin to adverbs). The words I'm talking about are nouns, verbs, etc., words just like any other, and the "descriptive"-ness is just an etymological feature. The category I'm looking for or to create would be under CAT:Terms by etymology by language.
  • Apparently the wider phenomenon is called sound symbolism. So "sound-symbolic terms"? It comes across as a bit awkward.

If there already is some existing category that I've missed, please point it out to me. If not, suggestions for the name are welcome. — surjection??10:53, 25 July 2021 (UTC)[reply]

@Surjection: Do you think this (or a similar, less wordy) category could be added to the regular category tree? I would think at least all other Finnic languages (Karelian, Ingrian) have simiral structures, even following the same 'roots', so it would be handy to be able to list such terms for those languages as well. Thadh (talk) 19:52, 26 July 2021 (UTC)[reply]
I envisioned Category:Finnish descriptive terms (which I just added) as a stop-gap until there's some proper name that we can add to the category tree. "expressive terms"? That seems to be used more universally, including when talking about English words, but might still be ambiguous. "terms of expressive origin"? — surjection??20:31, 26 July 2021 (UTC)[reply]
What about "List of [LANG] terms by phonestheme"? (wiki article). This way, we can create something like {{phonesthemic|f-o-o}} which produces a text like, say, "Descriptive formation from the phonestheme f-o-o", f-o-o being a phonestheme, like I imagine hö- (?) for hölmö, hömelö and höperö. This way the terms will also be sorted out by root, rather than all together. It's just a suggestion though, I can see a lot of (dis)advantages either way. Thadh (talk) 21:00, 26 July 2021 (UTC)[reply]
Sort of possible, but I'm not sure I'd want to be the one divvying up the entries for a categorization like that for any language. It sounds like a bit of a minefield. — surjection??21:06, 26 July 2021 (UTC)[reply]
@Surjection I would have zero idea upon visiting Category:Finnish descriptive terms what it refers to, so I am not in favor of this terminology. I would prefer either Category:Finnish sound-symbolic terms or Category:Finnish phonosemantic terms or similar; at least you can Google "sound-symbolic" or "phonosemantic" to find out what it refers to. However, maybe the bigger issue is the choice to make "ideophone" a part of speech. It seems strange to me to do this, esp. since according to Wikipedia only some languages with ideophones have them as separate grammatical categories. Can they really not be categorized as adverbs or interjections? Benwing2 (talk) 01:15, 27 July 2021 (UTC)[reply]
That's why the category has a description in it. If either of the ones you linked (or something else entirely) gets added into the category tree, I'll happily merge the category I created into that. — surjection??10:09, 27 July 2021 (UTC)[reply]
@Surjection See Category:Finnish sound-symbolic terms. Benwing2 (talk) 03:37, 29 July 2021 (UTC)[reply]
@Benwing2, Surjection: Do you think we can maybe also create a template like {{sound symbolic}}, following {{onomatopoeic}} (and add 'sound symbolism' to the glossary)? I think that may be handy, rather than add the category manually and just give 'of discriptive origin'. Thadh (talk) 11:00, 29 July 2021 (UTC)[reply]
@Thadh Done. I'm not sure whether the text generated by {{sound symbolic}} should have a hyphen in it; I made it have a hyphen but maybe it shouldn't. Benwing2 (talk) 21:27, 31 July 2021 (UTC)[reply]

Proposal to Change Wiktionary Logo (slightly) to use

[edit]

I would like to propose that the Wiktionary logo be modified slightly to use the character instead of (ie, change the logo to use the traditional form of the character rather than the simplified form).

My reasonings (in no particular order) are:

  • English Wiktionary has voted in the past to use traditional Chinese characters for all Chinese entries, for reasons such as greater clarity and less ambiguity (eg, ). Although the current character is not ambiguous, it would be nice for the logo to be more consistent with the entries in Wiktionary itself; indeed, the page for is essentially just a redirect to .
  • The character is the one used on the w:Wikipedia logo.
  • Using has the added benefit of being the same form of the character used in other languages such as Japanese and Korean, which did not simplify . The character is actually common in Japanese in addition to Chinese, for example in the terms 維持 and 明治維新. (In fact, the used on the Wikipedia logo appears to be set in the Japanese/Korean font style.)
  • There is no loss of legibility associated with making the change, as everyone who can read can also read .

(As an aside, I am not sure if this is the right place to make such a suggestion. I've look at pages on Meta-Wiki [2] [3] [4] [5] but I'm not sure where to begin, and I figure a discussion is better than a vote here. I notice that User:Miborovsky had the same suggestion back when the logo was first proposed by @Smurrayinchester at [6], but nobody seems to have acknowledged the comment.)

I'd love to hear comments and feedback, or if I should be bringing this proposal up somewhere else. Thanks, ChromeGames923 (talk) 22:17, 25 July 2021 (UTC)[reply]

I like this idea; if this change isn't accepted, then I think an even more fun change would be to change the character to the Small Seal script form (pictured here). --Geographyinitiative (talk) 22:36, 25 July 2021 (UTC)[reply]
I like this as well. You'll get a lot farther with this if you can make or get someone else to make a version of the logo with the Traditional character. This is such a minor change that it's possible we could do it without a vote if there is unanimous support. —Μετάknowledgediscuss/deeds 04:23, 26 July 2021 (UTC)[reply]
I'd support this as well. To make it more inclusive, the radical on the left should be 糸 (following the Kangxi shape), like in the Wikipedia logo. — justin(r)leung (t...) | c=› } 04:38, 26 July 2021 (UTC)[reply]
Yes, this. To hopefully clarify things a bit, the radical on the left has been written and printed like what is shown in the upper row in this table: external link. That is, the lower three strokes are from the left to right 1) descending stroke from upper-right to lower-left, 2) vertical stroke, and 3) a somewhat symmetric upper-right to lower-right descent. In contrast, may fonts generate by default what looks like the central image in the lower row in the linked page, i.e. the "new" glyph shape (subject to a fair amount of controversy). --Frigoris (talk) 17:53, 27 July 2021 (UTC)[reply]
I think this is a really good suggestion, to have it look like instead of ChromeGames923 (talk) 18:45, 27 July 2021 (UTC)[reply]
I think File:WiktionaryEn - DP Derivative.svg is the file that is currently used as the English Wiktionary logo (although not directly but in PNG form), so that's the one to change. — surjection??15:09, 26 July 2021 (UTC)[reply]
Does anyone know the font used for the CJK characters in this logo? @Smurrayinchester perhaps? One of the description pages says it uses Lucida Bright, but that font doesn't seem to have CJK characters. — surjection??11:18, 27 July 2021 (UTC)[reply]
Seems like a commonsense proposal. Andrew Sheedy (talk) 18:08, 26 July 2021 (UTC)[reply]
I am in support of this (and not sure why some people are making light of it). Benwing2 (talk) 01:03, 27 July 2021 (UTC)[reply]
The font in the current logo is not the free Noto Serif SC typeface, but the character in the semi-bold (600 weight) font looks at least as acceptable to me.  --Lambiam 12:18, 27 July 2021 (UTC)[reply]
@Lambian: If the licensing is a problem, perhaps this alternative font may be considered: I.Ming License at GitHub (scroll down for English). The license places no restriction on "Digital Content" (i.e. shapes, images, prints generated by the font). This font may not be the most refined in terms of aesthetics, but it sort of works if you want to show the glyph-level "traditional" traits in the style of the Ming typeface. --Frigoris (talk) 17:53, 27 July 2021 (UTC)[reply]
@Lambian, Frigoris: Noto Serif KR should also work if we want to take my suggestion of following a more traditional glyph. — justin(r)leung (t...) | c=› } 23:23, 27 July 2021 (UTC)[reply]
I think the Source Han/Noto CJK fonts are exceedingly boring in terms of design. I like the suggestion of I.Ming. —Suzukaze-c (talk) 00:31, 28 July 2021 (UTC)[reply]
@Lambian, Suzukaze: I just noticed that the current glyph on the logo seem to be a bold variant. Before using that it's perhaps the best to check if I.Ming has bold glyphs at all (possibly not). --Frigoris (talk) 07:03, 28 July 2021 (UTC)[reply]
@Lambiam, Frigoris, Justinrleung, Suzukaze: I really like I.Ming but I did not find it to have bold version. If that is a concern, I am aware of these two Ming style fonts that do come in bold: [7] and [8], though it's not abundantly clear to me what the difference between them is. They are also based on Source Han Serif, so I'm not sure if you would call them boring. That being said, since we are just looking for the character, these fonts seem sufficient, and I think the licenses would be usable. However, I tried to open the svg and see what changing the character would look like, but it seems be more complicated than I thought; when I clicked on the character it brought up many small points and not just a text character to replace. Any suggestions on how to do that? ChromeGames923 (talk) 22:13, 30 July 2021 (UTC)[reply]
There are free online image vectorizers that turn a png image into svg code for a collection of filled polygons, each represented as a path (a sequence of points). The output code will have to be embedded in the logo code with appropriate scaling, translation and rotation. 07:26, 31 July 2021 (UTC) — This unsigned comment was added by Lambiam (talkcontribs).
@ChromeGames923: I may be able to help, although I haven't done SVG stuff in a while. Do you have the link to the file from which I may proceed? --Frigoris (talk) 09:15, 31 July 2021 (UTC)[reply]
@Lambiam, Frigoris: There are several versions of the logo, such as with/without text and with/without background tiles. The most relevant file, c:File:WiktionaryEn - DP Derivative.svg, is already in svg format. Another one is c:File:Wiktionary-logo-en-v2.svg, which omits the caption for when the words would be too small to read. ChromeGames923 (talk) 18:11, 31 July 2021 (UTC)[reply]
@ChromeGames923, I'll take a look but I can't make any commitments. Meanwhile if anyone feels like drafting the logo, please help, thanks! --Frigoris (talk) 18:04, 1 August 2021 (UTC)[reply]

spelling reform hints allowed, appropriate?

[edit]

would it be allowed and appropriate to list alternate spellings which came out of spelling reforms, here: https://en.wikipedia.org/wiki/English-language_spelling_reform. like: frend, hed. --ThurnerRupert (talk) 22:15, 27 July 2021 (UTC)[reply]

Only if they can be attested as having been used – and not merely proposed – in accordance with our criteria for inclusion.  --Lambiam 06:25, 28 July 2021 (UTC)[reply]

Simultaneous suffixes and gendered variants

[edit]

138.23.68.27 (talk) is making strange edits to Latin entries not immediately obviously bad, but resting in my opinion on erratic evaluation of conflicting interests. Why, for example, is he changing {{af|la|term|-olus|alt2=-ola}} to {{af|la|term|-ola}} and delinking the first suffix in a term having been derived by multiple suffixes simultaneously, claiming the resulting term should not appear the category of the first? Surely it should appear there, and the links are now wrong due to not linking the Latin entries specifically? And we shan’t categorize the same suffix in different gendered variants? Similarly it would be very bad to have the same Turkic suffix in different vowel forms that depend on the vowel of the preceding word. @Metaknowledge, Brutal Russian. Fay Freak (talk) 23:49, 29 July 2021 (UTC)[reply]

I've seen conflicting treatments of this sort of thing. For Irish, I try to keep all the various allomorphs of a suffix together in one category, but for English, we seem to have separate categories for different spellings of the same suffix (e.g. CAT:English words suffixed with -ey vs. CAT:English words suffixed with -y. I agree it's better to have one category for one suffix regardless of spelling and gender and the like, but I'm not going to attempt to do that for English, or any other language where splitting them is well established practice. For Latin, if the status quo is to have one category for all genders of a particular suffix, then the IP shouldn't try changing that without first getting consensus. And there's absolutely no excuse for bare links or no links in "''[[citrus]]'' + ''-ulus'' {{suffix|la||lus}}". —Mahāgaja · talk 07:30, 30 July 2021 (UTC)[reply]
What reasoning is there to treat -olus and -ola as different inflections of the same suffix? This is a noun, and nouns only have one gender, unlike adjectives, which do have gendered forms. —Rua (mew) 15:30, 31 July 2021 (UTC)[reply]
@Rua: The same way adjective suffixes do not have gender. This is seen as not having inherent gender: the gender comes from the noun it is added to. Fittingly someone sorted -ulus under Category:Latin adjective-forming suffixes instead of Category:Latin noun-forming suffixes. Anyway it is the general idea of allomorphs. (We have defined allomorph and complementary distribution too narrowly by phonological criteria: a mistake German Wikipedia and Wiktionary do not make.) Fay Freak (talk) 18:32, 1 August 2021 (UTC)[reply]

Strange transliteration formatting

[edit]

I noticed that on ᚱᚢᚾᛋᛉ (runsʀ) and various other Proto-Norse entries, people have been including two different transliteration schemes, putting one of them in bold and the other in italic. This is quite confusing to the user, and it's not how it's formatted for any other language, so it's inconsistent with the rest of Wiktionary. A single transliteration is plenty, and we already have Module:Runr-translit, so autotransliteration is totally fine for these entries just as it has always been. In fact, in this entry, it appears that it has been spelled with the wrong letter, using another letter with a similar shape, therefore breaking the autotransliteration. —Rua (mew) 17:41, 31 July 2021 (UTC)[reply]

This is because the first scheme is the actual transliteration, while the second is the normalised form. Because these two often diverge, runes do not mark vowel length, often include svarabhakti vowels and have no word separation, it is neccessary to include both for greater readability. This is done by real academics, as can be seen on this page listing interpretations of a runic inscription, which I linked in an edit summary. One could of course use the ts= parameter for this, however it has slashes (/.../) around it, which makes it seem like an IPA pronunciation, which it is not. The real solution to this would be a new parameter nr= for normalisation, which @Sartma has also proposed, but as long as that doesn't exist I'll continue using this method.
The bold script comes from the academic standard of transliterating runes in bold font, something that is done almost universally. I would recommend reading this short introduction to runology, originally written by Klaus Düwel and translated into English, which goes into more detail about all of this. But for a short quote: "The conversion of a runic character into a Latin letter is called transliteration, and such transliterations are printed in bold type."
Further, you say that the "autotransliteration is totally fine for these entries", but this is only true for the inscriptions of the earliest period in the classic elder futhark (roughly 200-450). In the later Proto-Norse period, known as the traditional period, the old j-rune begins to represent the a-sound, usually in the shape ᚼ, but in the Istaby inscription as ᛋ. Since these do not exist separately in unicode, I have used the letters from the younger futhark which are identical in shape. Similar problems exist in cuneiform languages, and you will see that they have solved it in basically the same way as I. 𒀀𒀀𒀸𒊭 𒋫𒀊𒉺𒀸𒊭
To conclude, my goal here is to make wiktionary more consistent with the actual academic writing, and make it as useful as possible. I am not sure what exactly yours is. ᛙᛆᚱᛐᛁᚿᛌᛆᛌWiktionary's most active Proto-Norse editorAsk me anything 18:05, 31 July 2021 (UTC)[reply]
I don't think Wiktionary should have bold transliterations for this one language just because that happens to be the practice in other works. Wiktionary is internally consistent when it comes to styling, and we have common templates and style sheets to make sure of this. Deviations from that should be done with Wiktionary consensus, which I don't think you bothered to form.
Regarding the use of the S-rune for A, I think the normal practice in Unicode is to encode the meaning, or basically the intended letter, and leave graphical variations to fonts. For the same reason we have adopted left-to-right Italic script even when some inscriptions are right-to-left, and we use the regular Greek sigma even where a lunate sigma was used (which has a separate codepoint for some reason).
Rua (mew) 11:43, 1 August 2021 (UTC)[reply]
How could I have "formed a consensus" when I was the only person contributing to this language for almost a year? If you want to revert 100s of hours of work for reasons of petty bureaucracy, go ahead; you're the admin.
Regarding the use of the s-rune, I think you have a point. However, if we are to move the pages to ᚱᚢᚾᚼᛉ and others, there should be a note of some kind describing the unusual shape of the rune. ᛙᛆᚱᛐᛁᚿᛌᛆᛌWiktionary's most active Proto-Norse editorAsk me anything 12:43, 1 August 2021 (UTC)[reply]
Whether the editor questioning this departure from the usual transliteration practice (i.e., usual for other scripts) is an administrator or not is completely irrelevant; introducing this in the discussion is uncalled for.  --Lambiam 22:41, 1 August 2021 (UTC)[reply]

Pashto etymologies being removed

[edit]

Category:Empty categories contains a lot of 'Pashto terms derived from X' categories:

This means someone has been actively removing etymology links from Pashto pages. I see a lot of recent edits by User:SAb54iudwe1, all of whose edits are on July 27 but who is clearly not a new user given their familiarity with Wiktionary syntax and categories. However, I'm not sure if all these emptied categories are due to this user, or if the user's edits are legitimate. Can someone help me out here? Benwing2 (talk) 20:58, 31 July 2021 (UTC)[reply]

I inspected a few (~30) of the user's edits, and I see they sometimes expand etymologies in legitimate ways, but often contract etymologies when chains of ancestor terms are involved (example 1 removing French/Frankish/Proto-Germanic terms, example 2 removing a Latin term). It might be a good idea to explain the value of such chained ancestors to them...--Ser be être 是talk/stalk 02:18, 1 August 2021 (UTC)[reply]
These long etymology chains were directly copied from the borrowed terms' entries that may have been edited since then. I mean, let's review the old version of "کوټ", where you will find the etymology copied from an old version of "coat", while the current version of coat has its etymology rewritten. I think it's enough to link one source term with some cognates in neighbouring languages. But besides that, I don't have anything against those chains and may keep them if you would like.
For borrowings from Hindustani, I use {{bor|ps|inc-hnd}} and then give both Hindi and Urdu spellings. The Pashto terms were previously labelled as borrowings from Hindi or, sometimes, Urdu. If you still insist on using a specific language, I think the Urdu terms should be given as sources rather than the Hindi terms, purely because of the geographical proximity. Unfortunately, the Hindi entries in Wiktionary are usually more developed than their Urdu counterparts, and for some Hindustani terms there's only a Hindi entry created. So I think my approach is more reasonable.
But I admit I definitely went a bit overboard when I replaced Template:inherited with Template:derived in terms that could be traced back to proto-languages, like in لاس. SAb54iudwe1 (talk) 14:41, 2 August 2021 (UTC)[reply]
This gets at an ongoing problem: it's informative to have categories for e.g. Pashto terms derived (however indirectly) from Tupi, but duplicating chains on all pages causes problems; as SAb54 points out, when the etymology is changed at the "main" (e.g. English) entry, changes may not be made on other (e.g. Pashto) entries. Even the recent development of a template that supplies the categories without requiring the spelling and meaning of the Tupi word to be duplicated everywhere doesn't help when the change is to a language code, e.g., it turns out English got it from another language, not Tupi, so the English entry gets updated, but Pashto (which got it from English) still claims to come from Tupi. I can think of ways we could get around this, but they're inconvenient (e.g., put each step of each "long" etymology in a template) and/or Lua memory intensive. IMO the ideal might be a system where Pashto can just point to English foo, which can just point to Spanish fu, which can just point to Tupi, but Pashto would automatically get the "derived from Tupi" category and this would change if the English or Spanish entries were changed... but my understanding is that we can't currently do this in a way that can handle what if the English entry has two etymology sections and doesn't require a ton of Lua memory. - -sche (discuss) 04:03, 4 August 2021 (UTC)[reply]