Wiktionary:Beer parlour/2020/September: difference between revisions

Content deleted Content added

Inline

Revision as of 07:56, 26 September 2020

Is it non-controversial to run bot-tasks to apply the conventions at WT:NORM?

As of the last XML dump, there are 88,299 entries that violate WT:NORM in ways that Special:AbuseFilter/103 detects. (Of these, 74.6% violate the "One blank line before all headings, including between two headings, except for before the first language heading" rule, and 44.7% violate rules besides that one. There's overlap, obviously.)

There are also probably many entries that violate WT:NORM in ways that Special:AbuseFilter/103 does not detect; I haven't checked.

Is it non-controversial to run bot-tasks that address violations of WT:NORM? Or do we need individual discussions for different violations and how to bot-address them?

Are there any best practices I should follow for such tasks, or pitfalls I should know about?

—Ruakh_TALK
06:14, 1 September 2020 (UTC)[reply]

To answer your first question: non-controversial, go for it. —Justin (koavf)❤T☮C☺M☯ 06:45, 1 September 2020 (UTC)[reply]

Agreed! We could really use more people like you volunteering for boring bot jobs! (There are some funky Chinese and Japanese entries using {{zh-see}} and {{ja-see}}, but I don't think they technically break any NORMs; they're just worth being aware of.) —Μετάknowledge^{discuss/deeds} 06:49, 1 September 2020 (UTC)[reply]

OK, sounds good; thanks! —Ruakh_TALK 21:50, 1 September 2020 (UTC)[reply]

@Ruakh: User_talk:Erutuon#ToilBot_"Normalizing"_Vandalism (@Erutuon) —Suzukaze-c (talk) 03:24, 3 September 2020 (UTC)[reply]

Thanks for the heads-up! It sounds like Erutuon's bot was specifically targeting recently-edited pages, hence that problem, and that he fixed it by changing it to instead target pages that received edits between one and thirty days ago. (Please correct me if I'm wrong.) If so, then my bot already wouldn't cause that problem, because I use the twice-monthly XML dumps to find the pages to edit, so there's a delay of much more than a day between when the NORM-violating entry was captured in the XML dump and when the bot retrieves and edits it. (Of course, it could still happen by random chance that it edits a page that was recently vandalized, but then, the same is true of Erutuon's updated bot as I understand it. And for that matter, the same is true of any other bot; my {{t}}/{{t+}} updater could similarly edit a recently-vandalized page. So I'm not too worried about this. But it shouldn't be too much work to change the bot to skip pages with recent last-edited timestamps, so, sure.) —Ruakh_TALK 06:15, 3 September 2020 (UTC)[reply]

For what it's worth, the current version of my bot script is here. It won't edit pages where the latest revision is more recent than one day ago. This feature is provided by the Recent Changes API (see rctoponly). That won't be useful for your bot, though, since it's pulling from the dump. — Eru·tuon 06:44, 3 September 2020 (UTC)[reply]

Thanks for the link. My bot takes a different approach, obviously; it just retrieves the page, and if it sees that it was edited less than 24 hours ago, it skips it without editing. —Ruakh_TALK 08:40, 7 September 2020 (UTC)[reply]

The "WT-NORM" alert is a perpetual annoyance, all the more so because the message does not actually say what the problem is, so it is impossible for anyone with normal patience to fix it, when nothing visibly appears wrong. Any automated process to eliminate this useless irritation would be welcome. Mihia (talk) 22:03, 11 September 2020 (UTC)[reply]

Sorry ... I think I do now remember someone saying that "WT-NORM" was useful to identify crap random edits. If so then I stand corrected, but for me personally it is just a stupid irritation because it does not actually tell me what I have done wrong. Mihia (talk) 22:26, 11 September 2020 (UTC)[reply]

"WT-NORM" should be broken down to different tags detailing the actual problem. 恨国党非蠢即坏 (talk) 06:21, 15 September 2020 (UTC)[reply]

I agree, though according to a previous explanation, I seem to remember also that many WT:NORM "problems" are totally anal from the user perspective, such as might be silently auto-corrected, if for some reason they have a system importance. Mihia (talk) 22:04, 18 September 2020 (UTC)[reply]

I'd say they're pretty much all anal. Personally, as the creator of the filter, I've been in favor of leaving the filter but removing the tag, but for some reason haven't done it yet. That way would be totally invisible to most users but users who know how to could find edits that matched the filter. But perhaps the filter should be gotten rid of and we should only be looking at the dump to identify WT:NORM violations. — Eru·tuon 23:39, 18 September 2020 (UTC)[reply]

I would definitely support that -- that is, make "WT:NORM" invisible to ordinary users but accessible to editors who care. Mihia (talk) 10:39, 19 September 2020 (UTC)[reply]

Format for thesaurus pages

On Thesaurus pages, lists of synonyms are currently wrapped in {{ws beginlist}} and {{ws endlist}}, with items given with {{ws}}. {{ws}} links to the WS page for the argument, if such a page exists. However, this was clearly designed with a monolingual thesaurus in mind. On Thesaurus:da:nonsense, you can see that it links to a Polish page. I think it would be better to have a single template {{ws list}} similar to {{col3}} that takes a language code, and then as many terms as needed -- of course, the current format for auto-linking only works if Thesaurus entries are entered under a native synonym like Thesaurus:da:fuld or Thesaurus:god (with or without the language code). Knowing the language might also allow us to do some other things, although I can't currently think of any.
Additionally, most Thesaurus pages are not currently in a subcat of Category:Thesaurus entries by language. Most of these are English, but far from all. I added a lang parameter to {{ws header}} some time ago that categorizes. Would someone get a bot to do this?
@Dan Polansky I assume you probably have opinions about this.__Gamren (talk) 23:37, 2 September 2020 (UTC)[reply]

Using User:AutoSkull for automated surname edits

Having had a decent handful of experience with Python coding at this point, I just started messing with pywikibot, with which I am building a potential Wiktionary bot that automates edits to surname entries, and also would automate their creation. I've already been using it on my main account (see some of my recent contributions) for slower semiautomated edits, and just today I had the idea to move the testing and operations of this code to my new AutoSkull account. There are definitely still some tweaks and problems I'm working out, but in the state it's currently in, it could deal with most surname entries pretty well...but obviously most isn't quite good enough.

The tasks it will be able to perform when it is finished are currently listed on the bot account's user page. Basically, though, it will pull from lists of verified surnames and search them on Wiktionary to see if they have English entries here yet. If there is no entry, the bot will just create it. If there is an entry, the bot will decide what to do from there.

It's worth noting that among its many surname-related tasks my bot will be editing currently existing surname pages to make them a bit more complete. It will be adding plural forms to the template {{en-proper noun}} according to the consensus on how surnames should be inflected in English, with a few exceptions (see the last bullet point on User:AutoSkull#English surnames). Entries for plural inflections of surnames will also be added in large numbers. It will also add relevant Wikipedia disambiguation page links for all our surname entries when such a page exists.

I won't share my code yet as it's not in a finished state, but when it is I will. I will also at that time share a large series of edits made perhaps in AutoSkull's userspace subpages, that emulate various different wild circumstances the bot may encounter when unsupervised, to prove it won't just be wreaking havoc here. But even so, I wanted to go ahead and let the community know about the fact that I'm coding and testing with this, as I suppose that's a predecessor to a bot status vote, which I'll start in the near future. I'm really hoping with this project I can help get Wiktionary's coverage of surnames to be pretty lengthy. Let me know of any suggestions or comments. PseudoSkull (talk) 03:16, 3 September 2020 (UTC)[reply]

@PseudoSkull This sounds fine to me. If you need specific help, let me know ... I've written over 400 scripts by now to do all sorts of things on Wiktionary. These all use pywikibot and (usually) mwparserfromhell, which has proven to be a great combination. For example, one of my most productivity-enhancing scripts has turned out to be a script I wrote called find_regex.py, which outputs a text file consisting of subsets of pages (either the entire page or one language section) matching a given regex, based off of a category, references to a given page, a fixed list of pages, or a Wiktionary dump. I can then edit the text file, either by hand or using a purpose-written script, and push the resulting changes back to Wiktionary using another script push_find_regex_changes.py. This makes it possible to quickly do all sorts of manual and semi-automated changes. Benwing2 (talk) 06:47, 13 September 2020 (UTC)[reply]

Old Korean lemmas with direct attestation are in the reconstruction namespace

The two egregious examples are the genitive 叱 and the topic-marking 隱, both of which are omnipresent in the surviving Old Korean corpus. In the case of 叱, for example, the interpretive gugyeol data makes it undeniable that 叱 (or abbreviated forms) acts as a genitive:

天人^叱供 is used to gloss a Chinese phrase in the Avatamsaka Sutra that means "provisions of the heavenly ones"
佛^叱國土 is used to gloss a Chinese phrase in the Humane King Sutra meaning "territory of the Buddha's country"

And so forth. These forms are thus attested, there being universal scholarly consensus about their semantic value, and do not belong in the reconstruction namespace per WT:RECONS. What is reconstructed about them is their phonetic value, but this can be marked with an asterisk while the terms themselves (in the hanzi-based orthography) are moved to the normal entry namespace.--Karaeng Matoaya (talk) 08:28, 4 September 2020 (UTC)[reply]

Support, agree with all points. —Suzukaze-c (talk) 03:24, 5 September 2020 (UTC)[reply]

@Quadmix77, who created these entries. —Μετάknowledge^{discuss/deeds} 05:22, 5 September 2020 (UTC)[reply]

Support per above. --沈澄心 ✉ 11:26, 5 September 2020 (UTC)[reply]

In the absence of further input, I'm making mainspace entries for 叱 and other attested OK grammatical particles.--Karaeng Matoaya (talk) 00:47, 7 September 2020 (UTC)[reply]

@Karaeng Matoaya: Please mark the duplicate entries with {{d}} and an explanation (or just a link to this discussion) once you've made the entries and fixed all incoming links. —Μετάknowledge^{discuss/deeds} 02:04, 7 September 2020 (UTC)[reply]

@Metaknowledge: Done.--Karaeng Matoaya (talk) 13:04, 7 September 2020 (UTC)[reply]

Draft proposal for pre-c. 1910 Korean forms (Old, Middle, Early Modern)

Hi everyone,

After some talks with @Suzukaze-c, I've drafted a brief sketch draft of how to deal with pre-contemporary Korean forms at User:Karaeng Matoaya/Draft.

This will probably be moved to Wiktionary:About Korean/Historical forms if people don't hate it too much. The main features include:

The use of the new periodization for Old Korean, in which texts up to c. 1300 are considered examples of OK. This is the growing consensus in South Korean academia and has a number of advantages compared to the traditional periodization still used in many Western sources, which wasn't really evidence-based in the first place.
Only forms attested in actual Old Korean texts are considered valid entries, which won't affect anything except 波珍, which should be deleted as a proper noun-based reconstruction. Also added some preliminary standards for disputed OK entries.
The use of the three-way periodization of Korean given by ISO 639-3: OKO for Old Korean, OKM for Middle Korean, and KOR for Early Modern and Modern Korean. This means that Korean forms attested between 1600 and 1900 share the KO language code together with contemporary forms, and are modified with obsoleteness templates instead. (Previously the very few EMK entries that existed seemed to be grouped together with Middle Korean forms, but this is problematic given academic consensus that MK ends in c. 1600; if we want to separate EMK from Contemporary Korean, the best way to do that is to create a new language code specifically for EMK.) Some examples of new EMK entries are at 뉴#Etymology 3 and ᄯᅡᆼ.

Thoughts?--Karaeng Matoaya (talk) 13:22, 7 September 2020 (UTC)[reply]

Looking over your draft, I have a few questions / comments.

In the Chinese wordlists section, you state, "references to these wordlists are strongly recommended in the Phonology sections of Old Korean entries, and in the Etymology sections of Middle and Modern Korean entries." I'm not quite clear on how you mean this. Presumably this recommendation is only for those terms that have alternative forms that appear in the Chinese word lists?
In the Proper noun reconstructions section, you state, "references to such reconstructions are strongly recommended in the Phonology sections of Old Korean entries, and in the Etymology sections of Middle and Modern Korean entries." Similar to above.

Albeit from something of an outsider's perspective -- my Korean ability is quite basic -- your proposal looks good to me.

Really appreciating the deeper dive you're giving for Korean entries. Thank you. ‑‑ Eiríkr Útlendi │^{Tala við mig} 18:57, 9 September 2020 (UTC)[reply]

@Eirikr Thanks for the comments, and also for the encouragement—they mean a lot. I've fixed both to "strongly recommended in the Phonology sections of otherwise attested Old Korean entries, and in the Etymology sections of likely Middle and Modern Korean reflexes" and also added three examples of how Chinese or proper noun data can be integrated within attested entries: 有叱 (*Is-), 無叱 (*EPs-), and 거칠다 (geochilda).--Karaeng Matoaya (talk) 12:30, 10 September 2020 (UTC)[reply]

The changes look good to me. Thank you again for taking this on! ‑‑ Eiríkr Útlendi │^{Tala við mig} 18:25, 14 September 2020 (UTC)[reply]

"Pronunciation spelling" label

Is everyone happy that the usage of the "pronunciation spelling" label has by implication been determined by the outcome of the recent "eye dialect" vote? That vote established that the "eye dialect" label is to be applied only to words such as sed for said or lissen for listen that represent standard pronunciations but imply that the speaker generally uses a nonstandard dialect. It has been said that "eye dialect" is a subset of "pronunciation spelling", on which basis such words could in theory be labelled both "eye dialect" and "pronunciation spelling", but I imagine that this would be viewed as unnecessary.

This leaves words such as borrowin' for borrowing and fink for think, that represent non-standard pronunciations, as well as simplified phonetic spellings such as lite, as eligible for the "pronunciation spelling" label. Is it uncontentious that all these should be labelled "pronunciation spelling"? Are there any other types of words that are "pronunciation spelling" candidates? Mihia (talk) 08:29, 10 September 2020 (UTC)[reply]

I wouldn't use the "pronunciation spelling" label for borrowin’ and fink; I'd simply call those nonstandard forms. Things like lite, tonite, and donut, on the other hand, are definitely pronunciation spellings that are not (I think) eye dialect (at least not usually). —Mahāgaja · talk 11:44, 10 September 2020 (UTC)[reply]

I second that. I suspect that some common misspellings arose as pronunciation spellings, or, as in the case of artic and nitch, even as mispronunciation spellings. I’d apply the term only, though, to intentional nonstandard spellings that do not imply the use of nonstandard speech but merely aim to convey how kool and with it the author is. — This unsigned comment was added by Lambiam (talk • contribs) at 13:55, 10 September 2020‎ (UTC).

In the case of artic, I wouldn't say that it's a mispronunciation spelling; rather, I'd say that /ˈɑɹktɪk/ is a spelling pronunciation, since 300 or so years ago artic was the normal spelling and /ˈɑɹtɪk/ was the normal pronunciation. —Mahāgaja · talk 15:52, 10 September 2020 (UTC)[reply]

Invitation to participate in the conversation

Hello. Apologies for cross-posting, and that you may not be reading this message in your native language: translations of the following announcement may be available on Meta. Please help translate to your language. Thank you!

We are excited to share a draft of the Universal Code of Conduct, which the Wikimedia Foundation Board of Trustees called for earlier this year, for your review and feedback. The discussion will be open until October 6, 2020.

The UCoC Drafting Committee wants to learn which parts of the draft would present challenges for you or your work. What is missing from this draft? What do you like, and what could be improved?

Please join the conversation and share this invitation with others who may be interested to join, too.

To reduce language barriers during the process, you are welcomed to translate this message and the Universal Code of Conduct/Draft review. You and your community may choose to provide your opinions/feedback using your local languages.

To learn more about the UCoC project, see the Universal Code of Conduct page, and the FAQ, on Meta.

Thanks in advance for your attention and contributions, The Trust and Safety team at Wikimedia Foundation, 17:55, 10 September 2020 (UTC)

Yay, rules! We'd better start crafting templates to issue various degrees of admonishment, warning, and scolding before escalating to interaction bans and topic bans. We could use help from a graphic artist to produce good icons. Vox Sciurorum (talk) 18:13, 11 September 2020 (UTC)[reply]

I wanted to add "right-wingers are humans too" but I got banned instantly, lol. Equinox ◑ 22:24, 11 September 2020 (UTC)[reply]

You should have read the FAQ: "UCoC may not fit into all cultural contexts." Vox Sciurorum (talk) 22:47, 11 September 2020 (UTC)[reply]

... and the footnote at the bottom: "not actually universal"... On a more serious note (but still highly sarcastic), I'm loving the name "Trust and Safety Team" - it fills me with calm and respect, and can be made into a nice acronym too, which has been a must for any initiative since the Patriot Act. --Java Beauty (talk) 23:14, 13 September 2020 (UTC)[reply]

Careful there! One of the proposed rules is to ban sarcasm. (I'm not kidding, go look at the draft.) —Μετάknowledge^{discuss/deeds} 05:36, 14 September 2020 (UTC)[reply]

pics or it didn't happen

And it sure would be great if more right-wingers recognized others as human too :^) —Suzukaze-c (talk) 05:14, 14 September 2020 (UTC)[reply]

Archaic forms and spellings should not be lemmas

Most English archaic forms and spellings are lemmas. However, archaic forms are just like declined/conjugated/inflected forms, in that they don't add any information on meaning of the root word. --Numberguy6 (talk) 20:56, 12 September 2020 (UTC)[reply]

Archaic terms may have been the predominant form at times in the past. We are attempting to be a historical dictionary among other things. DCDuring (talk) 21:39, 12 September 2020 (UTC)[reply]

I disagree whom is just as much a lemma now as it has ever been and so is thee. —Justin (koavf)❤T☮C☺M☯ 02:01, 13 September 2020 (UTC)[reply]

I also strongly disagree with the proposition that "Archaic forms and spellings should not be lemmas". Mihia (talk) 22:35, 13 September 2020 (UTC)[reply]

Phrase ellipsis, three regular dots or two ellipsis characters (six dots)?

Hi all,

First, sorry for cross-posting. I was advised that I'd be better served posting here. Here are my original questions:

Concern A: I came across how do you say...in English and I'm ... year(s) old. The former has been moved to how do you say …… in English. After reading the page history, there seemed to be a rational explanation as to why two ellipsis characters (six dots) were used. Given that Wiktionary:Phrasebook provides an example with three regular dots (three separate characters), I'm confused about what the naming convention should be. Please advise.

Concern B: Most people cannot type the ellipsis character (…) without copying and pasting from somewhere else. Doesn't this limit the usefulness of Wiktionary as a tool for looking up words? What if a phrase starts with the ellipsis characters and the user wanted to look that up? It would likely only be found with great difficulty.

-- Dentonius (talk) 00:11, 13 September 2020 (UTC)[reply]

I thought redirects worked and still work in Wiktionary as usual, don't they? In Wiktionary, there are many languages and in them lots of characters that are difficult to type for outsiders, however, redirects (and the {{also}} template) do an excellent job. I don't see why we should make an exception at this particular point when we don't do otherwise. The succession of three dots is just a clumsy substitute for an ellipsis character. There are several terms even in English that would be hard to type (e.g. 1,450 terms with æ or 1,213 terms with é) if it weren't for the convenient lookup and redirect features that we have here. Adam78 (talk) 01:06, 13 September 2020 (UTC)[reply]

Here is what I wrote at WT:GP:

This is maybe more of a beer parlo(u)r issue, and you might get more traction posting it there. However, I agree with you that six dots seems a bit strange. The explanation "and two of them to mark the width of an average word, separated by spaces as usual" by User:Adam78 makes a certain amount of sense but was clearly a unilateral decision. The issue with an ellipsis character vs. three dots seems less of an issue than you might think; at least for me, if I type "I'm ..." with three dots, it autocompletes to the variant with an ellipsis character. Same thing happens if you start typing "..."; it autocompletes to the ellipsis character entry. Even using a single ellipsis character isn't completely standard; for example, there's what does XX mean and Appendix:X is a beautiful language. In addition, all the entries under Appendix:Snowclones use X, Y, Z, N, etc. For snowclones maybe this makes sense as it makes possible things like Appendix:Snowclones/I'm here to X A and Y B, and I'm all out of A. I think at least all the non-snowclone entries should use a single ellipsis character.

Benwing2 (talk) 05:18, 13 September 2020 (UTC)[reply]

I disagree with ever using 6 dots to create space. Normally, when I'm just trying to create space I will use two m-dashes (——) or any number of underlines (___). But for what was being attempted on this site I don't know if any of that would be preferred. I would assume a single ellipsis would be sufficient. -Mike (talk) 22:34, 13 September 2020 (UTC)[reply]

A single ellipsis looks to me like a great compromise. I'm sorry for the one-sided change. Adam78 (talk) 15:45, 15 September 2020 (UTC)[reply]

Thanks, guys. I appreciate it. ;-) - Dentonius (talk) 17:13, 15 September 2020 (UTC)[reply]

Canadian English

Hello all, I raised a question at Category talk:Canadian English upon which I'd like to hear your input. -Montrealais (talk) 15:43, 13 September 2020 (UTC)[reply]

As far as the purpose of having a category is concerned, yes, I agree with what you say at that talk page. "Canadian English" should be for words used only (or primarily) in Canada, else what is the point. The actual name of the category could be open to discussion, though. Would a person expect a category called "Canadian English" to contain every word used in Canadian English? That is, including all North American or even "universal" English words too? I'm not sure. Mihia (talk) 22:24, 13 September 2020 (UTC)[reply]

As you imply, it doesn't make much more sense to put all North American words under "Canadian English" than it would to put universal English words under "Canadian English" on the grounds that they're used in Canada. I feel that if the category is to be useful, it should be for words that are, or at least mostly are, peculiar to Canada. There's a difference between a dictionary of English used in Canada (e.g. the Canadian Oxford Dictionary) and a list of Canadian words, which I believe most people would expect the category to be. - Montrealais (talk) 23:16, 13 September 2020 (UTC)[reply]

I think that one's perception may vary depending on whether the region in question is one's own or not. For example, as a BrE speaker, I would probably expect a list of "Canadian English words" to include words that are used only (or primarily) in Canada, whereas I might expect a list of "British English words" to include all words that are used in BrE. Opinions may vary. Despite this, we might adopt the convention that "X English" words include words used only (or primarily) in region X, and expect/require people to understand this. Otherwise the labelling may get clumsy. Mihia (talk) 00:33, 14 September 2020 (UTC)[reply]

@Mihia I definitely don't think having "British English words" consist of all the words used in British English (as opposed to the ones specific to this variety) would be workable. The category would be enormous and wouldn't be of much value, since over 99% of English words are common to all varieties. Benwing2 (talk) 03:05, 15 September 2020 (UTC)[reply]

No, I absolutely agree. I think perhaps I did not explain my point clearly enough. I was talking about what a person might expect a category named "British English words" to contain, if he or she did not already know how Wiktionary defined this. I was speculating that a person might think that it would contain all words used in British English, and therefore musing whether the category name should somehow indicate that it didn't (e.g. "Words specific to British English"). However, in conclusion I mentioned that this may be too clumsy. Mihia (talk) 09:25, 15 September 2020 (UTC)[reply]

@Mihia I see, makes sense. Benwing2 (talk) 03:31, 16 September 2020 (UTC)[reply]

I agree with the above, that I would expect this category to contain words that are specific to Canada, and I think the category should reflect that. Andrew Sheedy (talk) 22:51, 16 September 2020 (UTC)[reply]

We seem to have a consensus here, more or less - how should we proceed? -Montrealais (talk) 05:01, 18 September 2020 (UTC)[reply]

Lexicography films

Apparently there's only one film ever made about dictionaries, called The Professor and the Madman. Anyone seen it? --Java Beauty (talk) 21:12, 13 September 2020 (UTC)[reply]

(Malmoi is apparently another one. —Suzukaze-c (talk) 22:40, 13 September 2020 (UTC))[reply]

Oof, Korean film set in 1940, no thanks... --Java Beauty (talk) 23:18, 13 September 2020 (UTC)[reply]

I saw The Professor and the Madman. Watchable, though it takes various liberties with the facts in the name of drama. Out of curiosity I Googled and found these other ones:

The Forgotten Founding Father: Noah Webster's Obsession and the Creation of an American Culture (not a film but an hour-long lecture organized by the Library of Congress).
Samuel Johnson: The Dictionary Man (BBC Four "drama documentary"; doesn't look viewable online where I live, sadly).

— SGconlaw (talk) 12:51, 14 September 2020 (UTC)[reply]

Vulgar Latin

Do we really need Classical & Ecclesiastical pronunciations for unattested Vulgar Latin terms? I think, we ne need. If our community unanimously decide in favour of showing only Vulgar Latin pronunciations, then we can begin using |classical= & |ecclesiastical= (the latter does not work tho' & I know not why) to hide those twain. Or better, some user might even want to make some adjustment with Latin templates to this effect. — inqilābī ^{[ inqilāb zindabād ]} 12:44, 14 September 2020 (UTC)[reply]

@Erutuon, what d'you think? — inqilābī ^{[ inqilāb zindabād ]} 18:43, 15 September 2020 (UTC)[reply]

Wiktionary sitelinks dashboard: URL update

Hello all, and sorry for writing in English. Feel free to translate this message below.

The Wiktionary Cognate Dashboard presents interesting data about the extension powering your sitelinks. I just wanted to let you know that the URL of this tool changed: it is now accessible at https://wiktionary-analytics.wmcloud.org/Wiktionary_CognateDashboard/ . The former URLs, https://wmdeanalytics.wmflabs.org/Wiktionary_CognateDashboard/ and https://wdcm.wmflabs.org/Wiktionary_CognateDashboard/ , will be disabled on September 25th. Don't forget to update your documentation pages accordingly.

If you have questions about the tool or the URL switch, feel free to ping me. Cheers, Lea Lacroix (WMDE) 11:46, 14 September 2020 (UTC)[reply]

If anyone wants an Indian English translation, let me know 🙃 —Aryaman^A ^{(मुझसे बात करें • योगदान)} 22:33, 14 September 2020 (UTC)[reply]

I'm... kind of curious to see what that would entail. —Μετάknowledge^{discuss/deeds} 22:42, 14 September 2020 (UTC)[reply]

Category:Indian_English? Weird but gotta be respected. Equinox ◑ 00:05, 16 September 2020 (UTC)[reply]

We could all have a go at translating into Scots... XD - -sche (discuss) 20:53, 17 September 2020 (UTC)[reply]

Propose making Template:en-noun pluralization algorithm smarter

@Equinox, DCDuring Pinging a couple of random people who I think work on English lemmas a lot. I propose to make the {{en-noun}} pluralization algorithm smarter. Not sure if this has been discussed before. Basically, I want to implement the following default rules (which are mostly already implemented in the pluralize() function in Module:string utilities):

If the noun ends in -s, -x, -z, -sh or -ch, add '-es'.
If the noun ends in consonant + y, and does not begin with a capital letter, change '-y' to '-ies'. Hence cherry -> cherries, but Kennedy -> Kennedys (begins with a capital letter; cf. Rolling Stones "who killed the Kennedys?"), boy -> boys (ends in vowel + y).
Otherwise, add '-s'.

The values s and es would force an '-s' or '-es' plural, as before. The special symbols -, ~, ! and ? work as before. A new symbol + means "produce the default plural"; this is used e.g. on the page accessibility in {{en-noun|-|+}}, which currently has to be written {{en-noun|-|accessibilities}} (the - in conjunction with a plural means "usually uncountable"; without a plural specified, it means simply "uncountable"). This would be implemented as follows:

Implement the new behavior, but only if |new=1 is given in the template.
Use a bot to find the places where arguments would change between the old and new behavior; change the arguments to the new behavior and add |new=1.
As soon as all such places are changed, make the new behavior the default and remove the dead code supporting the old behavior.
Go through and remove the |new=1 parameter.
Remove the dead code supporting the |new= parameter.

That way, there would be no disruption while making the change. The only possible issue is someone changing the plural of an existing noun or adding a new noun while step 2 is in progress. I may be able to work around this by checking esp. for new entries in the Category:English nouns category, as I think adding a new noun would be a lot more likely than changing an existing noun. Thoughts? Benwing2 (talk) 08:34, 15 September 2020 (UTC)[reply]

Sounds promising. I appreciate your concern about the transition process. I will try to think on possible problems etc. How long do you think step 2 would take? Could an input filter be used to prevent changes to the plurals where "new=1" was present? DCDuring (talk) 14:52, 15 September 2020 (UTC)[reply]

Currently the noun ally has |1=allies. Under the proposed smarter rules this could be omitted, but I see no step that would perform this simplification. Did I miss something? I am not sure~ what the old "dumb" rules are. What is an example of a case that would be flagged in Step 2? --Lambiam

Oppose unless you can articulate what benefit this would have. I have not encountered mistakes in English plurals in any significant amount. DTLHS (talk) 22:26, 15 September 2020 (UTC)[reply]

“Smarter“ means it takes off the work of thinking about the code – as of typing it, since with the suggested changes one would have less to specificy manually –, which means creating English entries would be faster, and less erroneous in every respect because of more human attention left – unless syntax changes requiring accustomization work towards the opposite, but there aren’t “changes” in that sense, only things becoming unnecessary. Fay Freak (talk) 22:49, 15 September 2020 (UTC)[reply]

Hi. Interesting idea. Thanks for pinging my pimply arse. You know what, I mostly see templates as something that gets in my way, that I have to work around, and I know that's really sad and wrong, because a lot of templates do very useful things. See current discussion on my talk page about why I find it hard to use the proper citations template with lots and lots of parameters (year, author, etc.). I also very much appreciate the fact that you are proposing a new=1 parameter and a phase-in rather than sort of just throwing it in there and hoping it works (ahem...). Could you please tell me: (i) what is the basis of your proposed rules (did they come from a certain grammar book, or a corpus study, etc.?) -- not just your head, right?; (ii) I think I just didn't totally follow your explanation, but suppose we have got a "perfect exception" that works with old en-noun but not with yours, such as drivebys: is there any risk of breaking these while implementing your new proposal? Thanks. Equinox ◑ 00:14, 16 September 2020 (UTC)[reply]

@Equinox In response to your questions: (i) These are the rules I was taught as a kid. Can't cite a specific grammar book but I bet any standard English grammar contains these rules. (ii) Exceptional cases like "drivebys" and "nudibranchs" would just need to be specified as {{en-noun|s}} instead of the current {{en-noun}}. My bot will change them automatically. Benwing2 (talk) 03:11, 16 September 2020 (UTC)[reply]

OK. Then probably in favour of this. We always have the Preview screen, after all. (You might also enjoy seeing the hot mess of User:Equinox/code/FindMissingNounPlurals.) Equinox ◑ 03:19, 16 September 2020 (UTC)[reply]

@Equinox Your code isn't so bad :) ... and it's liberally commented, which is something near and dear to my heart. Benwing2 (talk) 02:41, 18 September 2020 (UTC)[reply]

@Lambiam, DTLHS User:Fay Freak articulated the reason well, in my view. What the current module does by default is to always add -s to the noun, regardless of the form of the noun. So, head -> heads, house -> houses, boy -> boys, but also batch -> batchs, cherry -> cherrys, box -> boxs, etc. What I'm proposing to do is make the default rule smarter, so that there are fewer exceptional cases, and so that the cases that do require the plural to be enumerated explicitly correspond with English speakers' intuitions of what are exceptional. For example, currently nudibranch is specified as just {{en-noun}}, relying on the default -s plural, hence nudibranchs. This happens to be correct for this noun because the final -ch is pronounced as /k/, but any native speaker will tell you this is an exception, and that the "normal" plural would be nudibranches. Someone who doesn't know this word and comes across it in Wiktionary might think the bare template call {{en-noun}} is a mistake by some other editor who forgot to specify the explicit plural (which is required for 99% of nouns ending in -ch), and try to "correct" it to nudibranches. When the template is changed as I propose, so its default rules accord with normal English plural rules, all the exceptional cases will be specifically indicated as such and this problem won't occur. Benwing2 (talk) 03:23, 16 September 2020 (UTC)[reply]

@Benwing2 — You did not answer my first question, about the bot applying possible simplifications, like for example for ally replacing {{en-noun|allies}} by {{en-noun}}. --Lambiam 04:00, 16 September 2020 (UTC)[reply]

@Lambiam Apologies. Yes, the bot would apply all possible simplifications in step 2. That would include e.g.:

(for ally) {{en-noun|allies}} -> {{en-noun}}
(for match) {{en-noun|es}} -> {{en-noun}}
(for box) {{en-noun|boxes|boxen}} -> {{en-noun|+|boxen}}
(for thesaurus) {{en-noun|thesauri|thesauruses}} -> {{en-noun|thesauri|+}}
(for lexicography) {{en-noun|~|lexicographies}} -> {{en-noun|~}}
(for accessibility) {{en-noun|-|accessibilities}} -> {{en-noun|-|+}}

The way I would probably implement it is to first replace all arguments with + where possible (which means "use the default algorithm"), then eliminate + where possible. (Specifically, {{en-noun|+}} -> {{en-noun}}, and {{en-noun|~|+}} -> {{en-noun|~}}.) Benwing2 (talk) 05:26, 16 September 2020 (UTC)[reply]

@DCDuring I'm not exactly sure how long step 2 would take, but I can write a script to find out. It usually takes 1-2 seconds to save a page, meaning the bot can do maybe 3000 pages an hour. I think an edit filter would work and be a pretty simple solution; it could even be made to allow changes that add |new=1, but not otherwise, and display a message indicating that this needs to be temporarily done. Benwing2 (talk) 03:30, 16 September 2020 (UTC)[reply]

So, if everything worked right and it ran 24 hours a day with you faithfully overseeing it, at least on standby, it would take at least 4 days. If it was overseen 40 hours a week and was only run when overseen, it would take at least 2 1/2 weeks. I don't know that there is anyone besides you who could properly oversee it and lead a prompt recovery from any unforeseen problem. Thus the time step 2 would take would be totally subject to your availability for oversight, at least on a standby basis. DCDuring (talk) 11:41, 16 September 2020 (UTC)[reply]

@DCDuring This isn't actually the case. I think you're basing your calculations on the total number of English nouns (about 348,000), but the time would be determined by only those that need to be changed. I think that would be maybe 10,000-30,000, or about 3-10 hours, but I don't know for sure. This could be sped up by about 5x by running multiple processes at once. I also realized that I can use the tracking mechanism (Template:tracking) to track any pages needing updating that get changed during the primary updating process, so there's no need for an edit filter. Benwing2 (talk) 00:52, 17 September 2020 (UTC)[reply]

Indeed. I was basing my estimate on all occurrences of {{en-noun}}. Can you identify all of the English nouns that need to be changed before beginning the changes? DCDuring (talk) 01:11, 17 September 2020 (UTC)[reply]

@DCDuring Yes, I can write a script to do this. Benwing2 (talk) 02:33, 17 September 2020 (UTC)[reply]

Isn't that necessary to reduce the time you would need to supervise the process? from my estimate to yours? Or am I missing something blindingly obvious? DCDuring (talk) 03:13, 17 September 2020 (UTC)[reply]

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ @DCDuring No, you're not missing anything, I do have to write the script eventually. I went ahead and wrote it; here are the stats:

Stat	Count
Total pages with `{{en-noun}}`	340,887
Number of pages touched if all possible changes/simplifications made and `+` used everywhere	32,739
Number of pages touched if all possible changes/simplifications made and `+` used everywhere except when replacing `s`	23,752
Number of pages touched if all possible changes/simplifications made and `+` used everywhere except when replacing `s` or `es`	22,024
Number of pages that will differ between old and new algorithm if no changes made	352

There are three numbers I give above, depending on how the changes are made. I recommend one of the latter two (where I leave alone s, or maybe both s and es, instead of replacing them with + where it's appropriate to do so). Both solutions would take around 7-8 hours to implement in step 2. Note that there are only 352 pages that actually *have* to be changed with the new algorithm, because they will have different results. These are pages like nudibranch that rely on the default -s ending without specifying it explicitly, and where the new algorithm would add something else (e.g. -es in this case). This means it's unlikely there will be many, if any, pages of this sort that will be added in the time it takes to run step 2 above. However, I will set up tracking so that any changes made in those 7-8 hours get reviewed and fixed up as needed.

Please note, the list of those 352 entries is here: User:Benwing2/convert-en-noun.warnings. Some of these are in fact wrong and need to have an explicit plural (e.g. windgrass, film library). I made a list of all those that may be wrong, here: User:Benwing2/convert-en-noun.warnings.likely-wrong. This list has 110 entries. Some are clearly wrong, some may or may not be wrong. Could you take a look and fix the ones that are wrong? Thanks! Benwing2 (talk) 01:16, 18 September 2020 (UTC)[reply]

Also, note that the above stats are derived from the Sep 1 dump, so there may be a few pages not included. Benwing2 (talk) 01:18, 18 September 2020 (UTC)[reply]

I just did 15, marked with {{done}}, with an explanation. Tedious. There are many common nouns that are derived from proper nouns which require looking at cites. There are a whole bunch of nouns ending in "oxy", which could probably be resolved most quickly by SB. I'll take another look when I can. DCDuring (talk) 03:47, 18 September 2020 (UTC)[reply]

@DCDuring Thanks. If you can just list what needs to be done for the others (if anything), I can make the changes fairly quickly. I flagged these because they need further investigation, e.g. all the -oxy ones seemed wrong to me but I don't know for sure. Calling User:SemperBlotto, who created many of them. Benwing2 (talk) 03:56, 18 September 2020 (UTC)[reply]

There are also some that should use {{en-proper noun}}, not {{en-noun}}, and others that are uncountable or, IIRC, plural-invariant. In some cases where there are both common- and proper-noun L2s, I wonder why we have both. DCDuring (talk) 11:52, 18 September 2020 (UTC)[reply]

@DCDuring Thanks. I implemented the necessary changes in Module:en-headword and I'm ready to proceed using the Sep 20 dump, when it comes out. Are you OK with this? Benwing2 (talk) 19:11, 19 September 2020 (UTC)[reply]

I fixed up some of the remaining cases in User:Benwing2/convert-en-noun.warnings.likely-wrong and removed the ones you or others had already done. There are about 55 left; almost all end in consonant + -y. Benwing2 (talk) 20:42, 19 September 2020 (UTC)[reply]

What I'm hoping will come out of this is a greater willingness to review the inflection line for English nouns so as to address various questions about plurals, including countability, the various departures from the basic rules, etc. Not having to type in so many of the plural forms might reduce the tedium and wear-and-tear on keyboards and thumbs of such reviews. DCDuring (talk) 00:26, 20 September 2020 (UTC)[reply]

This is done. Benwing2 (talk) 03:40, 22 September 2020 (UTC)[reply]

meta:Small wiki audit/Malagasy Wiktionary

As many of you know, the Malagasy Wiktionary is the second-largest by article count as a result of its very low-quality bot-created entries. I have made a full report on Meta, and I'm hoping that the Wiktionary community can chime in on the talk page, and add pressure at Meta so this actually gets dealt with. —Μετάknowledge^{discuss/deeds} 02:17, 16 September 2020 (UTC)[reply]

Support, this bot is highly irritating. P U C – 20:15, 16 September 2020 (UTC)[reply]

Don't comment here, go comment at Meta! —Μετάknowledge^{discuss/deeds} 20:24, 16 September 2020 (UTC)[reply]

Everybody, please vote at meta:Talk:Small wiki audit/Malagasy Wiktionary#Poll! I'd really like a good turnout of Wiktionarians to show that we care. And @Noé, could you please post about this poll at fr.wikt (and maybe the mailing list, if anyone still reads that)? —Μετάknowledge^{discuss/deeds} 17:39, 17 September 2020 (UTC)[reply]

Northwestern Indo-Aryan

I'm trying to improve our organization of Indo-Aryan languages (it's very loose as it stands), and I have an issue that could use some discussion. The Indo-Aryan lects of the Northwestern zone (Sindhi sd, Punjabi pa, etc.) are currently classified as descendants of Sauraseni Prakrit psu. The (literary) language most closely associated with these lects is Paisaci Prakrit, which we now have a code inc-psc for.

Certainly, Sauraseni doesn't give us the appropriate intermediary forms between Sanskrit and this languages: e.g. Kholosi taɽgo (a Sindhic language) preserves the r in the consonant cluster in Sanskrit दीर्घ (dīrgha), but Sauraseni has lost it as Lua error in Module:parameters at line 290: Parameter 1 should be a valid language or etymology language code; the value "psu" is not valid. See WT:LOL and WT:LOL/E.. Similarly, Punjabi ਭਰਾ (bharā)'s currently given etymology does not make much sense again due to preservation of r. The conservativeness of Northwest IA is well-known, e.g. {{R:inc:Masica:1993}} discusses it. Sauraseni, on the other hand, is distinctively a Central IA language that is obsessed with cluster assimilation.

However, Paisaci is very very scantily attested and so I'm uncertain whether it actually is a good candidate for intermediary language between Northwest IA and Sanskrit. Glottolog does include it in Northwest IA, but Glottolog also very stupidly puts Dardic languages in there. What's the best way to organize Northwest IA? Create a family for it and group it under Sauraseni? Or put it under Paisaci which would require removing all of the current etymologies involving Sauraseni? Pinging @Bhagadatta, Kutchkutch, Victar, not sure who else might be knowledgeable enough to help but any input is welcome. —Aryaman^A ^{(मुझसे बात करें • योगदान)} 16:19, 16 September 2020 (UTC)[reply]

@AryamanA: I am not knowledgeable about Northwestern IA lects but I would support classifying them as descendants of Paisaci Prakrit. Paisaci Prakrit's meagre attestation should be no cause for not showing it as the intermediate between OIA and Northwest IA, because we should be more bothered about representing the IA family tree as accurate as current linguistic data points to. So we can obviously go ahead with cleaning up the etymologies and the descendants to be affected by this change. — inqilābī ^{[ inqilāb zindabād ]} 19:40, 16 September 2020 (UTC)[reply]

@AryamanA: I don't mind having Paisaci as the ancestor of NWIA but there are a couple of issues. Paisaci Prakrit has sometimes de-voiced Old Indo-Aryan voiced stops like Lua error in Module:parameters at line 290: Parameter 1 should be a valid language or etymology language code; the value "inc-psc" is not valid. See WT:LOL and WT:LOL/E., for which Punjabi has ਦਿਉਰ (diura). There are more examples like Lua error in Module:parameters at line 290: Parameter 1 should be a valid language or etymology language code; the value "inc-psc" is not valid. See WT:LOL and WT:LOL/E., Lua error in Module:parameters at line 290: Parameter 1 should be a valid language or etymology language code; the value "inc-psc" is not valid. See WT:LOL and WT:LOL/E. etc for which I don't know the Punjabi equivalent.

Also, as for Kholosi taɽgo, the initial dental is de-voiced in a Paisaci-like manner but the r is also preserved which does not seem to be something Piasaci would do: Skt. घर्म (gharmá) --> Lua error in Module:parameters at line 290: Parameter 1 should be a valid language, etymology language or family code; the value "inc-psc" is not valid. See WT:LOL, WT:LOL/E and WT:LOF.; the r has been lost.

How should we handle cases like this? -- Bhagadatta (talk) 02:08, 17 September 2020 (UTC)[reply]

@Bhagadatta: Hmm, so it seems we won't be finding any perfect Prakrit match to Northwest IA, as I had suspected previously. (If anything, Shahbazgarhi/Mansehra Ashokan Prakrit/Gandhari Prakrit seem to be closer.) I suppose we can make a Northwest IA family and put it under Sauraseni to maintain the status quo. We could have Paisaci as a separate branch with no descendants. —Aryaman^A ^{(मुझसे बात करें • योगदान)} 02:24, 17 September 2020 (UTC)[reply]

@AryamanA: Well, I love the idea of Proto NWIA and Proto Central IA etc. But can we really continue calling Punjabi and Kholosi descendants of Sauraseni? As you pointed out, these languages preserve features and (remnants of) clusters that Sauraseni lost. But then again, there are a lot of features in Punjabi that appear to be from Sauraseni; one feature I can think of is geminated stops. Classifying IA languages is a really challenging task. -- Bhagadatta (talk) 03:16, 17 September 2020 (UTC)[reply]

@AryamanA, Bhagadatta, Inqilābī: It’s really unfortunate that the Indo-Aryan language family is not understood as well as it should be, and this particular issue is certainly one that needs to be discussed.

It is the overwhelming consensus in Indo-Aryan scholarship that Punjabi and Sindhi constitute a Northwest Zone distinct from the Central Zone. This modern Northwest Zone has similarities with the Ashokan Northwest Zone. However, there are several challenges in the definition of such a Northwest Zone.

First, the dividing line between the Northwest and Central Zones in modern South Asia is not well defined. Although the Western boundary with Iranian and the Northern boundary with Dardic are somewhat clear, the Eastern boundary is blurred due to contact with Hindi-Urdu and the Southern boundary is blurred due to Rajasthani and Gujarati.

Second, the academic understanding of the dialects of Punjabi and Sindhi is sparse.

For Punjabi, academic literature usually only refers to the Majhi lect or Standard Punjabi (MSP) of Amritsar and Lahore. This predominance of MSP in the academic literature distorts any general understanding of the Punjabi linguistic area as a whole. Other than MSP lect, Saraiki is the next best understood Punjabi lect. Saraiki has the advantage of being both the variety most consistently divergent from MSP and the one with the best local claim to separate recognition. Since Saraiki is spoken in southern Punjab close to the border with Sindh, there are numerous similarities between Saraiki and Sindhi. For example, both have implosives that are otherwise absent in Indo-Aryan. Pahari-Pothwari is perhaps another important Punjabi lect due it being the native lect of Rawalpindi, Islamabad and Mirpur.

Since most Sindhi speakers in Pakistan and less than 1% of Indians speak Sindhi, understanding the Sindhi linguistic area from an Indian perspective is very challenging. Fortunately, the Kachchi lect is spoken natively in Kutch, which is accessible to Indians. Although Kachchi is a Sindhi lect, many Kutchi people choose not to identify as Sindhi as can be seen in their choice to use the Gujarati script. What is needed for a general understanding of the Northwest Zone as a whole is enough data for one or two Module:zh-dial-syns for Punjabi and Sindhi. Appendix:Sindhi Swadesh lists is a step in that direction.

Third, as is the case for all of Indo-Aryan, the documentation of various earlier stages does not represent a logical successive relationship from one stage to the next. There is no information regarding the transition from MIA to Early NIA for the Northwest Zone. The earliest data for Northwest NIA are two short fragments of the Adi Granth termed ‘Old Punjabi’ that have been analysed by Christopher Shackle. Although the attestation is fragmentary, comparing Old Punjabi with Modern Punjabi and Sindhi helps with diachronic analysis.

Despite nearly a thousand years of Perso-Arabic influence, Punjabi and Sindhi still show many features of Prakrit to an extent greater than Marathi, Hindi and Bengali. Markandeya claimed that Vracada Apabhramsa was spoken in Sindh and is the ancestor of modern Sindhi (sindhudeśedbhavo vrācaḍopabhraṃśaḥ). Pischel and Grierson have both supported this claim by Markandeya. Very little is known about the Vracada itself, except nine peculiarities noted by Markandeya. Here are some of those features of Vracada:

Retroflexion of MIA dental stops. For example, <त> → /ʈ/ and <द> → /ɖ/
An epenthetic <य> before <च> in Vracada may be the source of the Sindhi and Saraiki implosive /ʄ/
Sibilant merger: ṣ, s→ ś
The च-series are pure palatals. For example, <च> → /c/, <ज> → /ɟ/

Fourth, despite Bhagadatta’s valid analysis, multiple sources say Paisachi represents the Northwest Zone. Page 24 of {{R:inc:CGMIA}} says that regardless of whether the Northwest Zone is the home of Paisachi, it was also spoken in the Central Zone. Pages 30-31 say that there are at least four lects of Paisachi: Kaikeya, Saurasena, Pancala and Culika. The Paisachi lects in Pischel are perhaps Kaikeya and Culika with Culika being marked separately. I see no harm in using reconstructed Paisachi like reconstructed Ashokan Prakrit as a solution to this issue. Since Proto-Prakrit as a separate entity was rejected, attempting to create Proto-NW I-A as a separate entity is likely to be rejected on the same grounds. Like Ashokan Prakrit, the attested and reconstructed terms would represent different entities. See अक्खइ for a Paisachi quotation.

Fifth,

The area encompassed by Sauraseni Prakrit is too large. The area between Mathura and Karachi is at least twice the size of the areas encompassed by both Maharastri Prakrit and Magadhi Prakrit.

There is no Sindhi-speaking editing community (other than the inactive user User:Aursani) to obtain information from.

Old Punjabi pa-old is an etymology-only language.

What is the purpose of Category:Western Panjabi language pnb if it is intended to be merged with pa? Kutchkutch (talk) 13:15, 17 September 2020 (UTC)[reply]

@Kutchkutch: I am not an expert, but I would like to make a general remark that the biggest problem faced while classifying the IA family is the effect of dialect levelling and/or dialect mixing that happened historically, which can disrupt the otherwise regular nature of sound laws, and thus lead to common innovations in divergent lects. For example, Punjabi shares with Dardic the feature of losing voiced aspiration, and metathesis of the rhotic consonant. — inqilābī ^{[ inqilāb zindabād ]} 17:10, 17 September 2020 (UTC)[reply]

@Kutchkutch: Very well articulated. I agree that Proto-Northwestern Indo-Aryan has little to no chances of being approved as a full fledged language on wiktionary, complete with lemmas in its name. The best solution seems to use reconstructed Paisaci for this purpose. -- Bhagadatta (talk) 01:34, 18 September 2020 (UTC)[reply]

@Inqilābī: The points that you have mention are worth considering. Anything I say about Northwestern Zone is from an outsider's perspective (so if something is incorrect then please feel free to correct it). User:AryamanA and the other Punjabi editors are probably in a better position to make internal judgements about the Northwestern Zone. Despite having an outsider's perspective and numerous limitations, learning about the other Zones of Indo-Aryan is still a worthwhile pursuit (If User:DerekWinters was still around, I'm sure that he would agree). Since there is an international border and a variety of religions in the Northwest Zone, discussing it in detail might involve several sensitive issues such as politics or religion.

Perhaps the effects of ʻdialect levelling and/or dialect mixingʼ, Areal features, Sprachbund#Indian_subcontinent and Dialect_continuum#Indo-Aryan_languages is one of the reasons why ʻthe documentation of various earlier stages does not represent a logical successive relationship from one stage to the nextʼ. These are some of the examples of the shortcomings of the Tree model especially for the Indo-Aryan family. The Wave model tries to fix some of those shortcomings, but understanding and applying it appears to be a challenging task. Although Module:zh-dial-syn seems to be a possible approach to addressing the shortcomings of the Tree model, data is either hard to find or non-existent.

@Bhagadatta: This discussion about the Northwest Zone raises interesting parallels with the other Zones of Indo-Aryan. The work on Maharastri Prakrit, Old Marathi, Marathi and Konkani has certainly been advancing our understanding of the Southern Zone in the public eye. It would be nice to see the same kind of collaboration (if it doesn’t exist already) on the modern and historical languages of the other Zones among native speakers and learners.

Interestingly, User:AryamanA created codes for Proto-Central Indo-Aryan inc-cen-pro, Proto-Northern Indo-Aryan inc-nor-pro and Proto-Northwestern Indo-Aryan inc-nwe-pro without anything more than:

I had not correctly categorized some of the subfamilies, so the pages themselves (except for 1 Ahirani lemma) are okay. This reorganization will take a bit.

So I'm a not sure whether that means we can start reconstructing these proto-languages (finding citations for such reconstructions would be difficult). Perhaps he'll tell us more about what is happening after the reorganisation is complete. According to Wiktionary:Families, non-genetic groups of languages can also be called a ʻfamilyʼ such as CAT:Prakrit languages and now CAT:Central Indo-Aryan languages and CAT:Eastern Indo-Aryan languages. The ancestor of Ahirani is now Sauraseni Prakrit instead of Maharastri Prakrit (perhaps the similarity of से (se) with છે (che) was the reason for the change) with the result now visible on आऊत (āūt) (औत (aut) is more common than अऊत (aūt) for Marathi but mr.wikt uses अऊत (aūt)). Khandeshi continues to have Maharastri Prakrit as its ancestor. Although {{R:ahr:RSS}} exists, not all the pages of that dictionary are available.

It says on Wikipedia:

Sanskrit refers to the whole range of mutually intelligible Old Indo-Aryan dialects spoken in North-western India at the time of the composition of the Vedas.

the original speakers of what became Sanskrit arrived in the Indian subcontinent from the north-west sometime during the early second millennium BCE

So perhaps that means that attested Vedic Sanskrit is OIA in the Northwest Zone, and *पुरिष (puriṣa), *दिन्न (dinna), Reconstruction:Sanskrit/झापयति and the terms in CAT:Sanskrit reconstructed terms could either be alternate forms of OIA in the Northwest Zone or OIA in the other Zones. Kutchkutch (talk) 09:22, 18 September 2020 (UTC)[reply]

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ @Kutchkutch: The part about Skt. being a catch all term for all OIA dialects in North Western India was added by me when I was trying to re-word Woolner's statement. The original statement was something to the effect of, "if Sanskrit is taken to mean Vedic and all Old Indic dialects then the Prakrits are derived from Sanskrit". I think I ought to update it when I have more time on my hands as OIA was also spoken in the east. The part about Sanskrit speakers arriving in the northwest of India was already there. As for Vedic being North Western OIA, it's beautifully illustrated here that Vedic too was a mixture of several dialects. The hymns and verses that would later be included in the Rigveda were composed near the confluence of the Sutlej and Beas rivers in the Punjab region but the Rigveda was redacted in modern day Haryana where the dialect was slightly different so the original Rigvedic dialect was "filtered" through the phonetics of the Kurukshetran dialect, also called "Western Vedic" which was spoken where the redaction of the text took place. Finally Panini's Classical Sanskrit comes from the Northwestern dialect of OIA spoken in Gandhara. -- Bhagadatta (talk) 10:43, 18 September 2020 (UTC)[reply]

@Bhagadatta: Thanks for the explanation and sharing that link. I've read that before but wrongly assumed that it was an imagined story. When you have time, it would be very helpful if you could update Wikipedia with your understanding of this information. Interestingly, Chitrapur Math's Sanskrit lessons use a funny story about the difference between two Konkani dialects to explain what Panini did for Classical Sanskrit here. Apparently there is बड्गी dialect in North Kanara and a तेङ्की dialect in South Kanara (this is reminiscent of the first discussion at Talk:हांव). The female author of the Sanskrit lessons who speaks बड्गी marries a तेङ्की speaker and encounters a few difficulties.

Good ol' Panini, God bless his soul, being extremely sensitive to people's feelings, so no group would feel left out, and wanting to see everybody live happily ever after together, decided to act Pappamma, and brought all of them together under the aegis of "Sanskrit." He toured all over Bharatvarsha, noted every word used and put it all down on paper. Then he classified the words. AND HOW!!! ( To his credit ...he studied all existing grammar works and in his own work, has very religiously and faithfully accounted other grammarians' thoughts on the subject under discussion.) Once Panini's work became known to the people, the Sanskrit Badgis and the Tenkis of the days gone by became familiar with each other's vocabulary and very soon a mixture of the two became a single , common medium of communication. Much like my kids speak today!

The connection between Dardic and Punjabi probably comes from the article Dardic languages. There also may be some Dardic influence on Konkani_language#Pre-history_and_early_development

Dardic may in turn also have left a discernible imprint on non-Dardic Indo-Aryan languages, such as Punjabi and allegedly even far beyond.

Konkani shows a good deal of Dardic ( Paisachi ) influence. Even Magadhi has got a good deal of "Dardic" influence. The other languages on which Paisachi exerted influence are Sindhi, Punjabi, Kashmiri and Nepali in the north.

The influence of Paisachi over Konkani can be proved in the findings of Dr. Taraporewala, who in his book Elements of Science of Languages (Calcutta University) ascertained that Konkani showed many Dardic features that are found in present-day Kashmiri. Thus, the archaic form of old Konkani is referred to as Paishachi by some linguists. This progenitor of Konkani (or Paishachi Apabhramsha) has preserved an older form of phonetic and grammatical development, showing a great variety of verbal forms found in Sanskrit and a large number of grammatical forms that are not found in Marathi.

The link that you provided also demonstrates that some work on the Northwest Zone has been done. Perhaps a summary is in order.

For Paisachi:

The most iconic feature of Paisachi is the apparent devoicing of intervocalic stops (Compare Sanskrit bhagavatī with Paiśācī phakkavatī). The grammatical rules at work in Paiśācī could simply be the reverse application of the voicing rules applied to produce the other Dramatic Prakrits. For von Hinüber (1981), however, the supposed devoicing in Paiśācī is actually a fiction of orthography. According to his theory, at some point in the development of Middle Indic, the character <g> no longer represents voiced velar stop /ɡ/ but rather voiced velar fricative /ɣ/ due to lenition. After this shift, the character <k> is repurposed to mark /ɡ/. For von Hinüber, the odd appearance of Paiśācī is due to the distorting lens of this orthographical shift.

For Sindhi:

The five major dialects of Sindhi are Vicholi, Lari, CAT:Lasi language, Thari and Kachchi. CAT:Memoni language, CAT:Kholosi language and CAT:Luwati language are often included as well. Thari is perhaps another name for Dhatki mki, in which case it may actually be a Rajasthani language. Some sources say that there is a Saraiki dialect of Sindhi and another Saraiki dialect of Punjabi.

Vicholi is the standard dialect in central Sindh. Lari is the dialect of southern Sindh. Lasi is spoken on the western frontier of Sindh and in Balochistan. Thari is the dialect of the Jaisalmer district of Rajasthan. There is a dialect map for Sindhi at [1]. Implosives are explained as the outcomes of geminated voiced stops in MIA (MIA /bb/ → /ɓ/). This is slightly different from the analysis of Vracada Apabhramsa above. The number of voiced implosives differs from dialect to dialect (similar to how the number of tones differs from dialect to dialect for Punjabi). All Sindhi dialects have at least one implosive, and curiously none have a dental. *[tr] → /ʈ/ in ٽي (“three”).

For Punjabi:

Fariduddin Ganjshakar (1173 - 1266) was one of the first Punjabi writers. Fariduddin Ganjshakar wrote in the Shahmukhi script. Perhaps the name Shahmukhi is only used to distinguish it from Gurmukhi. Fariduddin's literature was included in the Adi Granth. The excerpt interestingly claims that there was some contact between the writers of Old Punjabi and Old Marathi.

The dialects of Punjabi are divided in to Western Punjabi and Eastern Punjabi. Here are some maps: 1 2 3

The Eastern dialects are Majhi, Doabi, Malwai and Puadhi. Doabi is spoken Beas and the Sutlej (perhaps the same region in which the the Rigveda was composed). Malwai and Puadhi are spoken south of the Sutlej along the boundary of the Haryanvi language area. Arya Samaj's promotion of Hindi in the Punjab is often cited as the reason for the blur between Hindi and Eastern Punjabi. Lahnda is an exonym for Western Punjabi coined by William St. Clair Tisdall and this term was also used by Grierson. Two major groups within Western Punjabi are Saraiki and Hindko. Hindko is spoken in discontinuous areas of Khyber Pakhtunkhwa and in frequent contact with Pashto. Kutchkutch (talk) 10:18, 19 September 2020 (UTC)[reply]

@Kutchkutch, Bhagadatta, Inqilābī: Thank you all for your input!! I have been a bit busy, but I have read and tried to do my own research as well.

I want to clarify, Proto-Central IA, Proto-Northwestern IA, etc. were only a temporary measure for organization in our modules. I don't expect that we will be reconstructing (most of) these and we should strive to replace them with attested languages if possible. Hence, why I brought up Paisaci as a possible substitute for Proto-NW IA. (Also, most sources I found treated Ahirani as a Central IA lect, hence the reclassification.)
The status of Paisaci seems to be difficult to ascertain. Upadhye (1939-40) in their review of Paisaci literature classifies it as an Old MIA language, on the order of Ashokan and Pali. This would seem to explain the rarity of it in written texts; probably, since no religious group promoted it (as Buddhists and Jains did with other MIA lects), there was little incentive for its recording, and so all we are left with are the statements of grammarians.
But it should also be noted that early Buddhist texts claim that Paisaci was used by one of the schools of the Vaibhāṣika Sthavira (sub-sect) of Buddhism, based in Kashmir. And going through Pischel again, I find that all of the evidence does point to Paisaci being a Northwestern IA language The only argument put forth for it being anywhere else (namely the Vindhyas) are the presence of retroflex ḷ (suggested by Rudolf Hoernlé) but pretty much everyone that followed has found this to be insufficient evidence. It should also be noted that (some lects of) Punjabi have developed ḷ natively too, and I think some of the Dardic languages too.
Then the problem follows, what is a descendant of Paisaci? I strongly doubt that Konkani is involved, although there certainly hasn't been enough historical linguistic work on it; I would think a lot of Konkani's archaisms are rather due to re-Sanskritization, perhaps earlier than occurred for other IA languages. The Dardic languages are often tied to Paisaci, but Pisaci only preserves one sibilant while the Dardic languages have all 3 generally. Maybe we ought to keep it as only the ancestor of Sindhi and Punjabi? I also think a code for Vracada (as the ancestor of Sindhi etc. as Kutchkutch investigated) is necessary and uncontroversial. The devoicing really reminds me of Punjabi, which has done that to its voiced aspirated series and has resulted in tones, but if it's purely an orthographic difference then that is not the same process. Upadhye gives a very interesting idea about Culika dialect as being a form spoken by Sogdians who came to India, suggesting the speakers of Paisaci migrated inland later on, which would result in dialect admixture with Sauraseni. But I think it's very difficult to draw satisfactory conclusions.
Of course, there are probably discussions to be had about Dardic's placement as well. Probably, Shabazgarhi/Mansehra Ashokan Prakrit are a good proxy for the ancestor of Dardic, but it is confusing where Gandhari comes into play. We will not put it under Paisaci at this time.
Finally, I'm afraid the tree model is too simplistic for IA in general, as Inqilābī and Kutchkutch pointed out. On a purely lexical basis, as examined from the view of lexicostatics, Kogan (2016) finds Punjabi to be closest to Hindi, grouping Sindhi closer to Gujarati, Rajasthani, Lahnda, etc. Our grouping is quite different. Hindi itself, as we know, is a highly mixed language that developed in Delhi from contact between many languages in a political centre, as reflected in its early forms such as Sadhukkadi, in which case even it is not a purely Central IA language and probably is highly influenced by other Central (Braj, Haryanvi), Northwest (Punjabi), Western (Rajasthani), Eastern (Bihari lects) and the Ardhamagadhi lects of IA.
Overall I am not opposed to the placement of Punjabi and Sindhi (and related languages) under Paisaci, and Sindhi and its relatives under Vracada Apapbhramsa. Reconstructing Paisaci seems difficult however, although I am curious whether late MIA can be reconstructed at all so it may be an interesting experiment to undertake. However, I wonder if we should only rest on geographical groupings as the best we can have, since the Prakrits seem to be difficult to tie directly to NIA languages.

—Aryaman^A ^{(मुझसे बात करें • योगदान)} 21:50, 19 September 2020 (UTC)[reply]

@Bhagadatta, Kutchkutch, Inqilābī: Also check out [2] from Suniti Kumar Chatterji's work on Bengali. I think this is the best tree-based chart we'll be getting! —Aryaman^A ^{(मुझसे बात करें • योगदान)} 02:38, 20 September 2020 (UTC)[reply]

@AryamanA: Thanks for the update and the research behind the update. Although we made some progress in furthering our understanding, this is certainly a complex set of issues that are unlikely to resolved anytime soon. The tangible result is that there is now a code for Vracada. However, the usefulness of that code for Vracada remains to be seen. There are probably shortcomings of the tree model in every Zone.

Vracada Apabhramsha's role in forming Sindhi is another reminder that Late MIA (Apabhramsha) is an important stage of I-A. Apabhramsha is often treated it as a single entity rather discrete regional entities because there was a dialect continuum like Ashokan Prakrit. In the bazaar scene of Uddyotana’s Kuvalayamālā c. 779 the narrator quotes small bits of eighteen different languages, some of which sound remarkably similar to the spoken languages of today rather than the Prakrits. Thus, the original names given to the various Apabhramshas (Nagara, Upanagara, Vracada) are like the dots in MOD:inc-ash-dial-map. Most of the Apabhramshas like Vracada are known by name only with little information about them. Kutchkutch (talk) 09:38, 20 September 2020 (UTC)[reply]

@Kutchkutch: Thanks for sharing the link. It's really interesting and refreshing to see that the idea of Sanskrit having different dialects is being alluded to by an Indian guide on Sanskrit. Because I've seen otherwise, most people usually believe that Sanskrit, right from the Rigveda to the elementary lessons on a Class 8 Sanskrit textbook, is of a single form. Konkani being related/influenced by Paisaci/Dardic is news to me but I can already see why someone would think so. It was the language(s) brought in by the GSB people that would eventually become Konkani and the GSBs are said to have arrived from the banks of the Saraswati river. But then, Konkani is after all descended from pmh so Paisaci influence if any would be comparable to the influence exerted by a substrate which probably has been lost by now.

@AryamanA: That chart definitely explores the relations between IA languages better than how it's presently done on this site. I'd say the Romani languages are perhaps descended from Paisaci if it weren't for that chart which lists Romani as being influenced by it and Dardic. -- Bhagadatta (talk) 12:21, 20 September 2020 (UTC)[reply]

@Bhagadatta, Kutchkutch: Another non-controversial change we can make is replace Proto-Northern Indo-Aryan with Khasa Prakrit, removing the Pahari languages from the Sauraseni fold. This would need a bot run to replace current instances of inheritance from Sauraseni in those languages.

I'm still a bit uncertain about Paisaci reconstructions, but it seems we have consensus to put it as the common ancestor of Sindhi and Punjabi and Lahnda lects. I will do that. We still need an Apabhramsa for Punjabi, but it seems there were several in the Punjab area if we go by Grierson and Chatterji. The primary one of these is Ṭakka Apabhramsa, and we can classify the other ones (Kekaya, Madra, Upanāgara, etc.) as varieties of this. Like Vracada, these aren't well attested. —Aryaman^A ^{(मुझसे बात करें • योगदान)} 20:24, 20 September 2020 (UTC)[reply]

@AryamanA: The new codes and reorganisation will help with descendants trees. The most important idea conveyed by the recent changes is that MIA doesn't become NIA in one fell swoop. Replacing Proto-Northern Indo-Aryan with Khasa Prakrit would allow a replacement for ambiguous term Pahari in descendants trees. If Takka is the same as Pischel's Dhakki, then Takka Apabhramsa could have attested entries.

The only other discussion about Apabhramsa appears to be User_talk:DerekWinters#Apabhramsa. Some interesting points from that discussion are:

This shows that there must have been a language spoken in Kalinga (historical region) that led to Oriya being separate from Bengali-Assamese.

Principle Apabhramshas are Takka Apabhramsha in Central Punjab and Vrachada Apabhramsha in Southern Punjab. By 1200 AD these Apabhramshas had few inflectional morphemes left. During Middle Ages Takka Apabhramsha developed into Lahori dialect and Vrachada Apabhramsha developed into Multani dialect. Arab and Persian travellers, specifically Al-Biruni in his book تَحْقِيق مَا لِالْهِنْد (taḥqīq mā li-l-hind), had declared that even before the advent of Islam in Sindh (711 A.D.), Vrachada was prevalent in the region.

(@Bhagadatta:) It's still unclear whether there should be reconstructions for Vrachada, Takka and Paisachi. Comparisons could be made with reconstructed Ashokan Prakrit (Proto-Prakrit) and Proto-Dardic. However, the difference is that reconstructed Proto-Prakrit and Proto-Dardic appear in the literature, and Paisaci and the Apabhramsas don't appear to have a tradition of reconstruction unless en.wikt wants to start that tradition. The sound laws the affect Paisaci and the Apabhramsa would need to be established somewhere (such as User:AryamanA/Prakrit). Without reconstruction, there would be a lot of categories of the type CAT:Vracada Apabhramsa term requests. Kutchkutch (talk) 10:26, 21 September 2020 (UTC)[reply]

Category:Hindi Tadbhava

@AryamanA, Itsmeyash31, Atitarev, Bhagadatta I am planning to remove this category unless someone comes up with a good reason to keep it. It has only 33 entries in it, is badly named, and duplicates Category:Hindi terms inherited from Sanskrit. Benwing2 (talk) 04:50, 17 September 2020 (UTC)[reply]

@Benwing2: Oh man, consensus to delete this category had been reached years ago, I can't believe it's still there... I thought it was deleted. -- Bhagadatta (talk) 05:16, 17 September 2020 (UTC)[reply]

@Benwing2: Since we do not have Category:Hindi Tatsama and Category:Hindi Ardhatatsama, then how did this one linger thus long? Obviously, delete it. — inqilābī ^{[ inqilāb zindabād ]} 13:18, 17 September 2020 (UTC)[reply]

Turkish noun inflection

According to a comment in Requested entries (Turkish) (Special:Diff/60412898) the Turkish noun inflection template is incomplete. Some possessive forms can end in -in or -ini and only the first is generated. Could somebody who knows Turkish explain what needs to change? Vox Sciurorum (talk) 13:46, 17 September 2020 (UTC)[reply]

It is not a simple matter. Turkish nouns can have an optional number suffix (marking a plural), an optional possessive suffix, and an optional case suffix, in that order. For example, kitap-lar-ım-da = “book-plural-mine-in” = “in my books”. So the generic form is noun stem + number + possessive + case. Counting the absence of a marking as the presence of a null segment denoted, for the purpose of exposition, by ∅, some possibilities are:

kitab-∅-ım-da = “in my book”
kitap-lar-∅-da = “in (the) books”
kitap-lar-ım-∅ = “my books”

Including the null segment, there are two number suffixes, seven possessive suffixes, and six case suffixes, giving 2 × 7 × 6 = 84 combinations. Some forms are shared, but are analytically and semantically distinguishable. The inflection tables are in comparison rather simplified. They are essentially two tables: a main one for noun stem + number + ∅ + case, and an optional one for noun stem + number + possessive + ∅, reducing the number of combinations to 2 × 1 × 6 + 2 × 6 × 1 = 24. The requested entry bisikletini (which is the definite accusative of both bisikletin (“your (singular) bicycle”) and bisikleti (“his/her/its bicycle”), and therefore a shared form) contains both a possessive suffix and a case suffix, so it is not included in the 24 forma provided at bisiklet#Declension. All this is peanuts compared to Turkish verb conjugations, where you’ll have a real combinatorial explosion if you try to include all possible forms. I think this is more a grammatical issue than a lexical issue, but ultimately it is a policy issue; does the maxim “all words” really mean all 10,080 possible, completely regular and predictable inflections of some base form in some agglutinative language? --Lambiam 11:25, 18 September 2020 (UTC)[reply]

The Azerbaijani inflection table has also the same problem but it has the complete inflection, then why can't we do the same with Turkish? Some Wiktionaries have already the completed one. If we add the inflection for bisikletini we can just add as second person possessive and third person possessive in the accusative form. If we compare with examples as cases: Bisikletin nerde? = Where's your bicycle?(nominative second person possessive). Bisikletini istiyorum = I want your bicycle(accusative second person possessive). Lagrium (talk) 14:01, 18 September 2020 (UTC)[reply]

Kyiv

Perhaps Kyiv should be promoted from an “alternate spelling,” in the wake of the renaming of w:Kyiv. —Michael Z. 2020-09-17 16:49 z

The city hasn't been renamed AFAICT; as long as more English-language sources call it Kiev than Kyiv, I think {{alt sp}} is still right. Kyïv, Kyjiv, and Kyyiv may also be attestable, but none of them have an entry yet. —Mahāgaja · talk 17:27, 17 September 2020 (UTC)[reply]

Kyiv and Kiev are Romanizations of different names, in different languages, for the city. So they are, IMO, “alternative forms” rather than “alternate spellings”. --Lambiam 11:40, 18 September 2020 (UTC)[reply]

The difference in their written form is only their spelling—not capitalization, not diacritics, or anything else (not that I care much about the label). —Michael Z. 2020-09-19 16:37 z

Alternative spellings have to share an etymology. These two forms don't. Ultimateria (talk) 17:37, 19 September 2020 (UTC)[reply]

The same could be said about "castle" and "chateau". Would you say that Myanmar and Burma are alternative spellings? How about Beijing and Peking?

Interesting comments. (Who says alternative spellings have to share an etymology?) They sort of do share an etymology, because the English name was strongly influenced by both Russian and Ukrainian at a time when written Ukrainian was suppressed in the Russian empire, when Ukrainian was often referred to as “Russian” or “Little Russian,” especially by the literate class in the Russian empire, and when the old orthography could write it exactly the same in two languages: Киѣвъ (the letter yat was pronounced differently and led to different sounds and letters in modern Russian and Ukrainian). There’s a lot of historical Ukrainian influence on English that still carries that legacy.

The spoken name \key-ev\ is not two different words. It can be transcribed with either of two spellings, depending on whether you use a current style manual or still use your grade four textbook.

Peking and Beijing came about similarly, through transcription of Cantonese and Mandarin languages, respectively. Pronounced differently but written similarly, like Киѣвъ > Кіев/Київ. But English Kyiv/Kiev are spoken the same. —Michael Z. 2020-09-19 19:08 z

That's a good point. {{alt form}} is better than {{alt sp}} for this. —Mahāgaja · talk 16:28, 18 September 2020 (UTC)[reply]

Indeed the city hasn’t been renamed in Ukrainian or in Russian, but it is in the process of being renamed (re-spelled) in English, much like Peking→Beijing. More current sources are now writing Kyiv, and according to current style manuals Kiev deserves the label “dated.” Wikipedia has changed its practice as a follower, not a leader, and this only after the delay of a six-month moratorium (gag order) and three months of debate on the associated talk page. —Michael Z. 2020-09-19 16:43 z

The changes in style guides are motivated by politics. Kiev is still more common. --Vahag (talk) 18:10, 19 September 2020 (UTC)[reply]

Our dictionary shouldn’t share your political prejudices. It is based on usage, which we ascertain through references. And it shouldn’t give our readers writing advice that makes them look ignorant of current standards. —Michael Z. 2020-09-19 19:09 z

Google Ngram is too coarse. The last data point is all of 2019 averaged out. There was a tipping point in usage during October–November. If you’re interested in politics, the very conservative BBC and New York Times use Kyiv, Breitbart uses Kiev. —Michael Z. 2020-09-19 19:20 z 19:20, 19 September 2020 (UTC)[reply]

Appendix for strings in unidentified or uncertain languages?

Over in the User_talk:Karaeng_Matoaya#The_enigmatic_poem_of_Nukata_no_Ōkimi thread, a few of us where discussing a particular poem in the Man'yōshū anthology of Old Japanese poetry, completed around 759 CE. Poem number 9 has frustrated readers for centuries, as the first two stanzas may be written in a different language entirely.

That discussion gave rise to a question about whether Wiktionary might have space for collecting snippets of text like this, where the underlying language might not be known. Clearly, a mainspace entry would be inappropriate. But what about an Appendix page?

Do we perhaps already have such an Appendix area set up? ‑‑ Eiríkr Útlendi │^{Tala við mig} 19:05, 17 September 2020 (UTC)[reply]

Why is mainspace inappropriate? It's attested, and it's what we already do: see Category:Undetermined lemmas. —Μετάknowledge^{discuss/deeds} 19:14, 17 September 2020 (UTC)[reply]

The Buyla inscription has word breaks so you know what to make entries for. But these verses (there are other examples especially in ancient Chinese sources, as discussed there) are entire sentences (sometimes entire songs) of words in an unknown language where even academics can’t agree on where to split the word boundaries, and it doesn’t seem quite right to treat sentences or songs as mainstream entries.--Karaeng Matoaya (talk) 22:36, 17 September 2020 (UTC)[reply]

In that case, we could create entries for each character (as for the Phaistos Disc signs)... although I don't object to creating an appendix, which would I presume present the text and various scholarly ideas of where to break it up and what it might mean? - -sche (discuss) 18:43, 19 September 2020 (UTC)[reply]

@-sche: Unlike the case of the Phaistos Disc where the characters are literally undeciphered, the main languages that brought about this discussion are all transcribed in conventional Chinese characters that we already have CJKV entries for, so it's only a matter of reconstructing the pronunciations and orthographic practices at the time and place of the transcription and comparing the resulting sequence to known languages from those parts of Eurasia. Scholars have gone quite a far way to deciphering many of these cases, but of course each interpretation attempt only makes sense as a whole; if a certain passage is determined to be Proto-Turkic from the beginning, you're going to have entirely different results from if you decided that it was Para-Mongolic. So having separate-character entries would not be particularly productive, while discussing the passage as a whole would. This is why I think an appendix entry like Appendix:Song of the Yue Boatman would work better.--Karaeng Matoaya (talk) 01:06, 20 September 2020 (UTC)[reply]

Quote adder redux

In a recent discussion about templatising citations it became clear that some editors don't use templates for citations because it's more difficult (parameters must be remembered, brackets matched etc.).

Other communities have already addressed this problem and created the Citoid extension. It extracts metadata (author, date, title etc) from a URL/ISBN/DOI and generates template code which can be directly inserted into the page. This has been suggested before, but not much has happened since. I'd like to get the ball rolling, if there's interest we'd need to set up a vote to get the extension installed, and create mapping files to work with our templates. – Jberkel 13:58, 18 September 2020 (UTC)[reply]

Calling other users involved in that discussion (@Equinox, DTLHS, DCDuring) for their support. — inqilābī ^{[ inqilāb zindabād ]} 18:11, 19 September 2020 (UTC)[reply]

It looks like it takes a fair amount of work to set it up. I'd like to see how a version configured for Wiktionary does on older cites using URLs from Google Books and no ISBN (ie, pre-1970). How does it do on translated works? On edited works with many authors of individual works? And quotations taken not from the speaker/author's own work, but from some secondary source? What about digging out when the work was actually written rather than when the specific print edition was published (though that publication date might be of supplementary interest)? I expect that a lot of manual work will be required for these situations in which the relative ease of Citoid will lead folks to fail to perform the whole job. But the current setup does a pretty good job of that already. DCDuring (talk) 00:48, 20 September 2020 (UTC)[reply]

BTW, is anyone working on tools like those developed by ELEXIS? Does anyone have access to and opinions about these tools? Also, has anyone worked with Sketch Engine? DCDuring (talk) 01:19, 20 September 2020 (UTC)[reply]

@DCDuring: I don't think the set up will take too much time, it's a matter of adapting it to our template conventions. It's true that it probably won't work well for some of the use cases you mentioned, but it seems to be doing a good job for getting references from contemporary sources. Yes, it's biased towards English and scientific sources, but the general metadata extraction part is generic and can be extended. I think it would especially help editors who routinely add quotations from newspapers / news websites. These often have already good metadata (author, publishing date, title) embedded in the HTML which can be extracted without any custom code. – Jberkel 07:47, 24 September 2020 (UTC)[reply]

What I would ideally like (and I know it's a "big ask") is for templates in general to be able to auto-complete or auto-predict their contents, rather like Microsoft Visual Studio does when you write a function call in some programming source code. So I would type the opening braces {{ and it would immediately pop up an alphabetical list of all available templates (without impeding my typing: this would just be an optional pop-up menu that I could navigate with the arrow keys), and perhaps I would choose "en-noun", and then type the pipe | and it would pop up a useful hint, like "en-noun parameter 2: Language code, e.g. en for English". I realise this would take a lot of work both in implementing the popup stuff and in actually filling in the documentation, but I think this would be amazing, and would convince a lot more people to use the templates. TLDR: I'm not specifically focused on citations but on templates in general; I do realise however (see my talk page) that I am a major editor and some people would like me to format my large numbers of book quotations in a proper manner, which I haven't forced myself to learn yet. Equinox ◑ 08:51, 24 September 2020 (UTC)[reply]

I'm not sure that typing in an ISBN or whatever reference number would help me at all because I'm usually copying and dropping in some text out of Google Books. I don't want to look stuff up, and I usually don't have any idea what the ISBN is. I just want to put the text in there as proof that a word exists. Equinox ◑ 08:53, 24 September 2020 (UTC)[reply]

@Equinox: When you cite from Google Books, you still need to perform a few manual copy&paste operations (select author, select title, etc.) If you pop this Google Books URL into Wikipedia's Cite tool, it creates the following snippet for you:

{{Cite book|date=2005-03-23|first=James|isbn=978-0-596-00847-5|language=en|last=Avery|publisher="O'Reilly Media, Inc."|title=Visual Studio Hacks: Tips & Tools for Turbocharging the IDE|url=https://books.google.es/books?id=ux3-AgXEenoC&pg=PA137&dq=visual+studio&hl=en&sa=X&ved=2ahUKEwin4byBxYHsAhULyxoKHXD6CaUQ6AEwAXoECAMQAg#v=onepage&q=visual%20studio&f=false}}

All you have to do is add the text passage, the boring details are pre-filled. Obviously, in this example the template and parameters are Wikipedia-specific, but that's part of the mapping we'll have to customize. – Jberkel 10:21, 24 September 2020 (UTC)[reply]

Ah, I see; I thought I would have to type an ISBN into a text box or something. Yeah, sounds good. Please HIT ME UP (as the kids say, or did a decade ago) if you want to do any pre-release testing of this. Equinox ◑ 10:28, 24 September 2020 (UTC)[reply]

I'd be happier if the ISBN didn't display, just as I would like to see publisher's addresses and much other material in many of our RQ templates fail to display or, better, appear only in a bibliographic entry outside of principal namespace, possibly in yet another namespace. — This unsigned comment was added by DCDuring (talk • contribs) at 14:21, 24 September 2020 (UTC).[reply]

CFI for misspellings

Hi, I've created a vote at ~~Wiktionary:Votes/2020-09/CFI for misspellings~~ to begin next week. I appreciate any feedback. Please let me know on the talk page of any relevant discussions I've missed. Ultimateria (talk) 19:30, 18 September 2020 (UTC)[reply]

I've renamed the vote to reflect its scope of rewriting the Spellings section of CFI: Wiktionary:Votes/2020-09/Misspellings and alternative spellings. Ultimateria (talk) 21:25, 18 September 2020 (UTC)[reply]

FWOTDs

For the past year; I have been selecting the Foreign Words of the Day (FWOTDs) here on Wiktionary. However, I'm at a point where I won't be able to do them any longer and would like to wash my hands of them, so I'm seeing whether anyone would be interested in handling them. (To set them, choose words from the nominations and add them to the subpages here using {{template:FWOTD}}.)

Even if you're not willing to take on the job of setting FWOTDs on a permanent basis, setting a few would be immensely helpful.There's currently a shortage of FWOTD nominees. Nominating more here would also be great; remember to follow the guidelines on that page. I'm open to the idea of potentially changing how Wiktionary does FWOTDs, as long as I'm not involved. As I said, I'd like to wash my hands of them.

If you have any questions, look at previous FWOTDs (e.g. September) or ask me.Hazarasp (parlement · werkis) 05:39, 20 September 2020 (UTC)[reply]

If my comment provokes a lot of discussion, I'll move it to its own thread so as not to hijack yours, but : I note that both WOTD and FWOTD seem to burn out their maintainers over time (I speak from experience!). I wonder if we should take a page from e.g. de.Wikt and do a "word of the week" instead. - -sche (discuss) 07:30, 20 September 2020 (UTC)[reply]

That is something I'm considering - that's why I mentioned "potentially changing how Wiktionary does FWOTDs". @Metaknowledge also suggested the idea of recycling previous FWOTDs; I think that's what I'll do (at least for now) if this doesn't attract much interest. Hazarasp (parlement · werkis) 08:32, 20 September 2020 (UTC)[reply]

How is island (FWOTD 2020-09-20) a Swedish word? Are there any more concrete requirements than that the term be "interesting"? Looking at the entries for this month, it is not apparent in what sense they are more interesting than any other random word, such as stjórnborð, တီကောင်, débarrassâmes, revisión por pares, or limitarere, generated by pressing "Random entry". --Lambiam 19:53, 20 September 2020 (UTC)[reply]

Look more closely: the word is ö. As for interestingness, that was a major concern for me, and my pursuit of that is what burned me out, but I can't speak for Hazarasp. —Μετάknowledge^{discuss/deeds} 19:55, 20 September 2020 (UTC)[reply]

Interestingness is subjective, and if you're not familiar with a particular language it might not be obvious. What I find often lacking is context, or some connection between words. Focus or theme weeks might provide that, or what about simply merging FWOTD with the WOTD? Call it "Words of the Day": perhaps a translation or equivalent word in another language, or something that is connected to the WOTD, a cognate, false friend etc. – Jberkel 21:13, 22 September 2020 (UTC)[reply]

Another thing which just occurred to me: it would also help to work on the entries which have been nominated, but which are currently stuck in "not ready". Often it's just a missing citation or translation. – Jberkel 11:27, 24 September 2020 (UTC)[reply]

@Hazarasp I'm willing to look after FWOTDs, just tell me when and with what FWOTD date I can start. I expect to end up the same as the other people who've selected FWOTDs, but ¯\_(ツ)_/¯. ~~←₰-→~~ Lingo ^Bingo _Dingo (talk) 11:33, 24 September 2020 (UTC)[reply]

The first unassigned FWOTD is October 1, though October 2 and 3 are both assigned. After that, things are clear (with a few sporadic exceptions). Hazarasp (parlement · werkis) 06:10, 25 September 2020 (UTC)[reply]

Where to lemmatize Literary Chinese terms attested only in Korean sources?

There are some Literary Chinese terms which were used chiefly or only by Koreans, back when Chinese was the country's language of writing. These words were not used in some heavily Sinicized form of what is nevertheless Korean, but in entirely Chinese compositions that would otherwise be perfectly understandable to an educated Chinese reader versed in Literary Chinese. In many cases they actually have no corresponding Korean-language term at all.

Should these be lemmatized as Chinese?--Karaeng Matoaya (talk) 08:47, 21 September 2020 (UTC)[reply]

I think that is appropriate. {{lb|ko|Korea|literary|historical}}? Or maybe {{lb|zh|Korean literary}}, with an explanation link. —Suzukaze-c (talk) 09:17, 21 September 2020 (UTC)[reply]

Or {{lb|Korean Classical Chinese}}? --沈澄心 ✉ 11:59, 21 September 2020 (UTC)[reply]

@Karaeng Matoaya, could you give us some examples? --Frigoris (talk) 12:05, 21 September 2020 (UTC)[reply]

@Frigoris, Some examples as I understand it are:

白文 (báiwén) in the sense of "unapproved document"
次知 in the sense of "someone in charge"
啓聞／启闻 (qǐwén) in the sense of "to report to the central government"
原情 (yuánqíng) in the sense of "to sue, to petition"
發明／发明 (fāmíng) in the sense of "a criminal's excuse or alibi"

But I'm not very well-read in Chinese, have you happened to encounter any of these words in these senses in non-Korean works?--Karaeng Matoaya (talk) 12:29, 21 September 2020 (UTC)[reply]

That sounds a lot like Medieval Latin, with non-native speakers communicating in the default international language of the day. If it's distinct enough, I suppose we could come up with an etymology-only code, but given the multilingual nature of Chinese even in China, it's probably better to treat it as just one more form of Classical Chinese, with nothing more than a regional label. Chuck Entz (talk) 14:19, 21 September 2020 (UTC)[reply]

@Karaeng Matoaya, thanks! Those words were indeed rarely used for the senses in Chinese. For example, 原情 (yuánqíng) since Han dynasty could mean "to pursue/seek the truth" > "to investigate a legal/criminal case"; however in "Chinese Chinese" it was a very loose compound. The word indeed chiefly appeared in legal contexts, but the meaning was not identical to the one you listed (though related). The phrase now survives in fossil form in the Chinese chengyu 情有可原.

I presume that you do have the textual material in which these words appeared; if those material were clearly in a historical form of Chinese (as opposed to Korean), I think it's appropriate to add them under Chinese with a suitable label. --Frigoris (talk) 15:15, 21 September 2020 (UTC)[reply]

Multiple homographs for Ojibwe finals

I'm looking for advice/rules on how to edit for multiple meanings of the same form when i have no information on the etymology. The example the Ojibwe final -i (a "final" is a type of morpheme specific to Algonquian languages). It has four different applications, which i have listed with separate POS headings. I'm wondering if that is the most appropriate way to show this phenomenon. I don't think gloss numbers 1, 2, 3, 4 works either, because the most important information - derived terms - doesn't fit under glosses. The English parallels use Etymology 1, Etymology 2, etc, but as i have information on the different etymologies, listing that way seems inelegant. SteveGat (talk) 15:46, 21 September 2020 (UTC)[reply]

To have multiple etymology sections, you don't have to know what the etymologies are, you just have to know that the etymologies are different from each other. That said, I have an uneasy feeling that we may be dealing with something like a Swiss Army knife, that can be used to slice carrots, pull a cork, file your nails, or saw off a branch without having to swap it for something else. In my experience American Indian languages tend to twist categorization systems based on European languages into macramé. It may be better to use a morphologically-based POS like "suffix" or "particle" combined with senses and subsenses for the different functions. Pinging @-sche, who has experience with this. Chuck Entz (talk) 03:19, 22 September 2020 (UTC)[reply]

The POS isn't the concern here. Finals are a morphological category, according to the literature (see a quick explanation here). As for the Swiss army knife, i think it is possible that the multiple-etymology analysis may be flawed (or at least forced), but it is the analysis adopted by the only authority that deals with the language systematically and is accessible to non-academic readers, the Ojibwe People's Dictionary. But if i read you correctly, the best approach (if we accept that each of the finals has a different "meaning"), then we can enter each one under a different etymology (Etymology 1, 2, 3...) without actually providing an etymology. Is that right? SteveGat (talk) 20:53, 22 September 2020 (UTC)[reply]

@SteveGat In general, putting different meanings under different etymologies should only be done if the different meanings have etymologically diverse origins. However, if you feel this is correct, there is no particular requirement to provide an etymology for each etymology section. For example, when I generate non-lemma forms of a word and there's already a lemma on the same page for the same language, I use a different etymology sections, with no etymology given for the non-lemma form. See Russian пар (par) for an example. Otherwise, if you think they are all etymologically the same, doing what you've currently done is fine, using separate POS headings. Or alternatively, put them as different definitions under the same POS heading, and if you want to separate out the derived terms, you can put the terms derived from different meanings under different boldfaced headers in the same "==Derived terms==" section (e.g. put a semicolon at the beginning of a line to boldface it). Benwing2 (talk) 02:15, 23 September 2020 (UTC)[reply]

(e/c) If we know, or reasonably suspect, that they have different etymologies (even if we don't know what those etymologies are), then yes, they can have blank Etymology 1, 2, etc headers. (If it seemed more likely that the same root diversified into several functions, then just different POS headers or different sense lines would suffice.)
I question some of the things listed as "derived" from the final final (the one defined as "occurs in adverbs, numbers, and other uninflected words"). In niswi, for example, the ending (w)i seems to have been present since Proto-Algonquian (*neʔθwi) if not earlier, and in niizhwaaswi and ingodwaaswi it seems that the ending may have been changed to the ending wi (not a final or suffix -i) at some early date when the words (which ended in ika in PA) were adapted to have the same form as other PA numerals which ended in wi. (In several other languages, the ika was simply dropped.) If -i were a meaningful morpheme marking numerals in Ojibwe (which is questionable, since other numerals lack it), it might make sense to say the words superficially have -i, but when the -i in question is just defined as a meaning-less string that "occurs in adverbs, numbers, and other uninflected words", I question whether such an analysis is useful. - -sche (discuss) 02:25, 23 September 2020 (UTC)[reply]

Thanks for all this help. @- -sche and @Benwing2. I think the morphology offered by the OPD is weak at best. The various finals -i are the result in my view of OPD's attempt to create initials that can be dissociated from the finals, when the more likely linguistic phenomenon is that the -i is simply suppressed when various other finals or suffixes are added. Compounding the problem is that -i in Ojibwe plays roughly the same role as a schwa, it's just vowel filler between consonants, so trying to force out a meaning seems illusory. Unfortunately, the OPD is the most authoritative resource on this issue, and it treats them all as separate finals. Given that the curent layout is "ok", i'll leave it unless and until i find something more clear on what is going on. Thanks again.SteveGat (talk) 13:55, 23 September 2020 (UTC)[reply]

Gheg vowels

Hello,

I have to bring up here the issue of Gheg vowels according to the article given. I never heard the vowel [ɒ], designated with "ä" in the vowels grid. I didn't gather examples in particular yet but my overall impression is that nowadays Gheg speakers can change with rather less restraint between [ɛ] and [e]. All terms in Tosk or Standard Albanian with the vowel [e] or [o] due to a schwa at the end have got the same vowel in Gheg as well. I wonder whether the denotation of the vowels [o] and [ɔ] extends any further in Gheg than if one would orthographically distinguish the Standard Albanian or Tosk pronunciation difference on account of the schwa. Due to Tosk influence or whichever preferrence of the speaker, it almost does not make sense to separate [a] and [ɑ]. Both occur in Gheg words that would have their own entries so, as a basic approach, two Gheg entries would be needed for a single Gheg variant with this vowel. For example the term mas can surely have mâs too despite the overview given in the table. Presumably, combinations can occur freely in one word. Lastly, nobody uses all of these circumflexes in Gheg writing, especially not all together like a thoroughly adopted orthography, excluding also large parts of Albanian linguistics along with Vladimir Orel. They were more frequent at the end of the 19th century and probably the beginning of the 20th century. Perhaps it would be better to abolish one or more of the circumflex vowels and denote different pronunciations in the IPA table. HeliosX (talk) 21:44, 22 September 2020 (UTC)[reply]

Now I gathered examples for my doubts about the complete accuracy by transcribing the first two stanzas of the Gheg song "Gabim" by Dhurata Ahmetaj into the current Wiktionary orthography with all these circumflexes and I had to listen very closely, sometimes to each word. I have added all word-final schwas of Tosk, unfollowed by consonants, but they do not change the pronunciation in the distinctiveness of the orthography.

Gabim:

Sa ndrrovê

S’jê i njejti mo, s’jê i njejti mo, vetën ê harrovê

Çka menovê?

Sê ê jotja kôm m’u kônë

Dêri n’fund u mashtrovê

Ti prêmtovê qaq shumë rrêna saqê êdhê vêtes i besovê

Ê tregovê ftyrën tônê n’atë môment kur me tjera m’krahasovê

Êj, nuk um mêtë sên mâ

Tash nuk kena ça me bâllë

Sê prij mejê, zêmër, s’ditê ça pô dôn

Ti pô dôn mu mê m’pasë n’kôntrôll

Shumë pô dôn, amâ nuk ka mê ndôdhë, jô

Bêjbi, bêjbi, pak kôntrôll

S’ka mê ndôdhë, môs u lôdh mâ

The term "bêjbi" can also be written as "baby" if the speaker prefers that spelling. The preposition "prij" is usually "prej" also in Gheg. The transcription shows that Gheg may have most of the vowels in the table of the article linked but it also becomes apparent that they are interchanged easily as was my overall impression beforehand. The interchangeability occurs in the infinitive particles "me" and "mê", the adverbs "mo" and "mâ" and the reflexive pronouns "vetën" and "vêtes". In another Gheg song of the same year by another artist, there were multiple times "pâsë" and "mênu" instead of "pasë" and "menu" as one would think based on this. HeliosX (talk) 18:00, 23 September 2020 (UTC)[reply]

Entries and contributions being rejected -- CFI, idiomaticity and sum of parts (SoP) woes

Hi all,

First, let me say that I'm happy that you all use your valuable time to help grow and maintain this wonderful tool called Wiktionary. You've helped me learn so much over the years. Atitarev recommended that I start a topic here. He's right. It will probably lead to meaningful dialogue about an issue which probably affects all of us here. The issue, in the form of a question, really is:

As a criterion for inclusion, are some of us being too rigid about idiomaticity?

Here's the problem in question using contrived examples for simplicity.

SITUATION 1 (collocations, compound words, lexical chunks): Imagine that Wiktionary lacked the following word in some foreign language: central processing unit. To my mind, this is clearly a valuable entry for the dictionary regardless of idiomaticity. For me, it's simple: it can be found in published works and it deserves to be here. However, not everyone shares my opinion. Some might argue that it has "low linguistic value" and, perhaps, we should only include CPU. They would say that we should define one of the words central, or processing, or unit and that we should:

include examples for central processing unit in the central, processing, and unit pages.
include a translation on some other page like CPU. The translation must, however, be presented as separate words to avoid a red link (non-existent entry), presumably.

Through a combination of searching and encountering examples and translations, the Wiktionary user will know what a central processing unit is without ever having a dedicated page for it. Meanwhile, at someothersite.com, we can just type central processing unit and voilà. This leads to someothersite.com appearing above us in the search engine results by virtue of having a page with the correct title and heading which the user is interested in. (I'm assuming that being a popular dictionary matters to us too and that we want to lead the pack).

SITUATION 2 (prepositional phrases): We have the word regard in some foreign language. A Wiktionarian would now like to contribute the following:

The contributions are shot down and the Wiktionarian is told that sum of parts contributions are not welcome here. They are told: we already have a regard page. You can simply put all those entries as examples on that page. Problems:

The page for regard becomes a huge mess as it now has to explain all its derived terms.
The next random editor can remove or change the example. (A dedicated page for those entries would not be subject to that arbitrariness).
The derived terms listed above probably have different rules about their usage depending on the language. So, for example, one term might require the genitive case, another might require the instrumental; another might require only the singular number, etc.

But why are we doing this to ourselves? Why are we rejecting entries which are clearly helpful? What are we afraid of? This is not a paper dictionary. We will neither saturate the server nor lose the ability to find anything. The search function seems pretty awesome and I haven't heard about any Wiki projects complaining about hard disk space.

We have unlimited storage space which we should use to make this dictionary comprehensive.
We are are not trying to document infinite word combinations, we are documenting frequently used lexical chunks.
By having specific entries, we are doing ourselves a favour in respect of search engine optimisation.
We can reduce clutter and increase the utility of pages by having specific entries for derived and related terms.
We can include more details about the usage of terms and expressions when they have their own pages and aren't just a footnote or example somewhere else.
We can link more translations to actual pages when those pages actually exist so that translation hubs aren't a blur of red.
We make our dictionary more attractive when users immediately see what they are looking for in the autocomplete.

Can we agree that the policy should be: if a term or expression appears in a published dictionary, it is good enough for inclusion in Wiktionary?

All the best. -- Dentonius (talk) 04:41, 24 September 2020 (UTC)[reply]

This has been discussed a billion times, but in response to your concerns about storage, I will just note that NOTPAPER in itself is a poor argument because that could also support having entries for "green leaf" and "large green leaf" and "cute furry kitten". Equinox ◑ 04:43, 24 September 2020 (UTC)[reply]

I also don't like your "published dictionary" rule because that basically makes us beholden to others instead of potentially leading the way. Also, other dictionaries can sometimes be worse than us, perhaps not often, but it can happen. (If you want to look up policy, we usually refer to copying other dictionaries' habits as following the LEMMINGS. See WT:LEMMING.) Equinox ◑ 04:45, 24 September 2020 (UTC)[reply]

if a term or expression appears in a published dictionary, it is good enough for inclusion in Wiktionary Absolutely not. Different dictionaries serve different purposes. They have different criteria for what to include. This diversity is a good thing. We are not trying to be all dictionaries, we are trying to be a particular dictionary with particular editorial practices. DTLHS (talk) 04:47, 24 September 2020 (UTC)[reply]

@Equinox, it's been discussed a billion times but maybe this time will be better. :-) I've never come across green leaf, large green leaf, or cute furry kitten in any published dictionary. Yes, some published dictionaries can be worse than us. But the reverse is true, we can be worse than them too. Their lexicographers are pretty smart people. Who are we to say that we know better about their art? @DTLHS, and what purpose are we trying to serve? I didn't sign up for your editorial practices, I signed up because I like this tool which enabled me to learn several languages. For me, it's about the utility. Why boast about an online universal dictionary which falls short of a paper dictionary? - Dentonius (talk) 05:03, 24 September 2020 (UTC)[reply]

We are lexicographers because we practice lexicography. The OED does not have a magic inaccessible spell which we lack that allows them to write dictionaries. I would much rather debate on the merits of our particular editorial policies than throw them out all together because we decided the editors of other dictionaries know better than us. DTLHS (talk) 05:12, 24 September 2020 (UTC)[reply]

Well, we have hundreds of entries that aren't in any professionally published dictionary; indeed I've seen some fairly compelling evidence that some of them (including the OED) occasionally refer to us to find new words. I am glad you signed up but maybe it's a good idea to look around and get a feel for the place, and its policies (developed by votes by hundreds of people over more than a decade) before immediately taking a sledgehammer to it all. Equinox ◑ 07:11, 24 September 2020 (UTC)[reply]

I've been using Wiktionary for years Equinox, long before I signed up. I love the concept but I always had other dictionaries I used because I knew that multi-word terms are usually a problem for Wiktionary. For example, for Spanish, I'd look up expressions and wouldn't find them here. Thankfully, there's spanishdict.com. For Russian, I'd look up certain expressions. Same problem. Thankfully, there's openrussian.org. For a bunch of other languages, I just go to wordreference.com which is pretty solid. For German, nothing even comes close to dict.cc. That's a project we should try to emulate. But, don't get me wrong. I *love* Wiktionary. There's nothing out there like it, but it still has a lot of growing to do. Now, as regards that sledgehammer, I haven't come to take away or destroy. I have come to give. I'm saying that we should stop thinking that we're better than professional lexicographers and let entries which are in published dictionaries exist here too. DTLHS, we aren't lexicographers (not unless that's your actual profession in real life). I would never try to make their profession out to be something that any lay person can do. In fact, I would sooner trust a published dictionary than anything I see here. For academic purposes, it is only a published dictionary which I can cite. For the amount of time that Wiktionary has been around, it is a crying shame that we haven't overtaken all the other dictionaries out there. - Dentonius (talk) 07:31, 24 September 2020 (UTC)[reply]

I don't think you understand. We aren't saying we are "better" than professional lexicographers. We don't necessarily compare ourselves to them at all. Forget that they exist. We are trying to build a dictionary. In my experience it is not efficient use of volunteer time or resources to create lots of "sum-of-parts" terms: look at the current ongoing discussion around "flu jab", where it's been pointed out that you can have "ANYTHING jab" for any disease that gets a vaccine (e.g. "tetanus jab", "rubella jab"). Now you may be approaching this from the point of view of a TRANSLATOR (do you work as a translator?) where it's important to translate entire phrases into other entire idiomatic phrases. Currently we aren't really a translator's dictionary; that is a different thing from a general-purpose dictionary. Equinox ◑ 08:57, 24 September 2020 (UTC)[reply]

That's a sound argument, @Equinox. I like it. You're right: I do approach it more from the point-of-view of a translator. Aside from the fact that it wouldn't scare me to add all those terms for all named diseases, I can appreciate what you're saying. Your answer was really helpful :-) -- Dentonius (talk) 09:02, 24 September 2020 (UTC)[reply]

Sometimes it takes a good example, which starts to make sense, doesn't it? We don't want to create all types of entries for all types of jabs and we don't want to make entries for all types of workshops either, like automobile repair shop, refrigerator repair shops, computer repair shops, etc. --Anatoli T. ^{(обсудить}/^вклад) 09:15, 24 September 2020 (UTC)[reply]

I see what you're saying Anatoli T.. I'm just trying to be funny now: but surely there aren't any "wooden chair repair shops", "pet turtle repair shops", etc. I would still assume that the entity in question must correspond to something in real life? But, yes, it makes sense. The line has to be drawn somewhere. Thanks again for your time and patience. ;-) -- Dentonius (talk) 09:20, 24 September 2020 (UTC)[reply]

I would like somebody to explain this to me: How does it take away from our project or diminish our standards if we allow all entries to be created which can be found in real world dictionaries? We're not being lemmings. The published dictionaries are the standard! - Dentonius (talk) 07:31, 24 September 2020 (UTC)[reply]

You say they are the standard. But as I stated above, sometimes other dictionaries even copy stuff from us. So sometimes we are the standard, sometimes we "lead the pack". If we cower in the background, always waiting for a "REAL" dictionary to do something before we can do it ourselves, then we will never amount to shit. Equinox ◑ 08:59, 24 September 2020 (UTC)[reply]

I agree with you 100% here, @Equinox. We shouldn't wait for "REAL" dictionaries to do stuff. I also absolutely believe you when you say that others have copied from us too. But, I didn't mean we should limit ourselves to their entries. I just thought it would be helpful to have a CFI which explicitly states, the other criteria for inclusion don't apply to published dictionary entries; i.e. we should have no reservations about adding terms from paper dictionaries. However, I saw what you said about us not being a translator's dictionary. It makes sense, especially with what you said about volunteer time. However, it's still a little sad that so many useful terms in published dictionaries will just go to waste or be relegated to a footnote just because. I'll get over it :-) -- Dentonius (talk) 09:10, 24 September 2020 (UTC)[reply]

No, it wouldn’t be helpful. Editors let themselves already be helped by them even if they aren’t mentioned there. We always should have reservations. There are bare wrong things in published dictionaries or other sources. But luckily we can be “better than professional lexicographers”. And for the inclusion matter it is rather that there are unfitting things in the published dictionaries: you ask “how does it take away from our project or diminish our standards if we allow all entries to be created.” The answer is that “an entry” in a paper dictionary is not like an entry at this website. Here one accesses words in a different fashion, not by alphabet browsing but by typing in into the search or another search engine, and on the other hand one entry bears the danger of the creation of similar entries on its example. Whereas for a paper dictionary you cannot even make out a distinction in whether something “is a lemma” or just a usage example and on the other hand the public cannot fit new words into it. When under مِثْقَال (miṯqāl) Wehr’s dictionary ”includes” مِثْقَال ذَرَّة (miṯqāl ḏarra, “the weight of an atom”, often negated to mean a trifling amount) this 1985 dictionary does not say anything about whether this word combination should be included as an own web page. And an English-Russian dictionary does not put brackets onto its translations to signify whether a Russian term is SOP (as they do not hyperlink). The “entries” here and there are incommensurable. That’s why we cannot parallel them formally, why one should “forget that they exist”, because the notion that they argue for what should be included here is a fallacy – they don’t. I never argue by other dictionaries for whether and how a term should be included – although surely I let myself be helped and guided by them. Fay Freak (talk) 12:10, 24 September 2020 (UTC)[reply]

Proto-Min & Proto-Southern Min

Where to lemmatize terms in these two reconstructed languages? --沈澄心 ✉ 10:42, 25 September 2020 (UTC)[reply]

Global ban RFC for Nrcprm2026/James Salsman

Nrcprm2026, better known as James Salsman, has an active discussion regarding a possible global ban.--GZWDer (talk) 07:56, 26 September 2020 (UTC)[reply]

@@ Line 594: / Line 594: @@
 == {{w|Proto-Min}} & [[Appendix:Proto-Southern Min reconstructions|Proto-Southern Min]] ==
 Where to lemmatize terms in these two reconstructed languages? --'''[[User:沈澄心|沈]][[User talk:沈澄心|澄]][[Special:Contributions/沈澄心|心]][[Special:EmailUser/沈澄心|✉]]''' 10:42, 25 September 2020 (UTC)
+== Global ban RFC for Nrcprm2026/James Salsman ==
+Nrcprm2026, better known as James Salsman, has an active [[m:Requests for comment/Global ban of James Salsman|discussion regarding a possible global ban]].--[[User:GZWDer|GZWDer]] ([[User talk:GZWDer|talk]]) 07:56, 26 September 2020 (UTC)

Wiktionary:Beer parlour/2020/September: difference between revisions

Revision as of 07:56, 26 September 2020

Contents

Is it non-controversial to run bot-tasks to apply the conventions at WT:NORM?

Format for thesaurus pages

Using User:AutoSkull for automated surname edits

Old Korean lemmas with direct attestation are in the reconstruction namespace

Draft proposal for pre-c. 1910 Korean forms (Old, Middle, Early Modern)

"Pronunciation spelling" label

Invitation to participate in the conversation

Archaic forms and spellings should not be lemmas

Phrase ellipsis, three regular dots or two ellipsis characters (six dots)?

Canadian English

Lexicography films

Vulgar Latin

Wiktionary sitelinks dashboard: URL update

Propose making Template:en-noun pluralization algorithm smarter

meta:Small wiki audit/Malagasy Wiktionary

Northwestern Indo-Aryan

Category:Hindi Tadbhava

Turkish noun inflection

Kyiv

Appendix for strings in unidentified or uncertain languages?

Quote adder redux

CFI for misspellings

FWOTDs

Where to lemmatize Literary Chinese terms attested only in Korean sources?

Multiple homographs for Ojibwe finals

Gheg vowels

Entries and contributions being rejected -- CFI, idiomaticity and sum of parts (SoP) woes

Proto-Min & Proto-Southern Min

Global ban RFC for Nrcprm2026/James Salsman

Navigation menu

Wiktionary:Beer parlour/2020/September: difference between revisions

Revision as of 07:56, 26 September 2020

Is it non-controversial to run bot-tasks to apply the conventions at WT:NORM?

Format for thesaurus pages

Using User:AutoSkull for automated surname edits

Old Korean lemmas with direct attestation are in the reconstruction namespace

Draft proposal for pre-c. 1910 Korean forms (Old, Middle, Early Modern)

"Pronunciation spelling" label

Invitation to participate in the conversation

Archaic forms and spellings should not be lemmas

Phrase ellipsis, three regular dots or two ellipsis characters (six dots)?

Canadian English

Lexicography films

Vulgar Latin

Wiktionary sitelinks dashboard: URL update

Propose making Template:en-noun pluralization algorithm smarter

meta:Small wiki audit/Malagasy Wiktionary

Northwestern Indo-Aryan

Category:Hindi Tadbhava

Turkish noun inflection

Kyiv

Appendix for strings in unidentified or uncertain languages?

Quote adder redux

CFI for misspellings

FWOTDs

Where to lemmatize Literary Chinese terms attested only in Korean sources?

Multiple homographs for Ojibwe finals

Gheg vowels

Entries and contributions being rejected -- CFI, idiomaticity and sum of parts (SoP) woes

Proto-Min & Proto-Southern Min

Global ban RFC for Nrcprm2026/James Salsman

Navigation menu

Search