Wiktionary:Beer parlour/2009/July

 This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.
Beer parlour archives edit
2002
December
2003
2004
2005
2006
2007
2008
2009
 May June July August
2010
2011
2012
2013
2014
2015
2016
2017

July 2009

Another bot

Yeah, since it appears to be popular among admins, I decided to join the club and set up my own bot User:KassadBot. Anything you'd possibly want to know is described on its user page, and I'll deploy it as soon as it gets the bot flag. (Or are perhaps more test edits necessary?) -- Prince Kassad 00:03, 1 July 2009 (UTC)

Talk:Romeo

OK I have a question. There's this pizza called the 'pizza Borromea'. So I've looked up Borromeo and wikipedia states it comes from Buonromei/Borromei. I guessed the 'romei' was meant as plural, which lead me to this 'romeo'. Wikipedia also said that romeo means 'pilgrim to Rome'. Is it correct to think this way? Does anyone know? I don't speak Italian very well (yet), but they give this link on wp: [www.borromeo.it]. The links are at the bottom of that homepage. Anyway, perhaps if I know more I can do more searching on how the name was used for a pizza. Early google searches only try to sell me the pizza :( User:Mallerd (Zeg et es meisje) 11:53, 1 July 2009 (UTC)

See Borromeo, but I don't know anything about pizzas. Shouldn't you ask in the Tea Room? --Makaokalani 13:44, 2 July 2009 (UTC)

OK, done. :) User:Mallerd (Zeg et es meisje) 14:46, 2 July 2009 (UTC)

Robot to pick words from a file and add to Wiktionary?

I was wondering if there's a robot which can pick words from a file (on my hard-disk) and populate Wiktionary. I'm talking about the Kannada wiktionary on kn.wiktionary.org.--ಕಿರಣ 15:30, 1 July 2009 (UTC)

Try looking at User:SemperBlottoBot and User:FitBot. The code is explained for each. They are both designed to upload entries created on a local computer. You do have to set up the page format in the source file, but this can be quickly done if you have the right tools. --EncycloPetey 15:34, 1 July 2009 (UTC)
Thanks, EncycloPetey!--ಕಿರಣ 14:29, 6 July 2009 (UTC)
You can also go direct to the source and get the pywikipedia framework and modify it yourself if you follow the local bot policy and are careful and know or learn how to use the framework. - Taxman 15:45, 6 July 2009 (UTC)

Links to items that have failed RfD

We have a substantial number of redlinks which we might not want to encourage people to use to create new entries and which do not offer users the benefit of blue links. If an entry has failed RfD, I would think we would want to remove any wikilink that referred to it. This is not a reasonable burden to place on someone closing out an RfD. Should we have a bot-generated clean-up list for this? This would be a list easier to work on than Ullman's missing list, because the nature of the problem is more specific and the resolution possibilities limited. OTOH, it cannot be reliably automated because the resolution could take any of several forms depending on the circumstances:

1. Outright total removal of the wikilink.
2. Converting a phrasal wikilink to one or more word wikilinks, preferably a section.
3. Link to an article or entry at a sister project (often wikipedia or wikispecies).
4. Linking to an Appendix, preferably a section.

I don't know whether there are other classes of deleted entries for which this would be acceptable. I also wonder whether there is any way to make it work at a more granular level, such as language, Etymology, or PoS section. It might work in cases where the link had a language, etymology, or PoS indication (though PoS indication is often not unique and etymology sometimes is not). DCDuring TALK 17:09, 1 July 2009 (UTC)

Changed search results failure page

What happened to the old search results failure page that contained a link allowing you to add the word to the list of requested pages? Now all you can do is create the page. What's going on? Can someone explain? -- dougher 03:44, 2 July 2009 (UTC)

And on the create-new-page page, the "use the preload templates" link is broken! -- dougher 04:26, 2 July 2009 (UTC)

Developers have changed the search result page global without asking or even announcing in front of. That's why I've asked kindly in IRC if it's possible to switch this back to last state local. Unfortunately I have received of my objective and friendly questions only snotty comments like "Consensus isn't needed to change it". Also furthermore asking of a steward and me didn't brought some appreciation.
Finally I wrote a bugreport. If you think local communities should decide by themself which search result page should appear, you may vote for this bug: bugzilla. Additional comments are welcome, too. Best regards from German wikis, Stepro 02:40, 3 July 2009 (UTC)

And just now when I searced for LGBTQA I get a page that offers me no option other than to click on a link to something (LGBT) it thought I might mean. No option even to create a new page. It's totally useless.

Search results
Definition from Wiktionary, a free dictionary
Did you mean: LGBT
Content pages
Multimedia
Help and Project pages
Everything

-- dougher 17:45, 3 July 2009 (UTC)

I have restored the preload buttons - the major "breaking" change. Mediawiki:Search-nonefound is the new name for Mediawiki:Noexactmatch. It's a pity the new choose-namespace box is so huge and ugly, but maybe we can fiddle with Special:PrefixIndex/Mediawiki:Search to restore some niceness. Conrad.Irwin 19:35, 4 July 2009 (UTC)

There's a problem with that new page, looks like one of the variables isn't being resolved: "Wikipedia may have an article on \$1." -- dougher 19:56, 4 July 2009 (UTC)

Does it work like that? try substituting the MW message rather than transcluding. The variables are ignored when transcluding it would seem. If that doesn't work then it looks like the new version doesn't use the variable... An alternative would be to use MediaWiki:Searchmenu-new as it has it, but it would be positioned above any search results... -- 6Sixx 05:33, 5 July 2009 (UTC)

Entries mislabelled as interjections

Can we please avoid using "interjection" as a catch-all part of speech for phrases?

From our own entry, an interjection is "An exclamation or filled pause; a word or phrase with no particular grammatical relation to a sentence, often an expression of emotion."

A good few of the entries in Category:English_interjections are not interjections but phrases (eg, "does a bear shit in the woods" and "Kilroy was here") or pro-sentences (phrases that replace a complete sentence). Examples of pro-sentences currently under this category are "Happy Thanksgiving" (for the sentence "I/we wish you a happy Thanksgiving") and "as you wish" (for the sentence "Have it as you wish"). These do not fit our definition of interjections and so should not be marked as such.

I propose that we use "Phrase" (or, at a pinch, "Pro-sentence") as the part of speech for these terms. We reserve "Interjection" as the part of speech of words and phrases that truly are interjections. — Paul G 15:02, 2 July 2009 (UTC)

IMO the underlying problem is that "phrase" is unsatisfactory except to mark phrasebook entries, which purpose is better served by a category. "Phrase" (as defined at phrase) means:
1. A short written or spoken expression. OR
2. (grammar) A word or group of words that functions as a single unit in the syntax of a sentence, usually consisting of a head, or central word, and elaborating words.
The first definition is unhelpful. The second suggests that it has a relationship to the meaning of a sentence, apparently, one of which it is a part. If it has such a relationship, then we ought to be able to assign it one of the traditional parts of speech (with all appropriate caveats). IOW, "phrase" seems to me more like an indication that clean-up is needed.
The items that may be improperly called interjections don't have any particular grammatical relationship to a sentence in which thy appear or appear as if they were a sentence.
Unfortunately, besides not being one of the sanctioned PoS headers, "pro-sentence" is about as suitable a header as "meronym" from the point of view of a normal user, not appearing once among the 385MM words of COCA.
The other possibilities are "proverb" and "idiom". "Proverb" may accommodate but a few. "Idiom" is a different kind of throw-up-your-hands category. If a multi-word term is in Wiktionary, it must be idiomatic, so, if an expression doesn't have an obvious grammatical category from its head, a contributor could call it an idiom.
Other dictionaries, when they include, such phrases usually do not label them with a part of speech. They may appear in a grammatically undifferentiated mass of idioms.
In other words, I suggest the "idiom" may be a better catch-all than "interjection" or "phrase". DCDuring TALK 16:45, 2 July 2009 (UTC)
Re "'A short written or spoken expression'... is unhelpful": Actually, that's exactly what I think of when I see the word phrase usually, including in our POS headers. I like it as a last-resort POS header (meaning, where no real POS fits the bill).​—msh210 17:18, 2 July 2009 (UTC)
Looking at the category, I find many that are obvious proverbs, nouns, prepositions, adverbs, adjectives, phrasal verbs, adj/advs, and idioms that could easily be assigned a traditional part of speech. Some look like they would take a little bit of time to analyze. Of the 500-600 in the category, I'd be surprised with there were more than one hundred refractory cases that would require, say, a TR consult. Those probably will include some that have some similarity to the non-interjections that triggered this discussion. DCDuring TALK 17:47, 2 July 2009 (UTC)
I think phrase (in the first sense) is just fine. Idiom, on the other hand, most commonly means a more-or-less fixed set of words whose meaning is not easily derivable from the meanings of its members. So, some of your phrases will be idioms, but not all. Linguistics also uses lexical chunk, multi-word expression or ${\displaystyle N}$-gram (where ${\displaystyle N}$ represents the number of words).--Brett 01:58, 3 July 2009 (UTC)
One interesting thing is the number of items in Category:English phrases that can easily be assigned to more specific PoS categories. There are in addition items that have a "phrase" header but are not in the category, as there are items with the idiom header that are not in category:English idioms. These areas seem like the Augean stables. DCDuring TALK 02:43, 3 July 2009 (UTC)
"Phrase" makes perfect sense to me. Usually the parts of speech are sufficient to cover how a term can be used, and those terms that aren't covered are the ones with a subject and a verb.
It may seem kinda shady since all three ("phrase", "adverb", and "interjection") are catch-alls, but they have very different application. An adverb modifies another word (a word other than a noun) so it's a catch-all modifier when there's a strong connection to the sentence. While interjections and many phrases can stand on their own, an interjection is a catch-all for terms that don't play by the rules, when there's a weak or no grammatical connection to the surrounding words, and at the same time insufficient grammatical structure internally.
A phrase is all or the most important parts of a sentence, not used simply to modify or link other words grammatically, but for which other words are used to link it semantically. 70.112.31.249 05:57, 6 July 2009 (UTC)
Do we actually have a community preference for the "true" part-of-speech headers over headers like "idiom" and "phrase"? Idioms can usually be put into a true PoS category and header, put are tagged at the sense level, because some have non-idiomatic senses which we have for contrast. This puts them in the idiom category. The same is true of phrases.
Do the phrase categories have value?
Is it worthwhile to have a true PoS header if one can be provided? DCDuring TALK 03:45, 3 July 2009 (UTC)
I would say "yes" and "yes", but would qualify that first response. I do see value in a "Phrase" header, especially when a collocation does not quite fit a single function, or when it is a phrasebook entry. I can't offhand recall an example of the first issue, although I have come across them from time to time. The latter is generally accepted (or at least has been in the past). I do prefer a "true" PoS header, but there are some situations where it's really tough to decide. The most recent headache I've personally had in this regard in Latin nōn. Most books I've seen call this an "adverb", but it's used to negate nouns, phrases, clauses, and whole sentences, so the tag of "adverb" only works if you consider it to mean "word that modifies another word, but isn't one of the other parts of speech", which isn't much of a definition. I've set it as "Particle" for the moment, but that doesn't really seem to fit altogether either. I think what I'm trying to say in all this is "Language is messy," and to indicate that a label of Phrase allows us to deal with a few difficult cases on the fly. It is preferrable to use some other PoS when possible, but sometimes that isn't possible. --EncycloPetey 18:59, 4 July 2009 (UTC)
• Note that there are some words which are exceedingly hard to classify. In particular, linguists disagree about, and have even created specialized specific classifications for, the parts of speech of "yes" and "no". There's discussion of this, which should explain some of the complexities, at w:yes and no#English grammatical classification. I also recommend reading w:sentence words.

As an aside, note that w:yes and no has lots of interwiki links to us for various words including translations of "yes" and "no", some of which we are missing and really should have. Uncle G 15:13, 10 July 2009 (UTC)

As lexicographers, we are blessed in that we do not have to cover every possible PoS role that a word might assume. Although almost any word can be used to form a question ("Nuts?", "Dated?", "And?"), we get to ignore that possibility. EP has resisted meaningless proliferation of PoS headers for proper nouns. We delete adjective PoSs for participles that don't meet some criteria that demonstrate clear adjectivity. We insist that English -ing forms show use as plurals (or as mass nouns) before being treated as nouns.
• Can we similarly dismiss imperative, hortative, and precative uses of verbs (phrasal and other) (bugger off)? The sense is (almost ?) always in the verb. Why do any of these need to be separately shown as "interjection", "idiom", "phrase", or "sentence". If some are, which ones and why? Are we supposed to discriminate based on frequency in a corpus? Do we have access to any good corpora of colloquial speech? Should we rely on other dictionaries?
I have removed some proverbs from Interjections and have begun categorizing entries there (and elsewhere) as Category:English sentences (hidden) and have placed Category:English rhetorical questions and Category:English proverbs in that category for purposes of tallying how many such entries we have. (It appears to be hundreds.)
Appropriate depopulation of the Interjection PoS header and Category:English interjections under existing rules would enable us to better determine what hard problems remain. Similar steps would be appropriate for the Phrase and Idiom headers, many of which could (and should ?) be replaced with traditional PoS headers.
It seems to me that the range of headers we should use should be limited to those that are immediately intelligible to most anons. Many terms of use to linguists would be excluded by that standard, though they would certainly have use in categorization (hidden or not). DCDuring TALK 17:23, 10 July 2009 (UTC)

template:und

See [[pelfa]]: it uses `{{etyl|und|ca}}` to display "Undetermined" (a redlink incidentally) in its etymology section. Do we want to do this? I don't see the problem with it, but perhaps this should be at template:etyl:und instead (even though it's an ISO 639 code) since we (presumably) don't want to use it for all the things we use language codes for. Thoughts?​—msh210 20:01, 2 July 2009 (UTC)

I don’t think we do; nota that we already have `{{unk.}}`.  (u):Raifʻhār (t):Doremítzwr﴿ 14:42, 3 July 2009 (UTC)
Deprecating `{{unk.}}` in favor of the ISO 639 code und has been discussed in the past. Personally, I'm in favor of such a move, especially since we should want `{{etyl|und}}` to do what it would be expected to. — Carolina wren discussió 17:11, 3 July 2009 (UTC)
Actually, yes; having just read the preamble of w:List of ISO 639-2 codes, I agree with such a change, too.  (u):Raifʻhār (t):Doremítzwr﴿ 00:31, 4 July 2009 (UTC)
Yeah, I think `{{etyl:und}}` is the way to go. —RuakhTALK 18:35, 4 July 2009 (UTC)
Why the colon, though? Why not treat it as an ordinary ISO code?  (u):Raifʻhār (t):Doremítzwr﴿ 18:38, 4 July 2009 (UTC)
Because then you would be able to create entries in the Undetermined language and have Undetermined translations in translation sections. -- Prince Kassad 18:40, 4 July 2009 (UTC)
Oh, OK. Fair enough.  (u):Raifʻhār (t):Doremítzwr﴿ 18:42, 4 July 2009 (UTC)
[edit conflict, x2] I suspect the mechanics of recoding all the templates that can make use of ISO templates is at issue. If we code it like a language template, then we'll end up with red-linked categories for "Undetermined" words in odd locations, including topical categories (which do make use of the ISO templates). It would therefore be preferable to limit its use to etymologies by explicitly renaming the template. --EncycloPetey 18:42, 4 July 2009 (UTC)
`{{unk.}}` categorizes in [[category:Unknown etymology]] (and its xx: counterparts), though, whereas `{{etyl}}` categorizes in [[category:Whatever derivations]], here Undetermined derivations, (and its xx: counterparts). If we're to get rid of `{{unk.}}` and use `{{etyl:und}}`, we'll need to manually recategorize anything left after that switch in the unknown-etymology category (i.e., anything hand-categorized therein). Is that all right with all and sundry?​—msh210 17:18, 7 July 2009 (UTC)
Oh, interesting. Given that, I guess `{{unk.}}` doesn't mean "from unknown language", but rather "etymology unknown". It could apply just as well to a word that arose in English. `{{etyl|und}}`, by contrast, implies that the word came to us from another language, we just don't know which. I don't think we can deprecate `{{unk.}}` in favor of `{{etyl|und}}`. (Which isn't to say that we shouldn't delete `{{und}}` and create a more-appropriate `{{etyl:und}}`, just that we can't expect it to cover all cases.) —RuakhTALK 18:22, 11 July 2009 (UTC)
I like the wording "undetermined" better than "unknown" because it suggests that the question is open and yet to be solved. Qorilla 18:09, 10 July 2009 (UTC)

Cryptic codes

Why using cryptic codes when displaying pages? Why not generating masculine in the template m, etc.? Lmaltier 16:32, 4 July 2009 (UTC)

masculine would be way too long. Besides, if you hover over the m a tooltip pops up showing the meaning. -- Prince Kassad 16:34, 4 July 2009 (UTC)
In a paper dictionary, it would be too long. But here? Lmaltier 16:48, 4 July 2009 (UTC)
If it were the only thing on the line, it wouldn't be too long, but many entries must have multiple translations. Writing out "masculine" / feminine every time in the translations tables would distract from its primary purpose of displaying the translations. The genders are provided as a courtesy, not as a key point of translation. --EncycloPetey 16:50, 4 July 2009 (UTC)
You are perfectly right for translations. What I was having in mind was the mention of the gender just before the definition line. It's an unfortunate example of what happens when a template in used for several different purposes... Lmaltier 17:14, 4 July 2009 (UTC)
I have a rather large monitor, and in some cases (like at ראשון) the full words 'masculine', 'feminine', 'singular', 'plural', etc, tend to make the head two rows tall, which... I personally don't find particularly attractive. — [ R·I·C ] opiaterein — 18:12, 4 July 2009 (UTC)
The use of abbreviated forms in the inflection templates varies from language to language. As Opiaterein has pointed out, use of the full word would make some templates wrap to two lines, which is probably less desirable on the whole than avoiding abbreviations. --EncycloPetey 18:39, 4 July 2009 (UTC)

Give me a better title than Template:mention-Latn

There's really no reason for the span class of `{{Latn}}` to automatically be a mention, so in order to not mess up `{{term}}`, I'm going to move the code at `{{Latn}}` to a new mention-only template, while changing the span class of the former to simple 'Latn' instead of 'mention-Latn'. (Thus the new template will be italic (if that's preferred) and users can set their own preferences for the rest of the Roman alphabet based on the old template.) But, can we think of a better name for the template than the one I gave above, or is that alright with everyone? — [ R·I·C ] opiaterein — 18:09, 4 July 2009 (UTC)

Do we have any other simlarly named templates? The other option I can think of is `{{Latn-mention}}`, but that doesn't make the name any more elegant, just better at being listed alphabetically. --EncycloPetey 18:36, 4 July 2009 (UTC)
I don't think we do, but I suppose we could. `{{Latn-term}}` I suppose could work, I think I've seen another script template use 'term' in place of 'mention'. — [ R·I·C ] opiaterein — 18:39, 4 July 2009 (UTC)
(Sorry for not speaking up earlier — I didn't notice this until now.) There's no need for a separate template; we already have the face= parameter in many script templates, which can take the values of head, bold, ital, term, or anything-else. (For example, `{{infl}}` calls its script template with face=head, which for example `{{Latinx}}` and `{{Cyrl}}` translate into <b>, and `{{Hebr}}` translates into <big>.) We just need to start making more consistent use of that parameter — `{{Latn}}` needs to use class="mention-Latn" only if face=term (and presumably <i> if face=ital, though TBH I'm not 100% clear on the face=ital use-case), and templates like `{{term}}`, `{{prefix}}`, etc. need to specify face=term. —RuakhTALK 02:26, 10 July 2009 (UTC)
Yes, I'm going to go fix this; you can then remove the references to {mention-Latn} and delete it. The face=ital case is mostly for completeness, where italics are desired depending on script, but the semantics of "term" are not appropriate. Whether this case exists I don't know ... (I also want to add in the Xyzzy script selection magic, but not right now; I can't make that big a change without a longer work session, like 8-10 hours, to check the results ;-) off to flood the job queue ... done. Robert Ullmann 13:00, 10 July 2009 (UTC)
seems to have worked okay (the HTML tidy saved me from a horrid mistake, fixed ;-). References to "mention-Latn" can be changed back to "Latn". And {term} passes the language parameter to the script templates properly now ... Robert Ullmann 14:41, 10 July 2009 (UTC)
It was actually my mistake that HTML tidy saved us from. Sorry. (It seems to have been the Law of Prescriptive Retaliation, activated by my snarky comment at the bottom of Template talk:Latn.) —RuakhTALK 14:59, 10 July 2009 (UTC)

Database dump

I read about a mysterious database dump at various places. Where can one access it or is it not public? Qorilla 10:04, 7 July 2009 (UTC)

It's totally public and a new one just became available here: http://download.wikimedia.org/enwiktionary/20090707/
hippietrail 11:46, 7 July 2009 (UTC)
Thanks! Qorilla 16:14, 7 July 2009 (UTC)

What is a common misspelling?

I have no doubt that this has been brought up before, but do we have a WT:CFI definition for this? In French it's pretty common on the Internet just to get rid of accents just for simplicity. Plus some of them are difficult to type. It would be really easy to attest up to a million (yes, really) misspellings that are common by just getting the accents wrong, or not bothering at all. In Spanish we have things like adios too. Again, it's more like laziness than really "misspelling" the word. Mglovesfun (talk) 10:13, 7 July 2009 (UTC)

We haven't resolved any aspect of this, AFAICT. In English, there is no official arbiter of correctness. No dictionary is complete so relying on them would be too restrictive. We don't have a good general way to distinguish between a common misspelling and an alternative spelling.
It may be that we could get agreement on circumstances in which omitted diacritical marks constituted an error worth showing as a separate entry. I certainly don't think we need a million misspelling entries and their attestation. The use of `{{also}}` is supposed to assist users in finding entries with diacritical marks they find hard to type. DCDuring TALK 15:55, 7 July 2009 (UTC)
Removing diacriticals is not a "misspelling", in my opinion. For years, library databases have routinely omitted accents, tildes, and other diacriticals because the software could not support them. So, the words were deliberately spelled that way because of the restrictions of the medium. --EncycloPetey 17:02, 7 July 2009 (UTC)
You probably have English in mind. For French, I think you are partially right, but only for capital letters (e.g. NOEL instead of NOËL). Lmaltier 17:09, 7 July 2009 (UTC)

Unified Serbo-Croatian

Vote has been started: Wiktionary:Votes/pl-2009-06/Unified Serbo-Croatian

I announce this in BP as the last discussion appears to have generated lots of controversy. Please express your opinion by casting a vote. -- 17:26, 7 July 2009 (UTC)

Project pages for languages?

Is there anything similar here like Wikipedia's Projects or Portals that could give here a start page for entries of specific languages, inform editors about what needs to be done, what templates are there, etc? Qorilla 20:24, 7 July 2009 (UTC)

Wiktionary:About Hungarian (e.g.). Usually the shortcut is WT:AHU (e.g.).​—msh210 20:30, 7 July 2009 (UTC)
Yes, I know that page. I hoped there would be some nice, colorful, boxes-floating-around style Portal, like w:Portal:Languages. The About Hungarian page is merely an extension to the Entry layout explained page that describes how Hungarian entries should look like. Alright I just wanted to know. If no, then no. Qorilla 20:39, 7 July 2009 (UTC)
I don't know about colors and boxes (though in general we are far less into them than Wikipedia is), but the About Languagename pages are where language-specific information often is, and the associated talkpage is where much discussion takes place. (At least for some languages.) E.g., WT:AHE (the About Languagename page I'm personally most familiar with) lists templates, categories, and several community decisions regarding Hebrew, and its talkpage has discussions on a number of issues. Are the WP portals meant only for editors or also for readers? The About Languagename pages (and all Wiktionary namespace pages AFAIK) are meant only for editors.​—msh210 20:51, 7 July 2009 (UTC)
I see. I guess Wiktionary just needs time and most of all more editors. The project is kind of unknown... But I think in the end it would be desirable to have reader-portals too (with a part that describes how and in what one may collaborate). In WP the Portals are pages where you can start out to see articles about that topic, or just browse the featured articles etc. In WT this could be a place that listed links to various categories about that language (parts of speech, or any more exotic categories), the work to be done, the users that edit regularly in that language's entries, a did you know box, you know all that kind of stuff that is in WP Portals. But I think we have to wait for that a bit. Qorilla 21:04, 7 July 2009 (UTC)
I'm still unclear on whether you mean this to be aimed at readers or at (potential) editors. Stuff for readers is at category:Hungarian language: it has all you described: it's a page where you can start out to see categories (parts of speech or more exotic categories). Stuff for editors — like work to be done and even a list of active editors — can easily be put on WT:AHU. I'm not sure what place a "did you know" box would have in a dictionary, nor what other "kind of stuff" WP's portals have. Note that not everything on WP is translatable to here.​—msh210 21:13, 7 July 2009 (UTC)
I think all readers should be considered potential editors. That is why the  links appear on all pages. That is the philosophy behind wikis. Note that this did not start as 'I want this and this' but I was looking if there is such a thing. I just felt there might be a main "portal"-like page that includes everything we have e.g. on Hungarian language. Thus a short introduction, the above mentioned Categories link, a link to the WT:AHU, one to the Index:Hungarian, one to the Appendices, the frequency list, a pronunciation guide, some statistics, the random-entry link, links to other wikimedia projects, and who knows what else. A top-level master-page from where every Hungarian-related page is easily reachable, in a user-friendly portal-like layout. Qorilla 22:45, 7 July 2009 (UTC)
Would you like to create a portal for Hungarian? It could be Appendix:Hungarian portal. The question is how will our users find this portal? --Panda10 22:53, 7 July 2009 (UTC)
Yes, I agree with Panda10: make one if you think it's missing! (Although, again, I think that you can put all that at WT:AHU.) For "portal-like layout" I'm picturing something like... this?​—msh210 16:31, 8 July 2009 (UTC)
There are such pages on the french wikt: fr:Portail:Anglais, fr:Portail:Allemand etc... Beru7 19:44, 8 July 2009 (UTC)
I might do it later but I just got too many other plans for now. Qorilla 22:07, 8 July 2009 (UTC)

List of common terms used in Wiktionaries

To help someone make simple edits in a foreign-language Wiktionary, it would be great if that person's native Wiktionary provided material (translations and entries) on standard dictionary terms in other languages. Does anyone know of such a list of terms? It would be more specialized than [[Appendix:English words all Wiktionaries should have]] and combine elements of WT:ELE and WT:GLOSS. Examples would be noun, language names, plural, neuter, etymology, translations, sense (wikimedia interface terms like edit and history are already mostly translated and available to users).

For some of these words we have good coverage in the popular languages, but for most we don't. Ideally the list could be used as a base for something like [[Wiktionary:Project Multilingual Translations]] (possibly breathing some life into that project). I thought about Word of the Day, but these terms aren't usually exotic enough for that list. It there's good progress here, via some sort of Multilingual coordination other Wiktionaries could work on the effort as well. --Bequw¢τ 00:15, 8 July 2009 (UTC)

Are you thinking of Appendix:Glossary (or perhaps WT:GL)?​—msh210 17:05, 8 July 2009 (UTC)
That's also a good source. I made a start at such a set at Wiktionary:Project Multilingual Translations. Ideally people that edit other wiktionaries will know which other terms are commonly needed for simple editting tasks. --Bequw¢τ 08:27, 10 July 2009 (UTC)

Invented etymologies

User:KYPark is again adding etymologies invented by himself on the basis of skimpy information from dubious sources. For example, he seems to have taken this information:

onduler ... [lat. undulare < unda (=onde)]

from a Korean publication about French and crafted this edit, which includes cognates such as gondola. There is in fact no evidence of a Latin verb *undulare, and I pointed this out to KYPark. His response suggests to me that he will continue fabricating etymologies as he has done in the past, since he has told me (in effect) that I cannot criticize his work unless I am God. --EncycloPetey 04:43, 8 July 2009 (UTC)

My suggestion is simply a long (permanent?) block. This issue has come up so many times, each time with resounding community consensus. KYPark refuses to debate in an intelligible manner, and refuses to comply on this. -Atelaes λάλει ἐμοί 06:40, 8 July 2009 (UTC)
Fairly rude "debating" style. Quite stubborn. Even if his theories should by some odd chance prove out, they have so little support at present as to be at best irrelevant here. We seem to be a soapbox for him. Is he so productive of good material that we could tolerate the risk of time-wasting conflicts over the etymologies? DCDuring TALK 07:09, 8 July 2009 (UTC)
KYPark has already been abundantly explained what the word cognate means (in fact, the entry on it was rewritten because of his deliberate misuse of the term), and that the borrowings in different languages cannot be "cognates". Today it is Latin borrowing in English that is a "cognate" to Italian word (which is in turn a borrowing from Greek), tomorrow it would be some Korean dialectalism cognate to some German word (c.f. his userpage). KYPark has the potential to become an excellent editor (and has in fact created a great deal of excellent entries), but it appears to me that he is deliberately misusing the "community credit" gained with such entries to push some obscure and dubious etymologies. IMHO he should be simply forbidden to edit ==Etymology== at all, under the penalty of indefinite block. -- 10:07, 8 July 2009 (UTC)
"Fairly rude" indeed. He has gone so far as to curse the English language this time around. I've taken a look at w:Cognate and have noticed that it's missing some key points on the subject of cognates, such as cognates typically being the same part of speech, and typically not including words which have changed PoS or meaning as a result of word modification in one language. The article could benefit from some good non-examples with explanations. --EncycloPetey 12:46, 8 July 2009 (UTC)
Based on comunity response, both explicitly here and implicit in edit reversions by others who did not comment, and based on further bizarre arguments rooted in theology and anti-logic (for lack of a better word), along with his creation of Citations:undulate as a repository of dictionary sourced etymologies for *undulare (rather than any actual citations for undulate), I have blocked KYPark for two weeks. I suggest looking at his talk page responses to see just how bizarre the comments really are. I wouldn't have believed it myself if I hadn't read them. --EncycloPetey 15:08, 9 July 2009 (UTC)
I suggest that after this block expires that KYPark is permanently forbidden to edit ==Etymology== sections of any Wiktionary entry, under the penalty of indefinite block. He could be allowed to discuss or propose etymologies at the entry talkpages, and then bring them to community focus in WT:ES or somewhere to receive "blessing". I think that such sanction would bring out the true nature of his contributions. --Ivan Štambuk 15:24, 9 July 2009 (UTC)

Terms for gay

Please contribute to the discussion, relating to transwikifying a pretty poor list here, at w:Wikipedia:Deletion review/Log/2009 July 7#List of terms for gay in different languages. I also draw your attention to Wiktionary:Transwiki log/Long articles for review. Uncle G 15:46, 8 July 2009 (UTC)

Verb form categories

This is related to the recent discussion Wiktionary:Grease pit#Miscategorisation of præserves and to my edits in part of speech categories in the last few months. I hereby make the following seven proposals about categorization of verb forms in all languages.

1. When it's desirable to keep track of characteristics of verb forms such as tense, number and mood through categorization, each of these characteristics should be separated into individual categories.
• Example: The possible category German verb third-person singular present indicative forms (and, by extension, any other highly-detailed verb form category) should be deprecated in favor of German verb third-person forms, German verb singular forms, German verb present forms and German verb indicative forms.
• Note: This proposal is already in effect for Spanish language. See Category:Spanish verb forms.
2. All verb form categories should use the following naming system: (language) verb (characteristic) forms.
• Examples: Dutch verb singular forms, Spanish verb imperative forms, German verb third-person forms, English verb simple past forms, Portuguese verb obsolete forms (and not "Dutch singular verb forms", neither simply "Dutch singular forms" nor "Dutch singulars").
3. Gerunds and participles are also verb forms, so despite having characteristics from other parts of speech, they should be named as explained in the 2nd proposal.
• Examples: English verb past participle forms, Latin verb future participle forms (and not "English past participles").
4. If a gerund or participle has exactly all the characteristics of another part of speech, its entry should contain the respective part of speech header.
5. Language-specific exceptions from the 1st proposal may be established to avoid redundant or duplicate categorization. (i.e., two or more characteristics may coexist in single categories, such as, possibly, Spanish verb informal second-person forms, because all Spanish informal conjugations regularly marked differently from the rest are also second-person conjugations.)
6. All language-specific exceptions should use the following naming system: (language) verb (two or more characteristics) forms.
7. These should be the categories used for the few possible English verb forms, as the first language-specific exception:
• English verb forms
• English verb archaic second-person singular present indicative forms
• English verb third-person singular present indicative forms
• English verb archaic third-person singular present indicative forms
• English verb simple past forms
• English verb irregular simple past forms
• English verb archaic second-person singular simple past forms
• English verb past participle forms
• English verb irregular past participle forms
• English verb present participle forms
• English verb archaic forms
• English verb archaic second-person singular simple past forms
• English verb archaic second-person singular present indicative forms
• English verb archaic third-person singular present indicative forms
• English verb irregular forms
• English verb irregular simple past forms
• English verb irregular past participle forms

--Daniel. 07:50, 9 July 2009 (UTC)

Who uses categories? Who might use them? How might they use them? Is there any reason to believe that these categories are used by anyone other than regular contributors? If not, is there any reason for them not to hidden categories so that the concerns of a larger population need not be a factor in the decision. What is the benefit of forcing this framework onto languages with minimal inflection?

In particular, what is the benefit of this to monolingual English users of the English wiktionary? DCDuring TALK 09:35, 9 July 2009 (UTC)

I don't see much use for proposed categories but, generally speaking, categories are useful, e.g. when you want to consult a word, but cannot enter required characters on your keyboard. Topical categories are helpful when you want to find a word you know, but cannot remember, etc. Lmaltier 15:24, 9 July 2009 (UTC)
I agree with the Lmaltier's comments about categories in general; additionally, anyone who understands that categories are dynamic lists of related words and feels comfortable about them, might use them. Other possibilities would be searching through the existing appendices, entry translations in case of foreign languages, Wikisaurus pages or related words in same language. --Daniel. 13:56, 12 July 2009 (UTC)
To my North American ear, titling the categories (lang) verb (characteristic) forms sounds awkward. "English transitive verb forms" sounds better than "English verb transitive forms". Is there a technical reason you choose that ordering? --Bequw¢τ 23:14, 9 July 2009 (UTC)
• Please make these hidden categories, especially in English.
The names are distinctly not idiomatic English, except in propeller-head dialect. However, since this categorization has no known use to normal end users, the awkwardness is easily rendered harmless by making them hidden categories. It can be another of the many benefits available only to registered users. DCDuring TALK 23:02, 11 July 2009 (UTC)
The categorization of part of speech characteristics or subclasses is in use in many languages (the most common being probably (language) proper nouns conventionally as a separate POS, other commons ones include (language) adjective comparative forms and (language) plurals). The main objective of the proposals above is name consistency, rather than forcing an unknown system into these languages. --Daniel. 13:56, 12 July 2009 (UTC)
I do not see a reason to include here a category to keep track of verb forms that are transitive, because it would only reflect context which can be found at the main form. On the other hand, a category Spanish verb imperative forms would be acceptable because there are Spanish verbs conjugated to be imperative. (Similarly, English adjective comparative forms contains adjectives inflected to display comparative degree.) --Daniel. 13:56, 12 July 2009 (UTC)

Just by way of observation, fallen is only an English verb, not and adjective or a noun. The "adjective" is simply the participle functioning as a modifier (as opposed to a true adjective like interested). The "noun" is a fused head construction similar to that recently described here.--Brett 01:02, 10 July 2009 (UTC)

The discriminators I have been relying on for showing an adjective PoS for a past participle is that it be used as a predicate and that it be comparable/gradable. Some senses of fallen seem to fit the bill, serving as predicates: A soldier who has fallen can also be said to be fallen. Many dictionaries present it as an adjective: CompactOED, WNW, RHU, Wordsmyth, Longmans DCE, Websters 1828 and 1913 (but not online). The available Cambridge Dictionaries, AHD, and MWOnline don't. I assume that Collins does not either. The senses they show for the adjective are often just the "fallen woman" and "fallen soldiers" senses. Encarta shows those as idioms. Fallen is used as a plural noun in reference to the dead, especially in memorial services and similar places where an archaic usage might be expected.
• Perhaps we should show the adjectival and noun senses as archaic or dated and show fallen woman as an idiom (dated?). That would allow us to serve readers of older literature as well as those trying to learning contemporary English. DCDuring TALK 02:16, 10 July 2009 (UTC)
I don't want to sidetrack the thread, but simply being able to be used both attributively and predicatively is insufficient evidence for adjective status. I find no examples of very fallen, so fallen, more fallen, or most fallen, and become fallen doesn't seem to work either. Again, in reference to the dead, it is not a plural noun, but a participle verb functioning as a fused head in a headless noun phrase, that is an NP with no actual noun in it. That's why you talk about the fallen, but two fallen soldiers, not two fallen.--Brett 10:30, 10 July 2009 (UTC)
Copying to Talk:fallen DCDuring TALK 13:51, 10 July 2009 (UTC)
The word interested is not more adjectival than fallen, as they may both be considered modifiers from the verbs interest and fall. I prefer to consider English past participles as simply verbs conjugated to indicate events happened before unknown time references (which become known by compound tenses, such as has eaten and was eaten). And adjectives as words that give attributes to nouns, whether they are comparable or not. These attributes may be borrowed from verbs, as fallen may be said of someone who has dropped by gravity, or lost chastity. --Daniel. 13:56, 12 July 2009 (UTC)
Modifiers and adjectives are not the same thing. Adjective is a category where modifier is a function. The attributive modifiers in noun phrases can be adjectives, nouns, verbs, and prepositions. Adjectives can be distinguished from verbs in that adjectives (but not verbs) can be graded and modified by very, too, and so. Adjectives, further, can function as complement of become where participles can't. Thus, they became very interested in the changes, but not *they became very fallen. Daniel also conflates past participles, which have a variety of uses, with perfect aspect. There is no inherent "beforeness" in a past participle.--Brett 01:03, 13 July 2009 (UTC)

Wiktionary preferences (custom.js / WT:PREFS) update

I've added support for "subpreferences". When you turn certain prefs on or off this will enable or disable their subprefs. The affected prefs are:

1. clock
3. enhanced patrolling (sysops only)

Feedback appreciated as always. — hippietrail 07:52, 10 July 2009 (UTC)

As an expression of my sincere gratitude for your efforts and this announcement, I've got a complaint/question for you. I formerly had access to special characters under the search box. In the course of changes I have lost access to them, perhaps for a month or more now. The pull-down is accessible, but ineffective. I have selected the checkbox in WT:PREFS, refreshed, cleared the cache, deleted cookies and started again. No joy.
Is there something I might be missing, some interaction among choices in WT:PREFS or something in my monobook css or js that might be interfering? Any advice on diagnostics or remedies would be appreciated. The problem existed both before and after my week-old installation of FF 3.5 over FF 3.?.
Serious thanks for your efforts. I hope the refactoring makes all manner of things easier for you and those who labor in making the working environment for wiktionary contributors productive. DCDuring TALK 13:46, 10 July 2009 (UTC)
Thanks DCDuring. I know another one of our JavaScript developers (I forget which) had been doing a lot of work on EditTools to use a lot less bandwidth.
The preference that adds special characters under the search box uses EditTools so my guess is that the changes to the format there have affected the search box code.
Which is not to say it's definitely not my WT:PREFS changes at fault but I can't think of any way it could have caused this problem. — hippietrail 23:43, 10 July 2009 (UTC)
That was me, sorry. I was unaware of that PREF. It's fixed now (albeit hackishly and fragilely; at some point we'll probably want to refactor it). —RuakhTALK 02:07, 11 July 2009 (UTC)
Thanks for fixing it Ruakh and thanks for the work on EditTools - it's been sorely in need of TLC for years. And don't worry - the search box EditTools was always hackish and fragile anyway. — hippietrail 02:18, 11 July 2009 (UTC)

"use the preload templates" still broken

The "use the preload templates" link on a page such as this http://en.wiktionary.org/w/index.php?title=flapdoodles&action=edit&redlink=1 is still broken. Can someone please fix it. -- dougher 03:36, 11 July 2009 (UTC)

Done. Conrad.Irwin 01:37, 12 July 2009 (UTC)

Thank you! -- dougher 22:20, 13 July 2009 (UTC)

Request:Accessibility

Can disabled users pl also be accommodated here?? - F.i. me, out of pain in my arms, I cannot type but words, c?... :( A distraught 史凡/ʂɚ˨˩fan˩˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 09:08, 11 July 2009 (UTC)

Realistically, probably not. What do you find difficult, maybe we can try to make that easier? Conrad.Irwin 09:53, 11 July 2009 (UTC)

like,fi,canwe use v-chat pl?--史凡/ʂɚ˨˩fan˩˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 10:08, 11 July 2009 (UTC)

Others:

• tolerance ingeneral'n4just words/abbr.ns ontalkpp.
• use/innovation ofv[oice]messages4talk pp. [do we need dogmatic thinking 'bout that??]
Using voice messages is not very practical as it requires everyone to have a microphone, using Skype might help, but then you don't get the record of what was said, which can be very important.
• voluntary emblems4disabilities n'understanding4/of what theymean
I don't know what you mean.
• more assisted editing
If you have specific ideas please suggest them - or even write the code for them, I have plans to improve the translation adder, but I have very little time at the moment. Everyone here is a volunteer, so we can't just "get things done" in the same way a real company would be able to.
• " support w/comp technology[soft/hardware]n'wikicode
W/comp == wearable computing? If so, I don't think it's realistic to expect websites to integrate support for this, though no doubt work should be done by browsers. If you are looking for an easier way to type, I believe that Dasher (download) is pretty good and can be used by people with very little motive ability.
• easier go on newcomers
See Help:Interacting with humans
• focus on dict.edits,not talkpp.form/style
Format is essential to maintaining a dictionary that can be used by everyone, it is not negotiable, see Wiktionary:Entry layout explained. We do concentrate on the dictionary more than on forums.
There are tutorials on the internet, though I agree some more local information would be nice.
• more cooperation
Heh, easy to say - we have pages for each language where interested contributors can discuss issues, but there is very little "realtime" collaboration - we tried a few times using IRC, but I found it less productive than working alone.
• judge onmerit,notappearances
Good general principle, is this relevant to anything?
• more accessibility[whether layout,rules,wording,procedures,guidelines,help pp
This is something we worry about a lot, but no-one has managed to come up with a convincing "better" layout - maybe you could design us something? I certainly feel that the definitions should be in an easier place.
In terms of words we have Wiktionary:Criteria for inclusion, in terms of editors, anyone who can contribute productively is encouraged - though those who are unable to contribute alone (i.e. need constant minding and looking after) are generally not well received as they take up more useful time than the give back.
• bot-ting of automatable(=repetitive!!) tasks,as fi certain transliteration/phonemics and hanzi variant additions
This is already done - if you have ideas for more, point them out - better still, write the code.
• no hound/block/poking fun ofusers becos'of theirdisability/ies,or en bloc rejection of their criticisms ,esp.when notyet"established users"[isthatnot ad personam?how rthey gonna prov'emselvs when given no chance??]
See Help:Interacting with humans.
Where it is productive to discuss, we discuss, where it is not leading anywhere, we ignore.
I don't think we do. We don't make life harder for any specific person, and we don't make it easier for any specific person - seems fair if a bit nasty.
• inthenear future:also provide speech recognition at ur end
No, you should be able to find support for speech recognition in your operating system - if not try Dasher. It is not practical to implement speech recognition in a website.

can no compromise4such be found pl??[myhands rmega-sore bynow,isdad sohard2understand?!-'d any editor here dare say2a walk-disabled person2walk up10flights ofstairs,evenmore so justas aroutine, every time?!then why askthe equivalent ofsuch of me?!?:(:( ]

It is very difficult to understand what you are saying. I don't think people here will discriminate against you, but they will also not go out of their way to help you.

ifmy integrity is[stil?]put inquestion,bynow ivmade sumcontributions,notwithstandingthe up2harassmentsufferd cosevmy intractableRSI;any1,ofcours,maycheck'em,incl.checkuser me,butpl, letmeknow any inconsistencies found,as irest assured therewontbe many.--史凡/ʂɚ˨˩fan˩˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 11:32, 11 July 2009 (UTC)

Checkuser is for very different reasons.

ps andyes,imbitterabout this,here'n'in general :( -史凡/ʂɚ˨˩fan˩˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 11:51, 11 July 2009 (UTC) but inthe hope ofabetter future,iremain sincerely-史凡/ʂɚ˨˩fan˩˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 16:58, 11 July 2009 (UTC)

Bitterness solves nothing, if you find things difficult work out ways to make them easier, the chances are that if they are easier for you they will make things easier for everyone. Everyone edits Wiktionary because they enjoy doing it, while it is unfair to stop someone like you from editing, it is also unfair to force everyone to spend their time looking after you when they'd prefer to be editing; so neither is likely to happen. I don't think we have a big problem with prejudice, anyone who edits productively is welcome, anyone who refuses to is not welcome - that is "correct" discrimination. Conrad.Irwin 22:39, 11 July 2009 (UTC)
Maybe transfer to user talk page - no apparently relevance here. Mglovesfun (talk) 17:13, 12 July 2009 (UTC)

Firmly disagree!

1. is'boutpolicy
2. " fundamentalissue2wikimedia projects[as isWT],notjust concernsme

irwin-tx4reply,voice exchange possible pl?-史凡/ʂɚ˨˩fan˧˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 18:51, 12 July 2009 (UTC)

Policy means nothing on a wiki unless there are people available and willing to do the work that the policy requires. A policy with no backup is (effectively) not policy. I have to agree with some of the points raised above, in particular that your concerns are worded in a very broad and vague way. For example, "more assisted editing" is not specific enough to do anything about. A code developer needs a specific goal or design parameters to take any action. --EncycloPetey 20:53, 14 July 2009 (UTC)

ican provide anydetail required perskyp/voice medium[ihav never usedIRC,but,comp(uter)permiting,im willin2learn/try!]--posible pl?--史凡/ʂɚ˨˩fan˧˥/shi3fan2 (歡迎光臨/Welcome! 請也用/Please also use skype: sven0921為我/since I suffer RSI!) 23:36, 14 July 2009 (UTC)

Script and language magic

I've been carefully testing and will be phasing in some magic within templates such as `{{t}}` and `{{term}}` to improve the handling of script and language tags. Most of the gory details belong in a GP discussion if needed, but I'll explain here what you'll see.

1. For a small set (20) of common languages that have non-Latin scripts, and have only one script or a useful default, you will no longer have to use the `sc=` parameter in most cases. So you won't have to write things like `lang=hy|sc=Armn`, the language parameter will be sufficient.
2. The language specific font selection in your browser (e.g. Firefox) will become more effective. You may immediately see different fonts for some because it has enabled the default(s) for languages.

See `{{Xyzy}}`.

Not all the effects will be positive at first, there will almost certainly be at least a few font selection oddities, simply because of the large complexity (hundreds of languages, dozens of scripts, many hundreds of font sets). Please tell me, add something here, or, probably better, in WT:GP.

I was planning on doing this now, but was slowed a lot by the server problems earlier today, so later tonight perhaps. (Going to watch WI-BAN Test now ;-)

If something grotesque happens and I am not on top of it, you can disable it by replacing Template:Xyzy with a redirect to Template:Latn. I don't expect this will happen. Robert Ullmann 17:29, 11 July 2009 (UTC)

Nice! This works. I don't really understand how it's functioning but why does the number of languages have to be small? Can you add the pairs `lang=xcl|sc=Armn` and `lang=ka|sc=Geor` to the list? Also, such functionality will be very useful in `{{l}}` and `{{infl}}` as well. --Vahagn Petrosyan 18:48, 11 July 2009 (UTC)
It didn't work when I tried it with a Hebrew entry earlier. Maybe I pushed the button wrong. — [ R·I·C ] opiaterein — 22:23, 11 July 2009 (UTC)
That's because he hasn't actually changed `{{t}}` and `{{term}}` (and so on) yet; they still default to sc=Latn. What it will look like, once the default is sc=Xyzy, is עִבְרִית(ivrít, Hebrew). Which works perfectly. :-)   —RuakhTALK 23:16, 11 July 2009 (UTC)
As I noted, it hasn't been put in place yet (watching the test match ;-), will try to set that now and see.
It turns out that the number of languages to be effective is not that large. Think of it like this: start with 7000+ languages, about the target. Take out the ones that are Latin script; then the ones (e.g. Serbian) where you can't assume script, then a number that are "small" enough not to be worth adding code to lots of entries (Cherokee, albeit otherwise worthy :-) and you end with a small number; a usefully small number ... which are the ones implemented and doco'd. There probably will be a couple of additions. Robert Ullmann 23:24, 11 July 2009 (UTC)
Is in place for t, term, infl. I'll watch it for a while. If there is a problem, hack as above, or send SMS +254 722 929 463 ... thanks. Robert Ullmann 00:04, 12 July 2009 (UTC)
There are some more to be defaulted at the currently unused template `{{lang2sc}}`, which was unfinished but was meant to do exactly the same thing. Also, it would be good if the language-script switch could be made more user-friendly, and alphabeticized as {lang2sc} does, and the complete list of defaulted scripts listed at the documentation talkpage so that the users don't have to look it up in wikicode, to see what is supported and where. And also as Vaghan mentioned added to {l} and {infl}. --Ivan Štambuk 22:29, 11 July 2009 (UTC)
[redacted] Robert Ullmann 00:04, 12 July 2009 (UTC)

It seems to be working quite well; only one or two glitches found so far. I've added it to `{{l}}`. Adding languages must be considered very carefully, as the tradeoff is a fairly extreme ratio. I'd suggest that ideas be added to the talk page `{{Xyzy}}` and we'll look at them after a while, when we're sure it is all stable. Robert Ullmann 11:09, 13 July 2009 (UTC)

Just a big thanks to Robert for this – it’s a huge help for some languages like Japanese (without this or crufty typing, entries are typeset in Chinese fonts, which are markedly different)!
—Nils von Barth (nbarth) (talk) 16:28, 31 July 2009 (UTC)

Category:In London

Does this meet CFI? It doesn't seem to have any "lexical content", only geographical. Mglovesfun (talk) 13:49, 12 July 2009 (UTC)

It would be better to use just Category:London for these. However, the associated "context" template may not be appropriate to keep. --EncycloPetey 13:59, 12 July 2009 (UTC)
Is there any geographic location that wouldn't be eligible for a category? Are we not going to need to establish some criteria for inclusion in categories, especially visible ones? DCDuring TALK 16:34, 12 July 2009 (UTC)
I was gonna say that, why not Category:Paris or even Category:ja:Paris (etc). The slope is a bit slippery for me, I say get rid of them. Mglovesfun (talk) 17:06, 12 July 2009 (UTC)
If there is a large enough number of legitimate terms meeting WT:CFI, then why would a category not be appropriate for grouping them? The words in the category have lexical merit. --EncycloPetey 17:10, 12 July 2009 (UTC)
I dread category clutter creep. We have categories that have to do with language, dialect, register, grammar, context, and similar and our maintenance categories most of which are vital or at least useful for the functioning of a dictionary. Subject matter categories that do not have that kind of justification are the ones that concern me.
I suppose that my concern is that categories are as hard to correct as headers, but much easier to create. If categories had some kind of description of the criteria by which items were assigned to the category I might feel a bit better. I have a feeling that tools to facilitate assigning and removing items from categories might be useful. Also, we do not at present have any active system for challenging the validity of membership in categories. DCDuring TALK 19:45, 12 July 2009 (UTC)

Where is the Wiktionary going?

Just a general chat about where we are going. The Wiktionary is pretty young yet also pretty huge. I'm often amazed at some simple entries that are missing - for example caring isn't listed as an adjective, but we have uncaring! I suppose the Wiktionary has amazing potential, I sometimes worry there isn't enough multilingual coordination. It took me ages to learn to edit here, as opposed to fr.wikt, and I've tried doing some editing in Spanish and that's different from the other two! I think people who do have serious competences in more than one language are penalised - or at least not helped by the often very different formats that the Wiktionaries use. Mglovesfun (talk) 10:43, 13 July 2009 (UTC)

Well you can turn most of the English present and past participles into adjectives, but that's a mountain of work still left (every originally verbal sense has to be turned into an adjectival one, and lots of folks would want to evade that and simply redirect to the main verb..). As for the experiences on foreign Wiktionaries, mine were exactly the opposite: the approach of hierarchical sections proscribed by the WT:ELE seems to me much, much more intuitive and easier to memorize than the approach of e.g. de/fr/ru wiktionary here you have mountain of weird unreadable templates that generate sections. --Ivan Štambuk 13:41, 13 July 2009 (UTC)
I CANNOT stand the stupid header templates at fr:! If it weren't for them I,d be editing there much more often! Circeus 14:18, 14 July 2009 (UTC)
I agree, this Wikt is a lot simpler. Mglovesfun (talk) 19:07, 14 July 2009 (UTC)
Both are much too complex, in my opinion. Lmaltier 19:13, 14 July 2009 (UTC)
Agreed. MediaWiki software is not suited for writing a dictionary. At every step I see things that can be done automatically that we now do by hand. This is so annoying I sometimes think of quitting Wiktionary. --Vahagn Petrosyan 19:21, 14 July 2009 (UTC)
Templates are also difficult for newbies; I think more users would make "good edits" if things were a lot simpler here, and for the other Wiktionaries too. Mglovesfun (talk) 19:25, 14 July 2009 (UTC)
Which brings us to our gratest problem: that, confusing layout, difficulty to find definitions themselves in the sea of Alternative Spellings, Synonyms, Derived Terms, Wikipedia link boxes drives the newbies and possible contributors away. As a result there are too few editors, we're all extremely thinly spread. I wish we had 10-20 times more active contributors. --Vahagn Petrosyan 19:48, 14 July 2009 (UTC)
One possible way of driving more quality contributors (and regular users) to Wiktionary would be to go to Wikipedia and add a wikilink to Wiktionary (via [[wikt:<entry>#<language>|<entry>]]) to all of the mentioned words in the articles, esp. those discussed within the articles on language grammars, or in sections pertaining to etymologies. Esp. the onomastics, as most wikipedia articles on toponyms, names of deities, religious or philosophical concepts and similar stuff also have in their first line the term listed in several languages (in parentheses). If the users clicked on those and found out that these wikilinks link to some awesomely worked out entries, they'd certainly stay to poke around for some more. According to Alexa [1], 17% of Wiktionary (of all the wiktionaries, but that percentage is unlikely to be significantly different for en.wiktionary which is ~50% of all wiktionary.org traffic) users come from Wikipedia, so IMHO that is one very important aspect that is unfortunately largely neglected, and that figure has the potential to grow at least twice in number.
The apparently low aspect ratio of active quality contributors to the number of users is conditioned by the imminent complexity of editing - the knowledge bar for "quality edits" is much higher here than for Wikipedia, plus contributing here is significantly less fun and socializing, so unless people are highly-motivated they're likely to get bored very soon. Hence, Wiktionary's primary driving force are small-numbered but very dedicated individuals with some perverted love for words (or whatever), who discover the project basically randomly and decide to spend many a boring hour on it.
As for the layout - we simply must stick with what we have for now. Insisting on the strict format of entries (of ELE, and enforced via AutoFormat and other scripts) opens a possibility of relocating the project to some non-MW-powered website in the future, one that would provide some metalanguage that doesn't have utterly disgusting syntax, brain-damaged model of evaluation (templates are first completely expended and then processed) and is able to provide stone-age type of features such as "locate or extract substring" which would reduce the number of templates we currently use for at least an order of magnitude. And the possibility e.g. of search within individual languages (and not entry names), search of definition lines, or e.g. simultaneously adding a translation to English meaning and generating a target-language stub entry... But anyway, I don't think that either a migration, or adding any kind of important dictionary-related ability to MediaWiki is likely to come anytime soon, so we're stuck with the least bad of all possible approaches (I get chills of mere thinking of creating entries starting with {{-sh-}} {{-noun-}}..). --Ivan Štambuk 20:41, 14 July 2009 (UTC)
The problem with wiktionaries (all of them) is that data and presentation are mixed. So having different views of the data, or changing just the data or just the presentation is difficult, or even impossible. And it makes editing quite hard, tedious even. I think the way of the future is omegawiki (or something like it). But it's not ready yet. Beru7 20:48, 14 July 2009 (UTC)

editor.js, which currently has two modules - one to add translations, and one to balance them, can be easily extended with more functionality - I have lost much of the time I was hoping to dedicate to it to the real world, but if anyone is interested in adding features (wouldn't it be nice to partially-accelerate the creation of the foreign language entry from the translation/edit the translations that are there/add new translation tables/add synonyms/merge synonyms on all synonym pages in one go/pop up a java window to record words/add example sentences/<insert whatever you like here>. I see this as an interim solution, and have not given up on trying to work out ways to embed Wiktionary into some kind of structure - my plan is to try and do it phase by phase, initially just store the sections of the page as seperate blobs, then split out more and more fine-grained information while presenting it to the user as something like wikitext (for old-timers) or something like editor.js (for the future). This is again something we are never likely to accomplish, though I've had talks with Hippietrail, Leftmostcat and others on IRC about how it might be able to make a start. The major problem is the transition, if we can push all the data across while maintaining it's editability we're sorted. One thing I certainly do not wish for at the moment is more contributors, the more people we have the less control we can maintain over the structure, we need to get this sorted as soon as possible, it gets harder all the time. Conrad.Irwin 00:32, 15 July 2009 (UTC)

From SC vote discussion

(see Wiktionary talk:Votes/pl-2009-06/Unified Serbo-Croatian, no comment) Robert Ullmann 12:53, 13 July 2009 (UTC)

Robert Ullmann, in my very very personal opinion, is an insufferable, whiny twat whose continued existence is, in his words, a crime against humanity. He should stick to his bots that creates stubs and his ELE-mania. — [ R·I·C ] opiaterein — 22:33, 12 July 2009 (UTC
After this vote ends, it's my firm hope that we can pretend it never happened. There have been a lot of offensive, needlessly personal comments on both sides. (This is a big part of why polls are evil.) That one is a particularly odd example, though, in that it doesn't even seem to relate to the topic of discussion!RuakhTALK 13:05, 13 July 2009 (UTC)
Yeah, me too. Please folks, be all maximally civil from now on. There's not point in promoting further antagonism by excerpting individual offensive comments and/or personal attacks to the main discussion boards.As for the Opiaterein's comment above, I think that it's intended to be a (well, very low type of) cynicism on some of the Robert's arguments and the style of discussion, and not a real argument on the topic per se. I don't know whether that is all that obvious to everyone, but people have been reportedly increasingly losing their sense of humor here, so literal explication such as this might be required. We should seriously consider adding some CSS classes that would enable our Sheldons to customize the coloring of such text in their monobook.css --Ivan Štambuk 13:30, 13 July 2009 (UTC)
• Bad behavior is not trivial. Abusiveness and personal insults are beyond the pale. I have also seen no effort to actually attempt to unring the bell.
• Not to mention actions taken assuming that a vote will pass while the vote is in process. DCDuring TALK 14:12, 13 July 2009 (UTC)
DCDURING, I've stopped the merger-and-expansion (what you called "mass deletion") process of the separate B/C/S entries at Robert's request. Until the vote ends, I'll simply be creating the missing SC entries and focus on other stuff. Plus, you have to admit that the chance that this vote will ultimately fail is pretty low now... (plus Bogorom and Doremitzwr haven't still voted, as they seem to be absent, and both of them are 100% pro as we all very well know). --Ivan Štambuk 14:18, 13 July 2009 (UTC)
Why stop now? You've gotten tacit approval from the majority. Rules obviously are not terribly important when Truth is on your side. Eventually UN, ISO, et al will look upon our reform, learn from it, burn the dictionaries for Serbian, Croatian, Bosnian et al, and lead the benighted countries and peoples of the region into a world of peace, prosperity, and linguistic unity. And linguistic scholars will have played a vital role. How could I have been so selfish as to deny en.wikt its role in this drama? DCDuring TALK 14:45, 13 July 2009 (UTC)
Please cut the crap DCDURING. I stopped my "mass deletions" as a sign of good faith, which is apparently unappreciated. If this vote fails, I hereby authorize you to copy/paste the missing ==Serbo-Croatian== sections I create from now on into the 3 identical ==Bosnian==, ==Croatian== and ==Serbian== sections. I have no doubt that you'll enjoy it. --Ivan Štambuk 14:55, 13 July 2009 (UTC)

On standard languages vs "Serbo-Croatian"

Besides all of the "discussion" on the "Serbo-Croatian" vote, there is a serious problem: it won't work as proposed

We'd like to think that we should make decisions on some kind of linguistic basis, rather than "outside" standards (DAVilla noted this in his vote IIRC), but we are part of the world, not a separate playground (;-). What we are doing exists to be re-usable in the rest of the world, correct?

The international standards community, ISO, SIL, various national standards bodies and libraries, decided in 2000 to use the standard codes for the 3 (to be 4, with Montenegrin) languages, and explicitly delete "sh" from the code list. The uniform decision is to use the standard languages, and reject "Serbo-Croatian". The reasons for this were varied (political and practical), but the conclusions of the parties the same.

The objection here is that that is duplicative in our structure. "Fixing" it by using a non-standard, explicitly rejected, language is not a reasonable solution. It saves some duplication at the expense of being incompatible with everything else (including other parts of the WMF projects).

In case it was not clear: the proposed "solution" is about explicitly and entirely prohibiting the standard languages and requiring use of a non-standard language.

Here's just one example of how dropping the standard languages and using a non-standard language gets us into operational trouble, since we have to interact with a standards compliant world: (from the entry at July, in part)

which expands to: (sorry, this scrolls way off the right, so I'm doing a hide trick so the rest of the page isn't wide, just show it)

See all those `lang="bs"` and such? That's us telling the browsers which standard language the bits of text are in.

Now if we use the non-standard "sh", there is no magical way that templates (or any layer of software) can divine which of the valid codes and languages to use. So we are now not standards compliant. Which is not acceptable. And this is just one example.

Re-users of our content will run into exactly the same class of intractable problems, with no solution available.

And note that the only thing the proposal "solves" is the duplication of content. It isn't some vastly new useful feature.

I would suggest that anyone who has voted to support this proposal re-consider and change their vote to oppose.

We can then drop it and see if we can come up with something workable. Possibly just some simple things to avoid as much duplication. Otherwise we are in more endless pain, until the entries all get done over again, with the langauges separated (which, note, will be a laborious manual process, as all the various differences must be carefully transferred to the correct language sections, having been lost in the merger). The extreme pressure to force this through without careful considerations and thought is a prima facie indicator that it is a mistake. Robert Ullmann 14:51, 13 July 2009 (UTC)

Thanks. I understand better what you were meaning. Let's add that these lang parameters are very important in some cases: e.g. voice browsers and spell checkers must be able to know the language of the word. Lmaltier 15:03, 13 July 2009 (UTC)
Yeah I'm sure your browser would self-destruct if it encountered `lang="sh"`. --Ivan Štambuk 15:10, 13 July 2009 (UTC)
Also, spell checkers are not an issue in reading the translation tables, and heuristic voice generation that would work for either of the "languages" would also work for any of the other 3 "languages" (see below). --Ivan Štambuk 16:35, 13 July 2009 (UTC)
Robert Ullmann continues to inflammatory FUD the same things on wrong places in the wrong time. Why exactly didn't Robert Ullmann respond to several comments on the possible technical difficulties pertinent to the propsal that were stated to him more than 4 months ago? Why is Robert Ullmann mentioning this here and not on the places where he was asked, if not to scare uninvolved contributors to vote against the proposal now, when it is by all means certain that it will ultimately succeed? Why these "problems" do not appear with other non-ISO-corresponding codes Wiktionary has been using for various L2s for a long time? Finally, why is Robert Ullmann acting as if he is the Director of the Universe, that he can say what is negotiable or not, or what is an intractable problem and what a mere "trivial duplication" ? --Ivan Štambuk 15:04, 13 July 2009 (UTC)
• Oh please can we not turn this into another argument, people. Let's say the following though: we shouldn't set policy according to technical constraints. If the community wants to do it this way we will have to find a way to make it work with the technical skills available to us. Ƿidsiþ 15:21, 13 July 2009 (UTC)
• I agree: policy comes first, templates follow: we can use `* Serbo-Croatian: [[foo]]` instead of `{{t}}`. But more: Ivan's right: the XHTML can use `lang="sh"` (AFAICT).​—msh210 15:33, 13 July 2009 (UTC)
Not sure it's related, but Firefox, in its "preferred languages" menu, proposes Bosnian, Croatian, Serbian, not Serbo-Croatian. Internet Explorer proposes 2 Bosnian options, 2 Croatian options, 4 Serbian options, but no Serbo-Croatian. The Internet standard requires the use of 2-letter ISO-639 codes when possible, or 3-letter codes if no 2-letter code exists for the language. Lmaltier 15:55, 13 July 2009 (UTC)
• I agree with msh210. By my reading of BCP 47 (which defines how language tags work) and the IANA language subtag registry, sh is in fact a valid (albeit deprecated) language subtag of the "language" type. Note that IANA started the subtag registry five years after the code was deprecated, so it seems like they're fairly explicitly allowing (albeit deprecating) it. —RuakhTALK 16:13, 13 July 2009 (UTC)
If the policy does not conform to standards, it is not possible for the templates, etc, to "fix" it. (You can't, in general, apply technical fixes to political problems ;-). And `lang="sh"` is illegal, as the code is not valid. (sure, the browser doesn't "break" when you send it junk; that isn't the point; the point is to be compliant so that it works correctly) Robert Ullmann 16:15, 13 July 2009 (UTC)
And what exactly wouldn't work correctly, excepting some marginal cases such as voice-generation (spellcheckers are not an issue here) ? How about generating simply sr as a workaround, as Serbian is written in both scripts (Cyrillic/Roman) and standardized in both of the varieties (Ijekavian/Ekavian), so it would be a very close match, applicable in most of the scenarios? --Ivan Štambuk 16:23, 13 July 2009 (UTC)
Also for voice-synthesizers: standard Bosnian, Croatian and Serbian have identical phonology (exactly the same phonemic inventory, prosody and accentuation the same in 99% of words) - and if this hypothetical voice-generation software would not pronounce words on the basis of tens of thousands of pre-recorded sound files (which would be absurd as Serbo-Croatian is very inflected), but instead use some heuristic algorithms (Serbo-Croatian has phonological orthography so the pronunciation is completely predictable from the spelling), it would also be absolutely no problem at all to use the sr code instead. --Ivan Štambuk 16:31, 13 July 2009 (UTC)
Robert, please provide evidence for your claim that this is illegal — and please explain why it supersedes the evidence that I've provided that it is legal, merely deprecated. (Like how the <u> element is legal, but deprecated, in XHTML 1.0 Transitional.) —RuakhTALK 16:50, 13 July 2009 (UTC)
It's not evidence, but I assume that each browser uses a single list of languages, and what I tried (see above) suggests that modern browsers do not recognize this deprecated code any more (the code was probably kept for a transitory use only). But I have not tried all browsers, try it on your own browser. Lmaltier 17:08, 13 July 2009 (UTC)
Did you notice any kind of differences when switch among Bosnian/Croatian/Serbian (or any of their suboptions) in the translation tables or the entry headwords? I mean, what particular difference does this template-generated language markup make at all to the most (99%) of our users? We should really simply stick to some of the less problematic codes like sr and get this drama over with.. --Ivan Štambuk 17:14, 13 July 2009 (UTC)
To be honest, I'm not sure exactly of the significance of what you tried (I believe that has to do with Accept-Language headers and such); but it's true that Firefox 3 (at least my version of it) recognizes lang="sr" but not lang="sh" (as I determined experimentally, by creating a file
```<html lang="sr"></html>
```
, opening it in Firefox, right-clicking, and choosing "Properties" — I get "Text language: Serbian" — whereas when I change sr to sh, I get "Text language: sh"). But this seems rather abstract; if the specs deem it valid (which I believe they do), and it doesn't cause any actual problems (which it may, but I'd like to know that for sure), then I really don't care. Note, BTW, that this whole line of reasoning does not directly suggest that we should allow ==Bosnian== etc.; rather, it suggests that we should ban ==Serbo-Croatian==, from which we can infer that we must use ==Bosnian== etc. instead. Given what you've said about NPOV, I somehow don't think you'd really support that? —RuakhTALK 17:38, 13 July 2009 (UTC)
No, I don't propose to ban Serbo-Croatian. One reason is that there are Wikimedia Serbo-Croatian projects. Another reason is that it's better to keep contributors than to lose them (note that both reasons also apply to separate languages). A third reason is that it might be a good place to get an overall view of what is common and what is different between Croatian, Bosnian and Serbian (which is not yet the case: currently, users have to guess that the same word is used in all 3 varieties, because separate sections have been removed, e.g. in piramida). Better something which is not standard than nothing. But introducing technical problems by prohibiting codes recognized by browsers and allowing only a code they don't recognize is not a good idea. Lmaltier 18:19, 13 July 2009 (UTC)
As far as the 3rd reason is concerned: Lmaliter, I've been exclusively merging words so far that are shared among the varieties, and piramida means "pyramid" in "Bosnian", "Croatian" and "Serbian" (also will in "Montengerin" when they invent it this fall ;). Ijekavian/Ekavian varieties are self-explanatory, and those few that are different but were merged were elaborated on in the ====Usage notes==== sections, or by means of context labels - all in the manner outlined in the merger proposal. For example, confer the usage notes section of the Serbo-Croatian entries for nȍgomēt and fȕdbāl. Really nice, wouldn't you think? :) --Ivan Štambuk 19:01, 13 July 2009 (UTC)
You know that. But a user is not supposed to know it: does the absence of a note about differences mean that the word is common to all 3 varieties, or that this note was forgotten, or that the entry was created by someone not aware of the differences? The user cannot know for sure. Before the removal of sections, he could have known for sure. Lmaltier 21:06, 13 July 2009 (UTC)
Yeah, as I native speaker of course that I'd know that. But now, had you actually payed attention to the proposed WT:ASH policy, you'd know that the logic of the merger would be very simple: it's common (shared) unless otherwise specified. I and the other Serbo-Croatian contributors that have been merging the entries for the past 4 months have striven to strictly abide by the proposed policy. Of course I cannot 100% guarantee that there are usage notes missing somewhere (in fact, I'm pretty sure that there are!), but that's something that would have to be fixed along the way. After all, this is only a volunteering effort on wiki basis. When this vote closes and the policy eventually becomes formal, we'll probably have another Serbo-Croatian contributor but of Serbian origin, which will pay attention that those rare cases where something is Croatian- or Bosnian-only, but not appropriately marked as such by me, is conveniently elaborated on in the ====Usage notes==== section that in literary Serbian some other equivalent term is used. Similarly how we already have the differences in British and American English solved in the usage notes - very often by the "discovery" method of the English speakers familiar only with one of the aforementioned Englishes. --Ivan Štambuk 21:19, 13 July 2009 (UTC)
Also, aren't the settings that Lmaltier described something pertaining to the initiation of HTTP request, when server decides what "version" of the page to send to the client? Here on English Wiktionary they'd always be getting English version of the page (which is the only one), and the example which Robert listed for the translation table pertains to some local spans every Wiktionary user would get. Now, what interests me is of what usage is such type of tagging, where you could get several hundred words one after another each tagged as belonging to different language..what difference does it make with respect to the browser, if some are e.g. not at all tagged (suppose we drop tagging with sh completely), or tagged "wrongly" (i.e. with sr as I propose). Can someone explain it to me in plain English please? :) --Ivan Štambuk 18:06, 13 July 2009 (UTC)
Users can have client-side CSS (or even monobook.css) which specifies a font, or anything else, for stuff with a certain `lang` attribute. Also, perhaps tagged wrongly is not a problem for this, but untagged will read wrong in screen readers, as someone's mentioned already.​—msh210 18:12, 13 July 2009 (UTC)
So let me get this straight: this is relevant only to the users that use CSS customization in some obscure browser configuration files of a specific language, specifically in this case for those that use it for bs/hr/sr, where it wouldn't work when browsers would be fed with sh ? --Ivan Štambuk 19:06, 13 July 2009 (UTC)
Since in your edit summary you indicated that that's a question for me, I'll answer. I never said "only": I gave an example. I don't know. But you, Ivan, of all people, who are railing against referring to a word as Serbian, should not want to tag it as such (viz, by means of `lang="sr"`)! Nor do I (though certainly I feel less strongly about it than I'd think you would): assuming we call the word Serbo-Croatian, we should tag it as such if possible.​—msh210 19:43, 13 July 2009 (UTC)
Well, if there are some other categories of users that this change would affect, I would sure hope to hear about them. If it only affects a relatively insignificant minority that engages into CSS hacking of ISO codes for bs/hr/sr, then this really is not that much of an issue altogether, or at least not as much as it was announced by the OP.
Also: these are two very different things, referring to the actual language (one that it is) and a workaround for a technical glitch. That "Serbian" code would be invisible to > 99% of Wiktionary users (those who don't hack CSS to customize it for bs/hr/sr), and when we're discussing the technical minority that would engage in such perverse acts: I'm sure that they'd have sympathy for our effort :) Perhaps we could even rotate the generation of bs/hr/sr codes on a monthly basis to be completely "NPOV".. Also, WT:ASH should contain a big note to CSS hackers that they have to pay attention to such workaround.
Also, another thing: suppose that there exists such Wiktionary user that would customize via CSS the display of bs/hr/sr words: I seriously doubt that he would be 1) using different options for each one of them 2) customize only for one of them. It's more like that he'd customize for all of them together, the same fonts and layout.. So defaulting to one of them wouldn't be much of a loss. sr would be chosen simply because it's the most "generic" (both scripts, and both jat reflexes) --Ivan Štambuk 19:56, 13 July 2009 (UTC)
What Ruakh tried is more relevant, yes. A browser may use the language for several purposes (spelling warning, voice (this will probably develop), choice of the preferred font...). Or course, if it does not recognize the language, the word will be displayed, but possibly not as the user would like it. Lmaltier 18:19, 13 July 2009 (UTC)
As I said, spellcheckers are inapplicable since this would be for display only, and not editing. Voice synthesizer for "Serbian" would work exactly the same way for "Croatian" and "Serbian - or at least as good as e.g. voice synthesizer for British English would work for American English: synthesizers are not that good to reproduce such a trivial differences in pronunciation. Serbo-Croatian is usually displayed in standard Roman and Cyrillic fonts which are properly installed in basically 99.999% of Web-aware computers, so I don't think that there is plenty of potential for the experimentation with that. Also, our own Cyrillic-script templates `{{Cyrl}}` and `{{Cyrs}}` are already preconfigured to display stuff such as combining diacritics on Cyrillic letters as proper on possible. With all the benefits for the end-user that the acceptance of this proposal would bring, I think that the loss of language-code customization for bs/hr/sr is hardly something of that paramount importance as you guys are trying to depict it. --Ivan Štambuk 20:07, 13 July 2009 (UTC)
Spellcheckers are relevant (my browser checks spelling, and all editors can enter something wrong inadvertently, even in the translation table). And you should not forget blind users. They will be able to use Internet more and more (this is one of the objectives of Internet standards). Lmaltier 20:18, 13 July 2009 (UTC)
Wait, your browser checks the spelling of Web pages you visit? Mine only checks spelling in text fields, text-areas, rich-text editors, etc. — that is, where I can actually do something about it. Yours must be a source of great frustration. :-P   —RuakhTALK 20:28, 13 July 2009 (UTC)
Can you please point me to such a spellchecker that checks the spelling of the entire webpage (i.e. not only edit fields, as most browser-integrated spellcheckers do), and is moreover customized by the lang= parameter in HTML (normally spellcheckers are configured to only one language, and not mixed type-of, as the Wiktionary translation tables are) ? I'd really be surprised that such a beast existed :)
As for the blind users: yeah I agree. But that's why I'm advocating that we should use sr if sh is so much of a trouble - I am pretty much sure that that a voice synthesizer for standard Serbian would work just as good for standard Bosnian and Croatian. All the three, after all, have (in some 99% of cases) the same phonology and pronunciation rules. --Ivan Štambuk 20:40, 13 July 2009 (UTC)
My browser checks only what I enter. You are right, it's not very relevant in my case. Nonetheless, spellchecking is one of the objectives of the lang parameter (e.g. if a browser wants to provide a button to check the spelling in the page displayed). Lmaltier 20:51, 13 July 2009 (UTC)
Yeah but spellcheckers are generally customized on basis of one global preference of a language the user interested in, and not on the basis of what lang= in HTML parameter is.. I mean.. after all, who'd want to spellcheck a dictionary entry ?! Esp. the ones created by such as an exceptional connoisseur such as myself? :D (and I regularly create entries validated against a dictionary.) Sorry but what you are describing is completely inapplicable to Wiktionary. Perphaps to the Wikisource, Wikipedia or something where prose is being written, but not here. --Ivan Štambuk 21:05, 13 July 2009 (UTC)
Connoisseur-ջան, you need to add also Serbo-Croatian translations in English articles besides creating your exceptional entries, otherwise we can't find them! I know it's the least fun part, but it has to be done. The same goes for other languages: please link your FL entries in translations (especially you, Persian contributors). --Vahagn Petrosyan 21:25, 13 July 2009 (UTC)
I think I'll simply wait that for 17 more days before doing just that.. I don't want to be accused again and again of causing "severe damage" :) Especially the Serbo-Croatian translations on the entries such as July, which Ullmann cites above, and which are simply plainly wrong (juli is also Croatian, and srpanj is not valid Serbian). --Ivan Štambuk 21:35, 13 July 2009 (UTC)

unindent If this is to be argued in terms of standards, we should follow the ISO recommendations. Not following standards is totally fine for "now", it won't make anyone's browsing experience worse, it won't make it harder or easier to add entries/translations. The problem with not following standards is always at a much later stage. For example, see Internet Explorer 6, it didn't need to stick to standards because it was clearly the best browser of its day, yet when several years later a few different browsers emerged that could all render the same content in the same way, and IE6 couldn't - because it wasn't following the standards - everyone began to hate it. I doubt anything so severe will happen to Wiktionary, but the principle still holds. If the ISO standard is wrong, it is the standard that should be fixed, absolutely nothing is gained by ignoring it, and a considerable about of difficulty is added to the job of future Wiktionarians when they have to clean up the mess we make. I don't think it really matters in anyway whether we treat the languages as one or three, sure it's a bit less convenient or more smelly, or outrageously insulting - there is no technical issue that limits either proposal, the sole difference is that one is standardised, and therefore "correct", and the other is wrong (until we patch the standard that is). Conrad.Irwin 22:21, 13 July 2009 (UTC)

Actually AFAICS this has absolutely nothing to do with standards (that word is BTW seriously misused, 99% of "web standards" are no real international standards at all, but mere "recommendations", "technical reports" etc. serving as guidelines for browser implementators), but with IANA tag assignment status of Serbo-Croatian (sh), which is ATM at the status of "deprecated", and how it would affect the completely negligible minority of Wiktionary users that would customize their CSS for bs/hr/sr, and their browsers be fed with sh, so that their hacks wouldn't work. Of course, the simplest way would be to say in WT:ASH that Wiktioanary users doing such CSS hacking in the first place should set their preferences to to the lang= code of sh, and not bs/hr/sr. Now that I think more about it - why should we care of IANA at all? sh code is reserved and won't be reallocated to another language ever (no more 2-letter code assignments by ISO!), so we should simply use sh and Wiktionary users would have to change those 2 letters in their CSS scripts from xx to sh. Changing 2 letters on their behalf is much easier than wasting thousands of hours maintaining trivially differing separate sections for SC varieties on our behalf. --Ivan Štambuk 22:44, 13 July 2009 (UTC)
Actually I've already elaborated quite extensively in several places why it would be, from the Wiktionary's end-user's perspective, completely wrong to have 3-4 times duplicate of otherwise identical information. The basic argument is: there is no English-speaking learner of only "Croatian language" or "Bosnian language": basically all the uni-level courses, professional-level grammars (i.e. not quick-read type of for tourists and alike), and the most comprehensive available sh-en-sh dictionary (the Morton-Benson) use the collective "generic" approach. If we were to separate the sections the users would have to manually identify where there are slight differences (in some <5% entries) in the definition lines or usage notes. This merger proposal has absolutely nothing to do with the ease of creating entries, but with the ease of using and maintaing them ! Why waste time triplicating or quadruplicating all the inflections, definition lines...every single time you expand one of them in some place (e.g. add an example sentence, synonym...) you need to do exactly the same thing in 3-4 other places! Furthermore the separation would disable the generic type of treatment as I illustrated in type of nogomet / fudbal where really different words are preferred in different standards. You can read further explication on this in the rationale and the relevant discussions on the WT:ASH's and the vote's talkpage. --Ivan Štambuk 23:32, 13 July 2009 (UTC)
I'm really not bothered about this specific issue, but for what it's worth, the note on nogomet / fudbal is utterly useless to anyone/anything that doesn't speak fluent English - there is no way for a computer to work out what it means unless everyone uses that exact wording. There are already ways to make different parts of wiktionary agree with each other, template transclusion and labelled section transclusion allow you to easily copy parts of an entry that should be copied (though it's interesting to note that so far most uses proposed for it have found to actually require subtly different wording in each place). If you do have more than one section, that does not prohibit you from highlighting the differences between them, though as you continually point out, it would require you to either highlight the difference four times, or include the difference on four pages. I have already read more than I can remember about the rationale for these changes, and the rationale for not making them, and the compromises - I have no personal preference either way this is decided. Conrad.Irwin 07:44, 14 July 2009 (UTC)
there is no way for a computer to work out.. - Our primary audience are actually human beings, and not computers. Once you guys accept that, only then we can talk on the alleged trivial/intractable technical deficiencies of the proposal. Wiktionary is for people to use, not for programs to parse. This is not an exercise in computational linguistics, but the project of compiling the greatest dictionary ever! The type of approach taken on the usage notes of nogomet / fudbal is taken by all those BCS courses I mentioned on the vote page. If you look up fotball in the Morton-Benson dictionary you see fudbal; nogomet - and we moreover actually say which word is confined to which area. That is something that we want.
Yeah we could possibly "clone" duplicate content with template transclusions and labeled section magic - but why bother with all that sh*** when there is much more elegant way to solve all this. Furthermore, transcluding sections in the mainspace is something we strictly try to evade as much as possible, as it would extremely complicate machine-processing. These labeled section transclusions are particularly problematic, as they use obscure wiki-code which is impossible to put in templates (I tried it - it doesn't work!), so you must write it manually every time, which really looks ugly (like we're doing it now for WT:ES). At any case, this is all speculation on how it could possibly work - in the last 4 months this has been discussed no one bothered to propose another way of handling all this, and at this point (in the middle of the vote, ~1500 entries merged) it's really futile to do so.
I have no personal preference either way this is decided. - I am most sad that you voted "against" in the end, and not "abstain" as it would be reasonably to assume your vote would be judging from this sentence. Please Conrad reconsider it, this is all made in good faith (though gone way too far unfortunately, but hopefully the noise of personal attacks, politics, threats and general extra-linguistic arguments and "arguments" doesn't cloud the numerous logically undeniable benefits of the proposal for both the Wiktionary end-users and editors). --Ivan Štambuk 09:04, 14 July 2009 (UTC)
The reason that computers must be able to read Wiktionary is that is the only way we can ever hope to break out of our curretn position as an atrociously-layed-out, impossible-to-edit, challenging-to-read monstrosity of a website. It is not that a computer is going to come along and say, hmm I wonder what the French for "dog" is, it is that a developer may come along and say, hmm wouldn't it be nice if my iPhone knew the French for "dog" so that I can use it when I'm in France without having to connect to the internet. If he is successful, suddenly Wiktionary is being used by more people (Everyone with an iPhone who wants to know the French for "dog"), and (hopefully) more people will come to help Wiktionary. If he is unsuccessful, we just sit here on the internet getting fuller and uglier until no-one can comprehend what is here, despite it being amazing, so no-one bothers to visit, editing numbers decline and we get switched off. (I'm exaggerating a lot again). To me personally, this change makes no difference, to Wiktionary, I believe that it is a step in the wrong direction - sure it makes life a little easier for experienced Wiktionary users, for most others (and I happily admit that this is more of a feeling than four months of rational thought) it makes life harder. We should be striving to make things easier for everyone who isn't experienced rather than implementing yet more complicated shortcuts. I do share your feeling that transclusion is atrocious, I meant for there to be more emphasis on the bracketed statement about how all the previous proposed uses for it have been found inadequate as the wording was not identical enough. Conrad.Irwin 00:17, 15 July 2009 (UTC)
All that you describe applies to some hypothetical scenarios in some distant future on some imaginary users, so excuse me but I cannot at all take all that seriously. The difference between fudbal and nogomet is in no way different than with that of chips and crisps, and can (and is furthermore suggested by the proposal) be trivially handled with context labels, as ATM British/American lexical pairs are. It's just that I personally prefer the prosaic human-friendly explication to such metadata, so I've put it to ====Usage notes====. Furthermore, if you'd look up the translation table for football for Serbo-Croatian per proposal, you'd read something like (Bosnian, Serbian) fudbal, (Croatian) nogomet, which is also trivially parsable. That type of labeling is abundantly being done for similar scenarios. E.g. for Spanish there are some words in translation with almost a dozen context labels denoting regional usage. Again, this alternative format is trivially machine-processable so your objection cannot really apply. Also, you again mention "make things easier for everyone" - but the separation into 3 sections that would have 99% of duplication or triplication is hardly a step in that direction. --Ivan Štambuk 00:33, 15 July 2009 (UTC)
Just as you don't follow my line of reasoning, I don't follow your counter arguments. I have nothing further to contribute to this debate, sorry for wasting your time. Conrad.Irwin 00:39, 15 July 2009 (UTC)
In the first comment of yours at the beginning of this thread you express your concern for Wiktionary readers (human beings): however I would maintain that as most people who interact with Wiktionary are readers rather than editors, and people will read the same information more than the once (or three times).. and later you emphasize the importance of support of non-humans (computer instructions), and these two are mutually incompatible. Some compromise must be made. If you simply feel a need to vote against without a real justification just say so. --Ivan Štambuk 00:46, 15 July 2009 (UTC)
As I've tried to explain, the two are exactly the same - I'll try and make it clear why with another example: WT:EDIT has support for adding translations in most languages. It can't deal with Serbian, because of the nested scripts, it can't deal with Norwegian very well, because of the nested dialects, and it struggles with Chinese because they want to use different templates/language codes. Yes, I hope that one day it will support all these things, but I or someone else needs to manually write the code to handle each type of exception. Adding more exceptions creates more, similar, problems; not just for tools that try and make editing easier, but for tools that try and read what we do. It takes up a lot of time that could well be spent adding more interesting/useful features or doing productive work. Of course, it's not a technical `problem`, there is no reason that the code cannot be written (which is why such excuses shouldn't be uesd to prevent correct changes), it's merely a huge discouragement to anyone trying to do anything to make readers/editors lives easier. This specific split should be handle-able by the code that will handle Norwegian (if we stick blindly to WT:EDIT's current functionality) as it is the same sort of problem; but every introduction of non-standard behaviour causes this kind of issue - not just to WT:EDIT but to every single piece of code that tries to make life easier. Conrad.Irwin 07:54, 15 July 2009 (UTC)
AEL
Conrad, there is absolutly no difference in the way Serbian used to be treated before (2 scripts + 2 varieties) and Serbo-Croatian is now. Everything that used to "not work" or required specific handling with Serbian it now takes with Serbo-Croatian. After this vote passes I'll personally add the necessary javascript code to handle the support for Serbo-Croatian, it's no rocket science. --Ivan Štambuk 10:06, 15 July 2009 (UTC)
I never said there was, I was merely presenting an example of why exceptions to the norms are irritating. Thank you for the offer of help - if you need a hand with the code, just let me know - it'd be great to have more people working on it. Conrad.Irwin 00:51, 16 July 2009 (UTC)
There is a difference; and it is not trivial at all. Under the proposal it becomes (as Conrad has pointed out) impossible to extract one of the standard languages from the wikt. The proposal is about removing some duplication at the expense of making the data entirely intractable to any automation. (And note that it is all automation: even when it is a "user reading the wikt directly" the data has to be read and interpreted by browsers!) Keep in mind the the primary mission of Wikimedia is to produce re-usuable content. The direct web access to the (e.g.) en.wikt is secondary. Our purpose is to provide data that all kinds of applications, known and un-expected, can use. The proposal makes the standard languages entirely inaccessible to those uses. Robert Ullmann 02:35, 16 July 2009 (UTC)
Robert, you are greatly exaggerating. What I replied to Conrad applies to you to:: the example with ====Usage notes==== on the suppletive-stem lexical pairs such as nȍgomēt and fȕdbāl is machine-processable by means of context labels. Now, I've used ====Usage notes==== because I firmly believe that the prosaic explanation makes it more accessible to the end user than mere context labels. If you look at the ====Synonyms==== section, you'll see on either of those two the other one listed with context labels. So it would be trivially deductible that nogomet is the preferred standard Croatian word, and fudbal is the preferred standard Bosnian and Montengerin. We could even note that in the definition lines themselves (but I haven't, because I didn't think it was important that much, but we can do it and moreover require it in WT:ASH). Yesterday I created similar suppletive-stem entries fàbrika and tvórnica - in those however I also marked in the definition lines which variety prefers it. It should be noted that all of those 4 verbs are known to all the Serbo-Croatian speakers, it's just that some of them are preferred over another one in certain regions. These are for words with suppletive stems, words which trivially differ in morphology are handled at the ===Alternative forms=== section, also with context labels. So all in all, I think that the accusation of the impossibility of programmable extraction of one particular standard language are unfounded, as long as the context labels are used to mark them, and that is what the proposed WT:ASH policy asks us to do.
As for the claim that the primary mission of this project is to "produce reusable content" - I think that it is absurd. We're here primarily to provide the content that can be read by MediaWiki and displayed the browsers for users to study and learn. The possibility of having external projects reading the data we collect here and displaying them in another way (or using it for forever purpose) is of course worth having in mind, that is not the primary goal of this project. I don think that people of the Foundation would be too happy when reading that claim of yours, that the browser access is something secondary, and that our contributions here are secondary to the people who read Wiktionary database via browser (at this moment probably 99.9% of users). --Ivan Štambuk 16:08, 16 July 2009 (UTC)
Why don't you ask them? You are in for a very serious shock. (;-) There's a reason why all the careful use of CC and GFDL licenses and so on. Otherwise all of that would be ignored in favour of "all rights reserved", like any other websites. The entire point is to make the information available for anyone to use. Not some fancy website, the website is a tool for creating and adding to the content. Robert Ullmann 07:14, 17 July 2009 (UTC)
It was just a figure of speech, I wouldn't know who to ask. I agree that the entire point is to make free information available for everyone - but at this moment we're doing it primarily for humans who read the Wiktionary via broweser interface. You simply cannot claim that browsers is secondary, otherwise we wouldn't be editing the content in MediaWiki in the first place. Also, by far the most popular external Wiktionary tool used by the general public - WikiLook, doesn't seem to have problems at all with reading ELE-compliant format of entries. So we can without problems "sacrifice" various programming-related interface things at the expense of user presentation. --Ivan Štambuk 10:36, 17 July 2009 (UTC)
The fact that we try to represent words in languages with no ISO code should remind us that we can not be bound to their listing of languages. The IANAL registry is even more restrictive than ISO 639 so we certainly can't be bound by them either (would we disallow entries in languages that didn't have an IANAL tag?). If a fully valid ISO 639 code is needed, we can use hbs (the macrolangauge for serbocroatian). We use these macrocodes for many entries here (see Wiktionary:Language_codes#Language_separation) so that shouldn't be a problem either. We should stick to the standards when linguistically appropriate, but we must be flexible than them in certain areas. I see no deal-breakers here in terms of technical arguments. --Bequw¢τ 00:09, 14 July 2009 (UTC)
From an ideological point of view, surely the correct answer is to send an update request to the standards bodies, though I suspect that would take a while to come to fruition. Using hbs would seem to me more appropriate, but then the difference between using sh and hbs is negligible. Conrad.Irwin 07:44, 14 July 2009 (UTC)
If the proposal was to add "Serbo-Croatian" to the wikt, we could certainly use "hbs". And would not need to propose anything to SIL/ISO. But if that were the case, there would be no reason for a proposal, or a vote, or any particular controversy. The issue here is that the proponents are not interested in adding "Serbo-Croatian", which they could do without the slightest debate: they are (as should now be glaringly clear to everyone) demanding that the standard languages be suppressed. Which we simply can't do. Robert Ullmann 02:35, 16 July 2009 (UTC)
We are not "suppressing" anything. Every standard or non-standard Bosnian, Croatian and Serbian word is welcome to be added. It's just that it is to be formatted under ==Serbo-Croatian==, and if it is not shared with the other 2 standard varieties it should be tagged with context labels. The primary reason for the proposal is to reduce the immense duplication/triplication/quadriplication of content the separate treatment of B/C/S/M would require, thus making the job very difficult for both the contributors and the users. It makes much more sense to treat the common core collectively, and tag the differences. All the 3 standard languages are based on the identical dialect (Neoštokavian), have identical phonology and accentuation, share ~95% of the basic lexis, have 99% identical grammar (same inflection, minor differences only in derivational morphology as illustrated in the rationale), and are completely mutually intelligible. To pretend that they are something completely separate would simply be an insult to common sense! --Ivan Štambuk 16:16, 16 July 2009 (UTC)
Re: "If the proposal was to add "Serbo-Croatian" to the wikt, [] there would be no reason for [] any particular controversy.": Per Wiktionary:Assume good faith, I will pretend to believe you. So, I'll say only I think sh is preferable to hbs, because macrolanguage codes don't seem to be in the IANA registry; that is, whereas sh is merely deprecated, but hbs is actually invalid for this purpose. (But to be honest, I'm not sure exactly what the macrolanguage codes are for. I may well be missing something. And confusingly, the other ISO 639-1 codes for now-macrolanguages don't seem to be marked as "deprecated" in the IANA registry. I assume politics must have a great deal to do with this.) —RuakhTALK 18:33, 16 July 2009 (UTC)
Yes, the macrolanguage code isn't very sensible, but at least it is a valid 639 code. (And extension codes might start with such, although all the ones I've seen start with group codes.) Keep in mind that other uses of 639 have uses for "macrolanguages", groups etc, even things like -2 "B" codes. And IANA just hasn't caught up yet; it took 5+ years for ISO to catch up with itself (they listed "sh" in a version of -1 released in 2005, after deciding to remove it in 2000 (!!), since fixed ... -;) Robert Ullmann 07:14, 17 July 2009 (UTC)
Correct, pure politics. Also, if you look at the list of macrolanguages, you'll see that Serbo-Croatian is the only one missing ISO 639-2 code. SIL/ISO's classification for any language should only serve as a guideline for our cause, and not something we should inescapably be confined with. --Ivan Štambuk 18:48, 16 July 2009 (UTC)

That's good. So there is no intent to suppress the standard languages? Very good. Most excellent.

Except: that is what the proposal does. The standard languages disappear under the proposal.

(I would like to apologize for mis-understanding your motives, and severely criticizing what they appeared to be; I assumed that what the proposal accomplishes—suppressing the standard languages—was therefore what you intended. I did not realize that that you didn't understand what—in actual effect—you were doing.)

It is like this:

For words in a language to appear as part of that language in the Wiktionary, they must have an L2 header with that language name. For them to appear in Translations, they must have a line with that language name.

It isn't optional. "Formatting" them differently suppresses the languages: the entries become invisible and intractable to automation.

(You may only care about the browser interface, but that is only a small fraction of use, and that is broken as well, as observed above in one small example of many. E.g. in "Our primary audience are actually human beings, and not computers." Yes, that is the eventual audience for the data, whatever the application, but first it must be read by computers! What you are editing when you edit entries in the wikt is computer input data. It isn't the text of, say, a print dictionary.)

Okay, this may seem odd. Why should it make such a big difference? Well, first, things have to be in a definable and defined format for computers to deal with. (Like above where Conrad points out that while rigorous use of context tags may be tractable, making notes about the languages in prose is useless.) This may not be intuitive, but is critical.

Part of the problem here is because what we do mixes "schema" (database structure) and data. (As has been observed a number of time over years, including somewhere in this discussion, IIRC). That is, some of the "formatting" is structure, and some of it is just fairly arbitrary (conventions in prose, in, say, Usage notes). And the data is intermingled. This is a consequence of using 'pedia software for the Wiktionaries. (And turns out to be fortunate, look at how the rigidity of the "omega" project impedes progress.) But it also means that some of the "formatting" is required. It is not optional. In particular, languages must appear at L2 to be present.

In order to make the Wiktionary structure work in the presence of "Serbo-Croatian" entries, we will have to duplicate those entries into properly structured L2 sections for Bosnian, Croatian, and Serbian.

Which, of course, we could do more-or-less automatically, with some process set up to check the words that should only appear in one or two of the standard languages. Of course, we then have FOUR language sections for each word, the three standard languages and the deprecated "Serbo-Croatian". (By more-or-less, I mean with code handling a lot of hairy details; like applying some serious heuristics to determine when a prose comment means that the entry should be flagged, changing templates, and so forth.) Likewise in Translations sections, to make the languages work.

(Do you want to have 4 language sections instead of 3? That's where the proposal inevitably ends up. I've designed a recovery process to restore the proper functioning of the Wiktionary if the proposal were somehow to be adopted; I'm sure the proponents of "removing duplication" will choke, as it makes the duplication necessarily worse ;-)

Somewhere in here there ought to be a reference to the Law of Unintended Consequences Robert Ullmann 07:14, 17 July 2009 (UTC)

• The standard languages disappear under the proposal. - Standard languags do not disappear, headers for standard languages disappear. The difference is subtle but of most importance. Users still can see in the SC entries themselves the status of the listed SC word in standard B/C/S, and one can still machine-extract words that are e.g. Croatian only (all those that are tagged with the (Croatian) label, plus those that are not tagged at all, being the "shared core"). We could easily (I could, e.g.) write a program that would generate such B/C/S-specific list from the dump - we could even generate, beside the all-encompassing Index:Serbo-Croatian also the subindexes for B/C/S on the basis of tags.
• Another issue about these "standard languages" that I haven't mentioned before (as I thought it wouldn't matter) is this: they're completely ill-defined. Dejan mentioned on the vote page how Matica srpska's (i.e. the highest Serbian cultural institution) dictionary of "Serbo-Croatian language", simply became reprinted a few years later as a dictionary of "Serbian language". And furthermore it contains hundreds of words that are markedly Croatian, although they still have some marginal usage by the Serbs (and basically all Serbs would understand them without any problem). At the Croatian side, the situation is even worse: since the inception of "Croatian language" in 1991, they still haven't managed to produce a normative dictionary of it. Several big single-volume dictionaries have been published, but all of them descriptive corpus-based, and none of them is proscriptive in character and officially blessed by the relevant institutions (that would be the Ministry of Education, Matica hrvatska and the Institute of Croatian Language and Linguistics which operates the Council for Standard Croatian Language Norm). The biggest and the most professional of them, Hrvatski enciklopedijski rječnik ("Croatian encyclopedic dictionary") contains hundreds of words that any decent Croatian language purist would throw away in utter disgust (the ability to experience a psychological and emotional response upon reading/hearing the "wrong" word might sound a bit bizarre to the non-residents of Balkans, but to these people it's something as tangible as knife under your throat). The most authoritive and the only normative Croatian language grammar (this one) published by the Institute is ridden with ridiculous claims. According to it, the Croatian reflex of jat spelled as <ije> is phonologically a diphthong /ie/ - a claim which is mocked by lots of phoneticians, who call it an imaginary phoneme. In their overview of accentuation paradigms for nouns they list hundreds of patterns nobody uses today and which can be easily obsoleted. At this time at the Institute there is however a project to publish the first such normative dictionary of "standard Croatian" [2], for which that webpage claims it will be over in the middle of the 2011, but that's the most optimistic estimate (at least 6 more months until it gets printed and distributed, and a few more years until it gets influential).
• So what is my point here? My point is that the phrases such as standard Croatian are completely arbitrary, as there no single dictionary of standard Croatian still published, and the standard Croatian language grammar contains some very dubious claims! It's a more of a phrase used to refer to the part of Serbo-Croatian lexis that is primarily used by the Croats, some of it still having marginal usage by the Serbs and/or Bosniaks and/or Montenegrins, beside being almost completely intelligible to them. For example, the word kàšika is still used in various parts of Croatia (from Dalmatia to Slavonia), as opposed to the more "proper" equivalent žlȉca. The first one is Ottoman Turkish borrowing, the second one is native Slavic, so understandably the purists insist on the latter as being "more Croatian". Of course, the first word is not going to fell out of usage anytime soon, and it would be plain stupid to simply not list it as Croatian in the Wiktionary (citations for its usage satisfying the CFI can be easily provided) just because some lone academician with a reality distortion shield around him that spells "Croatian nationalist" didn't find it convenient to list it in a dictionary of "standard Croatian". If you're writing a dictionary of used language (and that's what we're doing, not only standard but also-nonstandard words merit inclusion per WT:CFI), the overlap between Bosnian/Croatian/Serbian becomes much more bigger. The only reasonable option that is left is to treat them collectively as Serbo-Croatian, and to provide a combination of usage notes and context labels that would explain on the actual usage of words, and their "standard status".
• As I said, having 4/5 language headers, i.e. the simultaneously BCS and the ==Serbo-Croatian== is not an option. It would completely confuse the users (every single entry would then have a duplicate, even the words which are Serbian-only or Croatian-only), plus it would be enormous waste of times on the contributor's side (you'd have to think of both creating separate and "unified" entries) - and the proposal was made just to eliminate the defect of such redundancy upon those two groups. Trust me Robert, I would love that we have some kind of easy technical solution to all this. E.g. some MediaWiki-supported solution and also some javascript magic that would enable users by means of e.g. WT:PREFS to switch the display between "Bosnian", "Croatian" and "Serbian"; or simply stick to the default "Serbo-Croatian". Actually, that kind of solution would even be workable: Serbian version would simply ignore Croatian- and Bosnian-only words (that are tagged as such), and would support both scripts and varieties (Ijekavian/Ekavian), Croatian would strip links to Cyrillic-script spellings, as well as links to words (synonyms, alternative forms..) that are tagged as Bosnian- or Serbian-only, and would only be listing Ijekavian forms. One day when the database of Serbo-Croatian words on Wiktionary reaches the real dictionary-quality in number, e.g. 100 000 entries, compiling a list of words that are not "proper standard Croatian" would be simply a few hours of manual work (tagging some 4-5 000 words in the index, and lots of it could be automated). So all in all methinks that as far as external Wiktionary utilities are concerned that would be interested in Bosnian-only, Croatian-only and Serbian-only interface, that could be eventually worked out. The only problem would be usage notes as you already mention, but I think that these must stay as I've explained some "standard Serbian" and "non-standard Croatian" words like kašika are still abundantly used by the Croats, and the Croatian language standard does not reflect the actual usage of language but imaginary would-be "pure" language as conceived by the academicians.
• Do you want to have 4 language sections instead of 3? That's where the proposal inevitably ends up - No, per proposal everything must be formatted under L2 ==Serbo-Croatian== and the unification of ==Bosnian==, ==Croatian== and ==Serbian== sections to ==Serbo-Croatian== is irreversible. I really wouldn't want that some of your bots suddenly starts generating separate B/C/S sections out of unified one, as it would violate the (currently proposed) policy, and also pretty much nullify the hard-working manual efforts done to unify them by me and the others. Note that the generation of separate B/C/S sections out of properly unified ones is machine-doable, while the opposite is not the case. We can provide some day B/C/S interface to Wiktionary SC entry database, but that shouldn't be done by duplication of content and the simultaneous presentation (i.e. the collective SC and the separate B/C/S interfaces should be mutually exclusive). --Ivan Štambuk 11:46, 17 July 2009 (UTC)
You don't get it do you? The Wiktionary structure requires that languages that exist appear at level 2. (Saying "it is just that the headers don't exist" is nonsense, in the structure the existence of the language in an entry is defined by the presence of the header.) So the proposal (as you interpret it) purges Bosnian, Croatian, and Serbian from the wikt.
With the proposal adopted, we will have to (more or less automatically) duplicate "Serbo-Croatian" entries into the standard language sections to comply with the defined structure. Fortunately, the proposal text doesn't prohibit the use of the standard L2 sections etc (you were asked to make your intended prohibition clear, and refused, instead weasel-wording the text ;-). And it only makes the "about" page guidelines, instead of policy (which is good, because it is not written clearly enough to be policy). (And the code will be changed to hbs at some point of course.)
However, I don't think at this point we have too much to worry about. the vote atm is 16/8, which hardly represents a consensus. While Mr Štambuk will jump up and down and scream "passes, passes", it clearly does not against the firm opposition (never mind the one spurious vote), and he will not be deciding which votes are valid or closing the vote.
As I said at the top, the proposal does not work, unless the intent is to completely suppress the standard languages. Since Mr Štambuk keeps saying that is not the intent, then the proposal is entirely unworkable. You can vote to declare the Earth to be Flat, and pass it unanimously, but you'll have trouble with the implementation.
I'm sorry that Mr. Štambuk does not understand that the headers are structure, and not optional, and in particular, not a "formatting" issue, and not about website markup. As I pointed out supra, this mixture of schema and data is not a great design, and certainly creates incomprehension among editors about "what means what". It is what we have. Yes, it is hard to understand.
I'm working out the process to repair the damage to the wikt caused by trying to implement the unworkable proposal (restoring sections, duplicating into the standard sections, changing templates and so on). Not terrible difficult, but hairy in places, we'll end up with things tagged with `{{attention}}`. (Which will also create/restore the proper places for usage examples, citations and other things ;-) It would have been/will be much easier if we didn't/don't try to break it in the first place, but c'est la vie. Robert Ullmann 05:51, 19 July 2009 (UTC)
The Wiktionary structure requires that languages that exist appear at level 2. (Saying "it is just that the headers don't exist" is nonsense, in the structure the existence of the language in an entry is defined by the presence of the header.) So the proposal (as you interpret it) purges Bosnian, Croatian, and Serbian from the wikt. - You just don't get it do you? Wiktionary structure requires structure, it does not proscribe what kind of content will abide by that structure. The proposed treatment of Serbo-Croatian is perfectly in accordance with WT:ELE-proscribed layout of entries works, working just fine, and I've found zero deviant anomalies so far (having created/merged almost 2000 words so far). The proposal doesn't "purge" B/C/S "languages", it purges their headers and merges them into one out of numerous reasons abudantly explained in various places but which you somehow simply manage to ignore, and continue to rant on "prohibiting languages" etc, which saddens me very much. WE ARE NOT PROHIBITING B/C/S WORDS, BUT ARE SIMPLY TREATING THEM COLLECTIVELY, AS LINGUISTICALLY THEY ARE ALL ONE BLOODY LANGUAGE WITH >95% SHARED LEXIS AND 99% IDENTICAL GRAMMAR! How hard is to understand that?!?!
With the proposal adopted, we will have to (more or less automatically) duplicate "Serbo-Croatian" entries into the standard language sections to comply with the defined structure. Fortunately, the proposal text doesn't prohibit the ose of the standard L2 sections etc - Please, Robert, don't play dumb, we're not kids in the park trying to outsmart one another with silly verbal puns. Quoting from WT:ASH: All the other L2 headers for Serbo-Croatian varieties (==Bosnian==, ==Croatian==, ==Serbian== and ==Montenegrin==) are obsoleted by L2 ==Serbo-Croatian==. (Here obsoleted means "allowed until manual conversion occurs"). Quoting from Wiktionary:Votes/pl-2009-06/Unified Serbo-Croatian: Voting on: treating Bosnian (bs), Serbian (sr), Croatian (hr), and Montenegrin (no ISO code yet) all as one language, Serbo-Croatian (sh). This affects L2 headers [...]. So duplication is per proposal and vote explication forbidden. I've noticed the recent "duplication" activities by User:UllmannBot. Funny, they work quite wrong (But that's expected, given that you're using it to generate entries in a language you have no clue about!). Look Ullmann, if you choose the ignore the WT:ASH when it becomes community-biding after the (let's face it - expected) acceptance, and continue to make unauthorized edits of generating B/C/S sections by UllmannBot and others, I'll have to initiate a vote for your desysop and community sanction, report you to meta and WM mailing list incidence boards and also to Jimbo Wales himself, as I have strong reasons to believe that your pattern of behaviour surrounding this vote is exceptionally disruptive to the community. You appear to be more intent to write code that will "repair the damage" by the proposal (which you still call "unworkable", and which works JUST FINE) rather than fixing several technical issues left with the proposal that are not that uber-complex at all (like the format for `{{t}}` used for Serbo-Croatian to generate 4 interwiki links in superscript) and that were noted to you many months ago.
However, I don't think at this point we have too much to worry about. the vote atm is 16/8, which hardly represents a consensus. - Perhaps we should scale the voters by a coefficient representing the number of languages they are profficient in? Just kidding. I see only 2-3 reasonable votes against. You have no right to ignore the voted policy on the grounds that you personally believe it does not "represent consensus", as consensus is already established in the knowledgeable userbase (moreove among all of the SC contributors themselves). I'll ask SemperBlotto, who appears to be the most neutral of the regulars, to close the vote.
As I said at the top, the proposal does not work, unless the intent is to completely suppress the standard languages. - What exactly is this sentence suppose to mean? Can you please reword it in clear logically parsable English prose? If you have complaints on why exactly the proposal doesn't work, please provide elaborate explication of why you think is so. Everything else would be FUD-spreading. The intent of the proposal is not suppressing anything, and that has been explained many times. See e.g. here for a recent explanation to Lmaltier.
I've ignored the rest of your ad hominems and petty attempts to "ridicule" my intelligence. Please provide substance as arguments and not empty verbiage if you wish to be responded to. --Ivan Štambuk 11:58, 19 July 2009 (UTC)

Just for the record, the edits referred to were 3 clearly identified tests, marked to be immediately reverted, and so reverted. I am working out a recovery procedure; parts are a bit buggy (;-). I apologize to the community for triggering yet another round of insults and abuse from Mr Štambuk. Robert Ullmann 09:09, 20 July 2009 (UTC)

After three weeks of your continuous trolling, FUD-ing and ignoring questions and discussions (and proposing selective "intractable technical difficulties" that affect one out of billion surfers), I couldn't care less for your apologies. Once again I remind you: you are not authorized to do such edits by your bots, and need a vote for that kind of activity. Call it "recovery protocol" or whatever - you're simply planning to undo hundreds of hours of manual labour by creating 10000+ full-blown L2 sections in language you have absolutely no clue about. Bots need a vote for even the dumbest thing like creating inflected forms in a different language, and yours are no exception. 2 out of 3 of those triplicated B/C/S sections had an error inside. --Ivan Štambuk 13:41, 20 July 2009 (UTC)

People should also note that in spite of being asked not to, and agreeing not to, Mr Štambuk continues to plow ahead removing valid language sections and replacing them with "Serbo-Croatian" while the vote is pending. (One might observe that Mr Štambuk is absolutely certain that he knows everything, and that no one else can possibly know anything at all. I noted on the vote talk page the history of his block on hr.wp for this sort of abusive behavior; it has apparently not dawned on him to wonder how I read dozens of pages of discussion there; since he insists I have "absolutely no clue about" Croatian. One might miake such observations, but I will not, to be charitable.) In any case, the code was a very preliminary test, as I said, somewhat buggy. His continued damage done while the vote pending is not really a problem, as I now have the code to recover all the deleted content; it will simply have a bit more work to do.

Oh, and may I apologize again, as this will undoubtedly provoke yet more abuse from him. Sorry. I would like others to have information about progress in cleaning up the mess, so they can review it; I should have something solid in a day or two. Robert Ullmann 16:28, 20 July 2009 (UTC)

The merger has been going on for the last four months, and not just by me but also by several other contributors. You only noticed it when some of your stupid bots generated ==Croatian== language section for the merged entry (despite being noticed of the merger several months prior to that!). Your first reverting of "sever damage" was of Krun's edit, remember?
So please stop portraying me as some kind of unilateral fundamentalist, I'm sick of it!
As for the stopping of the merger - I did stop it, only generating new SC entries (which isn't much of problem, just another type of activity), but continued to do so after this remark of petty sarcasm by DCDuring (who also was bothered by the merger ongoing while the vote is pending). After then I figured out that none of you has absolutely no desire at all to look at the bigger picture, and the benefits of the proposal in general, but only to simply oppose, oppose, oppose, generate more and more FUD and simply refusing to switch on the part of the brain that would say "Hey, this is not such a bad idea after all".
One might observe that Mr Štambuk is absolutely certain that he knows everything, - Yes, I'm by far the greatest expert here on Serbo-Croatian, and what's wrong with openly saying that? Lots of the things I provide in answers to you and others one simply cannot find by googling, because it simply does not exist in English language, and you need a knowledgeable native speaker to elaborate on the shortcomings of every alternative. But the problem is that you simply ignore any type of linguistic discussion. Your vote still has the "genocidal, Greater Serbia blahblah" rhetoric in it. You write kilobytes and kilobytes of texts elaborating why (bs, hr, sr, sh) wouldn't work and when I ask you would (bs)(hr)(sr) work you simply ignore it, or on that lang= HTML code wouldn't be recognized by brothers, when in fact it would be ignored in general unless the browser actually uses CSS hacks to customize their SC text presentation (which no one really does, and those one in a ten million that do can be asked to customize it to sh or whatever in their CSS scripts) - this latter one you even used as an invitation to everyone who voted to oppose the merger to vote against it!. I mean, you understand that I simply cannot trust you acting in good faith anymore.
This "recovery protocol" is another example of Robert Ullmann twisting and ignoring rules and acting like demi-god. The proposed policy and the vote itself obsoletes anything but ==Serbo-Croatian==, and yet you suddenly claim that they're still allowed and that the vote itself is still not important, any moreover plan to write a script that would generate more than ten thousand full-blown L2 sections, without a vote and without a supervision by a native speaker. Of course that could be done, but the whole thing about unification is to reduce needless redundancy by factor of 3 or 4. But since you still believe that "languages" that are 100% mutually intelligible, have 99% identical grammar, 100% identical phonology and accentuation, 95% shared lexis..are "completely different", and have on several occasions even ridiculed my elaboration on how the differences between variants can be trivially handled by a unified approach..there's really not that much left to talk to you.
Now you self-victimize yourself, as if I'm the "bad guy"...puh-lease.
If you want your content-duplication bot running start a vote, bot policy requires you to do so and you cannot ignore it. --Ivan Štambuk 16:57, 20 July 2009 (UTC)
Being the leading wiki-expert on Serbo-Croatian regrettably does not entitle one to unilaterally make decisions that affect matters others deem (however wrongheadedly) to be beyond one's ken. A wiki that had fewer non Serbo-Croatian considerations would, I'd expect, be more easily able to accommodate the views of a Serbo-Croatian expert. Indeed, I would expect such a wiki to welcome such contributions. Here, sadly, a good "wiki-citizen" must accommodate the needs, interests, and views of the wiki as a whole. Tragically, those needs, interests, and views are imperfectly reflected in the views and votes of woefully and regrettably ignorant active contributors, which inevitably put unreasonable limits on genius. This is a problem I experience myself every day. DCDuring TALK 17:31, 20 July 2009 (UTC)
DCDuring, can you please write in simple English to me, I have sometimes hard time understanding what you're trying to say. I'm not sure whether you're trying to be humorous or dead serious :D
does not entitle one to unilaterally make decisions - Once again, this was not a unilateral decision, but a consensus among all of the active Serbo-Croatian language contributors: at the moment agreeing that were Dijan, Bogorm and me, who are responsible for 99.9% of some 10,000+ B/C/S language sections on Wiktionary. Krun and others joined only later, and were also completely supportive of the proposal. Immediately after that agreement I drafted the WT:ASH page (which is still drafted, like all the other language about-pages), and notified the community of it in the Beer Parlour, and Robert on the Tbot's userpage on the possible issue with the generation of superscript interwiki links in the translation tables.
Exactly nobody expressed concern for such actions back then, except for Carolina Wren who at the WT:ASH's talkpage was provided with a more thorough account on the differences and "differences" among Serbo-Croatian varieties. Polansky expressed concerns with the parallel of Czech and Slovak - who are also mutually intelligible but treated as different languages here - but I dismissed that kind of argumentation as the differences between Czech and Slovak (literary form) are far greater than that among B/C/S varieties. Standard Czech and Slovak are based on different dialects, standard B/C/S on the same one (Neoštokavian). Linguistically they are one language by every single Western Slavist out there. End of discussion.
Here, sadly, a good "wiki-citizen" must accommodate the needs, interests, and views of the wiki as a whole. - The unification effort has numerous benefits for both the users and contributors of Serbo-Croatian entries. Users have three of four times less almost completely identical information to lose their time with, and as for contributors - the benefits are pretty much obvious. I only create Latin-script entry manually, copy it to clipboard and a computer program I wrote converts it to the corresponding Cyrillic-script entry on the fly - all I need to do is paste it with CTRL-V! Works like a charm. How the outcome of this vote would affect the users that don't use Wiktionary for looking up Serbo-Croatian words, and the contributors that don't contribute Serbo-Croatian entries - I couldn't care less.
Everything proposed in the unification policy was done in best possible faith. And yet I'm still continuously accused as if it's all about "suppressing languages", "promoting Greater Serbia" yadayada... It's becoming really annoying. Do I need to express my political views on a user subpage simply to get rid of you guys? ^_^ Of all the possible things to do on the Wiktionary, you are bothered with this tiny little thing you didn't care at all a month ago.. --Ivan Štambuk 19:11, 20 July 2009 (UTC)

Overview of Discussion

I haven't doubted your good faith. To me the issue is one of a conflict between two groupings of considerations. To drastically summarize and simplify, one set seems to be that:
1. there is "really" one language.
2. language learners should learn it that way.
3. all potential contributors should contribute in that way.
4. any technical considerations should accommodate the one-language view.
5. it is easier for the main contributors to the language to maintain one language section than three.
Perhaps an additional consideration is that it would have a good effect on the real world for this linguistic unity to be recognized.
The other set of considerations is:
1. Wiktionary (and all WMF projects) depends on the use of standards for ideological, technical, and organizational/political reasons.
2. The standards and our complete dependence on Wikimedia software force numerous compromises on us while making the whole enterprise possible.
3. Wiktionary is intended to be open to contributions from all comers including those who may have nationalistic views on language.
4. It is not easy to predict the future evolution of language where it is complicated by political considerations.
5. Wiktionary has not had enough success to compel WMF to provide us with technical support for problems not shared by Wikipedia and Wikicommons.
In the end, the technical capability to unwind a change may make it easier for folks to agree to make a change. At the very least, I would think it would be worthwhile to make sure that:
1. Ullman has bots that can unwind things and
2. any changes in the structure of entries do not make content unrecoverable for the individual ISO codes to be subsumed by SC.
Anyway, that is my simplistic view. DCDuring TALK 19:56, 20 July 2009 (UTC)
Good summary. However, all the opposite views have been successfully dismissed
1. We don't depend at all on SIL/ISO. Wikimedia foundation actually uses sh code for Serbo-Croatian Wikiprojects, e.g. the Serbo-Croatian Wikipedia. We already have and use a bunch of languages with no ISO code. ISO/SIL codes are a good guideline on how to structure wiktionary L2s, and how to use a set of codes to provide consistency in topical categories, translations and etymologies (wherever the codes get used in the lang= parameter), but they are not any kind of showstopper.
2. B/C/S are still in the process of a complete codification that will not end in at least a decade, so it's pretty much pointless to insist on the concepts of such as "standard Croatian" as I elaborated above. We write "all word in all languages", and 98% of what are today perceived as "Serbianisms" by Croatian language planners were in fact used by Croatian writers for centuries and can be trivially corroborated with citation evidence on e.g. Croatian Wikisource. So the amount of shared information would be even more than one would think.
3. Person who'd not be contributing to Wiktionary just because he'd be using ==Serbo-Croatian== instead of ==Serbian==, ==Croatian==, ==Bosnian== or ==Montenegrin== needs a visit to psychiatrist. We should not be concerned by Pepsi lite and his ilk. IMHO.
4. None of them changed in the last 150 years except some minor details in orthography and new terminology. Grammars published 150 years ago are still applicable. What is still called "the best Croatian grammar" was published more then century ago. With the advent of standardology, mass media and educational system - language "evolution" cannot happen even in theory. It's like the evolution of Homo sapiens that ended 300 years ago, now that 99% of all born children reaches reproduction age, and no competition of the genes can happen. I clearly remember reading a report published at the session of the Council for Standard Croatian Language Norm that says that any kind of reform in the basic orthography, phonology or grammar is out of the question (in a response to some of the Croatian linguists that wanted to eradicate the distinction between /č/ and /ć/, and /dž/ and /đ/ phonemes). So methinks we're pretty safe now. Furthermore, given the immense and ever-growing amount of cross-cultural exchange between Croatia, Bosnia and Serbia that has been ongoing since the fall of separatist fascistoid government of the 1990s, there is little potential for divergence.
5. Bot can indeed "recover" (actually a real recovery is impossible since I've been greatly expanding every processed Serbo-Croatian entry). However, that would be really stupid and pointless thing to do, that would negate the proposal itself. Also it would be forbidden when the proposal passes. It wouldn't be hard to write some javascript that would generate on-the-fly B/C/S sections to a user that selects to see them in their monobook customizations script, on some supposed WT:PREFS option. Simply triplicating and quadruplicating information again is unacceptable. --Ivan Štambuk 20:20, 20 July 2009 (UTC)

(left margin again; and since I'm sure to provoke yet another petulant tirade, I'll apologize again to the others here, but Mr. Stambuk's point 5 bears careful notice: apparently it is difficult-to-impossible for a bot to repair the database, but some on-the-fly javascript can do it for a browser user? Huh? Of course, it isn't about browser presentation, it is about the database. <snark>did he really say that people who disagreed with him should "visit a psychiatrist"? apparently so, it is right there on the screen. I kinda liked "Dictator of the Universe" though. I might start using that!<snark>)

Thank you sir, for now making it crystal clear that your motive and intent is to purge the standard languages from the wikt database. Replacing them with a "language" that has been soundly and firmly rejected by the international standards community, the relevent national standards bodies, SIL, LoC, (the RAs for 639), etc. So firmly rejected that they have taken the previously un-heard of step of deleting the SC code(s) to make it clear that "Serbo-Croatian" is unacceptable. The damage you are proposing (and doing, in spite of the vote being in progress) to the wikt database cannot be accepted and must be repaired. Robert Ullmann 09:07, 21 July 2009 (UTC)

apparently it is difficult-to-impossible for a bot to repair the database, but some on-the-fly javascript can do it for a browser user? Huh? - Can you make any kind of counter-argument without ridiculing or insulting the interlocutor Robert? I know it's stronger than you, trying to pump your ego all the time, but can you please at least try to be civil? What I was saying is that this "reparing of damage" (has it ever occurred to you that this brain-damaged term you continue to utilize to describe what the Serbo-Croatian contributing community has been doing for the last four months is kind of...insultive?) is not something that can be "undone" as I've been greatly expanding the merged/created entries with inflections, pronuncations, etymologies etc. So you cannot simply "revert" to prior versions, only triplicate/quadruplicate the newer one.
did he really say that people who disagreed with him should "visit a psychiatrist"? apparently so, it is right there on the screen. I kinda liked "Dictator of the Universe" though. I might start using that! You just can't help it can you? For the casual reader: that statement was referring to nationalists who'd be unwilling not to create their mother-tongue words in L2 ==Serbian==, ==Bosnian== etc. instead of ==Serbo-Croatian==, and not "everyone who disagrees with me". On the Balkans such a form of mind disease is deeply-rooted into collective consciousness and cannot easily be treated without professional eye-opening help.
Thank you sir, for now making it crystal clear that your motive and intent is to purge the standard languages from the wikt database. - No it's not: read here again. Also re-read what I've said aboove on the standardological aspects of B/C/S and example cases such as kašika - Wikitionary has "all the the words in all languages", so we wouldn't confine ourselves only on "standardly proper" words, and the amount of lexical sharing between B/C/S would be even more than the usual 95% - more like 99%.
As for the "rejection" of Serbo-Croatian I've provided at least a dozen citatitions crom the most illustrious living and dead Slavists who used/are using the term wihtout problems. We can call it "Serbo-Croatian" or "BCS" or whatever - it's the same language regardless of the name. SIL/ISO still uses the term "Serbo-Croatian" as a macrolanguage, which proves it's not at all controversial. sh code is not deleted only deprecated, and we can reuse it for our puprose (we already have a bunch of languages without ISO codes, and SIL/ISO won't assign 2-letter codes anymore, so there's not reason for us not to use sh.)
The damage you are proposing (and doing, in spite of the vote being in progress) to the wikt database cannot be accepted and must be repaired. - LoL. Look Robert, from the inception of proposal you've done everyhting to stop it. Every single one of your comment was to obstruct the proposal, first to portray its advocates as some genocidal Greater Serbia propagandists, later to dismiss it on technical grounds claiming that it's impossible (when it turns out it perfectly is), reiterating your threats and ridicules over and over again.. You have not made a single positive claim, offering help how to fix the proposal, only why it wouldn't work, or can't work!!! First you claimed that this is "non-negotiateable", and now when the vote is likely to pass you unilaterally write a bot that will undo hundreds of hours of manual labour done by me and the others, claiming that it is "undoing the damage the wikt database cannot except". I mean - LoL. Are you the wikt database official representative? ^_^ I remind you once again: you are not permitted to do such as activitiy without a vote and the Wiktionary bot policy requires you to start such a vote. Such an activity will be outlawed by the current SC unification vote (but that doesn't really seem to bother you..), so it's moreover highly-controversial and disruptive should you unilaterally put it into action. Robert, rest assured I'll do anything I can to stop you misusing the rights bestowed upon you by the community trust, if you choose to do so. We have rules and you must follow them, we're all equal under the law ^_^ --Ivan Štambuk 11:43, 21 July 2009 (UTC)

On scripts

The larger problem here is that the attempt to violently ram through "Serbo-Croatian" as a supposed solution to a perceived problem with information being duplicated has prevented looking at a reasonable way of proportionately addressing the issue without creating many, larger, issues.

While writing notes and code to recover from what has already been done, I made some observations about the scripts in use in these languages. Anyone in a constructive frame of mind might like to look at User:Robert Ullmann/SC recovery#On scripts. Robert Ullmann 10:20, 21 July 2009 (UTC)

It's astonishing to see how much time and energy you are willing to waste into something like this. The same disgusting rhethoric on "un-damaged sections", "standard languages" (and you don't even know what that phrase means!). You really need help Robert. --Ivan Štambuk 11:50, 21 July 2009 (UTC)
Plus your algo is buggy on step 5: you'd first be cloning all the entries and only then "tag with attention template if heuristics show differences in content given" - and who'd doing that manual checking step, you? ^_^ Plus, there is no handling at all of variant jat forms, which cannot be done 100% programmably as there are some Ikavian words in Serbian, some Ijekavian that are only Croatian.. Furthermore, I see no jat-handling code at all. You'd simply be generating a lots of garbage I can guarantee you no one no one here will be willing to clean up. Why must you be so evil? --Ivan Štambuk 12:02, 21 July 2009 (UTC)
Robert, re: your first paragraph: Don't be ridiculous. You've played as big a role in anyone in preventing the community from looking into better solutions, by launching into the Beer parlour with ridiculously offensive flames (I mean, really? accusing a Croat of participating in Serbian war crimes? how could that possibly have seemed like a good idea?), and then by failing at every turn to engage other editors in productive discussion. Your failure to apologize, and your continued lambasting of the SC editors for their good-faith implementation of what they thought was a good idea, does not make you an exemplar of "reasonable" and "proportionately". —RuakhTALK 14:23, 21 July 2009 (UTC)
O.K., I now notice that you did end one comment with:
(I would like to apologize for mis-understanding your motives, and severely criticizing what they appeared to be; I assumed that what the proposal accomplishes—suppressing the standard languages—was therefore what you intended. I did not realize that that you didn't understand what—in actual effect—you were doing.)
[sic parenthesis] which, whatever else it might be, is certainly an apology. I'm sorry; I should have paid more attention before writing so harshly.
RuakhTALK 01:01, 22 July 2009 (UTC)
It is not possible to engage in productive discussion with Mr. Štambuk, any objection or suggestion at all is met with relentless personal abuse and name calling. Note when I do work on something constructive (this sub-section), he immediately attacks it, saying I am wasting my time and energy. There is no possibility of working out anything that is technically possible, standards compliant, and reasonably non-duplicative with any participation from him; anything that does not promote "Serbo-Croatian" will result in yet another abusive tirade. As to apologies, there is not a single reply he has even written that does not require an apology. None will ever be forthcoming of course. I did mis-interpret his initial motive, as I've seen too much of "Serbo-Croatian" before from the same (groups of) people who are on trial at present in the Hague, I am sincerely sorry for that. (for attributing it to him; the association of SC with the war crimes is and was valid; it is one reason why SC has been firmly rejected by all of the authorities) What his motive is for promoting SC I do not know, but it seems to be clear that it isn't about "reducing duplication", as anything other that SC will result in attacks. (Also note that with that exception, I have made no personal comments at all, no ad hominem references, all of what I have said is about the proposal and its damaging effects. While at the same time, his statements are almost entirely personal abuse. A case in point immediately follows:) Robert Ullmann 08:12, 22 July 2009 (UTC)
Yeah it's all "my fault", It was I that started calling everyone Serbian nationalist for weeks, that still has "war crimes" type of comment on his vote, that has tried not to constructively discuss the shortcomings of the proposal but instead proactively find them, offering no workarounds or real solutions, only on "how it's bad", inviting anyone to change their support vote to oppose (!). Now you're doing what exactly: writing a bot that will "undo the damage" (that's the term you continue to use to describe the hard-working activity that has been going on for the last 4 months) in a "recovery protocol" algorithm that will create thousands of entries, each and every single of one them requiring manual clean-up, negating literally hundreds of manual hours committed by other contributors in unifying them. Moreover you're doing this completely unilaterally, don't even mention a bot vote (which the bot policy requires you to do so), and despite that the proposal is still being voted on which would actually forbid such type of activity if it passes. So excuse me RU for misinterpreting your alleged noble intentions - to me it all appears as backstabbing, and I can assure you I'm not the only one who thinks that.
the association of SC with the war crimes is and was valid; it is one reason why SC has been firmly rejected by all of the authorities - that's complete rubbish. War criminals on ICTY never use the term Serbo-Croatian (at least I never saw one of them using the term, and I watched hundreds of trial videos). I recommend to you reading the Greenberg's book [3] finally. I can send it to you in PDF on e-mail if you want to. Serbo-Croatian literal language (and the term itself!) was created some 150 years ago, long before the 1990s war crimes, 1945+ communist regimes etc. took place - by whom exactly? Croatian and Serbian linguistic visionaries, who wanted to the unify the divergent dialect-based literatures among their brotherly nations. See e.g.Vienna Literary Agreement (but Greenberg's book has the best take on it). So claiming that SC has something to do with "war crimes" is simply a lie - it fact it has to do with exactly the opposite thing.
What his motive is for promoting SC I do not know, but it seems to be clear that it isn't about "reducing duplication", - I've been for ~2 years on this project, have created 6-7 thousands separate entries (mostly for ==Croatian==, but also for others), provided countless etymologies, inflections etc. for all of them separately - how dare you to say that I have some "hidden agenda" in all this? It's primarily to reduce duplication which would immensely facilitate both editing and browsing the entries. I don't want to add a ==Derived terms==, fix etymology or e.g. provide an example sentence in one place, and then be forced to propagate it to 5 identical sections on different places. It would be a complete nightmare. If you simply can't see the immense amount of convenience that the unified approach would bring us, and all see it through some political prisma - well, I'm sorry for that. I just hope that the majority does. --Ivan Štambuk 11:10, 22 July 2009 (UTC)
Don't be fooled Ruakh, Ullmann's vote on the vote page still has the "war crimes blahblah" propaganda on it, and I reminded him of that and he didn't remove it nevertheless. He is a master of demagogy par excellence. Absolutely everything he wrote from the start was to dismiss the proposal on various grounds, either by ad hominems against the SC proposal initiators (which I ignored at the start, as one can see), overblown alleged technical "deficiencies" of implementing the proposal, continuous belittling of the intelligence and competency of doubtless experts such as myself (I'm a native speaker of Serbo-Croatian and he can't speak any Slavic language at all!). At first he claimed that the merger was not negotiable at all (!!!) and now before the vote even ended he writes bot for a "recovery protocol" that will in its current implementation generate thousands of garbage entries that would ought to be manually fixed (I wonder by whom). Everything I see in his writings are camouflaged pure malice and hatred and I simply cannot reason where all this is emanating from. If RU utilized his intellect in another direction we'd already have `{{t}}` supporting 4-interwiki superscript. That's sad indeed. --Ivan Štambuk 01:27, 22 July 2009 (UTC)

Sigh. I am working on this process to try to save the work Mr. Štambuk has done and produce an acceptable result. The simpler method would be just to undo all his NS:0 edits. And, as DCDuring points out, having a recovery procedure in hand makes the proposal more acceptable, as we can then correct it at some later date when desired. (you'd think Mr. Štambuk would appreciate this, but instead all we get is more of the continuing torrent of abuse, oh well, my apologies to others for his behaviour: I can't impose maturity, and we do need to work out a solution) I've re-edited User:Robert Ullmann/SC recovery. Robert Ullmann 11:38, 22 July 2009 (UTC)

The simpler method would be just to undo all his NS:0 edits - Oh I'm sure you'd enjoy that one ;) And I really appreciate that you've finally dropped the "severe damage" type of rhetoric, now it's simply "undo" :) Also, could you also please stop calling me "Mr. Štambuk"? I'm in the early 20s and that sounds really awkward, almost as if some Victorian-style type of mockery.
you'd think Mr. Štambuk would appreciate this - sorry Robert, nowhere did you mention any such "benevolent" plans - you've simply put a giant notice here in the BP, "You know guys, a wrote a "recovery protocol" bot that would "repair the Wiktionary database" by erasing all the greatly expanded ==Serbo-Croatian== sections, generating hundreds of invalid sections and every single generated entry would require manual cleanup." Now you turn it on the other side and claim it was for good intentions.. :) Sorry but I cannot somehow believe you, as so far you've did absolutely nothing to make the proposal technically workable on the points that are still left to be discussed (like the `{{t}}` and where you can provide valuable technical assistance), only exactly the opposite: by finding where it doesn't work and providing "fixes" such as this one. I have no doubts that you are indeed acting in the goodest possible faith from your own personal perspective, but it's just that some of those actions would prove to ultimately very detrimental for what I strongly believe is the best viable option for this Serbo-Croatian language treatment. --Ivan Štambuk 12:15, 22 July 2009 (UTC)

As Štambuk continues to plough ahead making changes while the vote is pending, having a recovery procedure is critical. And I am working out the problems with it, as I said above to preserve as much as possible and not create problems. It certainly would not be run as is; many things to be done and evaluated. (And the proposal cannot be made technically workable. Parts of it could be patched, but the result would still not work satisfactorily; and would be a headache to maintain; on the other wikts that copy us, it would be impossible to maintain. I have spent considerable time looking at the technical problems, in {t} etc, pointing out just one of many as an example, but have not done anything because (1) it can't be done, and (2) I, for myself, certainly would not be making changes while a vote is pending!) Robert Ullmann 14:56, 22 July 2009 (UTC)

And the proposal cannot be made technically workable. Parts of it could be patched, but the result would still not work satisfactorily; and would be a headache to maintain; - What exactly could not work? Please don't spread FUD Robert. So far you've mentioned exactly one thing: the generated language code lang="sh" would not work for CSS customizations for user that had it set up for bs/hr/sr - but such users make up a negligible minority anyway (one out of x million surfers), and could simply redefine their bs/hr/sr rules for sh or whatever (or we can simply default it to sh). It would be 5 times more maintainable for the contributors as there is 5 times less effort to be cloned. That was one of the primary reasons for unification! I cannot believe that you're writing that it would be "impossible to maintain".
I trust your claim on not making changes while the vote is pending, I was too paranoid at start, sorry :D I highly doubt that the vote on this bot would pass, as restoring the previous versions from history is a very bad move IMHO, as I'll elaborate more on later. --Ivan Štambuk 15:51, 22 July 2009 (UTC)
As a newcomer to Wiktionary I find it disappointing to see two willing contributors at loggerheads over this matter. It must be clear by now that the two of you are never going to agree, and the effort expended in argument could have been used more productively. However, this debate and the one above about Chinese do highlight a characteristic of Wiktionary that is glossed over by the "All words in all languages" motto. It seems to me that Wiktionary is actually about "All interpretations of all character sequences". Thus languages and translations are, fundamentally, beside the point. Without elaborating too much, we can see how a s ingle spelling can be used for more than one word, even within the same language (ooze in English, for example), and these words are segregated by etymology or sense. And we can handle the special case of a spelling being common across all or most languages and having the same etymology and sense(s). But a spelling common to several or many languages causes duplication, mostly uncontrolled. Perhaps it is too early to conclude that language at L2 is just plain wrong? ARAJ 09:02, 25 July 2009 (UTC)
Yes, it often happens that the same definition is repeated for several languages in the same page (e.g. Nepal). This is not a problem. Even when the definition is the same, some information is very likely to differ (e.g. pronunciation, derived terms, anagrams, paronyms, gender, declension, usage notes, etc.) Lmaltier 09:28, 25 July 2009 (UTC)
Well, I note that you are a francophone, which is handy, as French is my second language. I'll create a separate section to continue this discussion. I've just been looking at table, as an example, and important senses are omitted from the French section, which are present in the Wiktionnaire (albeit in French). If such a common term is in such a mess, it bodes ill for the rest of Wiktionary. ARAJ 14:30, 25 July 2009 (UTC) signed retrospectively ARAJ 18:57, 26 July 2009 (UTC)

Titles of glossaries

I think that titles of glossaries should be of the form "Appendix:Glossary of A", featuring the term "Glossary of", and not having the term " terms", while Daniel. thinks the term "Glossary of" should be dropped from the title.

Also, I find it superfluous for the titles to contain the term "English", as I have found only one glossary that is not English: Appendix:Portuguese internet slang, recently created.

Examples: "Appendix:Glossary of cricket" fits my proposal while "Appendix:Glossary of cricket terms" does not, neither does "Appendix:Cricket terms" nor "Appendix:English cricket terms".

The term "Glossary of" is common on the internet to identify a page as being a glossary, which is apparent from the following searches:

What I am proposing is the current practice, as apparent from Category:Glossaries. The current practice has been in part shaped by my renaming of some glossaries, yet many glossaries already had the title in the proposed form, which probably stems from their original naming at Wikipedia.

--Dan Polansky 09:07, 14 July 2009 (UTC)

I like the proposal. I also think we might construct glossary names using ISO codes in the same way we do for topical categories. That is, "Appendix:Glossary of religion" for English or "Appendix:pt:Glossary of religion" for Portuguese. There are some cases where "terms" will simply have to be in the title because I can't see any other way to make it work, such as "Appendix:Glossary of nautical terms" since there isn't a good noun substitute for nautical, and since we want nautical in the title to match the context tag of the same name. --EncycloPetey 20:45, 14 July 2009 (UTC)
I think language names (rather than codes) should be in the titles to increase googlability.​—msh210 20:49, 14 July 2009 (UTC)

There are two main issues:

1. The distinction between Glossary of and List of is particularly confusing. Dan Polansky moved Appendix:Glossary of toilet slang to Appendix:List of toilet slang (from what I can see by analyzing other glossaries and the description of Category:Glossaries, this move ocurred probably because there were no definitions repeatedly for each term). I do not like the idea of having to search for both names, such as "Appendix:List of cricket" and "Appendix:Glossary of cricket". This system reduces the chances of finding information. I'd rather prefer a more intuitive way, such as "Appendix:English internet slang" and "Appendix:Portuguese internet slang".
2. There are appendices meant to list words in all languages. For example, currently the Appendix:Colors seems supposed to be English-only, while Appendix:Days of the Week contains words from various languages. I'd like to recognize these two possibilities simply by looking at their titles. In my opinion, "Appendix:Internet slang" would be a good solution for this issue, if it might contain internet slang from all languages, or links to wide lists such as "Appendix:English internet slang" and "Appendix:Portuguese internet slang".

--Daniel. 14:30, 16 July 2009 (UTC)

Question for you then: How would you label a general glossary of geography, and a French language glossary of geography (general) such that it would not look like a glossary of the geography of France? --EncycloPetey 15:08, 16 July 2009 (UTC)
According to the issues and possible solutions I presented, I would name them "Appendix:Geographical terms" and "Appendix:French geographical terms". --Daniel. 20:50, 16 July 2009 (UTC)
I would prefer list and glossaries of are not split apart as I also don't like looking two places for one subject matter. If we split them apart, what about the ones that are half-and-half defined. One editor may come along and turn lists to glossaries... so they're all glossaries to me, some are just more finished and complete than others. Goldenrowley 23:42, 17 July 2009 (UTC)
Speaking from my experience, I have a use of glossaries--lists of terms with definitions--but not much use of mere lists of terms without definitions. One could think that a glossary is a superfluous page, as all its definitions are already in the Wiktionary namespace. Yet, a Wiktionary glossary is the result of a report on the mainspace database, showing only definitions from the given domain (say, logic), omitting definitions that do not pertain to the domain, and omitting all the other information including, say, etymologies. A glossary speeds up learning of a new vocabulary, as the user only needs to open a single page and scroll over it, undistracted by all the other information and hyperlinks shown in all the pages from which the glossary is sourced. A mere list without definitions, by contrast, does not serve this role. A mere list could appear to be redundant to categories, yet Appendix:Subatomic particles shows that a mere-list page can offer classification while categories show the terms sorted alphabetically. For the described use of a glossary, whether a page is a glossary or a mere list without definitions makes all the difference. It is because of this difference that the title of the page should indicate whether the page contains definitions, which it does by containing "Glossary of".
I cannot confirm that mere lists are just glossaries in the making. I have seen no mere list that would have been turned into a glossary; correct me if I'm wrong, by posting links to glossaries that were originally mere lists. Glossaries in the making do exist, having not yet all the definitions, yet these are clearly formatted as glossaries and not as mere lists, and they have at least part of the definitions already in place.
For certain topics, there exist both a glossary and a list: Appendix:Glossary of legal terms and Appendix:Legal terms, recently renamed by Daniel. to Appendix:English legal terms. The glossary and the list cannot have the same name "Appendix:Legal terms".
Whether a glossary is named "Glossary of logic" or "Glossary of logic terms" is much less important than whether the name starts with "Glossary of". The name "Glossary of nautical terms" fits my proposal, although my propossal does not make it explicitly clear. The idea of the proposal was that the suffix " terms" can be removed from the title when the removal does not make the title ungrammatical like "Glossary of nautical".
--Dan Polansky 09:02, 19 July 2009 (UTC)

Here is what readers should find in a glossary:

• Introduction: Explanation about all terms as a whole. It should include usage notes and etymologies when applicable. Additional introductions may exist at subtitles.
• List of terms: Each of the terms in a specific language or in all languages, properly linked to the main namespace. Context tags should be placed when necessary. Definitions should be provided for each term. However, if the explanation about all terms as a whole at the introduction can provide all the definitions easily (such as explaining the numerical system at the beginning of "Appendix:Numbers"), individual definitions are no longer necessary.
• Organization: Usually, sorting by the alphabetical order. Other possibilities are inherent to the topic in question.

Any glossary that doesn't fit completely one or more of the items above, may be improved. The introduction notifies the reader about the contents of the appendix and when to use them. For example, the possible fact that a whole list of terms is profanity and therefore likely to be offensive, should certainly be noted. The organization should be inherent to the topic when it's useful information, such as sorting by particle strucuture in "Appendix:Subatomic particles" or by card order in "Appendix:Playing cards". And the links provide access to related pages. --Daniel. 17:24, 19 July 2009 (UTC)

Pronunciations of Appendix:Irish given names

Why did Connel MacKenzieBot remove the pronunciations here? User Alasdair has restored them and we are having an argument on his talk page. --Makaokalani 12:33, 15 July 2009 (UTC)

Firefox 3.5 and Wiktionary

I've just suggested an optimization for Help talk:How to edit a page#Firefox 3.5 and Wiktionary. JackPotte 20:34, 15 July 2009 (UTC)

After some changes I made recently to this Dutch headword template, I realised that it's rather pointless to have to specify both the inflected form and the comparative. It dawned on me that in many cases, the inflected form is all that's needed to construct all 6 forms. After all, the comparative is, in the great majority of cases, nothing more than the inflected form +r, and the superlative is simply the headword +st. Only a very small amount of adjectives has a special comparative, usually by adding -d- in between. Currently, the template supports 5 parameters: the comparative and superlative can be specified using either a named or a numbered template. However, the inflected form, which as it appears is very useful as a basis to automatically construct the comparative, can only be specified using the infl= named parameter, which is more cumbersome than a simple numbered param.

It therefore makes sense to me, in an effort to make the template simpler to use, to change the meaning of the parameters so that param 1 specifies not the comparative but the inflected form. The comparative and superlative, currently specified by params 1 and 2, would then be 'moved up' a spot to 2 and 3. The obvious catch here however is that it breaks any page that uses the template with numbered arguments, which unfortunately is in the majority of cases where parameters are used at all. It also means a big change for the users of this template, who will need to learn that param 1 is now the inflected form rather than the comparative, and that the infl= param is no longer needed. So I'm bringing this up here to see if anyone has any ideas, suggestions or other info on how to go about this. --CodeCat 21:40, 15 July 2009 (UTC)

Perhaps write a new template:nl-adjective or something, change all existing inflection lines to use it, advertise it, monitor use of nl-adj (replacing it when added and informing the editor of the new one), and eventually delete nl-adj or replace it with a redirect to nl-adjective.​—msh210 21:54, 15 July 2009 (UTC)
I get the feeling that creates more problems than it solves, as you'd then get people who are confused about which template is the 'right' one. Having exactly one template for every job is rather important, IMO. Not to mention it's rather silly to say 'ok let's use that template now' and then later 'ok let's use the old one again'. The name 'nl-adj' is nice and concise, I'd like to keep it in use. --CodeCat 22:02, 15 July 2009 (UTC)
Well, the way I described it — or, at least the way I meant ` :-) ` — you wuld be switching once, to nl-adjective, and there should be no confusion about switching back and forth. This answers all your points except the last (that you like the shorter name). And you can reinstate nl-adj as a redirect, which should cause no confusion (since no one will be wondering "which should I use?" since nl-adjective will be the one to use), and you can use nl-adj if you prefer. (There are many commonly used templates that are redirects, including many inflection templates such as en-adjective.) Hey, it's just an idea; you don't like it, don't use it.​—msh210 22:15, 15 July 2009 (UTC)

Daily archives of English Wiktionary JavaScript and CSS

In case anyone is interested I have set up a script which creates an archive of all CSS and JavaScript pages in the Wiktionary every day.

It finds all pages ending with `.js` or `.css` in the `User:`, `Wiktionary:`, and `MediaWiki:` namespaces and archives them into a tarball.

If the new daily tarball differs from the previous one it is published on the Toolserver a http://toolserver.org/~hippietrail/enwiktionary-code.tar.gz

My primary goal is to make it searchable with Google Code Search but it may be of interest to anyone who wants to make sure their user scripts are going to be compatible with other user codes.

It has not yet been indexed by Google but it will be very handy if and when it is. — hippietrail 03:37, 17 July 2009 (UTC)

Automatic transliteration

I have written a MediaWiki extension that allows automatic transliteration (for languages where this is possible) - more detail can be found at User_talk:Conrad.Irwin/Transliterator.php alongside the code. Are people interested in adding this functionality to Wiktionary? I do not have a complete list of languages that this would work for, but I believe there are several for which an automatic guess could be made correctly: Armenian, Greek, Korean, Serbian/Serbo-Croation (Cyrillic-Latin). If there are more, or you think I'm wrong about these, please let me know. My aim would be to embed it into most of our generic templates so that transliteration can happen automatically. It might also be possible to overload this to provide (slightly) more efficient sc= parameter handling than can currently be done with templating, and other similar places where we need a map of keys to values. If there is interest in installing this, we can start a VOTE, and if the technocrats agree, we should be having outrageous fun in a few months. Any comments? Conrad.Irwin 00:13, 20 July 2009 (UTC)

Sounds very good. It will work with these languages: (all Cyrillic-based languages, not just Serbian) Russian, Ukrainian, Belarusian, Bulgarian, Macedonian; also Georgian; Hindi. If I am not mistaken, Gujarati and some other Indian languages. There are exceptions in some cases to the readings but transliterations will be still very useful-since transliterator may show the spelling, not the actual pronunciation. The tool won't help with accents (word stresses) either, which can be added manually. There are a lot of existing Greek, Hindi and Korean translations without transliterations, I wonder if they could be converted as well. For Korean, I need to provide you with a formula, since the Hangeul characters need to be decomposed into "Jamo" components before you can get the readings. Please remind me if you start working with the Korean transliterator. A while ago I have written a C# program to do that but I've lost interest in Korean too early, so forgot how to read it before mastering it but I reckon I can still help with this. Anatoli 00:32, 20 July 2009 (UTC)
Korean formula is here: [4] and [5]. If you can follow it. It's simpler than it's described there. I think you'd better off working on one at a time. In some languages, the preceding and following letters may affect pronunciation/romanisation. Anatoli 00:46, 20 July 2009 (UTC)
Heh, funny you should pick on Korean, I've been hitting my head against it for several hours. I've got as far as the decomposition into Jamo, (which is done by treating the word in unicode NFD form) and transliterating each of the three parts seperately, but there are apparently more rules for when different syllables are next to each other - as in the table at [6]. I found a similar table for the 2000 revised romanization method - which I believe is the "official" system to use, but I could not decipher it easily. The code I found at [7] seems to agree with the various online sources I found, but I can't work out what it is doing for the exceptions when syllables are next to each other (it seems to be looking for characters that can only occur in the lead where I would expect to find only characters in the tail). I have put the code for it on User_talk:Conrad.Irwin/Transliterator.php - do you know enough to add the missing rules? You can use the syntax ^ to match the "start of a word" and \$ for "the end of a word" to help with specifying which rule to use, and it is possible to create rules that depend on up to ten characters. Conrad.Irwin 01:00, 20 July 2009 (UTC)
You can relax a bit about the Korean rules. The current South Korean transliteration - [8] is mainly based on the spelling, not on the phonetic changes, so, eg. b/p (/)are transliterated based on the spelling not on the actual pronunciation. So, Pusan (부산) is actually Busan in the standard transliteration. The only Jamo letter that will cause a problem is (l/r) and dropping of , is silent at the beginning of a hangeul. Besides, wiktionary can relax the rules even further (subject to approval of course). BTW, it's not always 3 parts, can be 2 and 4. I'll have a look at the rules. Anatoli 01:23, 20 July 2009 (UTC)
Serbian/Serbo-Croatian doesn't sue sc= because it's dual-scripted and entries in both scripts always come in pair in the translation tables, listings etc. Sanskrit (in Devanagari or any other script) wouldn't work because it uses transliteration to note accents and members of compounds. The biggest problem would be that for a number of languages there are several transliteration systems in use by various contributors under various arguments. I'd be much more happier if someone wrote a bot that would generate ====Transliteration==== sections with all major transliteration schemes for languages that have several of such (e.g. Russian, Arabic, Hebrew, Greek, Persian, Armenian, all Devanagari-based etc.) --Ivan Štambuk 01:12, 20 July 2009 (UTC)
I know it is dual scripted, and I believe that it is possible to transfer automatically between the two (as on the), which is why I brought it up (using this system you could just say `Template:t-serb` and it could generate both Latin and Cyrillic entries in one go. Using this system it would be possible to create as many different transciption systems as you like making the creation of a transliteration section as simple as putting '==Transliterations== {{transliterations|ruleset1|ruleset2|ruleset3}}'. Hopefully this appeals to your sense of removing duplication ;). Conrad.Irwin 01:22, 20 July 2009 (UTC)
Just in case, unvocalised Arabic and Hebrew should not be automatically transliterated. Since it's normal to write in unvocalised Arabic and Hebrew (like writing in Chinese without pinyin), there is no point in producing a tool, which will romanise them without vowels and thus mislead both editors and users. Anatoli 01:33, 20 July 2009 (UTC)
Even vocalized Hebrew should not be machine-transliterated. (This contradicts incorrect information I gave Conrad in another forum.) For example, chochma ("wisdom") and chach'ma ("she was wise") are both חָכְמָה. The only way to machine-transliterate Hebrew is if it has both vowels and cantillation marks, and we'd never have that on enwikt.​—msh210 16:44, 20 July 2009 (UTC)
I think that, in the absence of kamatz and sh'va, Hebrew can be transliterated perfectly, just with no stress indication. And if there's some way for the transliterobot to indicate doubt (and add a category for it), then I think we can assume kamatz gadol and sh'va na (kamatz gadol because it's many times more common, sh'va na because anyone can see how to remove an apostrophe to make it nakh, whereas it's not necessarily obvious that you insert <'>, rather than (say) <’> or <e>, to make it na). BTW, stupid question: how would the cantillation marks distinguish between /ħɑˑχˈmɑː/ and /ħɑː.χəˈmɑː/?RuakhTALK 20:21, 23 July 2009 (UTC)
In absence of kamatz and sh'va, Hebrew can be transliterated perfectly, though with no stress indication: I agree. But I'd change your "if there's some way... to indicate doubt (and add a category for it), then I think we can assume kamatz gadol and sh'va na" to "iff...". Chach'ma would have a meteg (or a m'tiga or the like) attached to the chet whereas chochma would not.​—msh210 20:50, 23 July 2009 (UTC)
Personally I think that the current format of dual-scripted languages in the translation tables in separate rows is completely wrong. Now that I think about it would be great to have this SC where there is 1:1 mapping between Cyrillic and Roman script in order to simultaneously link to both script entries at once. E.g. writing something like {{t-sh|ми̑сао)}} and it would generate something like: {{t|sh|sc=Cyrl|мисао|alt=ми̑сао|tr={{l|sh|misao|mȋsao}}}}. The input script would have to always be Cyrillic as you can always generate Roman out of it properly but not the other way around (in some 0.1% of cases). This would require 3 translation schemes (removing diacritics, conversion to Latin with and without diacritics), and if it could work it would be awesome. --Ivan Štambuk 01:36, 20 July 2009 (UTC)
(About your above comments:) Look at it this way: this is an opportunity to finally force-unify all transliteration schemes used by various contributors. We absolutely have to choose only one of those: Wiktionary is confusing as it is. As for ====Transliteration==== section, we already have too much distracting L4 header stuff going on, obscuring and hiding the most important part of our articles - the definitions themselves (just read WT:FEED). Adding another one will make it even worse. Besides, the transliteration is used not only for the headwords, but also in inflection templates, synonyms, derived terms and a host of other places. Do you want alternatives there too? --Vahagn Petrosyan 01:41, 20 July 2009 (UTC)
No, only on the main entry itself, c.f. the ====Romanization==== section on горілка(horilka). One transliteration scheme would have to be chosen as the primary one, to be used in the headword, translation tables etc. As for choosing such primary transliteration scheme, I hope this will prove to be a useful accelerator, as too much time has been wasted on types of discussion whether we should use <sh> or <š>. --Ivan Štambuk 13:54, 20 July 2009 (UTC)

Mega-awesome! Hope to see this implemented as soon as possible (including in WT:EDIT). Will save us all huge amount of time. --Vahagn Petrosyan 01:41, 20 July 2009 (UTC)

Yep, sure like the sound of this. This will make many things easier. – Krun 18:38, 20 July 2009 (UTC)

Is the coding sophisticated enough for transliteration of Ancient Greek? There are cases where transcription is dependent on situational context. For example, a single gamma transliterates as "g" but two gammas together transliterate as "ng". There are a number of other digraphs that undergo the same kind of shift. Ancient Greek also has many fiddly diacriticals, which do not always transcribe. There is a full table (excluding breathing marks) at Wiktionary:Ancient Greek Romanization and Pronunciation. --EncycloPetey 18:41, 22 July 2009 (UTC)

It could work for such cases because the extension would match the longest strings first, i.e. the transliteration rules in case of gamma would be sequenced as:
γγ => ng
γ => g
The cases where there are diacritics denoting tones would be ignored (if it's the desired behavior), all transliterating to the same Roman letter. Now that I think about, what I'd like to see in this extension is a support for multiple transliteration schemes, from which user can choose. We could default to some user-friendly ones which use e.g. <sh>, <ch> instead of <š>, <č>, but it would be great if users could choose some scientific transliteration schemes.. (WT:PREFS, monobook, or however that works). --Ivan Štambuk 18:54, 22 July 2009 (UTC)
Yes, that would work - I can pretty much copy and paste those tables into the rules. I have the rules for modern greek at User_talk:Conrad.Irwin/Transliterator.php which are only slightly less dependent on the context. Breathing marks might end up increasing the size of the table a bit, but should be easily codable in the same way. Transliteration done on the server (as this is) would not be user-configurable, but it also has the advantage that it is google-able and (when/if they fix the search engine to index the output not the input) it should show up on mediawiki search too. Conrad.Irwin 19:11, 22 July 2009 (UTC)
As long as there can be some looking ahead, I think this should work out. Wiktionary standard transliteration is slightly different than more orthodox schemes (most importantly with υ = u and χ = kh). Also, accents are not transcribed at all, but there have been some folks who have had qualms with that, and so the issue might merit some discussion. If you need any help with it, let me know. -Atelaes λάλει ἐμοί 20:53, 22 July 2009 (UTC)
I suppose we could in theory set up multiple transliteration schemes by utilizing several different language pseudo-codes (e.g. grc-translitscheme1, grc-translitscheme2 etc.), but their integration into the current structure of complex templates such as `{{t}}` which would then have to select the appropriate (pseudo-)language parameter for `{{transliteration:}}` on the basis of user-defined preference would prove to be too much of an implementation/performance problem.. Or wouldn't? ^_^ --Ivan Štambuk

So basically I'd like to know:

1. Would combining diacritics pose any kind of a problem?
2. Could it support multiple transliteration schemes, by using language pseudocodes such as sh-lat2cyr and sh-cyr2lat ? This way we could for dual-scripted languages provide only one parameter (accented form in one script) and generate (deaccented) wikilink in that script, and also wikilink and an alternative display in another script.
3. Would it be feasible to implement a selection of a transliteration scheme (if multiple ones were supported) by a user-defined preference, inside `{{t}}`, `{{infl}}`, `{{l}}` and elsewhere where `sc=` is used?
4. Could this extension be used to detect whether the word is written in selected script (i.e. is any of the letters in that word written in that script) or not? E.g. by returning some kind of an error code when no rules were matched, or by comparing the input and the output (whatever is better performance-wise). This could be really handy in some situations. --Ivan Štambuk 11:52, 23 July 2009 (UTC)
No, Yes, No, Sort-of. It has been programmed to deal with diacritics and so won't attempt to transliterate a letter without them (even if they are decomposed). [On the other hand you can set it into decomposed mode, and then it will be possible to transliterate the letters and leave the accents]. Yes, I can't see there being a limit to the number of transliteration schemes available. No, a user defined preference is not possible, pages are rendered (converted to HTML) when they are saved, not when they are viewed, so the preference of the most recent editor would always be used. At the moment you can defined the "error character" to be used when no rule matches (by using a blank left-hand-side), it would be possible to implement an error message to be used if the transliteration fails, or to maintain the current default behaviour of passing through characters unchanged. Conrad.Irwin 07:42, 24 July 2009 (UTC)

I have started a WT:VOTE, please indicate there whether you would like this extension installed so that we can provide evidence of consensus to the developers. Conrad.Irwin 07:49, 24 July 2009 (UTC)

• What if several languages use the same transliteration scheme - should they be defined in separate set of identical mappings, or can the script support multiple language codes (and how do I mark it if it can)? --Ivan Štambuk 01:26, 25 July 2009 (UTC)
The extension looks for language mappings by just finding [[Mediawiki:Transliterator:{{{name of map}}}]] and using that, so no it would not be easily possible to share the same mapping for different languages. Something we should discuss is a naming convention for these maps, I had thought that using just the iso language code would suffice for the default mapping for each language, and then any variations could be at "iso language code"-"variant name abbreivated". Conrad.Irwin 00:00, 26 July 2009 (UTC)

For Yiddish, can we use YIVO's scheme without intellectual-property issues? It's pretty accepted from what I understand.​—msh210 06:14, 27 July 2009 (UTC)

I have an idea about a new L4 header: Grammar notes. One huge example would be Hungarian suffixes that have many features that are important to know. Like which other variants exist (according to the vowels of the word), what it does to the stem (lengthen the last vowel, shorten the stem by skipping a sound, shorten a vowel, etc), how it can be further suffixed (according to which vowel-harmony type) etc... It could also apply to certain words like kell "have to", where it could be explained what mood the verb needs to be, what the different moods indicate etc. Currently this kind of information is scattered between the definition line and the Usage notes section. In my opinion the Usage notes section is not meant for this kind of information. Rather it is a section to inform about how to actually use the verb in life when you go around. Not for such fixed and firm questions as how it acts in grammar, but stylistic information, or when to say it, whom to say it, what reactions one can expect after saying it etc. What do you think about this? Qorilla 13:09, 22 July 2009 (UTC)

Seems OK, but if it would grow too large it'd better be put into appendix. We already have a bunch of appendices on grammar issues of certain languages (usually on inflections, but suffixes and other types of derivational morphology would also be OK). --Ivan Štambuk 15:43, 22 July 2009 (UTC)
There are ways to accomplish this within our current structure. One way is to create an Appendix:Hungarian Suffixes (or similar) with whatever generic content there could be on the subject and a link from Usage notes or See also to the Appendix or a specific heading in that Appendix. The allocation of content among entry-specific (or linked term) Usage notes, one or more Wiktionary appendices, and Wikipedia articles is in your discretion. DCDuring TALK 15:46, 22 July 2009 (UTC)
These things would belong to the respective entries and I think Usage notes does not describe it well. It would not be long and would link to Appendices. I think it would be best not to explain much in this section just give the information, which can then be understood by reading the Appendix. The Appendix should be general and what applies of them to each word should be in the respective entry.
Basically, the entry would say: "this word/suffix is like this and this (has this and this property)", and an Appendix would explain in detail what that property means and how is it significant. Qorilla 16:26, 22 July 2009 (UTC)
It may not describe it well, but all of our headings are imperfect. And a vote takes a while. Initially, it might be good to start an appendix and insert the entry-specific content in an existing heading so that folks can see what you mean more concretely. Changing en.wikt-wide WT:ELE with a vote might not even be successful. Is there some agreement that can be reached among Hungarian contributors? DCDuring TALK 17:38, 22 July 2009 (UTC)
I understand. Writing more Appendices is on the list, but there are many other things to do about Hungarian. Seeing that making this change would be an entire "project" in itself I think we better shift it to a bit later. Qorilla 17:54, 22 July 2009 (UTC)

On a constructive proposition for Bosnian, Croatian, Montenegrin and Serbian

In April 1999, I found myself in a cell. It ended up being only for one night, and my captors didn't mistreat me further.
The jailer who checked on me during the night was irritated, as I spent the entire night sitting on the bunk, contentedly looking at the wall. I was warm, and safe—confident they weren't going to shoot me—and would probably not be there for long. My thoughts were with the Kosovars, out in the snow. They had left their families behind and gone off into the woods, so the Serbians would come after them there, and their families would be unharmed. Instead, the Serbs slaughtered the families.
Over a million people would flee, on all sides, Serbs as well, while Kosovars also committed war crimes. A few days later, Yugoslavia would file a complaint in the Hague, charging NATO with illegally interfering in Kosovo. It was not the beginning or the end, and I had seen and would see too much.

There is no reason why we cannot reduce duplication, follow the standards, avoid taking an official political position, preserve NPOV, and get there from here. And have some sensitivity to those for whom these things are all too real, as the Tribunal does in using "BCS" as an entirely neutral term. Likewise, we have a very simple safe harbor in following the standards. Taking a political position has the potential to put the Foundation in a very difficult position: we must not be doing that.

Now to set that aside.

Why "set this aside" ??? Robert, first you open a topic that starts with completely irrelevant CNN/Hollywood type of bullshit propaganda against the Serbs, and then you ask "please stop and think for a while before posting anything here not entirely constructive" ????!!!! Once I again, I have to rub my eyes to convince myself that I am not dreaming. I can imagine that for a bunch of Westerners the very notion of bombing Serbia by NATO (the "defensive, peace-keeping military alliance") in 1999 was something "deserved", and that the destruction of thousands of civilian objects and more than 1000 unarmed innocent people that died during the bombings by NATO terrorists were something that the Serb people "brought upon themselves", but I'm sorry to disappoint you - your "independent media" lied, and you all succumbed to the NWO propaganda. Dear Germans, the first engagement of Luftwaffe after WW2 was a war crime, and your beloved lying minister was modern-day Goebbels - [9] (There are entire libraries written on this, but this short movie would suffice for the interested). --Ivan Štambuk 14:36, 23 July 2009 (UTC)

On duplication, first note that for Serbian, we necessarily have two sections, as one is at the Cyrillic title, and one at the Latin (Roman) script title. However, the Latin entry is often identical to Croatian. Bosnian and Montenegrin are either the "same" as the Serbian/Cyrillic or Croatian/Latin, unless unique to either or both, in which case that is a separate entry anyway.

This presents a simple solution: make the Cyrillic entries the "primary" entries for Serbian, the Latin entries for Croatian, and use other language sections as and where needed or desired to explain language differences. We do variations of this in other areas, such as making the kanji entries the primary entries for Japanese, while the hiragana and rōmaji entries are expected (but not required) to be minimal. Likewise in some languages we use a particular standard script (Devanagari for Hindi) when the language can be written in various scripts. (Swahili is sometimes written in Arabic script, but other than a possible appendix on the common transliteration(s), we won't be going about adding Swahili to Arabic script entries.)

This leads to a proposal that preserves the standard languages for all users, and all applications, is fully standards compliant, and is easy for editors to understand; if they want to work in one language (as is usual) that is fine. Please look here. Note that while it is written in formal prescriptive language, it is of course only a draft.

Related note: Montenegro has released the first version of the Montenegrin language standard, just a few days ago. I haven't read it yet, but look forward to it. Apparently they have added 3 letters, in each of Cyrillic and Latin, to the alphabet. (! :-) Robert Ullmann 13:10, 23 July 2009 (UTC)

1. Serbian is written in both scripts, and both as equal. Today, Latin script has for Serbian, according to most estimates, more usage than the Cyrillic script, especially on the Internet. It's pointless to differentiate among languages on the basis of scripts.
2. Believe it or not, Croats (and Bosniaks - tho for a much lesser extents) have some 800 years of tradition of using Cyrillic script, of a particular style called bosančica. Since we include "all words in all languages" I could also be adding "Croatian words" written in Cyrillic script, but this proposal would force me to mark them as ==Serbian==. Bad, bad thing. For proud Croats, seeing words written in bosančica in uber-important historical document such as Vatican Croatian Prayer Book marked on Wiktionary as ==Serbian== would make them wanna puke their eyes out through their nostrils. Trust me.
3. Interestingly, all of the other language multiple-scripts parallels that you mention are treated as one language here on Wiktionary. Following that logic we should also treat B/C/S words as one.
4. You'd be explicitly forbidding non-stub entries on Latin-script entries other than ==Croatian==. For 15-20 million non-Croats that write them exactly the same way, that would be discriminatory and offensive. The treatment of what are today separate standard languages must be equal - either all of them are treated commonly under one header such as ==Serbo-Croatian==, or all are given an option to have full-blown separate entries. We cannot and must not (NPOV!) discriminate among them.
5. As for this "Motenegrin" - all it got was a proposed orthography with 2 additional graphemes denoting allophones, and which is not used in Montengerin media or schoolbooks at all (plus the "old" spellings in conformance with phonological Serbo-Croatian orthography are also valid, thus both e.g. sjekira and śekira). Montengerin language still doesn't have dictionaries, grammar books and no literature at all that uses the proposed orthography, no ISO code and it's standardization is by no means finished, and will take at least 3-4 more years. The "Montengerin language" which is today taught in the Republic of Montenegro is just another codified variety of Ijekavian Neoštokavian (i.e. the Serbo-Croatian), which was up until 5 years ago called "Ijekavian Serbian" in the Montengerin constitution. They were teaching srpskio jezik for a century, and now has switched to crnogorski jezik - simply by a name change. --Ivan Štambuk 14:36, 23 July 2009 (UTC)
There is no "forcing" and no "forbidding non-stub entries". Full sections are always allowed. But we can routinely start with entries as described; people doing wiki work don't generally waste their time doing things like duplicating sections needlessly. They will work on what they want and are comfortable with, that they see as useful. So in practice we will see little duplication. Lots of stuff to do. And yes, Montenegrin is very much wait and see. No hurry. Robert Ullmann 14:58, 23 July 2009 (UTC)
• Could you two go and do this somewhere else? Ƿidsiþ 15:05, 23 July 2009 (UTC)

Just to try to make it one simple statement: If we always add Latin when adding Croatian, and always add Cyrillic when adding Serbian, then in practice there will be little duplication resulting, as the entries and useful content exist. People can then work on whatever they like, they aren't restricted to stubs or whatever. The rest is all details to be standards compliant. Robert Ullmann 15:27, 23 July 2009 (UTC)

It could not be easily botted because for every one of them you'd have to manually check, so it's easier to do it by hand in the first place the first time you create an entry. Plus it wouldn't solve the redundancy for the users - B/C/S are basically always taught "in package" as a FL (I dare you to find me a Western uni that teaches only e.g. "Croatian language" or "Bosnian Language" :P), and those who'd expand entry on one B/C/S sections would have to manually propagate it to others.. This is a collaborative wiki, and that's the basic "problem" that the duplication approach faces, where basically what is one word could be found on 4 or 5 different pages - e.g. rijeka, ријека, reka, река.. Almost as messy as Mandarin which has entries in 4 scripts: simple/traditional, pinyin with and without tone marks. --Ivan Štambuk 16:03, 23 July 2009 (UTC)
Robert, your assumptions such as "people doing wiki work don't generally waste their time doing things like duplicating sections needlessly" - are unfounded - I've personally created thousands of ==Croatian== entries that Dijan simply copy/pasted to ==Bosnian== and ==Serbian== (with conversion to Cyrillic in the case of former, and minimal changes when they were required such as switching lang= codes etc.). This is exactly why we're proposing the unified approach. I think that every nation has a right to call it's own language however it wishes, but also think that it has no right to exclusively misappropriate to itself something that it's shared with other nations. Your proposal would work if people were not likely to be offended, frustrated or simply feel "bad" to see a full-blown ==Croatian== entry with pronunciation, etymology, inflection, example sentences....but all perfectly applicable to standard Bosnian and Serbian (and Montenegrin, when/if it becomes standardized, with the bulk of Montengerins identifying themselves as Serbs - likely to end up in the same cul-de-sac as "Moldovan language"). And only to found out that having ==Bosnian== and ==Serbian== is "not recommended" to reduce the duplication! You are simply putting too much faith into average Balkaner. I personally don't feel like any kind of a radical nationalist (though I very much worship Croatian cultural heritage, but in form of "good nationalism"), but would nevertheless feel very bad if I had to create only ==Croatian== words as a form of a workaround to reduce the content duplication. No, I don't want to do that. All standards must be treated equally - either all in a unified section or all in full-blown separate sections without any kind of "preferences". --Ivan Štambuk 16:03, 23 July 2009 (UTC)

All of this? All of this? All of the disruption, damage to entries, breaking of standards compliance, demands for s/w changes to deal with a special case, a huge amount of work for others, all of the abuse poured on anyone who dares to disagree? All of that is because you thought or think—incorrectly—that you had to waste your time duplicating things instead of just working on, e.g. the Croatian section(s)? All of it?

We can do any sort of minor technical magic we like; "LST" is not very good, but we can do something similar with a bit of code. Or do 'bot work to auto-sync sections, something that would be useful in many places. We don't have to do something enormously disproportionate to what is at most a minor problem. If in the meantime, you think something is a "waste of time": don't do it! Robert Ullmann 12:36, 27 July 2009 (UTC)

I must kindly ask mr. Ullmann to cease at once with empty rhetoric wordplays, drop the ad hominems, quasi-cynical overtones and dangerously misleading abuse of words (such as "disruption", "damage", "breaking of standards" etc.), rewording his arguments in a way that is logically parsable to a civilized human brain. --Ivan Štambuk 13:07, 27 July 2009 (UTC)

Attributive use of nouns and translations thereof

Despite efforts to remove them, we continue to have many adjective sections in entries for English nouns that are not comparable and which could not be verified as being used gradably or as predicates. That is, most language professionals would not call them adjectives. Even if we toke that view that we should cater to the needs of users who don't know any better, we would have to duplicate the senses of the noun. Contributors continue to add them. Many of them also attract translations.

1. Should we have some kind of bare-bones L2 section that referred users to the noun for senses?
2. If we have such a section, should we "forbid" translations on the grounds that it is a kind of same-page redirect and the space taken by the attributive adjective translations reduces the main value of the entry?
3. If a language uses different words for attributive adjective use of the English noun that it does for true nominal use of the noun, shouldn't they both appear in the noun translations.
It would be a great help if we could resolve the presentation issues involved. Some standard templates for the adjective section could reduce the number of additions that are simply deleted under current practice. DCDuring TALK 18:09, 23 July 2009 (UTC)
What does "many" mean? I haven't seen such a section added in a long time, despite much patrolling. Generating a whole new system of dealing with erroneous entries is a large investment of time and discussion effort. I don't see that it's worthwhile to craft a special means of dealing with this issue, since it doesn't seem that common a problem to me. --EncycloPetey 18:16, 23 July 2009 (UTC)
It would be nice if we had actual statistics on entry characteristics, patrol rollbacks. I just keep on coming across them. Today it was forest#Adjective with ten or so translations and a silly definition. If it isn't much of a problem, then the level of grammar knowledge must be higher than I thought or we are very good at discouraging contributions from typical users.
Do contributors adding translations to nouns add adjectival forms if their language requires a separate form for attributive use? Should they? DCDuring TALK 18:38, 23 July 2009 (UTC)
But it certainly is possible to look at the edit history to determine when the Adjective section was added. In the case of forest, the adjective section was added around the time the article was created on 14 Dec 2003 [10], so this is a long-standing cleanup issue, and not a new problem.
As for FL contributors adding non-noun translations to noun entries, I would say "no" (in general). In Latin, a "different form" is used, but it's the genitive of the noun. Where it is a different word, the adjectival use translation certainly should be included as a Related or Derived term on the FL entry, but that should be enough. I would only include the non-noun translation in those cases where the language of translation doesn't have a noun with that meaning, as sometimes happens with translation of language names for example. Latin identifies language using adverbs and Slovene uses adjectives, with neither language using a noun to name a language. The only other exception I can think of are English gerunds and participles, but then again they aren't quite the same as other nouns. Some languages use a different part of the verb to fulfill the same function. --EncycloPetey 18:52, 23 July 2009 (UTC)
I agree with EP - these should be listed at ====Derived terms==== of the FL noun. Usually these are formed by using a small set of well-known suffixes that add the meaning "of or pertaining to the X". For example, if you look at the Slavic languages translation at the disputed attributive use of forest (Bulgarian, Czech, Macedonian, Polish, Russian, Slovak), you can see that these are usually -ski and -ni (in various variants). Ideally these should be listed at the translation table of a real English adjecctive with a meaning corresponding to this attributive usage (if there is one, something like *silvic, *sylvic...apparently there are forestal/forestial ?? ), and not here.
Also, we still haven't solved the case of where the is obviously attributive use, but with a clearly discriminated set of meanings other than the simply attributive usage pattern of a noun would suggest. The usage of (noun adjunct) tag such as the one as on satellite seems to me the best approach, unless someone has the problem of English noun having mostly foreign-language adjectival translations.. Arguably the same can be said for forest itself, e.g. the quite separate attributive meanings beside the basic one 1) "of or pertaining to the forest" such as 2) "living or growing in forest" (~ animals, plants..) 3) that has something to do with growing and utilizing forests (~ worker, economy--) and similar. If such characteristic discriminated attributive usages could be identified, it would be good to have them listed.. --Ivan Štambuk 19:55, 23 July 2009 (UTC)
One problem with the [[satellite]] case is that the first instance of noun adjunct use seemed to suggest that there was an attributive use of the noun without there being an underlying noun sense. The word "satellite" has the sense of a subordinate entity as a noun though it is no longer as commonly used in this sense as it was in the mid 20th century.
Using separate senses under the noun PoS heading to show attributive use would require duplication of almost all senses. That has been one of the reasons that we don't want a full adjective PoS.
Also, I don't think the word "adjunct" is sufficiently unambiguous in linguistic use to be a desirable term in Wiktionary context tags. DCDuring TALK 12:07, 3 September 2009 (UTC)
Neither "forestal" and "forestial" (0 occurences in COCA or OneLook) nor "silvic", "woodsy", "sylvan", or "forested" convey the meaning of most attributive uses of "forest". It is more fruitful to find synonyms among a range of prepositional phrases: "of forests", "in forests", "from forests", "over forests", "for forests", "about forests", the preposition being appropriate to the combination, sometimes depending further on the context. And the sense of the preposition varies a bit depending on the noun. (This is why we find no shortage of users who want to add many of the combinations.) I suspect that an attempt at a complete lexical approach at forest#Adjective offers no value to a native speaker and little value to anyone else.
That is why the only alternative to the present approach that seems at all attractive to me would be to have a one-line `{{non-gloss definition}}` at forest#Adjective that referred folks to Appendix:Attributive use of nouns in the English language (or a Wikipedia equivalent if any) and forest#Noun.
As to the other problem you raise, "noun adjunct" is no solution at all for normal users, except in the sense that it offers a warning that the dictionary using it may not be for them. "Noun adjunct" appears 0 times in COCA and only in RHU among the other OneLook dictionaries. DCDuring TALK 20:50, 23 July 2009 (UTC)
I was not saying that any of them do, (can someone please define forestal/forestial? :), but was only suggesting that in lots of cases there would be a "proper" English adjective that denotes some of the frequent attributive usages of the noun in inspection, and if there weren't it would still be of value to the end-users (especiall ESL users) to list some commonly used attributive applications with obviously discriminated set of meanings in the definition lines, and provide translations for them. We don't have to do it, or use weird constructs such as "noun adjunct" (Thisis0's idea IIRC) - but the absence of such definitions elsewhere (namely other paper dictionaries) shouldn't really bother us. We could be exploring new lexicographic frontiers, boldly go where no dictionary has gone before..all under the reasonable justification of providing more useful educational content and more thorough description of language use. --Ivan Štambuk 21:19, 23 July 2009 (UTC)
Re "can someone please define forestal/forestial": Done. Also forestlike and foresty.​—msh210 23:18, 23 July 2009 (UTC)
I don't think that our much greater available capacity helps with problems of search and with the limitations on human working memory and the amount of information that we can scan visually. We already hear frequent complaints about the complexity and length of our English entries. We would have to pay much more attention to usability than we have to actually convert any of our fantasies into something close to reality. The challenge is to deliver all and only what a user wants and needs based solely in the information provided by his keystrokes-and-clicks-of-the-moment (for anons, adding preferences for registered users).
That challenge is far beyond the scope of the topic at hand.DCDuring TALK 21:51, 23 July 2009 (UTC)
The challenge is to deliver all and only what a user wants and needs - Uhm sorry but I don't think it's acceptable to confine ourselves with the user vulgaris, which is by definition semi-literate imbecile :) Our target audience are primarily reasonably intelligent people who'd be using Wiktionary as an educational resource, and are willing to spend something like max 5 minutes learning how to effectively use the structure of the entries, and language-specific policy pages. I.e. not the type of folks who come by Google searches and leave comments such as "I can't find the definition" [11] ^_^ If there is a need for idiot-proof interface (and there probably is, tho idiot-proof would be too harsh of a word, what I meant to say default interface displayed to such users-"paratroopers", with e.g. no etymologies, translation tables etc., but with others (registered ones?) switching to more "complex" interface as it is now), it must not be at the expense of professionality of the overall approach. We should take the best from all the dictionaries and make it better! --Ivan Štambuk 22:49, 23 July 2009 (UTC)
I always so enjoy such frank displays of contempt for others. I don't see much humor in them. I'm grateful to the language learners who at least leave some information, as opposed to the folks who look at our entries once and find that they prefer Merriam Webster Online. I'm sure that the 0.1% of internet users who hit Wiktionary include all the most elite of the worldwide intelligentsia. Of course, perhaps some of those users are actually the user vulgaris. We already have seen the contempt you display for your fellow Wiktionarians. I suspect that the circle of those you truly find adequate does not extent far beyond your own reach. DCDuring TALK 23:31, 23 July 2009 (UTC)
There's no need for such petty sarcasm DCDuring, and I asked you more then once to spare me of it. OK? Frankly I think that the user who can't find the definition lines at manifesto isn't particularly intelligent and that we shouldn't really bother ourselves whether his mind gets "hurt" when stumbling upon sections with all the fun stuff containing difficult words such as etymology or IPA/SAMPA. There is no point in saying that we shouldn't add valuable content to Wiktionary (obviously valuable and useful, as attributive nouns/adjectives that you dispute have been created countless times since the inception of Wiktionary and users were adding FL translations etc.) simply because it would make it "more difficult to scan for definition lines". Hopefully one day when and if we migrate this project out of this horrid software into something much more appropriate, we could account for them too, providing big fonts for definition lines, putting etymologies in smaller fonts at the end, similarly like dictionary.com and MWO have been doing. But in the meantime, there is simply no point in sacrificing the content (!) at the expense of such userbase. I can sympathize with your efforts to draw more of such casual surfers, for Wiktionary to be competition to free/commercial online services such as Merriam Webster Online, but that's simply the issue of content presentation and not primary goals of this projects, and we'll get there sooner or later. As I said, our primary audience should be people that are willing to spend maximally 5 minutes getting to know how to use Wiktionary effectively as an educational resource, and not those who give up after <5 seconds of skimming. --Ivan Štambuk 00:13, 24 July 2009 (UTC)

We are beyond petty sarcasm. I find your expressed contempt for others simply outrageous. DCDuring TALK 00:28, 24 July 2009 (UTC)

I don't understand, what "contempt" are you speaking of? Are you saying that a person who's unable to find the definition lines at manifesto is a representative Wiktionary user? I think not. I don't think that we should sacrifice valuable content to keep such folks more happy. For every such "this is oh so complex" complainer there are countless other users who are simply using Wiktionary silently every day. If we were to make the first group of users our primary target audience, as you suggest we should, we might as well get rid of etymologies and pronunciation sections too - the user himself explicitly said it: "I don't want it, I don't want to scroll down - I just want the definition lines!". If he is lazy, that's his problem, not ours.. --Ivan Štambuk 00:38, 24 July 2009 (UTC)
Re: "For every such 'this is oh so complex' complainer there are countless other users who are simply using Wiktionary silently every day": I would be so, so much happier if we knew this for sure. Part of me worries that for every such 'this is oh so complex' complainer there are countless other users who don't even notice the link to leave feedback. (TBH, I don't even 100% get how a user can find that link with no problems, but not see the definition. You speak with such seeming confidence, while I find I have nothing but questions.) —RuakhTALK 00:56, 24 July 2009 (UTC)
Statistically the feedback would always be in greater volume by those who have something to complain, than by those who are just happy to use another free Web page. I don't know if there is some kind of "law" on this, but I suspect that there is. This would esp. be valid on wiki-type collaborative effort where most of such complainers subconsciously think that the contributors should "work" for their cause. At any case, my point still remains: there's no point in making up for the presentational deficiencies of the MW software by degenerating the (possibly) valuable content. The problem of such users should be solved on the presentational not data layer. Selectable "irrelevant" data can be made hidden either by default or by a user-set preference, but non-existing data cannot be displayed at all to the potentially interested. --Ivan Štambuk 01:31, 24 July 2009 (UTC)

acesibility[a.of info]=1.p.=def+ipa[4nonnativs--史凡 - Pl also use MSN/skype as I suffer RSI and so cannot type very well! 06:14, 24 July 2009 (UTC)

(Unindent) I find it unsurprising when a newcomer has difficulties finding a definition in the overflow of section headings and other structural elements of the page (including the wiki buttons at the left and at the top), especialy given that, at the moment the new user visits Wiktionary, he may have his working memory full of his own context, such as a text he is reading that contains word for which he wants to find a definition. As far as I am able to estimate, a user's inability to find the definitions section the first time around here is a sign of his information load rather than a sign of low intelligence. It may also be a sign of attention deficit or any other condition that can be found in the broad populace, a deficit which, per se, is no sign of low intelligence.
Adding a heading "Definitions" could be an option. Or not. The point is that usability concerns have noting to do with the intelligence of Wiktionary users. --Dan Polansky 07:25, 24 July 2009 (UTC)

No Wiktionary admin should address offending words such as "imbecile" to the world's populace at large and get away with it. I do not know how the case of such repeated abuse should be handled. I would like someone more experienced that I am to initiate a disciplinary action. The evidence of abuse, while not yet collected, is plentifully scattered throughout the recent discussions. Options include: (a) Ivan Štambuk should be desysoped; (b) Ivan Štambuk should here publicly apologize for blankly attacking people at large; (c) Ivan Štambuk should get a warning block. Wiktionary community should cleary indicate that this kind of behavior is unacceptable in an Anglo-American civilization, or correct me if I'm wrong. --Dan Polansky 07:26, 24 July 2009 (UTC)

No. Firstly he can't be the only one, secondly, calling names does no particular damage, thirdly, what right does any of us have to enforce our world-view on anyone else? The Internet cannot be censored, many attempts to try it have failed, if you don't like what people say here, ignore them or go away. It would of course be different if we had evidence of this name-calling being destructive to building the dictionary, but I can't see that it is. Conrad.Irwin 07:37, 24 July 2009 (UTC)
This is a particular wiki community, not internet at large. I have seen no one else here make such blank offending remarks, to people who did nothing to him but provide feedback upon our request, one which AFAICT was useful. I don't know what kind of evidence there should be that vulgar name-calling harms building the dictionary; a statistical study?
While desysoping or blocking may be too harsh, a clear statement of disapproval could be in order.
For a comparison, Connel MacKenzie earned a vote of reprimand[12] for something[13] much less abusive than the behavior of Ivan Štambuk, AFAICT. --Dan Polansky 07:44, 24 July 2009 (UTC)
What's "Anglo-American civilization" supposed to mean? --Vahagn Petrosyan 07:54, 24 July 2009 (UTC)
I am afraid I have to admit, after I have written that, that the use of the term "Anglo-American civilization" is but an expression of some kind of bias of mine, resting on my incomplete experience, and I am now sorry that I have used it. --Dan Polansky 08:04, 24 July 2009 (UTC)
It's just that I've seen people express the notion that somehow the content on en.wikt is intended for native English-speakers — something I firmly reject. --Vahagn Petrosyan 08:15, 24 July 2009 (UTC)
That is not what I meant at all: my understanding is that English Wiktionary is meant for anyone sufficiently proficient in English to peruse its content, including intermediate learners of English as a second language. I expect, say, Czechs to use the Czech entries that I create here. --Dan Polansky 08:36, 24 July 2009 (UTC)
ignore them or go away: it's well possible that some editors have actually chosen the second option (just a feeling, it would be interesting to study statistics). And I know that a number of editors get tired. Lmaltier 07:56, 24 July 2009 (UTC)
I would like to see the data. Certainly, Ivan has gotten rather personally involved in the current vote regarding Serbo-Croatian (as well as various related discussions), and has lost his temper more than once. I believe he has already apologized for this, but it might not hurt for him to do so again here. Ultimately, I think the context is of the utmost importance. Rudeness to a new contributor who has made an honest mistake, but is acting in good faith is very problematic, and should perhaps bear consequences. Rudeness to an editor acting in bad faith, or simply not picking up our standards quickly enough to be worth the trouble is, in my opinion, good practice. Rudeness to a veteran contributor varies a great deal, especially with regards to whom the editor is. Now, I think that all of us should have somewhat thick skin; this is Wiktionary after all. Additionally, some of us simply discuss in a rather curt manner. Robert has certainly received some abuse from Ivan, but I doubt any more than he has given in turn. At the same time, Robert is enough of a grown-up to not hold a grudge and get on with life (speaking from personal experience here :-)). I disagree with Conrad......I wonder if that's ever happened before? As an anarchic community, we have every right to react in whatever way we wish. I think that the primary (if not sole) consideration is the process of building a multilingual dictionary. If we view someone's attitude or comments (or personal philosophy, or favorite breakfast cereal for that matter) to be in opposition to that goal, we have every right to impose whatever violence we deem most in keeping with our goal. Considering the fact that Lmaltier is right in that rudeness can potentially drive away editors, Ivan's rudeness this is something worth examining. However, we should also bear in mind that Ivan is a remarkably valuable contributor, and driving him would be a detriment to the project. In any case, it is my hope (and my impression), that Ivan has the depth of character to admit mistakes, when made, and attempt to rectify them. -Atelaes λάλει ἐμοί 08:24, 24 July 2009 (UTC)
Just to clarify: I am not referring only to the snippets "HELLO DCDURING!" or "Can you understand something as simple as that?" addressed to Lmaltier. These should not be there, yet are still within bounds. What particulary struck me as completely inappropriate is this:
• "Uhm sorry but I don't think it's acceptable to confine ourselves with the user vulgaris, which is by definition semi-literate imbecile :)"
Not that I know what google:"user vulgaris" is supposed to mean exactly other than the broad mass of less educated people, anyway. --Dan Polansky 08:46, 24 July 2009 (UTC)
user vulgaris was not meant to mean "less educated population" (and I don't see how the education has anything to do with surfing ability or the basic intellectual competence to utilize an online dictionary), but the type of surfer that expects that everything on the Internet should be served to him in a manner that he shouldn't bother to waste more than 5 seconds to pick it up. The user who made the comment "I can't find the definition lines" on the entry on manifesto was a prime example of that - what bothered him was the ToC (which shouldn't've been left-aligned granted its size) that he had to scroll down by one mouse click, which "frustrated" him and make him (imagine that) write several sentences at the feedback page on how he's not interested in anything else but the definition lines. Internet generations are getting more and more dumb-lazy, and instead of paying due respect to the immense amount of effort donated by contributors in their free time to provode them the type of quality content that you cannot find anywhere else, they're doing what exactly? - Bitching on how they can't skim the definition lines immediately, and must lose precious 5 seconds they could've otherwise spent on WoW, Facebook, Twitter, type 3 words in IM application and similar types of intellectually very demanding activities. Also Polansky, note that I haven't personally addressed that user as an imbecile - it was merely a qualification of the mental state of a casual surfer DCDuring that was advocating as being the only (!) relevant point of reference for the content that we provide. Which is pointless, as I said, because if he'd be confused by (noun adjunct) in the definition lines who'd illustrate some very common phrasal attributive usage of the noun in senses that cannot be easily deducible from the main definition lines (which should esp. be true for ESL learners), he was certainly also to be confused by stuff such as etymology and IPA/SAMPA transcription which we abundantly provide and have no intention of removing.
Re recent SC business: Lmaltier and DCDuring were doing everything they could to obstruct the proposal, every little "difficulty" that came up was greatly hyperbolized and supported by them (to the point it became tragicomic, like using a spellchecker to parse FL lang="xx" tags upon viewing (!) web pages). When I saw Lmaltier canvassing opposing votes at French Wiktionary and meta discussion boards, bringing folks who voted for oppose and who'd otherwise 100% not vote (as they don't care) or even notice the vote, it makes me wanna break things so EXCUSE ME Daniel Polansky if I overreact from time to time, as NOBODY of you guys stood in defense when there was a mountain of personal attacks directed at me, the same fallacious "arguments" reiterated over and over again, or people simply willfully trolling.
As for the desysop - I'll desyop myself when the vote ends (but no sooner because I have reasons to believe that I'll be abusively blocked in the meantime), but note that this has nothing to do with your comments on this topic, the "hive mind" opposing clique, and the decision was made much earlier. --Ivan Štambuk 11:43, 24 July 2009 (UTC)
Let me full-quote the allegedly imbecillic semi-literate user:
• "I cant find the definition. I just want the definition. Keep the other things too, like the origins and stuff, but make the definition clear or like easier to find, see or get to.Im saying people dont want to look through all that stuff when all they really want is a definition. My suggestion is to make the definition a bigger font size and place it at the top, meaning you dont have to scroll down and prefferably at the top half of the screen."WT:FEED#manifesto.
Judge for yourself. --Dan Polansky 12:51, 24 July 2009 (UTC)
He apparently managed to found the tiny WT:FEED link but not the definition lines at manifesto? Either an imbecile or simply lying (which is more probable). --Ivan Štambuk 13:00, 24 July 2009 (UTC)
Self-documenting. --Dan Polansky 13:42, 24 July 2009 (UTC)
Can you please be be more explicative? I don not find such laconic replies particularly understandable or illustrative of a point I might be missing. I am firmly convinced that with my remark pretty much anyone reasonable must agree with: It's almost unbelievable that the user managed to find the über-tiny WT:FEED link and not see the definition lines at manifesto. The only reasonable conclusions are that he is simply lying, or is indeed an imbecile par excellence. The possibility of suffering from some reading condition must be ruled out, as well as the possibility of being too lazy to scroll down to see the entry itself (with the effort he took to post the comment on WT:FEED). My conclusion is based on simple and intuitive logic. It would be much more desirable when you'd argument on the basis of possible fallacies in reasoning that drew to such conclusions, and not the conclusions itself, as in the latter case you don't really invalidate them but only demonstrate the inability to cope with a very sad fact: The Web is also surfed by idiots and there's nothing we can do about it. --Ivan Štambuk 14:30, 24 July 2009 (UTC)
You want some explanation. I'll try to explain you what I can guess. The link to manifesto you provide is term|manifesto|lang=en but, obviously, this is not what the user entered. I suggest that the user simply entered manifesto, then clicked on Go. He got a page with a table of contents. The beginning is: 1 English 1.1 Etymology 1.2 Pronunciation 1.3 Noun 1.3.1 Translations 1.4 Verb. He tried to find what link he had to click to display the definition. It's not Etymology, nor Pronunciation, nor Verb. It might be Noun but, unfortunately, the Noun section (1.3) is composed of a single subsection Etymologies (1.3.1). So, where is the definition? He probably found it quickly after that, but not at once, because the table of contents is somewhat misleading. This is what I guess, I may be wrong.
He wrote explicitly I cant find the definition. - so he didn't find it. He also wrote Keep the other things too, like the origins and stuff - so he obviously did browse down to the etymological section of the ==English== entry, by clicking whatever got him there, and the definition lines for verbal and nominal senses are exactly below, so it's unbelievable that he could've missed it. But it could have also been what you write, that he indeed found it afterward, and that the first sentence of the feedback was a rhetorical play meaning "I wasn't able to find the definition lines immediately" not "I wasn't able to find the definition at all".. But I don't want to hypothesize on that. --Ivan Štambuk 15:33, 24 July 2009 (UTC)
I reckon also, maybe the new user miunderstood, and Mr Polansky is unnecessarily stirring the shit. --Rising Sun 13:19, 24 July 2009 (UTC)
And the following has been posted here by Ivan Štambuk and subsequently deleted, yet should be striked out using <del> instead of removed:
• "Now that I think about it, why I am all this writing to you, as you were the ones who advocated that DIREKTOR's vote be struck, and Pepsi Lite's kept and counted.."
Ivan Štambuk is as accurate as always. So, Ivan, if you want so sincerely apologize instead of shouting at me with "EXCUSE ME", you have the option. --Dan Polansky 13:32, 24 July 2009 (UTC)
I confused you with Prince Kassad's comments. I deeply apologize for making such cardinal mistake, and the exceptional emotional grievance that it might've possibly caused.
As for this: "Ivan Štambuk is as accurate as always." - Isn't it ironic that the initiator of this "abusive report" is acting abusively himself? I find it comical that you, DCDuring and RU always suddenly become cynical when you have zero arguments left in the discussion. --Ivan Štambuk 13:43, 24 July 2009 (UTC)
• In my personal opinion, although I have been very critical of rudeness among admins in the past, I think this is a situation born of sheer frustration. I always found Ivan a polite and considerate admin and I have seen his patience gradually disappear over the past couple of weeks. It's a terrible shame things have degenerated this far, and I'm sure an apology from him and others would help, but I think starting discussions like is to put it mildly not likely to help restore good relations. Ƿidsiþ 14:23, 24 July 2009 (UTC)
I am happy that you have had a good experience with Ivan. Have you ever had any occasion to disagree with him on a matter he cared about? DCDuring TALK 15:10, 24 July 2009 (UTC)

"Uhm sorry but I don't think it's acceptable to confine ourselves with the user vulgaris, which is by definition semi-literate imbecile :)"

Maybe I just don't understand anymore. As far as I can see this thread is just another "let's waste time by annoying Ivan" thread, of which there are already several. What will this thread achieve? We have had, as has been pointed out, situations similar to this, the end result of which looks (to me) as though Connel MacKenzie was hounded off Wiktionary by "us". I sincerely hope that we don't loose another trained, productive, (if with possibly slightly different opinions) editor this way. Was there any damage caused by the statement that this thread is purported to be about? No. Is it conceivable that attacking people for no reason will cause them to go away? Yes. Can we reasonably enforce any sanctions? No. (look at Wonderfool). Will forcing someone to apologise for something they said make them think it is false? No. The whole Serbo-Croatian fiasco has caused many people to lose their tempers, and I'd much rather see a happy resolution to that discussion than an apology to those who haven't even read what he said. I cannot see how forcing only Ivan to apologise is rational, productive or fair - is there something I'm missing? Conrad.Irwin 23:02, 24 July 2009 (UTC)

I agree with most of the above (Conrad's).​—msh210 18:17, 27 July 2009 (UTC)
Support --Bequw¢τ 00:23, 25 July 2009 (UTC)
Oppose but only in case this turns into an actual vote of some kind. People should say what they think; we don't need a lot of tedious behaviour constraints and sanctions like Wikipedia. If somebody is here only to be malicious and obnoxious and not to contribute, they'll end up getting blocked anyway. There is no need to stifle legitimate arguments between bona-fide contributors, even if they get hot-tempered. Equinox 00:28, 25 July 2009 (UTC)

Can an orthographic word be too SOP for inclusion?

I'm sorry if this has been discussed on numerous occasions already, but I'm wondering whether it's possible for an orthographic word to be considered too sum-of-parts for inclusion. For example, in Hebrew, the definite article and certain prepositions are written as prefixes to the word they modify; thus "the house", "in the house", "with the house", "like the house", etc., are all orthographically single words. Should such entities have separate entries at Wiktionary, or would they be considered unidiomatic sums of parts? If they should have entries, what should their L3 heading be? ===Phrase===? Angr 15:04, 24 July 2009 (UTC)

Their presence may be useful, especially when you want to use modern techniques allowing to get the definition of any word from any website (add-ons). These tools cannot analyse the words, of course. Their presence may be useful to beginners, too.
But it's a difficult problem (e.g. what about German words such as Seestrasse (Lake Street))? My feeling is that they should be allowed, but only with at least one citation, in order to mention only those likely to be searched. Lmaltier 15:37, 24 July 2009 (UTC)
Well, Seestraße is different, first because it's a compound and second because it's most likely to be a proper name (though it could be used to mean "a lake road"). For compounds it's fairly straightforward to include only the attestable ones (and their part of speech is unambiguous), but for cases like the Hebrew, it would mean potentially adding four more entries for practically every word listed in Category:Hebrew nouns: one for "the X", one for "in the X", one for "with the X", and one for "like the X". And again, would we call these phrases? Angr 15:50, 24 July 2009 (UTC)
I'd suggest to do the same, and to include only attestable ones, with a citation. It's also somewhat similar to cases such as no in Portuguese or au in French. But I have no idea about how they should be called. Lmaltier 16:07, 24 July 2009 (UTC)
Currently, we call no and au contractions, but I don't think that would be defensible for the Hebrew examples. Angr 16:12, 24 July 2009 (UTC)
Re: "potentially adding four more entries": Not four, but rather hundreds. In the discussion that Ivan links to below, I gave the example of וְשֶׁכְּשֶׁמֵּהַפֶּה(v'shek'shemeihapé, and that when from the mouth). I really don't have a good suggestion for how a Hebrew learner would know to look up פֶּה(pe, mouth), but I don't think we can feasibly add all these word-strings, just because they're orthographically and phonologically indivisible. They're not even grammatical units; the one-letter words just attach orthographically and phonologically to whatever word follows, even if syntactically and semantically they don't, as with מִבְּעֵרֶך(mib'érekh, from approximately). (Admittedly, I don't think English orthographic words are always grammatical units, either — for example, the -ed in "blue eyed" seems to attach to "blue eye(s)" — but in English that's not so common, and I think it's easier to handle. I just don't think we can even attempt something like this for Hebrew.) —RuakhTALK 23:43, 24 July 2009 (UTC)
Not hundreds attested, Ruakh.​—msh210 18:31, 27 July 2009 (UTC)
Eh. Maybe, maybe not. I guess it depends; some words are more frequent than others, so some word-sequences are more frequent than others, and therefore more likely to reach our attestation threshold; for example, the five-word sequence "וּכְשֶׁלָאִישׁ" ("and when to the man") gets six Google hits, while the identical "וּכְשֶׁלָרוֹפֵא" ("and when to the doctor") gets none — but with any word we're talking about more than four (except with words that are so rare that they themselves are hardly attested). But the attestation issue makes things worse, not better, because that means it's not even bottable. And anyway I think it's a misguided notion; it doesn't really make sense to describe וְשֶׁכְּשֶׁמֵּהַפֶּה as "unattested", any more than it would to describe its English translation as such. It's not some hypothetical form, but the normal string of words with that meaning-fragment. (BTW — not that it affects my opinion either way — but if a single well-known work happened to include the word-sequence וְשֶׁכְּשֶׁמֵּהַפֶּה, would that count as attestation for e.g. שֶׁכְּשֶׁמֵּהַפֶּה? Or would we say that word-sequences are only "attested" if they're attested with nothing attached to them?) —RuakhTALK 19:50, 27 July 2009 (UTC)
(I've just come across this discussion, so am adding my belated comments.) As I've mentioned at other times about Hebrew prefixes (or proclitics) and German compound nouns, I think that English Wiktionary should have anything (attested) that an English reader is going to consider one word, which includes everything from one space (or punctuation mark0 to the next. (I don't know anything about CJK languages or languages that don't use spaces, so my comments may not apply.)​—msh210 18:31, 27 July 2009 (UTC)
• Related discussion here, until some of the Hebrew folks answers this specifically. Needless to say, if we allow it it would open a Pandora box of possible stub entries in a number of languages, which wouldn't present much valuable content except for extreme beginners. It would be much better to have these listed in some inflection tables at the corresponding lemma entry, so that they at least come up in search results. --Ivan Štambuk 16:27, 24 July 2009 (UTC)

This is an interesting discussion, because it's important for German. Compounds in German are simply joined together, but I don't think we should include every single one of them (though I failed to get rid of Wetterlage). -- Prince Kassad 17:53, 24 July 2009 (UTC)

I don't really see how inflections of nouns (to a house etc. in languages like Finnish) are any worse than inflections of verbs (was from be, French inflections like destruisons). Equinox 00:33, 25 July 2009 (UTC)
Hungarian, my native language uses many morphemes that are written to a word without leaving a space (i.e. prefixes and suffixes). Some are also on-the-spot made words like "with the X", "to the X", but there are not just four, like you say with Hebrew, but many (there are about 18 which have latin names and which are called 'cases'). The situation is similar for any agglutinative language, like Finnish. The entries for these words are so called "form-of" entries. See Hungarian márciusokkal (with Marches). These are extremely short entries and are included because they are single words, so one might encounter them somewhere and may want to search for it without knowing the stem. I think this will work (or already works) with Hebrew as well. Qorilla 18:21, 24 July 2009 (UTC)
I don't know about Hebrew, but doing this for Arabic would be insane. There would be thousands of entries for every single word of the language. Contrary to what Lmaltier thinks, such tools actually can, and should, be able to remove prefixes/suffixes/articles/prepositions from words (even if it is not 100% precise, but neither will we ever be). Beru7 18:42, 24 July 2009 (UTC)
Thanks for the information. But it's impossible for tools to know rules to be applied for all languages. The objective of the project is all words, all languages. If something can be considered as a language, it should be allowed. If something can be considered as a word, it should be allowed. Yes, this objective can be considered insane (and, very obviously, it cannot be achieved). Nonetheless, it seems to me a good objective. Lmaltier 19:06, 24 July 2009 (UTC)
The problem is Lmaltier how do you define the word word: grammatically or orthographically? Word is not everything between whitespace! In highly-agglutinative languages, and languages with complex obligatory word sandhi (e.g. Sanskrit), that definition of word makes no sense as it would include entire sentences of arbitrary length ^_^ This should all be discussed on a language-specific basis, on the appropriate policy pages, in this cases Wiktionary talk:About Hebrew.. --Ivan Štambuk 19:41, 24 July 2009 (UTC)
Indeed. For example possessive forms in english have been excluded from having their own entries. In fact there isn't one exact, definitive definition of what a "word" is so it has to be discussed language by language. Beru7 20:02, 24 July 2009 (UTC)
I would say that, if the search feature is improved, entries should be defined for grammatically determined words, but that those entries includes tables of the related orthographic words. That way we're still including all the words (though some "second-class citizens") in a logical manner, and in a way that the orthographic (but non-grammatical) words can be found via search. But of course, there are language specific considerations. --Bequw¢τ 00:34, 25 July 2009 (UTC)

Here word is considered in its broad sense as something not broken by a space character. For languages not using the space character, I am totally incompetent, but for those that simply use inflections and suffixing, the entries should be made (maybe with the aid of bots), but they should be very brief, just like márciusokkal. Qorilla 19:30, 24 July 2009 (UTC)

I agree with you. When it cannot be done, well, it cannot be done. We don't include all possible sentences, only (!) all words. Lmaltier 19:56, 24 July 2009 (UTC)
I don't really see how inflections of nouns (to a house etc. in languages like Finnish) are any worse than inflections of verbs (was from be, French inflections like détruisons). Equinox 00:33, 25 July 2009 (UTC)
It is indeed the prevailing consensus — one I agree with — that we should include non-lemma noun forms, just as we include non-lemma verb forms; but I don't see what that has to do with the stated topic of discussion, which is languages where words aren't always separated by spaces. (Discussion drift is O.K., obviously, but it would suck if someone in the future were to come across this discussion and think we've reached a consensus about languages like Hebrew, when in fact we've simply restated the existing consensus about languages like Finnish. The two situations are similar enough to be confused, but not similar enough for such consensi to be transferable. So if we're going to be drift, I'd like us to be explicit about it.) —RuakhTALK 01:15, 25 July 2009 (UTC)
We should include all words. Therefore, in my opinion, the only questions are is it possible to consider this as a word? and does it really exist in the language? If the answer is yes (to both questions), it should be allowed, however insane this might seem. If it cannot be considered as a word, it should not be given a page (but a redirect might be an option in some cases?) It's clear that long German words built by catenating nouns are considered as words, and they should be accepted (only when they are actually used). For Hebrew, I don't know. Lmaltier 06:11, 25 July 2009 (UTC)
Compare with house's and houses' in English, or kill myself as a conjugated form of kill oneself. In Spanish, llámame is not a "conjugated" or "declined" form of llamar, it's just verb + pronoun (call me). I'd like to see this sort of entry treated like call me in English, i.e call + me. Mglovesfun (talk) 12:24, 25 July 2009 (UTC)

Our choice in this matter seems to vary somewhat by language. We do include forms of Romanian nouns with the definite article suffix, so I don't see that doing something similar for Hebrew would necessarily be a problem. However, there are cases (as Mglovesfun has hinted above) where the community has decided not to include the items, such as possessives of English nouns or certain prefixed French words. --EncycloPetey 14:08, 25 July 2009 (UTC)

Re: Romanian definite article suffix: I do think we should include definite forms of Hebrew nouns and adjectives. In colloquial Modern Hebrew the definite article seems to have become somewhat cliticized, but in most forms of Hebrew it's been considered a simple prefix, and I see Hebrew state (definite/indefinite/construct) as closely analogous to Latin case. —RuakhTALK 15:43, 25 July 2009 (UTC)

our convention of transcribing non-rhotic "/(ɹ)/" is factually incorrect

I know it's conventional to transcribe words like bar as /ˈbɑ(ɹ)/ for RP. But it's factually incorrect: in traditional RP, bar is phonemically /ˈbɑɹ/, just as it is in GA. It's only the phonetic realization that differs.

If we assume that RP includes linking r but not intrusive r, then the optionality of the [ɹ] is allophonic, not phonemic. That is, a phonetic transcription of [ˈbɑ(ɹ)] would be correct, but */ˈbɑ(ɹ)/ is not, because there is no underlying change: /ˈbɑɹ/ → [ˈbɑɹ]/_V, → [ˈbɑ]/_{C,#}.

If by /(ɹ)/ we're accommodating dialects which lack linking [ɹ], then AFAIK we're no longer talking about RP, which is the label we're using. If we're going to do that, we might as well do the same with /(j)/ in new and accommodate both RP and GA in a single transcription.

If we go with Wells and accept intrusive r as part of modern RP, then we need to transcribe idea as /aɪˈdiːəɹ/ as well. We can't leave /ɹ/ out of the phonemic description, since not all vowel-final words take an intrusive [ɹ]. But I doubt that's what many of us would want.

If we're going to continue with the convention of writing /(ɹ)/, I think we should at least make it clear to our readers that it is not a phonemic transcription, as the slashes imply. I doubt many would care, but it bothers me that we're mistranscribing so many words on Wiktionary without at least acknowledging it. kwami 07:45, 25 July 2009 (UTC)

That is not the current convention at all. That transcription format was abandoned some time ago. For RP, we do not represent the /ɹ/ because standard RP does not pronounce that letter.
I notice that you have already begun sweeping changes to pronunciations of words and rhyme page names, even though this thread has just been started. --EncycloPetey 14:03, 25 July 2009 (UTC)
Note the timing: I stopped making changes, then posted here.
How is it not the "current convention"? As usual, you appear to have private conventions which do not appear to be shared with the rest of the community, with the facile claim that thousands of entries are wrong because they aren't what you want. Pray tell, where are these conventions explained, and where are they discussed? I don't mean archives, but style guidelines etc.
If we don't use "(r)", the problem remains: bar is no more /bɑː/ in RP than it is /bɑ(ɹ)/. Neither is phonemic; the phonemic form is /ˈbɑɹ/ (note that to be complete it must have stress as well as the /r/). When you say that "standard RP does not pronounce that letter", I assume you mean "phoneme" rather than "letter", since these are not transliterations, and that in certain environments it is not pronounced: that is, allophonic variation which a truly phonemic transcription would not reflect. Since you have in the past insisted that anything between slashes be phonemic, I wonder if it truly is the consensus that Wiktionary IPA between slashes not be phonemic. If it is, fine, but we should still make it clear to our readers that our transcriptions are in-house conventions and not accurate. kwami 17:48, 25 July 2009 (UTC)
It is not a matter of private conversations; there are no private conversations on a wiki. You are asking to see documents which have not been written, and this is a known problem on Wiktionary: we do not have written style guides for a significant fraction of what we do. We have current conventions and we have outdated conventions, but in many cases those conventions have never been summarized nor codified in written form. Wikipedia does such things regularly because they have many editors who specialize in preparing policy documents, and because they run on policy. Wiktionary editors are primarily contributors to content, and seldom bother with policy writing until some crisis necesitates it.
Please also keep in mind that some practices and conventions are regularly debated. The issue of phonemic/phonetic/etc. transcription on Wiktionary is regularly argued, debated, beaten to death, then left to rot in a quiet corner until the next cycle or arguing begins. I have been through that debate many times, with no community resolution. If you have specific. clear, well-reasoned (and implementable) ideas, then my suggestion is to write all of them down in an essay to be discussed by the community as a distinct entity, and which may then result in some policy decision on the matter. As I say, though, some of the issues involved have never been satisfactorily settled, and there are many people here with polarized opinions on the issues involved. --EncycloPetey 19:23, 25 July 2009 (UTC)
Fair enough. I am more concerned with making false statements than with expounding a specific policy, or choosing between phonetic and phonemic. At the very least, in this case I propose that we transcribe bar as either phonetic ['ba(r)] or as phonemic /'bar/, but that erroneous */ba(r)/ and */ba:/ be avoided, unless in the keys we provide our readers we explicitly state that we follow a conventional dictionary usage that is factually incorrect. kwami 19:43, 25 July 2009 (UTC)
Note: I wouldn't agree with any of those transcriptions. I wouldn't use /a/ or /r/ for any English word, barring some Australian words and Scottish dialects. We had a vote on using /ɹ/, so that is "policy". For similar reasons, use of /a/ would be misleading to non-English speakers. As you see, there is more than one issue tangled up here, and several more besides that haven't been explicitly noted. Any significant change will require significant work, and the truth is that most of us here lack the energy to tackle the whole problem, even in pieces. --EncycloPetey 20:07, 25 July 2009 (UTC)
EP, I'm simply using the letters available on my keyboard. I'm not concerned here with which letters we use. The question, as above, is whether we omit final /r/ between brackets, which is a phonetic transcription with a false claim to being phonemic. kwami 07:28, 26 July 2009 (UTC)
• Of three unabridged dictionaries by me I have a Larousse (made in Europe or Mexico) which uses "[bɑː(r)]", a Collins (made in England) which uses "bɑːr", and an Oxfored (made in England) which uses "bɑː(r)"
Real dictionaries are concerned with useful pronunciations, not nit-picking over what is technically phonemic or phonetic. I was surprised that all three dictionaries had a system for indicating "optional final r". None of the dictionaries states anywhere that their pronunciation schemes are supposed to be strictly phonemic or strictly phonetic - they merely describe how to use their pronunciation schemes. They are free to design a system as best fits the job. They are not "incorrect". — hippietrail 08:31, 26 July 2009 (UTC)
No, they are not incorrect. They do not make false claims: [bɑː(r)] is perfectly acceptable, as I pointed out above. "bɑːr" would also be okay, though I suppose we might get into an argument over whether that's really IPA or not. But we do make a false claim: /bɑː(r)/ is not a phonemic transcription, and nowhere do we acknowledge that. Yes, I agree that practicality trumps theoretical purity. But we have an obligation to explain that's what we're doing. kwami 08:46, 26 July 2009 (UTC)
The fault is not with our pronunciation scheme. The fault is with the documenation of the scheme. And I'm not picking on the authors of that documentation becuase it's very tricky documentation to write. It's tricky because there are so many misunderstandings and few places where people can go to read how to design an effective dictionary pronunciation scheme. Even if you go off and read about IPA or phonetics you don't get a thorough understanding of how to apply that to dictionary pronunciation guides, which have different aims to linguistics texts.
"whether that's really IPA or not" is such a misunderstanding. The "A" in "IPA" stands for "association". Anything else is informal including how specifically to use the IPA symbols. The association has defined an alphabet which anybody can use as best fits their needs. The symbols and diacritics have specific phonetic values but can be and are used in more relaxed fashions in areas other than phonetics. It works just as well for phonetics, phonemics, and lexicography, but will not be used identically across all fields. Neither will various people in the same field use it the same. There is not "one true way to use IPA for dictionary pronunciation guides". What we do need is define one system well and strive to use it consistently. — hippietrail 09:05, 26 July 2009 (UTC)
• Hippietrail's point is right. The fact is that the transcription we use between slashes, while fairly broad, is not totally phonemic. Because that would be really unhelpful. The OED in its new revisions gives such pronunciations as /bɑː/, which is exactly how I would analyse the word. I actually see no need to treat liaison-R as a phoneme when transcribing a single word, so I wouldn't necessarily agree that this is a "false claim" to being phonemic anyway. Ƿidsiþ 09:03, 26 July 2009 (UTC)
• I find it ironic that we chose to use the symbol <ɹ> instead of <r> in IPA transcriptions, because strictly it would be "more precise", despite the fact that 95% of real-world en-FL and FL-en dictionaries use <r>, which should be much more familiar to the general user, but when it turns out that we're not exactly using a real phonemic transcriptions (there can't be "optional phonemes", that's why they're called _phonemes_ ^_^), IPA is suddenly not that important at all and we're simply accommodating to the "real-world" lexicographical conventions.. --Ivan Štambuk 11:47, 26 July 2009 (UTC)
The choice between /r/ and /ɹ/ was purely convention, and it was made because there was a feeling than in a multilingual dictionary it was better to choose the more accurate symbol. It WAS about making things clear for the user, because it was thought that it would be confusing if English and Italian, say, both used /r/ when there is obviously such a difference in pronunciation. When I learnt phonetics I was always taught that the division between phonemic and phonetic notation is not binary but a sliding scale. We are free to adapt it in whatever way is most useful for our users. On this issue, in my opinion it would be extremely unhelpful to portray bar as /bɑɹ/ in UK English because no one says the word that way on its own. (For that matter, I do not see why it is OK to say that /bɑr/ gets realised as [bɑː] in certain environments, but it is not OK to say that /bɑː/ gets realised as [bɑɹ] in some environments....am I missing something?) Ƿidsiþ 14:43, 26 July 2009 (UTC)
If you speak a dialect where [r] is always added at the end of a word between vowels, then the two transcriptions /ba:r/ and /ba:/ (using ASCII for convenience) would be equivalent—in your dialect. However, that is not the case with RP: While I can safely predict that "/ba:r/" will be pronounced without an [r] at the end of certain prosodic units, I cannot predict whether "/ba:/" would take an [r] between vowels, except by looking at the orthography ("bar", presumably yes, "baa" presumably no)—but the whole point of adding the IPA is that English orthography is not very predictive. kwami 01:40, 28 July 2009 (UTC)
• I agree. There just isn't the appetite here (yet) to record phonetic variation. The pronunciations given are essentially phonemic, with some allophonic preferences being identified. Although (British) English is my native tongue, my accent is rhotic, as a result of which some vowel sounds are diphthongized before whichever allophone of phonemic /r/ I happen to be using. I note that in my English edition of Collins English Dictionary, /r/ is used in the pronunciation provided for "bearing" (pronunciations of inflected forms are not normally given, but "bearing" is a separate headword), whereas for "bear", like "bar", no form of /r/, optional or otherwise, is given. Phonetically, it makes no sense to add an optional /r/, since the /r/, when pronounced, generally alters the preceding vowel sound (or, if you prefer, not pronouncing the phonemic /r/ alters the vowel). On a related point, the "tt" of "battle" is currently shown as /t/ in British English and /d/ in American, which is reasonable, but both pronunciations show a following consonantal /l/. I feel that the standard British pronunciation is a voiceless lateral release, whereas the American is voiced, implying an intervening schwa. Any thoughts? ARAJ 13:56, 26 July 2009 (UTC) signed retrospectivelyARAJ 19:14, 26 July 2009 (UTC)

iwish4standardisationof ipa4dict-ries,as is=major pain2 2ndlanguage learners[1has2study aech bl dict's system,wota waste oftime'n'energy2the user!:(:(--史凡/Sven - Pl also use MSN/skype as I suffer RSI and so cannot type very well! 15:44, 26 July 2009 (UTC)

Where is Wiktionary going at all?

All administrators are cordially invited to read carefully User talk:KYPark#undulate and hopefully the relevant pages, and make comment here, not there. Thanks in advance. -- User:KYPark aka --nemo 08:45, 25 July 2009 (UTC)

My comment is that, if you are very careful when wording etymology sections, if you don't select one option among several possible hypotheses when the etymology is disputed, and if you mention your sources, there should not be any problem. Lmaltier 09:34, 25 July 2009 (UTC)
Are you asking all editors all over the world to be "very careful" esp. among possible hypotheses or conjectures? As far as I know, there is no place for such things in Wiktionary, about which I'm so unhappy. Otherwise, Latin pulmo "lung" could be freely compared with Korean 풀무 (pulmu) "bellows" also meaning "lungs" without prejudice. -- User:KYPark aka --nemo 11:20, 25 July 2009 (UTC)
Yes, this is the "neutral point of view" principle, we should not take position, it's important to be careful. And, except if the fact is well-known, you should not suggest that words are cognates without citing your source (this is the "no original research" principle). Lmaltier 11:37, 25 July 2009 (UTC)
Never forget that I'm ultimately asking YOU to talk whether EncycloPetey and his kind are WRONG or not. No more! -- KYPark aka --nemo 11:31, 25 July 2009 (UTC)
Lmaltier should admit that he was trapped unconsciously by himself. Oh dear, why should I explain their folly, oh no!
I think there's no place for this sort of behaviour on the Wiktionary. Whether you are right or wrong, KYPark, this behaviour is intimidating and harassing and I'd support a lenthy block if another admin wants to do it. Mglovesfun (talk) 12:06, 25 July 2009 (UTC)
• KYPark has been noted to behave himself and not to push for another wiki-drama. I sincerely hope that he'll continue his Wiktionary editing enterprise in a more fruitful direction. --Ivan Štambuk 12:17, 25 July 2009 (UTC)
Again and again I do ask you the so-called "community" to answer THE WHOLE WORLD, not me, if I'm so unforgivably wrong as to be blocked forever for what reason exactly. Sincerely yours, User:KYPark aka --nemo 12:48, 25 July 2009 (UTC)
If you'd be blocked now it would be for attacking behavior and wasting other people's time. Please KYPark find something else to do here. --Ivan Štambuk 13:21, 25 July 2009 (UTC)
It's a real pity not to be understood. Oh yeah, you're free to understand anything. So-called "community!" be aware! You could be silent. But be quite aware you are not free to be silent at all! Should something be wrong, you should bother it at last. -- KYPark aka --nemo 14:11, 25 July 2009 (UTC)

Despite the inappropriate tone of this discussion, EP appears to be correct that there is no attested Latin verb undulāre, but is unjustly accusing KYP of making stuff up. According to the OED, 2nd ed, KYP is more or less correct: it has:

ad. L. type *undulāt-, ppl. stem of *undulāre, f. unda. Cf. Sp. and Pg. undular, It. ondulare, F. onduler.

Now, perhaps researchers have since decided that this etymology is incorrect, but it's hardly "making stuff up".

As for 'cognate', a restricted usage is supported by ELL2, under "etymology", where it says,

In the case of a word that is taken to form part of the inherited wordstock of a language, [the etymology] may be a listing of the word's immediate or remoter cognates (the similarly descended forms in related languages), ... In the case of a loanword, this may amount to no more than the immediate donor form, ...

suggesting that the word 'cognate' does not apply to loanwords—excepting, of course, when a cognate is borrowed, as with skirt and egg ("albeit in the last instance displacing a native cognate word"). However, it is wrong to assume that cognates are the same part of speech: take live and life in English, which AFAIK are cognate by anyone's definition (cognate object 'live life', etc; also cognate word (= hanzi) families in Chinese). True, within historical linguistics, at least when identifying or reconstructing language families, the word 'cognate' generally excludes loans. For example, in ELL2 I read, "Words common to Turkic and Mongolic [...] are regarded by Altaicists as true cognates and by non-Altaicists as Turkic loans in Mongolic." However, I wonder if that strict definition is restricted to that field; the fact that the word had to be clarified with "true" suggests that 'cognate' might also be used for loans. Very often, historical linguists speak of two languages having a high or low percentage of cognates, when it has not been determined whether or not they are loans. (Sometimes this will be clarified as "apparent cognates", suggesting that this is sloppy usage.) But the Oxford Companion to the English Language (1992) says that "Pairs of cognates in a single language, such as English regal and royal (both ultimately from Latin) are called doublets." And in ESL, 'cognate' often includes Latin words in English that a Spanish speaker will recognize: "the following cognates for Spanish speakers: calorie/caloria, carbohydrate/carbohidrato, nutrient/nutriente, and vitamin/vitamina" (in What Research Has to Say about Vocabulary Instruction, but similar statements in much of the rest of the ESL literature). This even appears to be true in historical linguistics regarding language contact: Jeremy Smith, Essentials of early English, states that "Most of these words were adopted from Norman French, sometimes demonstrated by the distinctive form of the adopted word in [Present-Day English] compared with its Present-Day French cognate," and gives war-guerre, carpenter-charpentier as examples of cognates. Later he says, "cognate forms in Norse and English have developed distinct meanings, e.g. skirt, shirt." I.D. Melamed, Empirical methods for exploiting parallel texts, speaks of English-Japanese cognates, meaning English loanwords in Japanese written in katakana. And elsewhere in ELL2 we get "The English word define and its European cognates are derived from Latin definire." I think that either we need a good reference that such usage is incorrect, or we need to agree that here at Wiktionary, "cognate" is to be restricted to its specialized, historical language-family reconstructive definition, and not used with its generic linguistic definition. kwami 19:34, 25 July 2009 (UTC)

Please see. Does this meaning of "cognate" agree with your experience? --EncycloPetey 20:01, 25 July 2009 (UTC)
We use cognate in etymologies strictly under the sense "inherited from the same ancestor language" (either attested or reconstructed). Words which are borrowed from the same source, and which are worth (by some arbitrary but common-sense criterion) mentioning in the etymologies, are treated as "related". This latter should generally exclude stuff such Graeco-Roman coinages that were borrowed/adapted into numerous unrelated languages (i.e. internationalisms). Our old definition on the article [[cognate]] was something like "words that come from the same source", which was abundantly abused by KYPark to promote "cognates" between Korean (language isolate) and modern-day German or Finnish. promoting fringe theories such as Uralo-Altaic. Once we discovered the magnitude of his work, the article on cognate got rewritten and him forbidden to use the word cognate in any other sense than "genetically related" in etymologies, as well as to arbitrary link between languages which are not generally accepted to be genetically related, which he also did very often (just see his userpage).
In the ==Descendants== section of the ancestor language it is however allowed to mix borrowed and inherited terms, because it is obvious what is which (by the existence of a genetic relationship or not; the only exception is between pairs such as Spanish palabra / parábola which should then have it clearly marked). As for Altaic as such, that is not about the definition and usage of the word cognate itself, but on the theory whether ancestor language existed or not (if not, those are then obviously borrowings). If the reconstructed ancestor language is not too fringy, it might be mentioned . At worst, we can put it in an appendix. See also Wiktionary:Etymology for some of the written guidelines. --Ivan Štambuk 20:12, 25 July 2009 (UTC)

Marking reverted vandalism as patrolled

If vandalism has been rolled back, should the vandalising edits be marked as patrolled? Equinox 19:17, 25 July 2009 (UTC)

If you've set your preferences to do that, then those edits will be marked automatically when you revert. --EncycloPetey 19:24, 25 July 2009 (UTC)
And, yes, they should be imo.​—msh210 05:53, 27 July 2009 (UTC)

Ancient Greek orthography question

I've recently come to the rather disheartening realization that a great many (seriously, a lot, really) of our Ancient Greek entries are misspelled......sort of. If you go here, you'll see that in Unicode, there are two main regions containing Greek characters, "Greek and Coptic [0370-03FF]" and "Greek Extended [1F00-1FFF]." "Greek and Coptic" contains the requisite characters to encode modern Greek (i.e. Demotic), and "Greek Extended" contains additional characters (primarily vowels) which are needed for the rather more complex orthography of Ancient Greek. If you look at character O3CC in "Greek and Coptic", you'll see an omicron with a little apostrophe-like thing on top of it. This is a tonos, the basic unit of Modern Greek diacritic use, which signifies a stress accent. Now, if you look at character 1F79 in "Greek Extended," you'll see another omicron, with another, very similar little mark on top of it. This is an acute or oxia. To fully understand its significance, we'd have to get into the incompletely understood realms of Ancient Greek phonology and phonological evolution, but suffice to say, it is different from the tonos, because, if for no other reason, it contrasts with the grave and circumflex, where the tonos is the only stress accent in Modern Greek. So, as it turns out, all (or at least most) of our Ancient Greek entries which should have an acute actually have a tonos. Try putting λόγος into this website. Hippietrail tried to warn me of this a few years back, but I lacked the requisite tools to verify it, and the warning went unheeded [14]. It appears that most Ancient Greek keyboard layouts utilize the tonos, even though they probably shouldn't. So, we've got this problem, what do we do with it? The simple fact is that the words are, in fact, misspelled. At the same time, most of our users who utilize Ancient Greek keyboard layouts to find words (such as myself), are probably unknowingly entering a tonos when they mean an acute. If we can get the software to recognize this, and redirect them appropriately, then it is probably best to move all the words to their correct spellings. If not, then we should probably live with this mistake for the time-being. Now, this is certainly not an urgent issue (I rather doubt that anyone realizes the error exists). However, it has rather serious consequences for a project I'm currently working on, and I would like to see a plan of action made within the near future (the actual execution of this plan is not quite so important). Thoughts? -Atelaes λάλει ἐμοί 22:11, 25 July 2009 (UTC)

Is it just me, or MediaWiki seems to be automatically converting oxia to tonos? --Ivan Štambuk 22:39, 25 July 2009 (UTC)
After writing this thread, I noticed that Hippietrail mentioned some MW normalization in our previous conversation (the linked one). This may be what he was talking about. This will certainly have to be addressed in any decision we come up with. -Atelaes λάλει ἐμοί 23:01, 25 July 2009 (UTC)
Yes MediaWiki implements Unicode normalization form C since version 1.4. What this means is that various ways of making the same string are merged into one "normal" way otherwise searching and sorting would not work. Now Unicode specifies that for normalization that Greek oxia and tonos are merged. This would have been discussed and debated by Greek languages experts and the Unicode Consortium. I'm at work now but a quick Google search found this document which looks like it might be the one covering this decision: http://omega.enstb.org/yannis/pdf/amendments2.pdf.
Hope this helps! — hippietrail 23:28, 25 July 2009 (UTC)
According to that document (section 1.1), vowels with tonos are "exactly the same letter" as vowels with oxia, and the section 2.1.4 Amendment 1b states that in earlier editions of the standard tonos was "mistakenly shown as a vertical line over the vowel", when in fact it should have been acute (oxia). This was, quoting, "probably because people with little (or no?) knowledge of Greek have imagined that an accent which derives from both the acute and the grave accent should be symmetric" ^_^.
Well, at any case, for forcing vowels with oxia you can still use HTML codes, e.g. for the abovementioned omicron: `&#x1F79;`. For pagetitles this wouldn't work of course.. --Ivan Štambuk 23:47, 25 July 2009 (UTC)
Here's another document that seems to mention it. It seems there is only one accent in reality. It has had a couple of names. Unicode messed up and made it two different characters. Various font makers messed up and made the two characters Unicode messed up, and they made them look different. Unicode tried to clean up the mess by combining the characters back together. Some old fonts are still messed up and will show the wrong style for the character.
I suggest doing the best we can with font templates by choosing fonts known not to reflect the old messy situation. — hippietrail 00:33, 26 July 2009 (UTC)
(I don't see another document's link?). It appears to me that they've also managed screw up when "fixing" the problem: had they chosen to coalesce the code points defaulting to the ones with oxia, there wouldn't be this font problem left.. --Ivan Štambuk 11:31, 26 July 2009 (UTC)
Oops sorry I must've been busy at work when I posted that. I'll see if I can find it again. Unicode does not remove codepoints, the way they "coalesce" them is to deprecate the faulty one and normalize it to the correct one. It's most often seen in CJKV characters. I'm certain you'll find more useful reasons by searching the Unicode mailings than any of us complaining or speculating here. — hippietrail 11:47, 26 July 2009 (UTC)
You're saying that, to accommodate a few broken fonts that used a totally wrong glyph for the tonos, Unicode should have deprecated the widely-used Modern Greek accent in the Greek-and-Coptic range (two bytes in UTF-8), in favor of a rare Ancient Greek one in the Extended-Greek range (three)? I can think of an estimated 11,216,708 people who might have objected to that. —RuakhTALK 13:40, 26 July 2009 (UTC)
Those fonts were not broken at that time, but simply followed the broken standard. The standard was then changed to accomodate for the broken behavior in practice (which largely ignored the standard), not to correct it properly: fonts that provided the Unicode-compliant glyphs for tonos (with vertical line) and also for oxia (with acute) would not work properly after that change. But I was wrong to say that it would solve the fonts mess, if it's true what you say that the fonts providing Extended-Greek range are rare (I thought that fonts providing Greek support in most cases provided glyphs for both of the ranges). --Ivan Štambuk 14:09, 26 July 2009 (UTC)
I believe you're mistaken. AFAIK Unicode does not define glyphs, and I find it hard to imagine what a "Unicode-compliant glyph" would or would not be. Before the change, a properly designed, Unicode-compliant font would have used similar or identical glyphs for characters with tonoi as for those with oxiae (are those the right plurals?). The sample glyphs in the character charts were wrong, but if you're basing your glyphs entirely on the ones in the charts, then you're stealing the intellectual property of the owners of those renderings. Unicode was broken, yes, in that it distinguished the tonos from the oxia, and that it implied (in a few different ways) that tonoi are straight vertical; but I think merging the oxia into the tonos, and not vice versa, was the right decision. —RuakhTALK 20:16, 26 July 2009 (UTC)
They don't exactly "define gyphs" but they do provide reference charts of them for font implementers to base their glyphs on.
"Before the change, a properly designed, Unicode-compliant font would have used similar or identical glyphs for characters with tonoi as for those with oxiae" - untrue, Unicode explicitly separated those two in code points, assigned them different names, and provided reference glyph for tonos with a vertical line with the abovequoted rationale. When I open a webpage Atelaes linked above, I do see a difference between those two (I'm not sure which font the browser has been loading from my computer), tonos with vertical line and oxia with acute. Only later Unicode folks realized a mistake, acquired some real experts and solved it the way they did... It was a "right thing to do" with respect to the prevalence of Unicode non-compliant fonts which displayed tonos as with acute, but not strictly with full backwards compatibility in mind. More of a workaround with the minimal possible bad real-world impact... --Ivan Štambuk 20:36, 26 July 2009 (UTC)

Unique Senses and Definitions

I suggested above (On Scripts) that I have difficulty with the structure of Wiktionary and, in particular, having languages as Level 2 headings. I don't imagine for one moment that this will be changed, so this section outlines the difficulties I face when considering whether to commit a significant amount of time and effort to improving Wiktionary. What interests me most about words is not their spelling, their pronunciation or their etymology but their meanings. It seems obvious to me that translations, synonyms, antonyms, hypernyms, hyponyms, holonyms, meronyms and other related "words" are all relationships between senses of words, not words themselves. In theory, these "senses" are enumerated in the definition of each part of speech (per etymology, but that implies separate "words" to me). As far as I can see, this does not happen. And I can't see how it could happen, as things stand, because there is no obvious way to refer to the senses of a word, other than by ad-hoc text. Can someone please explain how things are not as bad as they look? --ARAJ 19:55, 26 July 2009 (UTC)

Nope, they're as bad as they look. There's a trade-off between being able to glance through all senses very quickly, and being able to see all information about a sense grouped together, and we've mostly opted for the former (the exceptions being example sentences and quotations: example sentences always go right under the sense line, and quotations frequently do; but everything else goes in subsections with `{{sense}}` tags). In theory, this problem is resolvable in software; in practice, that's not easy within the MediaWiki framework, especially if we want it to work for all users (not just those with modern, visual browsers, and JavaScript enabled). And the current approach, annotating senses using `{{sense}}`, is not tractable to in-software parsing; if we ever do create a technical solution to the problem, we'll need to edit all existing entries to make use of that solution. —RuakhTALK 20:22, 26 July 2009 (UTC)
Perhaps within the MediaWiki framework the only practicable way forward is one page for each sense. Deep breath! Okay, maybe that's not quite as bad as it sounds, as a lot of words do have only one sense. Anyway, let's consider that a proposal! What does it imply? We want to keep the different senses listed together on the word's page, and some techie will probably object to generating this list each time the page is displayed, so we have a bit of a challenge there. But I think I can live with the risk of the definition of a sense being altered so fundamentally that, in fact, it becomes a definition of a different sense. All this proposal requires is a single link to the sense page for each sense of a word. The translations, synonyms and other sense-related content (including examples and quotations, perhaps) all moves to the sense page but could be displayed on demand from the word page (or could it? And who cares, really?) --ARAJ 22:21, 26 July 2009 (UTC)
• I hope that eventually we will be able to have a hidden, collapsible box system like we have for translations, which will be under each sense of a word, and which can be expanded to show translations, quotations, synonyms etc etc. (Etymologies still have to outrank meanings though, in my opinion.) Ƿidsiþ 22:46, 26 July 2009 (UTC)
• Agree with Widsith. I fail to see why we can't use similar javascript. Yes, it will make for somewhat messy pages for those with JS disabled, but I wonder if they will be that much messier than the current situation. And for those with JS enabled (the great majority of our users, IMO), the situation will be vastly improved. -Atelaes λάλει ἐμοί 23:09, 26 July 2009 (UTC)
I wrote some javascript to do this, but it is not of high-enough quality to enable for everyone (and it seems recently to have stopped working entirely) - it does not solve the main problem however, which is that we can't link to a definition. As writing definitions is a hard job, creating a page for each is slightly flawed - there would be just as much page moving as page editing, which would break all of the links all of the time. I would prefer (to the current format as well as to the per-page idea) to have pages as we do now, with Level 2 language headings and Level 3 definition headings (with the part of speech at the start of the line). Actually achieving this would be very hard - but I don't think it would be impossible, (as the prototype javascript showed, most of the transformation can be done automatically). There are still some issues to consider, such as how to keep links up to date - would it in fact be better to do a wolfram-alpha like single-word or short phrase "definition identifier" at the start of each line and link to that, with a full definition underneath? How do we deal with information like spelling and etymology that may be shared in complicated fashion amongst many definitions (my suggestion would be to put them at the bottom and link to them per-sense). It might be possible to hack in some kind of linking system (I think there are some prototype templates for that somewhere), but implementing it across all pages will take some effort and determination that is not likely to persaude everyone to join in until we have a proposal that will definitely work, and definitely improve things for most words without degrading things for many, or any. Conrad.Irwin 23:32, 26 July 2009 (UTC)
I'm not sure that I understand your problem with definitions. The difficulty of writing definitions seems beside the point. If you actually mean discriminating senses, then fair enough. Surely, though, the usual case is an omitted sense, perhaps discovered by a difficulty in translation or when exploring synonyms etc because you can't find a suitable sense to attach something to? Or you want to attach something to only part of a "sense", which implies that there is more than one sense covered by the same definition? Although the first scenario looks easier to handle than the second, neither seems especially problematic.
Part of speech per sense seems just fine; a definition would almost always be written that way, and translations into some languages must make the distinction. As for etymologies, these are generally at word level. That is why they are at a higher level than parts of speech or senses. This does not mean that the etymology itself should appear before the definitions, of course. My preference would be for one-line etymologies to precede the definitions, with a reference out to a more complex account. Often the complexity relates to the relatively ancient history of the word, and this is shared (by definition) with its cognates, so there is an opportunity to avoid duplication and exploit synergies here (for example, where an English word derives from Old French and so does a modern French word). Not sure about spelling variations. I guess that one of the variants would be considered primary, and for this word the alternate spellings are probably of most interest to someone who has been re-directed from the variant, so I'd be inclined to put it right at the top.--ARAJ 17:55, 27 July 2009 (UTC)
The linking system experimental template is `{{jump}}`. It's interesting because it offers a way to uniquely identify every sense of every word. Even if the result with the superscript links is not that great IMO, having all definitions, synonyms, translations etc. tagged that way would open many possibilities for automatic restructuration of the articles. Beru7 15:08, 27 July 2009 (UTC)
Unique identification of senses would be a giant leap forward. Great!--ARAJ 17:55, 27 July 2009 (UTC)

conrad:"Level 3 definition headings (with the part of speech at the start of the line)"<ilike!:p--史凡/Sven - Pl also let me use voice-MSN/skype as I suffer RSI and so cannot type very well! 06:55, 27 July 2009 (UTC)

Moving hypernyms, hyponyms, meronyms... to Wikisaurus might be a small improvement: they are more about the thing than about the word (and they are shared by all synonyms). But I would keep translations and a few synonyms (not too many) in the page. Lmaltier 05:46, 27 July 2009 (UTC)
What are you saying here? That hypernyms etc are more about the referent than synonyms and translations? That we should avoid duplicating the hypernymy etc but happily duplicate the typically much larger number of translations (or, worse, fail to duplicate them)? I'm sorry... translations are, quite rightly, at sense level, so that is the logical place for them to be entered and viewed. It also seems ludicrous that there is no correlation of senses across wiktionaries, so we will find a French word on Wiktionary with a different number of senses from the same word on Wictionnaire! Okay, I'm all in favour of autonomy, but we are talking about the same word in the same language. Of course it is unreasonable to expect an editor who adds a sense in one wiktionary to troll round dozens of other wiktionaries in languages they are wholly unfamiliar with to add the sense to all those, in the hope that someone will provide a suitable definition.--ARAJ 16:49, 27 July 2009 (UTC)
The problem with translations is that many users interested by the word are interested by a translation, and they should find it in the main page. Another problem is that translations should be associated to a very precise sense of the word, not with a rough sense, and the translation table might (or should) be different between synonyms in many cases (because synonyms with no subtle differences are rare).
All synonyms might be sent to Wikisaurus, but I think mentioning most common ones is useful to many readers.
About the differences between wiktionaries, it's because the project is just beginning (less than 7 years). Entries will improve with time. Lmaltier 20:19, 27 July 2009 (UTC)
It is true that there are relatively few perfect synonyms in English, where two words can be used interchangeably in all of their senses. There are many more cases where words share some of their senses. Some of these might be better regarded as "translations", since they are (or were originally) used in different dialects (e.g. lad and boy). The problem of finding an appropriate synonym is exactly analogous to finding an appropriate translation. So I object in principle to synonyms and translations being presented differently from one another, although I am not convinced that either actually needs to be on the main page.
As a contributor, however, I do care that, for example, the shared senses of boy, lad, garçon, Junge etc are, in theory, provided under each synonym or translation (and then, again, in a different language, in each of the Wiktionaries). To do this manually and keep things in step afterwards is rather more of a challenge than I am inclined to accept. So, yes, as a contributor I would rather identify the concept of a "male child" and then look at how well it is defined and included in all words in all languages (and all Wiktionaries). Personally, I find the semantic relationships between such concepts both fascinating in themselves and useful in discriminating between both senses and usages.--ARAJ 09:04, 28 July 2009 (UTC)

proper names

imo they allneed inclusion alredy4ipa![andno,wp dontdo agoodjob onthat,they no linguists..:ex. wpnever mention amE/rp etc]--史凡/Sven - Pl also let me use voice-MSN/skype as I suffer RSI and so cannot type very well! 06:25, 27 July 2009 (UTC)

We should accept all words (and all languages), not all names. I agree with you, but only when they are words (e.g. New York, Confucius, Winston, Churchill are words, but Winston Churchill is not a word). This makes the inclusion of etymology, pronunciation, etc. possible. Lmaltier 07:14, 27 July 2009 (UTC)
So it's OK to have entries for English villages like Winston but not for million-population Russian cities? :)
Our current CFI for toponomastics is really bad IMHO, relying on some arbitrarily definable phrases such as "widely understood". Also the criteria of a widely-understood phrase having a place name would exclude all the toponyms that don't make appearance in such phrases, which could have an effect of Wiktionary not listing some pretty relevant stuff at the expense of something that might have ended in some widely-understood phrase simply by chance. I understand the concerns for creation of hundreds of thousands of stubbish entries containing the definitions such as "city/village in X", but the criteria are rally too strict. Lots of these place names can have full-blown entries, with pronunciations, interesting etymologies (like the abovementioned Winston < joy stone ^_^), inflections etc. - the type of information you could not find on Wikipedia. --Ivan Štambuk 13:44, 27 July 2009 (UTC)
I fully agree with you: all city or village names, in all languages, can be considered as words, therefore all these names should be accepted and described as words (not encyclopedically), just as all languages should be accepted. We should take the principle all words, all languages more seriously. It makes things much simpler, and it's better to keep things simple (KISS principle). Lmaltier 14:00, 27 July 2009 (UTC)
We could choose to include "all proper nouns in all languages" or not. It is not an inevitable byproduct of "all words in all languages". We have also determined that Wiktionary is not an encyclopedia. I personally find little value to modifying CFI to include more Proper nouns. I would be perfectly happy with having a bot add only-in templates pointing to WP or other sister projects for every proper name for which we have no entry and a sister project does.
Under current rules, the test of attributive use is an approximation to the idea of something having meaning beyond its literal referent. It has not been consistently applied. If one does not like some of these terms, one can RfV them.
As to etymology, how are we doing on generating interesting etymologies for surnames and given names?
There are many criteria one could apply short of "all words naming any referent". One way to start is to piggy-back on the work WP has already done. Wikipedia applies a notability criterion. We could start with all of their proper nouns one category at a time. I've been noticing that many band names are interesting collocations that often don't quite meet CFI. We could simultaneously get more entries and get some interesting collocations. We could apply an "interesting etymology" test. For example, we could see whether WP has an etymology section for the Proper noun involved.
Another approach is to go for the kinds of names not in WP.
To me improvement in quality of definitions, especially of polysemic words, usability, usage examples and usage information seems a vastly better goal than entry-count. The pursuit of proper nouns will necessarily dilute our technical, patrol, and support resources. DCDuring TALK 14:45, 27 July 2009 (UTC)
What I would like is not all proper nouns. It's all words, even when they are proper nouns. Winston Churchill is a proper noun, but cannot be considered as a word. Anyway, it would be difficult to provide linguistic information for Winston Churchill (except information related to Winston and to Churchill). But interesting information (etymology, pronunciation, anagrams and more) can be provided for all words. Lmaltier 15:05, 27 July 2009 (UTC)
To clarify: Rostov-on-Don, no; but Rostov, yes; Don, yes ? Would you retain the existing attributive-use criterion or would it be New York and New Rochelle, no; York and Rochelle, yes, thereby taking a right-pondian bias for English-language place names? Nordsee, yes; North Sea, no. New City, no; Newburgh, yes. Saint Louis, no; Springfield, yes. I suspect we would have no less explaining to do. DCDuring TALK 15:36, 27 July 2009 (UTC)
I think that Rostov-on-Don and New York are words, and should be accepted. It's the same case as deus ex machina. But Winston Churchill and Excelsior Hotel are proper nouns composed of two words, and should not be accepted. Lmaltier 15:49, 27 July 2009 (UTC)
As I have said once before, if you "really" accept all words in all languages, what about all single-word proper nouns? Ghost (the film) Blockbusters (company), Mattel (company), Big (another film). Sure words like these can have etymologies (of sorts) but our job is not to duplicate Wikipedia. Mglovesfun (talk) 16:07, 27 July 2009 (UTC)
Film names or book titles may be considered as proper nouns but, obviously, they are not words (even when they are composed of a single word). But I think that Mattel is a word, yes (but not Mattel Company or Mattel Inc). And I fully agree that our job is not to duplicate Wikipedia: only linguistic information should be provided. Lmaltier 16:14, 27 July 2009 (UTC)
"Obviously"? Your feelings on this seem as arbitrary as anyone's. I can't tell which words you'd want to include and which you'd omit. Equinox 16:31, 27 July 2009 (UTC)
New York is not includable under WT:CFI except by reason of attributive use. If we do not keep the attributive-use standard for multi-word names, it would not be included under a simple version of the standard you suggest. I don't really see why it is necessarily true that we would have all of solid-spelled compounds, or all of hyphenated names, or all of spaced names. We could draw the line where we please. I think there are a few dimensions across each of which we might to draw (or not draw) lines or select criteria for inclusion or exclusion.
1. Subject matter category (geography, personal name, organizational entity (political, commercial, other), product, brand, business name, taxon)
2. A dimension is the spelling of the names (solid, hyphenated, spaced);
3. might be notability (population, inclusion in WP, geographic area, existence of a single person willing to enter it); and
4. might be relationship to the rest of language (etymology, attributive use). Attributive use might be subdivided to:
1. Use in English
2. Use in multiple languages
3. Use in its own language
I find it hard to believe that we can adequately come up with adequate criteria for each subject category, when we barely have consensus on personal names (What is inclusion criterion for surnames?) and taxons (Do we have or delete two-part species and three-part subspecies name?). ::::What is even harder for me to believe is that we could have a single slogan or set of criteria that worked for all categories. DCDuring TALK 17:09, 27 July 2009 (UTC)
Have you ever heard film names or book titles or full person names referred to as words? I don't think so (this is why I used "obviously")... Have you heard town names referred to as words? I think so. But you are right, this should be explained clearly, because there are some less obvious cases (e.g. seas). I think that common sense can help: when names for members of a class of things are usually words, all member names may be considered as words, even when they happen to be composed of two typographic words (e.g. New York). Animal or plant names are words, therefore scientific names are words too. Lmaltier 17:22, 27 July 2009 (UTC)
I don't want to bring in encyclopedic criteria like notability, or other attributes of things. That just opens the door to additional encyclopedic criteria, and could encourage biographical and gazetteer entries.
Would “all villages” also mean all lakes, rivers, islands, forests, counties, bays, town squares, streets, neighbourhoods, landmarks, buildings, alleys, etc? This doesn't sound like a good first step to me. We'd become the dumper for every made-up unverifiable “place” that Wikipedia doesn't accept, and if wikipedia doesn't list the population of Lower Bumcross, then that info must go into the Wiktionary entry! We aren't an encyclopedia. We list words, not things.
I'd really rather see some purely lexicographical criteria, qualities or attributes of words, used to include more geographical entries. It's a tough assignment, but there is clearly a consensus to add more toponyms, so it just needs someone to make a modest proposal. Michael Z. 2009-07-27 17:44 z
I fully agree with you: no notability criteria, no encyclopedic contents, only words. You are right, this must be defined though lexicographical criteria only (i.e. what does word mean?) Our entry forword does not solve this issue, but it has been studied by linguists. Lmaltier 18:01, 27 July 2009 (UTC)
Perhaps if we could find a good toponymic dictionary to serve as an example. Sorry, but I don't have much time to go to the library these days. Michael Z. 2009-07-27 18:16 z
The subject heading and initial comment were about proper names. Then we were talking about gazetteer entries. Now we were talking about toponym components as a means of preempting the pressure to make this a gazetteer. Would it perhaps be useful to have a new heading on the current subject of toponymic entries if that is the subject of interest? A specific focus like that might attract broad participation in a serious discussion that would lead to decision and even action. DCDuring TALK 19:11, 27 July 2009 (UTC)
Sorry, it's a bit lazy to just respond to the last comment and ignore the thread. Since we're squarely onto place names, I'll add a subheading. Michael Z. 2009-07-28 02:02 z

Place names

[continued from immediately preceding]

Lmatier, I don't think we should use a criteria such as "I have heard of <towns/countries/etc> referred to as a words" as a basis for inclusion ("New York" yes, but "Socrates" no?). For one, your understanding of this bit of "common knowledge" doesn't seem to equate with mine or some others', and I'd find it hard to believe a consensus could be reached with this approach. It is not a very solid foundation for our mission of all words in all languages. Whatever the approach is, it should be more well defined (possibly along the lines DCDuring mentioned). I still hope that something akin to the recent vote could be reworked to garner more support. --Bequw¢τ 00:25, 28 July 2009 (UTC)

Can we define the rule a significance of a place, rather than how many parts it consists of? I don't deny New York the right to be included, if there is a space in it. neither should be Rostov-na-Donu/Rostov-on-Don. It's a million-city, a large administrative centre and an important historical and cultural city in Russia, mentioned far too often in the Russian literature and translated into other languages. It's also a major part of East Slavic culture, History of Don cossacks and has a popular slang name - Rostov-papa (Ростов-папа), due to its criminal history.
No need to include small villages and squares, nor anyone is trying to do so. Let's talk about how big and important is enough to be included, it would be more constructive. By any criteria discussed earlier in previous discussions, Rostov-na-Donu and Rostov-on-Don should be restored.
I would also leave people's names out of this discussion, let's have another if anyone is interested.
If Wiktionary only allows cities originating in English speaking countries, that's arrogant and not helpful. Who wins if we remove an important place name from the dictionary? A bit more complicated names like Rostov-na-Donu/Rostov-on-Don also deserve to be included, simply because their translations are not so straightforward in most languages, it's not a simple transliteration based on the sound and I find information on how this is done in a particular language interesting and useful.
Most good dictionaries include major proper geographical names, why should we be different. Let's just talk about what is significant to be included. Anatoli 01:31, 28 July 2009 (UTC)
No. As I've explained repeatedly, I'm opposed to including any term based on the encyclopedic notability of its referent. The dictionary is only about words, not about the things they represent. “Encyclopedic” dictionaries throw in gazetteer entries, often completely devoid of lexicographical information, purely for marketing reasons. Since our partner encyclopedia is free and only a link away, there is no reason to water down the dictionary with lame excerpts from it.
OED's practices look like a good start, and maybe we can derive a few guidelines from them. They define Moscow strictly as an analogy: as the government and ideology of the USSR, or of Russia; also as used in combination for Moscow centre and Moscow mule. Not defined as a city at all. Its inclusion in the OED is strictly based on its use as a word, and has nothing to do with how many people live there or what they speak. There's a lot more to be gleaned from the OED's definitions—they include Churchillian and Churchilliana, but not Churchill (we already go beyond the OED by including surnames for their own sake). They include Ford for its figurative use, London and New York for attributive use and for many combinations such as London blue and New York slice (the OED includes such entries as a home for the run-in combination words—since we have combinations as independent entries, the root entries may not be required).
Energy is better expended coming up with some lexicographical or onomastic basis for inclusion of terms, and not repeating population statistics which belong in Wikipedia. Quite frankly, a straight transliteration like Rostov-na-donu, or a direct translation like Rostov-on-Don may not warrant inclusion in any case. Michael Z. 2009-07-28 02:34 z
Ростов-папа probably belongs in the dictionary, but I'm skeptical that Rostov-papa is anything but a transliteration of the Russian.[15] Michael Z. 2009-07-28 02:40 z
I am not requesting this addition (Ростов-папа), simply stating that the city name is important in many aspects. If you don't like the encyclopaedic statistics, see how often word Rostov-on-Don is used on the web and in the dictionaries and it is a word. That's enough warrant for me to include - there are 325,000 English pages for "Rostov-on-Don" and 607,000 English pages for "Rostov-na-Donu" on the web. I understand your opinion very well but I don't agree, no need to give more details to explain your point. I think we should vote for a change in the CFI. I wasn't the one who restarted this discussion. I am not going to propose to include the encyclopedic or statistical information but: meaning (city and location), etymology, spelling, pronunciation, gender or other grammar info (if applicable) and translations. Anatoli 03:22, 28 July 2009 (UTC)
Is there an English dictionary which includes serious lexicographical information about Rostov, Rostov-on-Don or Rostov-na-donu? (like etymology, year of attestation, etc.) I can't find anything but encyclopedic entries, with location, date of foundation, population. Michael Z. 2009-07-28 05:19 z
You are right, language dictionaries about proper nouns are very rare. But there are some books dealing with the etymology of place names (I own a dictionary for world place name etymologies, and another book on this subject, limited to France (not a dictionary)). There are dictionaries dealing with the etymology of surnames, or first names, etc. But the fact that proper nouns dictionaries are almost always encyclopedic (unlike wiktionaries) makes the inclusion of these words here still more useful. The fact that a name is the same as the name in the original language, or just a transcription, is not a reason to omit it: it's just like interjection (from French) or kimono (from Japanese). It's important to include them, especially for pronunciation. Lmaltier 05:45, 28 July 2009 (UTC)
A few quick links from online dictionaries: [16], [17], [18], [19], [20] (need to enter in the search box), [21], [22], [23]. I skipped some which looked more encyclopaedic. Of course, I also have bilingual dictionaries, which have place names sections or contain place names in the main body. The first indication that these are word entries is the word "noun" or "n.". Anatoli 05:54, 28 July 2009 (UTC)
Proper noun dictionaries are indeed rare, but those that exist are extermely valuable. I have a copy of Ekwall's Concise Oxford Dictionary of English Place-Names, 4th ed., and it's an incredibly valuable book for dated citations, historical spelling variation, and etymology. All these things and pronunciation are well within the scope of what a dictionary should include. —This unsigned comment was added by EncycloPetey (talkcontribs) at 15:38, July 28, 2009.
Browsing on Amazon I see a bunch of dictionaries with titles such as A Dictionary of Iowa Place-Names, Colorado: Place Names, Indian Place Names in Alabama... there is even one "Toposaurus". The biggest Croatian one-volume dictionary Hrvatski enciklopedijski rječnik has out of 175k headwords some 45k which are onomastics, with details such as pronunciations (especially in local idiom), etymology, distribution and derived terms (demonyms and relative/possessive adjectives - often very counter-intuitive formations). There is no reason why Wiktionary cannot do the same. I for once would like to know how do you properly say "citizen of Rostov-na-Donu" and non-attributively "of or pertaining to Rostov-na-Donu" in English :) --Ivan Štambuk 15:30, 28 July 2009 (UTC)
In some languages, like Chinese, treatment of foreign names is now very serious. Just having a correct correspondence of names in different scripts with a minimum information of what the actual names is, can only be achieved with a dictionary. This is a picture of a Chinese dictionary of foreign names (世界人民翻译大辞典): [24]. I bet Rostov-on-Don is there among much smaller place names. Sometimes Wikipedia can be used to get this information but it's not designed for it, doesn't have enough info (pronunciation, grammar, etymology) and article names don't have to match. Anatoli 06:02, 28 July 2009 (UTC)
Lmaltier, you're right. My interest is in Asian languages and transliteration of foreign names is not a simple sound substitution in Chinese. You must KNOW what a city is called in Chinese. There is more flexibility with people's names but most place names are standardised and are stored in a dictionary. Anatoli 06:13, 28 July 2009 (UTC)
So I'll ask again: does a single English dictionary have an etymology of Rostov? Or any dictionary? The dictionaries linked by Anatoli are exactly the kind of gazetteer entries I'm talking about—bad examples for us. I am interested in the kind of content Lmaltier's place-name dictionary has—can you quote an entry?
Elizabeth Mountbatten, HMCS Winnipeg, and Schwartz's Deli are nouns, so by Anatoli's argument they are “words” and belong in the dictionary. We're not making much headway here.... Michael Z. 2009-07-28 13:46 z
No, we're not making much headway. That's why we should vote to allow what is de facto in Wiktionary - place name entries exist, even if they don't have attributive usage. We have expressed out points but you keep using references to house, club or shop names, although I explained clearly we are talking of significant place names, leaving the discussion about significance open. You yourself, Michael boasted the Ukrainian major administrative cities in Wiktionary in your user page, which is good, good effort! Why do you have to discriminate against entries not from your background? I don't understand why you have to twist my words when I say let's include large cities, you give these funny examples again (Schwartz's Deli). If McDonald's should be included, it's irrelevant to this discussion - we are talking about city names. If you insist on calling these gazetteer entries, let it be so. I don't see any damage in having them, if they are correct. Rostov's etymology is not proven and has too many theories, so no need to add this particular etymology, only need to add if it's known, suffice to say that it's from Russian into English. Please don't give more "McLeod YMCA" type of examples. Anatoli 14:24, 28 July 2009 (UTC)
We have already had one topic change. The subject at hand is not "city names"; it is not "Rostov-na-Donu"; it is "place names".
A dictionary is not an encyclopedia and isn't necessarily a gazetteer.
If you are confident that you can draft a proposal that will win a Vote on whether Wiktionary should be a gazetteer, either:
1. do so or
2. consider that more effort might be needed to make sure there is some consensus.
Merely ratifying the status quo is not what a Vote is for. If the lack of conformity to stated policy offends you, by all means, insert RfD or RfV tags on any or all the existing entries that have not already survived a challenge. Policy and guidelines are supposed to provide a direction in which Wiktionary will be going and require some thought beyond individual entries. If you are solely interested in an entry for Rostov-on-Don, such a conversation might not be very interesting to you.
I, for one, am looking forward to a coherent and comprehensive proposal about toponymic entries. I can see that there is a great deal of enthusiasm for transcribing and translating place names as well as the expected hometown boosterism. This might be a great way to broaden participation in en.wikt worldwide. We could do worse than simply allowing every WP entry to be an en.wikt entry, but with a one-line definition, hypernyms, see also link to WP, no external links(?), and, especially, etymology and translations. DCDuring TALK 15:01, 28 July 2009 (UTC)
If the lack of conformity to stated policy offends you, by all means, insert RfD or RfV tags on any or all the existing entries that have not already survived a challenge. - this would be destructive and unreasonable course of action from his perspective. I'm sure that none of those who think that the current CFI policy for place-names is too strict (if not simply broken) is intent on waste time chasing entries which somehow managed to escape RfV process, and which obviously do not satisfy "widely-understood attributive usage" criterion, as a form of "compensation" for their own entries being deleted, or simply being unable to add new place-name entries in the first place. --Ivan Štambuk 15:17, 28 July 2009 (UTC)
My concern was that I was hearing many aspects of the issue being framed solely in terms of their consequences for inclusion of Rostov-na-Donu, Rostov-on-the-Don, and Cyrillic equivalents, as opposed to some thought about Wiktionary as a whole. This single focus had already diverted the topic of conversation once and was threatening to do so again. DCDuring TALK 16:48, 28 July 2009 (UTC)
I reply to Michael Z. The dictionary I mention is a Robert dictionary, published in 1994: Dictionnaire des noms de lieux (Louis Deroy, Marianne Mulon). It includes an entry for Rostov-sur-le-Don and this entry discusses its etymology. Toponymy is an important branch of linguistics (see w:Toponymy for more information). Lmaltier 15:41, 28 July 2009 (UTC)
Rostov-on-Don serves as an example or stand-in for pure toponyms: place names which are not part of the English language in the same way as New York has given meaning to New York slice, New York deli, etc. The dictionary is a technical document—I won't agree to sweeping policy changes based on anything de facto, and based the way we've been running the show lately I don't think the consensus would either. I created entries for major Ukrainian and Canadian place names before I became familiar or concerned with these encyclopedic vs lexicographical issues and did some significant reading about them. The current policy requiring attributive attestation is if not ideal, adequately inclusive for place names from a lexicographical standpoint, and not much different from the OED's. To conform, there are many place names which could be RFV'd and RFD'd, including entries I added, and I'll be glad to help with that effort once we decide what to do (or not to do). Even if we don't remove them, we should remove all encyclopedic information, like population, date of founding, significance, etc.
Where the policy is lacking is from the point of view of toponymy and its parent discipline, onomastics, in allowing a broader overview of names to be included. I believe we specifically allow given names and surnames, so why not toponyms? We need a proposal. Does the Dictionnaire des noms de lieux have an introduction explaining what was included and why? I've also seen some Ukrainian etymological dictionaries which include place names. How do we justify the inclusion of specific place names but not other specific entities?
Remember this is about the words, not the things. I don't care how big or important a thing is, that doesn't make its name belong in the dictionary, even if some popular dictionaries do include it for that reason. It's only important how important its name is. Michael Z. 2009-07-28 20:03 z
The issue is not about Rostov-on-Don/Rostov-na-Donu, these are the entries that were deleted and it's not my home town. Yes, I feel sorry about wasted time on adding translations and transliterations. I have created entries in Russian, Chinese, Japanese, Arabic and some other speaking countries, added translations for many others. Toponyms is not my main activity but I think they are important for Wiktionary. I honestly believe that Wiktionary should have toponyms - so that users could look up them in native and other scripts, find out about the pronunciation and other linguistic information, such as etymology, gender, declension, etc. I don't have a full confidence that the vote will succeed, neither have the experience in putting the proposal. When I find out more, I may do so. Agree that encyclopaedic information can be removed. Anatoli 20:12, 28 July 2009 (UTC)
Reply to Mzajac: There has been a selection in the dictionary I mention, of course: it's impossible to cover all place names of the world in a single book. The criteria were 1. covering the whole world (trying not to favour European toponyms) and 2. giving priority to place names most likely to be searched: place names of French-speaking countries (the dictionary is in French), most notable places, and places with strange names (readers may be interested in them), e.g. Titicaca. But we are not a paper dictionary, we are not limited by space.
We should allow specific entities when they are words. Paris, Vespasian or Confucius are words, clearly, but not Excelsior Hotel, nor Adolphe Hitler (despite the translation table in Hitler). The key question is: what's a word? Lmaltier 20:25, 28 July 2009 (UTC)
You keep using the word "clearly", but in fact, to at least one editor (me), Paris, Vespasian, and Confucius are not words. They're names. As are, presumably, Excelsior Hotel and Adolphe Hitler. I can easily accept that Paris, Vespasian, and Confucius may be worth including, while Excelsior Hotel and Adolphe Hitler are unlikely to be; but I can't accept that this is some sort of trivial consequence of "all 'words' in all languages". —RuakhTALK 20:38, 28 July 2009 (UTC)
Some names are words, some names are not words, but combinations of words. I'm surprised that somebody feels that Paris is not a word... So, what's a word? Lmaltier 20:42, 28 July 2009 (UTC)
The slogan "all words in all languages" has been put into policy as WT:CFI. "All words in all languages" are fine words, in the nature of "give me liberty or give me death" or "liberty, equality, fraternity". "Liberty" turns out to be quite circumscribed by constitution, law, regulation, and social expectations. Similarly with our slogan. When it gets down to specifics we get to decide what we include. We decide based on reasons like benefit to users non-contributing, or interest of occasional and major contributors. If we think it is a good idea, then we should devote technical resources to getting some basic starter entries bot-loaded from Wikipedia to here in some form so we can rapidly present adequate coverage of what we want to cover. DCDuring TALK 21:15, 28 July 2009 (UTC)
I don't know what a word is, I just know that for me, Paris isn't one. However, the CFI are quite clear that we aren't restricted to just words — see especially Wiktionary:Criteria for inclusion#“Terms” to be broadly interpreted — and I have no objection to including certain types of names, as long as I know which types. —RuakhTALK 18:48, 29 July 2009 (UTC)
I think that we are restricted to words (+ characters, suffixes...). But I consider that table cloth is a word, in the linguistic (not typographic) sense of word. Lmaltier 19:02, 29 July 2009 (UTC)
I recently added Kashgar / Kashi, since they were in the news (because of the riots in Xinjiang) and are likely to be searched by users. Both names are used for the same place name, the latter is often preferred by Chinese media in English and some tourist agencies. Anatoli 20:34, 28 July 2009 (UTC)
DCDuring, specifically, I suggest to include:
1. all countries (already there) in the shortest form (France, not the French Republic, Transnistria / Pridnestrovie, not "Pridnestrovian Moldavian Republic", other languages may require the prefix - in Japanese it's always "沿ドニエストル共和国" with the word Republic, not used without it)
2. country capitals in their full but shortest acceptable form - Mexico City is OK along with Mexico (city), Santiago de Chile is a word, not the same meaning as Santiago.
3. City names, administrative centres (provinces, prefectures, states, oblasts, not necessarily counties or districts, unless they meet other criteria for inclusion). Say, e.g. major cities - 0.5 mln and over population or historically/culturally/politically significant place, e.g. Ramalla may be very small but it's always in the news. It doesn't mean that all HAVE to be created but I don't see issue in allowing this. Do we have space issues? Maintenance is not a problem, since large city names can be verified easily. 0.5 mln and over is an example only, no need to quote me on this. Please suggest other criteria if you want.
4. All based on common sense, no need to twist or exaggerate. Hong Kong's capital district doesn't count as a capital city but Taiwan's Jhongsing (village) is (中興新村 / 中兴新村 Zhōngxīng Xīncūn). Anatoli 22:44, 28 July 2009 (UTC)
Also to DCDuring: all words, all languages is not only a slogan, it's a principle. Trying to stick to this principle, in a systematic way, would not save space, of source, but would save much (very, very much) time in RfV, RfD, Beer parlour, etc. discussions, and it would be much easier to understand and to accept (by editors and by readers) than making exceptions here and there (there will always be people disagreeing with exceptions). Lmaltier 17:24, 29 July 2009 (UTC)
Since we don't have to worry about space and can choose not to worry about maintenance, I suppose that the time consideration could govern.
I guess then the slogan principle of "all word in all languages" applies not just to "words", but also to "languages". Or are both of these in fact terms circumscribed by those who assert the authority to do so?
I have added to hidden categories Category:English headwords containing toponyms and Category:English etymologies containing toponyms for purposes of collecting samples of the value to our existing entries of toponymic entries. I perhaps shouldn't have added English. DCDuring TALK 18:59, 29 July 2009 (UTC)
This kind of interwiki debate is as long as it should be synthesized on the Wiktionary:Embassy. JackPotte 04:07, 30 July 2009 (UTC)
Anatoli, the criteria you list are purely encyclopedic. It's Wikipedia's job to identify capitals and large or important cities, or the sites of events in the news. Imagine if WT:CFI#Given and family names only allowed names held by a million people, or only by famous living people. Why should criteria for place names welcome the inclusion of Jhongsing or Kashgar, but not Zion, Babylon, Babel, Cathay, Waterloo, Auschwitz, Vulcan (from Star Trek), Utopia, Scotland Yard, Camelot or 24 Sussex Drive? Michael Z. 2009-07-30 04:33 z
Michael, because these rules are created by the people and for the people and we can decide the significance of the entries here as we go along or right now. I don't see any issue with having these entries. Also, as I said, let's leave people's names and street addresses out of these discussions. The toponyms from fiction, not sure about these and I think it's not so relevant to this discussion either. Of course, the final rule may stipulate whether they need to be included. I don't think the number of real cities over 0.5 million is huge and can't be handled, even if we talk about other languages and different spellings. They are not going to appear from nowhere. I already stated my point of view that the criteria are not encyclopedic, except, perhaps the location and the fact that it's a capital or an administrative centre but it belongs to the meaning of the word, e.g. Ottawa means nothing if you don't say that it's the capital of Canada. I explained the criteria for the inclusion, not the entries themselves. The entries may stay linguistic as they are now. See Belgorod and please say what's wrong with the entry. This is the city I lived in the last few years before leaving Russia (not my home town) but there's no boosterism I was accused of - only dry dictionary info and the link to Wikipedia. Anatoli 05:20, 30 July 2009 (UTC)
I agree with Mzajac. Criteria you propose are encyclopedic. In a paper dictionary (even a pure language dictionary), such criteria are required, because space is limited. Here, they are not needed. The origin of the name of a small village may be much more interesting than the etymology of a capital. Yes, very few readers will be interested. But not fewer, probably, than for some obscure obsolete nouns or verbs. Lmaltier 05:54, 30 July 2009 (UTC)
What kind of criteria would you suggest, Lmaltier? Anatoli 06:37, 30 July 2009 (UTC)
None. Accepting all toponyms when they can be understood as words, which is the case for most toponyms (but not for odonyms, which should be accepted only when they are words, e.g. Champs-Élysées or Canebière). But I think that place names which are not really considered as words (e.g. Excelsior Hotel, World Trade Center, etc.) should not be accepted, except when including their definition would be useful for good linguistic reasons (I don't like the "attributive use" criterion, because it seems to me to be specific to English, but criteria of this kind could be used in such cases). Lmaltier 07:09, 30 July 2009 (UTC)
That was my idea as well and I certainly wouldn't oppose it but I am trying to find a middle ground or a compromise, since the number of large and known places, therefore more likely to be sought is smaller than small villages, exceptions, like Jhongsing village (seat of Taiwan't parliament) I mentioned above, could be made for small places, which have an important political, historical or cultural value, again if people agree. Michael disagrees with me for a different reason, you want to increase the criteria, he doesn't want proper names allowed without attributive usage. Anatoli 03:09, 31 July 2009 (UTC)
That is not what I don't want. Michael Z. 2009-07-31 05:40 z
I may have missed something but to put it simply, you don't want to allow place names to be included if they don't meet the current CFI, that is attributive usage, not used in English expressions, right? Or this rule only applies to place names, which are made of more than one word, or rather have spaces/dashes between them? Didn't you say you don't want Wiktionary to become a gazetteer? Or was it that the information in the entries that you were worried about. If you don't mind, please explain your position again on place names in simple terms. Please restrict to real place names. Anatoli 06:19, 31 July 2009 (UTC)
A assumed that until we decided to change the CFI, then we are trying to conform to the CFI. What I want is a different issue. I want to include more place names, once we've figured out an acceptable way to do that and agreed to update the CFI. Michael Z. 2009-07-31 16:10 z
“Please restrict to real place names”—sorry, but I won't. We're talking about the definitions of real words here. I assume Wikipedia already categorizes real places separately from fictional or mythical places, and it's pointless for us to go down the same road. This discussion may be more productive if you stop couching your arguments in encyclopedic terms, and insisting that others do the same. Michael Z. 2009-07-31 16:26 z
And its reality aside, significant use of a name doesn't require any measurable quality of its referent, like a million residents, viz. Jericho, Chernobyl, Greenwich, Alexandria, Olympus, Oxbridge and the already-mentioned Babylon, Auschwitz, and WaterlooMichael Z. 2009-08-01 05:52 z
You win, I lose. I am leaving the discussion and wiktionary. If my suggestions sounded imposing, I apologise. Anatoli 04:41, 3 August 2009 (UTC)
AEL

Well, I'm coming rather late to this discussion, and must admit I haven't read the entirety off the discussion above. But would something like Wiktionary:Votes/2007-08/Brand names of products 2 (=Wiktionary:Criteria for inclusion/Brand names) work for place names? In fact, though, I like to think of two possible types of definition lines for place names. The one is "A place name". This, I think, is good for any non-SoP place name (so not, e.g., New York) and is useful for the etymology, pronunciation, and other info. The other is the one that identifies a particular place, and that's the one that we need good CFI for (and that I suggested something similar to our brand-name criteria for). Do others disagree?​—msh210 20:28, 30 July 2009 (UTC)

A town and a different town with the same name, that makes two different senses. The definitions have to be different. And the general case is that there is at least some other information specific to the sense, even when the etymology is the same, e.g. the demonym, or a translation. See fr:Beaulieu for a typical (but rather extreme) example. Lmaltier 20:37, 30 July 2009 (UTC)
Yeah, I have a problem with that. In terms of inclusion, I'm starting to think that “every village” might work, just like “every surname” hasn't presented us with any problems.
But in terms of defining, I have a problem with each list of senses becoming a geographical catalogue, with items differentiated by nothing but their location. Imagine how many towns, villages, suburbs, and developers' tracts would be listed under Lakeview, Elmwood, Bridgeport, St. Paul, or Garden City. (These entries would become needless duplicates of Wikipedia's respective disambiguation pages: w:Lakeview, w:Elmwood, w:Bridgeport (disambiguation), w:Saint Paul (disambiguation), or w:Garden City) Perhaps hundreds in some cases, adding zero lexicographical value. Michael Z. 2009-07-31 16:10 z
I have proposed before, and will again, that any name in common use as a geographic identifier (e.g. Springfield) should be defined as pretty much exactly that: "a common name" for whatever type of thing it's a common name for. And put a link to the Wikipedia disambig page in case anyone wants to see all the specific instances. bd2412 T 20:22, 31 July 2009 (UTC)
I like that. Michael Z. 2009-08-01 05:25 z

To Ruakh: From the definition in w:Word, it's clear that Paris and Confucius are words. This page also states In English orthography, words may contain spaces if they are compounds or proper nouns such as ice cream or air raid shelter.. According to this sentences, ice cream and air raid shelter are words, and words may be proper nouns including spaces (my example: w:Le Mans). This pages provides criteria from different authors for defining what a word is, but also insists on the fact that the definition of word is very elusive. Nonetheless, I propose to use this page as a basis for improving CFI. Lmaltier 08:29, 31 July 2009 (UTC)

I note as well that the definition of "word" in Wikipedia also makes no reference to attestation. Clearly we have blundered. Our policy violates our slogan or principle or whatever it is. DCDuring TALK 17:40, 31 July 2009 (UTC)
You know that this is not what I mean. Only existing words should be included. Lmaltier 17:55, 31 July 2009 (UTC)
But you seem to be writing as if outside sources have some implications for our choices. We clearly make our decision on what we include within the broad definitions of "word". In the broadest sense, almost anything used in an intelligible sentence is a word. Fine.
Beyond that, I think we, as an entity with limited resources, especially technical ones, we need to couch the discussion in terms of costs and benefits to types of users for including specific classes of "words". For example: "We should have entries for all place names which are of encyclopedic import at any Wikipedia in any language (subject to some test of the adequacy of their inclusion criteria?) because they need to be translated for the benefit of those projects their wiktionaries and others." "We should include all proper nouns that are used to define terms in en.wikt or in etymologies so that we can give users a quick link (using popups) instead of making them wait for a WP page download. "We should have our own encyclopedic criteria because it would be a way of involving new users in educational discussion with senior Wiktionarians on why those criteria are justified in excluding their neighborhood, village, or favorite natural wonder."
I don't know whether I agree with any of the statements above, but I could imagine having a discussion about such matters that seemed relevant to Wiktionary. DCDuring TALK 19:27, 31 July 2009 (UTC)
What we are considering here goes beyond “every word” in the conventional dictionary sense. We are considering going beyond non-encyclopedic entries in normal dictionaries, and beyond the OED. We are considering moving beyond lexicography into onomastics by adding names. I don't have a problem with this, but let's please understand the slogan for what it really means without exaggerating. Michael Z. 2009-08-01 03:48 z
Toponymy belongs to lexicography: toponyms belong to the vocabulary of a language. Onomastics too, but it's normal to exclude full names such as Winston Churchill' from a language dictionary, because they are considered as two words rather than a single word. They are included only in encyclopedic dictionaries, because the only interesting data to be provided are encyclopedic, not linguistic. Lmaltier 06:24, 1 August 2009 (UTC)
Practically no non-specialized dictionary has toponymic entries about place names at all, but many, especially American ones, have encyclopedic entries about places added, for marketing purposes. Although they throw in the pronunciation, they almost universally include specific references and statistics like population, history, etc., while ignoring etymology. This is not lexicography or onomastics, this is pure encyclopedic supplement.
The OED is an exception, it only includes place-name entries for three reasons, as far as I can tell. 1—if a place name has become a word on its own, e.g., Moscow, which the OED doesn't define as a city at all (although of course this is mentioned in the etymology and an etymological note). 2—if a place name is widely used attributively, with a meaning beyond designating location, e.g., New York's meaning restricted specifically as “Only in attr. use [...] Designating things originating in, characteristic of, or associated with the city [...]” with the example New York dressed (of poultry). 3.—since, as in most print dictionaries, the OED includes compound words as run-in entries under a main headword, it includes place names in main headings as a place to house collocations, for example London: “the name of the capital of England, used attrib. in various special collocations:” followed by about two-dozen collocations including London broil, London fog, London paste. This is a purely lexicographical approach, omitting encyclopedic and onomastic information as much as possible.
We aren't an encyclopedic dictionary—or at least we shouldn't be diluting our work with a half-assed copy of Wikipedia material.
Sorry I'm repeating this like a broken record. The distinction of encyclopedic material seems to be unclear to some editors. There are lots of explanations in lexicography books, e.g. Oxford Guide, p 186–89, Manual of Lexicograpy, p 198–99Michael Z. 2009-08-01 16:13 z
I agree on your analysis of what paper dictionaries usually do. Yes, OED, and Webster's, and almost all other paper language dictionaries include toponyms only in some specific cases, because they choose not to include proper nouns, and toponyms are proper nouns. It's easy to find dictionaries which include proper nouns but, as a rule, they are encyclopedic dictionaries: only specialized paper dictionaries provide linguistic info about toponyms. But this is a good reason to try to study all toponyms, with an exclusively linguistic point of view: if somebody wants to find the pronunciation or the etymology of the name of a small Albanian town, he won't be able to find it anywhere, but here. Lmaltier 16:29, 1 August 2009 (UTC)
I'm starting to come around to your view. I can't think of a valid reason to disqualify any village at all, if we decide to allow place names. I don't see major problems with the place name entries already present.
But how would we discourage the growth of encyclopedic entries? I'd like to limit the definitions of place names to the bare minimum information required to identify them. Certainly I'd ban statistics like population figures, areas, geographic coördinates, and so on. I'd also like to discourage statements of notability, like “biggest city in Ohio,” unless the quality contributes to the meaning or connotation of the name.
And how do we check the proliferation of lists of places? Everyone will want to add their home town.
Finally, how would we address prescriptivism? Many countries have official lists of place names and their spellings. Obviously we would include anything that is attributed by our rules, but should we also note such official status? Michael Z. 2009-08-03 02:25 z
Geographic coordinates should be included for cities and smaller population centers. They are the one aspect of a place that does not change. The name and its spelling change; the population and relative size change; even the parent country can change as national borders move. However, it is very rare for the geographic coordinates of a city, town, or village to change. The coordinates therefore become the most reliable and consistent means of identifying a particular population center. --EncycloPetey 15:36, 4 August 2009 (UTC)
I'm skeptical. Geographical centres wander, cities grow, boundaries change both organically and by decree, districts are annexed and merged. The boundaries and even meanings of various bodies of water, regions, and neighbourhoods are undefined, or variously defined. And as we know, names and their meanings change.
And in practical terms, we're talking about eventually cataloguing a (theoretically) measurable property about tens of thousands of things (not terms). Sounds encyclopedic to me, and duplicates Wikipedia. Michael Z. 2009-08-04 17:22 z
I agree. Some amount of encyclopedic information may be inevitable — I think it would be perverse to include Atlantis, for example, without mentioning that it's presumed mythical, or Xanadu without mentioning that it's historical — but geographic coordinates seems a bit extreme. (And "presumed mythical"– and "historical"-type things might best be covered by sense labels, anyway, rather than going into the definitions proper.) —RuakhTALK 17:59, 4 August 2009 (UTC)
I feel that a map showing where the place is located is clearer, and more appropriate in a language dictionary than precise geographic coordinates, which are appropriate in an encyclopedia. As for encyclopedic information, it should be limited to the definition, i.e. what is necessary to understand the word. This is true for all words, including common nouns (e.g. cat cannot be defined as Animal. or square as Polygon.). Actually, the definition is the only part which should be present both in an ancyclopedia and a language dictionary (and might be common to both in most cases). Lmaltier 18:39, 4 August 2009 (UTC)
A map is still specific to a place, not to a word or name. If we start adding geolocations or maps, then someone will make it their weekend hobby to take a definition like Alexandria (2. “A number of cities bearing the same name, including Alexandria, Virginia, USA”), and turn it into an array of map graphics. This does not serve lexicography.
All we should do is link to Wikipedia, and let it be dealt with over there (which already is, with a link to w:Alexandria (disambiguation) listing 49 places. Forty-nine!). Michael Z. 2009-08-05 06:42 z
I don't think size / population of a place / fame should be the key criteria for inclusion of a place. We include very rarely narrowly used words an specialized words. Paper dictionaries don't have many place names in but I think that is partly because they have to be printed and this limits total content. We need not observe the limitations of printed dictionaries.

John Cross 07:06, 18 August 2009 (UTC)

Strategic Planning Wiki

The Wikimedia Foundation has begun a year long phase of strategic planning. During this time of planning, members of the community have the opportunity to propose ideas, ask questions, and help to chart the future of the Foundation. In order to create as centralized an area as possible for these discussions, the Strategy Wiki has been launched. This wiki will provide an overview of the strategic planning process and ways to get involved, including just a few questions that everyone can answer. All ideas are welcome, and everyone is invited to participate.

Please take a few moments to check out the strategy wiki. It is being translated into as many languages as possible now; feel free to leave your messages in your native language and we will have them translated (but, in case of any doubt, let us know what language it is, if not english!).

All proposals for the Wikimedia Foundation may be left in any language as well.

Please, take the time to join in this exciting process. The importance of your participation can not be overstated.

--Philippe

(please cross-post widely and forgive those who do) --ARAJ 13:58, 30 July 2009 (UTC)

Should be a separate topic --ARAJ 13:58, 30 July 2009 (UTC)

New top line on entries

The Unchanged/Paper dictionary/Toggle sections line is not functional and is not exactly self-explanatory.

1. When was it discussed?
2. On whom has it been tested?
3. Is it a default for anyone?
4. Can it be turned off?

Did I just not get the e-mail? DCDuring TALK 15:27, 31 July 2009 (UTC)

Please elaborate. I have no idea what you're talking about. -- Prince Kassad 16:03, 31 July 2009 (UTC)

Across the top of every entry, immediately beneath the headword, I see a line containing "Unchanged Paper Dictionary Toggle Sections", the last two blue-linked. Based in your reaction, I checked my personal monobooks, which show no change. The first I saw this was after the servers went down. If this is merely a personal technical problem, I will take it to the Grease Pit or do a bug report, though it looks more like an errant test of an early draft of something than a true bug. I wonder if someone has modified something my monobooks use but yours do not. DCDuring TALK 16:51, 31 July 2009 (UTC)

I don't see anything, and I've not modified anything in the monobook. Lmaltier 17:00, 31 July 2009 (UTC)
Do you happen to have the PREFerence "Use User:Conrad.Irwin/parser.js to provide two different views of most pages."? Because if I check that, I too get that line (but in my browser - firefox 3.5 - that's everything I see of the page...) \Mike 18:02, 31 July 2009 (UTC)
Thanks, Mike. I also use FF3.5 but by clicking options I eventually got a satisfactory view, although I've deselected it now. It seems to be gone.
Has that been changed lately? I had set it weeks (months?) ago. Can it really take weeks to see the effect of a change in preferences. I had rebooted, restarted my browser and couple of times since then, I think. Does FF restore sessions nullify what I think of as a restart? DCDuring TALK 18:59, 31 July 2009 (UTC)
I fixed a bug in it yesterday, so it's just been failing to work for that long. What was your impression of it, or were you just annoyed that it was there? Conrad.Irwin 19:12, 31 July 2009 (UTC)
This is the first I have seen of it. When it first appeared, I could only see the new line underneath the headword. When I clicked on blue links it took too long for anything to happen, so I quickly lost track of which click generated the response I saw. The names for the actions are not intuitive for a first-time user of this aspect of the interface, even one who had heard of "paper dictionary" view. That the paper dictionary view seemingly gave no output whatsoever was not much of a help. For entries that only appear in one language it seems silly to make a user have to click anything. Why do L3 section headings appear in red in any view?
Finally, I think it is a truly bad idea for en-wikt to appear without a fully expanded English language section as immediately visible in any default setting. We will almost certainly rapidly kill off any use by English monolingual users, which "imbeciles" represent a noticeable fraction of users. I, for one, would find OneLook to be a superior dictionary reference gateway, and when combined with google search for usage examples and COCA for grammatical analysis vastly superior.
As you can tell, the above is a completely unrestrained reaction to the current state of a work in process with no recognition of what this might become. To someone more experienced than I with such works in progress, the glitches I found may seem insignificant or easily overcome. I hope they are. Is use FF 3.5 with many open tabs on consumer cable broadband on a dated laptop. HTH. DCDuring TALK 20:17, 31 July 2009 (UTC)
Thanks for the feedback, I'd agree with all of it. I wrote parser.js at the end of 2007 to explore how possible automatically rearranging pages so that all the content associated with each definition is with that definition would be - it is very technically-orientated and not at all pretty, however the idea remains. The paper-view section should give you all the definitions concatenated into one paragraph with none of the other content visible, I just added that for fun. The prototype showed then, and still seems to show now, that rearranging our pages quite drastically would be possible automatically (maybe as much as 95% of the time if a bit more work is put into the algorithm). I know some people were keen at the time on getting it into a end-user usable state, but as with so much other stuff, no-one has yet found the time. Conrad.Irwin 20:32, 31 July 2009 (UTC)

Example of when a non-idiomatic meaning can be good

pearl necklace came to mind. If you delete, the "a necklace made of pearls" meaning, it can confuse the reader, who thinks that pearl necklace only refers to the sexual act, and not to a necklace made of pearls. Another non-English (and cleaner) example is mettre en bouteille. If you delete #1, people may think it doesn't refer to putting something in a bottle. The three solutions I see are:

1. Delete any and all non-idiomatic meanings
2. Allow literal meanings with a gloss such as (literally)
3. Allow literal meanings in the etymology only (I do this on fr: quite a lot.)

--Mglovesfun (talk) 20:24, 31 July 2009 (UTC)

This would be an issue worth resolving. Once resolved with broad consensus it ought to be recorded somewhere accessible so it gets used. I would like to add:
4 Insert a usage note referring to multiple possible non-idiomatic meanings if there are such, say, more than two. DCDuring TALK 21:11, 31 July 2009 (UTC)
It would help to decide what we want to do if we could first establish a suitably diverse list of situations where we think this issue is relevant. I can remember having a similar conversation over apple pie, where we decided to keep the literal meaning as a definition. I don't like the idea of restricting the literal meanig to the etymology, because we already have users complain about not being able to find the definitions :P However, I can think of cases where the literal meaning is a strictly "sum-of-parts" in a way that I wouldn't want to keep the literal definition, such as for hole in the wall, or the literal definition of orange blossom as a blossom that is orange. Hence, I think we need a large enough list to look at to determine what we want to do. --EncycloPetey 15:04, 1 August 2009 (UTC)
I agree with EncycloPetey. But it's not easy. Lmaltier 15:22, 1 August 2009 (UTC)
A couple of additional cases: no end and make for. I think these two illustrate cases where having all the SoP meanings is clearly counter-productive. In [[make for]] I inserted a usage note that illustrates an approach. [[no end]] has the additional problem that a single usage note would not necessarily be visible as the user hunts for a meaning because the meaning is spread over several parts of speech (which necessarily take up vertical screen space, especially with our current layout). DCDuring TALK 15:42, 1 August 2009 (UTC)
I think Category:English idioms, Special:WhatLinksHere/Template:figuratively, and Special:WhatLinksHere/Template:literally can provide additional test cases. I'll see if I can make a shorter list from some intersection of these. DCDuring TALK 15:54, 1 August 2009 (UTC)
I'd like to see something like this:
1. Used literally; see pearl, necklace.
which makes it clear that the primary use is literal, while not wasting readers' time and attention on a full explanation of that use.
RuakhTALK 03:17, 4 August 2009 (UTC)
That's pretty good; I like it. Mglovesfun (talk) 16:00, 7 August 2009 (UTC)
It's not a good idea because I didn't think of it first. Actually, I just wish that I had. It seems like a quite straightforward way of addressing all of these entries, no matter how few the literal meanings. If this doesn't meet with valid objections, where can it be made usefully available as a "guideline" or something (ie, no vote!)? DCDuring TALK 16:29, 7 August 2009 (UTC)
It's a fine idea, but how and where do you indicate the required sense(s) of each component term? I would argue that orange blossom is nothing more than a (mildly, theoretically) ambiguous SoP collocation. But if it warrants its entry, both definitions are equally "literal" and should simply point to different senses of orange.--ARAJ 10:27, 11 August 2009 (UTC)
We are trying to address a limited problem. At some point users have to think for themselves. I think the idea is that the user would have to determine for themselves which senses of each word might apply. The point is simply to remind users of the need to do so. Users often seem to not know or notice that each individual blue-linked word (or sub-idiom) on the inflection line gives access to useful content.
The "combinatorial explosion" of possibilities of meaning is not something that can be readily addressed by more and more "complete" lexical coverage of collocations. Users may find the entry for "take" daunting and the entry for "walk" less so, but an entry for "take a walk" that had all possible senses might turn out to be longer than the entry for "take", without offering as much benefit to the user the next time a different collocation of "take" came up.
At some point in the future it might be possible to make more guesses about what a user really wants, but for now, we are limited to basing our content on what they type in the search box or what the search engine provides us and any user preferences that might be in effect (which may are an uninformative default for users not logged in). Once they are in our system they have to be able to select links. We probably best serve users by not having too many links, bolding links of especially high value, and perhaps distinguishing multi-word links from single word links. This last item would probably require a technical solution, perhaps providing a faint underlining under each linked unit at least in inflection templates. DCDuring TALK 14:36, 11 August 2009 (UTC)
I am oblivious to your subliminal suggestion that I should "take a walk" and will not, therefore, take umbrage. Your comments about a possible "combinatorial explosion" deserve consideration. It seems to me that it is far less of a problem than you imagine. The point is that "take a walk" certainly could have any meaning implied by any of the senses of the transitive verb take and the noun walk, but it almost certainly doesn't. So it is unhelpful to point the hapless user at the words "take" and "walk" without indicating (where applicable) which etymology and/or part of speech they should be looking at. And if, as will generally be the case, a particular sense of each word is most likely to be applicable, then it would be helpful to indicate it. I'm sorry if this is not as simple as you would like it to be, but it is a difficult aspect of language to deal with.
Having said that, I'm inclined to agree that, in many cases, this additional information would not be helpful. And this is something the mooted guidelines should address. So you may wish to deem my perfectly valid objection to a perfectly reasonable proposal unworthy of further consideration.--ARAJ 01:49, 12 August 2009 (UTC)