Hello Francis.

I've noticed that you've copy/pasted a few Croatian entries of mine to Bosnian and Serbian, and Serbian to Bosnian and Croatian. That is generally completely OK, with the exception a few problems in the process I noted, and of which would like to inform you:

  • Standard Croatian is based on Ijekavian reflex of Common Slavic jat phoneme. Forms written as mesto occur only dialectally (mostly in Kajkavian and Northern Čakavian speeches), and as such represent sub-standard forms which should be corroborated by citations, not being present in e.g. normal dictionaries. Of course, they're nothing less "Croatian" than the officially codified subdialect, and I myself have been creating lots of those (e.g. misto with citations of Ikavian Čakavian and Štokavian Croatian writers) in Ikavian reflex of jat, and plan to do so for every other instance where jat occurrs (and not only with the reflex of jat, but of *t'/*d', strong jer vocalized as *e not *a etc. - all to ==Alternative forms== section). Small additional problem is that this written <e> as an Ekavian jat reflex is phonetically distinct than /e/ in Serbian Ekavian, and dialectologists mark this with special diacritics, but for the time bing the usual notation is OK (i.e. by the time the Institute publishes the massive Rječnik hrvatskoga kajkavskoga književnoga jezika on the Web, which should be very soon)
  • This newly-invented "Bosniak language" (which Bosniaks call bosanski "Bosnian") invented by Muslims in the 1990s exists only as codified language (some would say, mixture of Croatian and Serbian), so entries like mesto which moreover don't even appear in organic idioms spoken by large populations of Bosniaks should have no place being formatted as ==Bosnian==.
  • I noticed that you've copied Croatian inflection template {{hr-decl-dan}} to a Serbian entry. All of those special Croatian inflectional templates are obsolete and should be substituted with the general one {{hr-decl-noun}} that accepts all possible inflected forms one by one. Why is that? Because there is this little thing called pitch accent which can unpredictably alternate withing the paradigm, and with more than 300 morphological-accentological paradigms existing for nouns only, it would be pointless to create 300+ templates to cover all the corner cases. One day I'll write a bot that will add inflection for all Croatian words (verifying it against HML) and in another pass generate the appropriate accents.

Cheers and don't hesitate asking me on anything you find confusing. --Ivan Štambuk 07:07, 3 December 2008 (UTC)

There are {{bs-decl-noun}} and {{sr-decl-noun}}. Templates for inflection tables vary from language to language in layout (one day they're probably be generally customizable via CSS, but lots of general issues needs to be settled before that). No one can copyright inflected forms of a lemma. It says (c) on the site, but it cannot be proved that you actually got it from there. Though in this particular case if necessary a permission from mr. Tadić who is behind the engine might as well be asked for. Also, before bot-adding (for which you need to start a vote expressing rationale and demonstrating proper functioning), paradigms are usually checked by a literate native speaker or someone similarly knowledgeable. Most of the HML generated forms can be reused for Serbian too, taking into account differences such as 'pisaću' vs. 'pisat ću' etc. --Ivan Štambuk 08:16, 3 December 2008 (UTC)

I don't know whether it would be infringement or not, but I can e.g. use their database to extract inflectional pattern of a lexeme, and generate via my own algorithm. But I'm sure that possible copyright issues will be resolved when the time comes. The engine is non-commercial, published by an academic institution, after all..
požar seems to be present in all Slavic languages, but I can't find a ref. for a Proto-Slavic reconsturction of *požarъ (everywhere I looked it's described as po+žar, which is in fact wrong etymology as it's inherited word of Common Slavic lexical stock not a later morphological derivative). I usually do add etyms for all Slavic languages that I see that have WT entries, but when I edit one section I just add it to it.. --Ivan Štambuk 09:23, 3 December 2008 (UTC)
  • [1] - Note that Čakavian is Croat-only dialect and has no place in Serbian language entry. --Ivan Štambuk 02:03, 6 December 2008 (UTC)
  • [2] - also note that per WT:ELE language sections are supposed to be separated by ---- and sorted alphabetically (this latter is enforced by AutoFormat bot). --Ivan Štambuk 02:08, 6 December 2008 (UTC)
  • I am 100% literate only in Croatian, and sincerely don't have much desire to indulge into the thorough expansion and amending of Serbian entries, esp. with dual-script maintenance hell and the subconscious prejudice of "is this literary Serbian of today?". Sometimes when I'm in the mood, I do create bs/sr entries as a part of Slavic cognates series (cf. the contents of Category:Proto-Slavic language). Essentially, my stance is that cloning the existing Croatian entries into bs/sr subsections doesn't add any real new content to the WT, and hence is not a type of mental activity worthwhile pursuing with respect to the millions of other interesting tasks pending my intervention that this project presents. --Ivan Štambuk 15:53, 6 December 2008 (UTC)
Because Serbian is written in both Cyrillic and Latin, and that is the official policy of the standard language guaranteed by the constitution (though the exact formulation in the constitution gives mild preference to Cyrillic IIRC). Cyrillic is the traditional script used for centuries, closely associated with Serbian national pride, literary tradition, Orthodoxy and the association with Eastern cultural provenience. The usage of Latin script is chiefly a result of various "unification" efforts with Croatian envisioned by 19th century Romantic and naive "Illyrians", but ultimately coming to be enforced by certain totalitarian regimes whose names and practices we don't need to mention explicitly here and now. Most native Serbs are familiar with both scripts, but 90% of Serbs in diaspora can only read Latin script, and they represent much more numerous potential user target for Wiktionary than domicile Serbs (suppose they want to improve their language skills). Though when it comes to that, the only 2 Serbian editors editing from IPs I've managed to detect during the last year acted in a highly disturbing manner (one of them is well-known sci.lang troll allergic to Ottoman Turkish LWs etymologies an who has been replacing them with his imaginative Slavic-root alternatives, the other one had a high appetite for adding words of exclusive Croatian usage as ==Serbian==, and was probably not that literate as he made numerous errors - just the other day he added a word meaning 'fist' but translated it as 'hand' - mistakes on basic words such as this usually indicate non-native proficiency). But nevertheless, the potential for "passive users" is still out there, even though it cannot be directly measured (90% of Web users just use content, and don't attempt to create it).
You can indeed generate Latin script out of Cyrillic for Serbian, but not vice versa, and there is no way for a machine to know where 'nj' sequence in writing is monophonemic /nj/ and where /n/+/j/. Admittedly, words like this are rare (инјекција, надживети..), but should nevertheless be checked by a human.
Note that entries per WT:ELE entries can have multiple etymologies, multiple PoS sections, and arbitrary amount of meanings for each one, each meaning being marked with context labels, or additional qualifiers such as {{pf}}/{{impf}} in the inflection line. Not to mention additional sections for the inflection, related should be best done manually by means of "copy-paste" method where applicable, for the template such as that wouldn't scale much excepting the creation of most primitive type of entries. --Ivan Štambuk 17:14, 6 December 2008 (UTC)
Well, Cyrillic script often serves in practise as a badge of Serbdom, esp. when confronted with Roman script (great "reformer" Karadžić's roots of it, and the "najsavršenije pismo na svetu" arguments are often emphasised). From my experience, younger population is not that bothered with the cultural prominence of Cyrillic, and some are utterly disgusted by it (like the owner of the biggest Serbian [and Balkans] IT forum [3] who made a script that converts submitted Cyrillic to Latin..few years ago there was even a "riot" amongst some of the moderators and users (proud Serbs :) for not being able to write in Cyrillic, and some important folks left..)
I am not familiar with any online Serbian language corpus. If you have a list of extracted and properly spelled Serbian lexemes in Cyrillic that can be compered to Roman-script list for the purpose of extraction of the aforementioned exceptions, feel free to dump it somewhere on the Web :)
Try writing any kind of non-trivial code in the MediaWiki templates, one of the most horrid and degenerate mini-PL ever conceived (the only one I know that doesn't use lazy evaluation for conditionals), and you'll see how simple things can get ugly. c/p is much more easy. --Ivan Štambuk 17:57, 6 December 2008 (UTC)

Note: When a page has a section for more than one language, Wiktionary puts those lanagauges in order alphabetially by their English name, with Translingual and English coming first (if there is such a section). So, if a page has a Croatian, Serbian, and Bosnian section, the languages should be ordered: Bosnian, Croatian, Serbian. --EncycloPetey 21:30, 4 December 2008 (UTC)

Creo que hay un bot que lo hace despues de cada cambio no ? Así que no pasa nada si sigo haciendolo de manera más fácil. - Francis Tyers 22:01, 4 December 2008 (UTC)
My primary language is English. Yes, there is a bot that makes changes, but it doesn't always make them correctly (just 99.99% of the time). If you make an error in header level (which is easy to do), the page could be reformatted incorrectly by the bot. It is therefore better practice to put the items in order to begin with. --EncycloPetey 22:14, 4 December 2008 (UTC)

Hi, If you use {| class="inflection-table" then red-links will appear black without the need for #ifexist trickery. Conrad.Irwin 10:14, 26 April 2009 (UTC)


Database dump analysis[edit]

Am I correct in assuming from Vahagn's talkpage that you are able to do database dump analysis? I don't want to pressure you, but if you can, there are some (unfortunately non-Armenian) lists of entries that I could really use. Well, if you're able and interested, I'd be glad to hear it, but even if you just don't want to, that's fine too. Thanks! —Μετάknowledgediscuss/deeds 15:28, 2 December 2012 (UTC)

(I'll respond here, you can respond where-ever is convenient) Thank you so much! The lists I need are of the following types:
  1. entries in a certain language with a certain character in the pagetitle
  2. entries in a certain language with a certain character in the pagetitle when that character is not next to or combined with another character
  3. entries in a certain category that lack a certain string
Are you able to do any of those? —Μετάknowledgediscuss/deeds 15:38, 2 December 2012 (UTC)
The third one is easiest. What category and what string ? - Francis Tyers (talk) 15:45, 2 December 2012 (UTC)
Awesome! There's a few... probably the big ones are members of Category:Yiddish nouns without the string {{yi-noun| and members of Category:Yiddish adjectives without the string {{yi-adj| or {{yi-adjective|. Thanks! —Μετάknowledgediscuss/deeds 15:49, 2 December 2012 (UTC)
(If you don't want to fill up your talkpage, you can put them on User:Metaknowledge/Yiddish headword.) —Μετάknowledgediscuss/deeds 16:17, 2 December 2012 (UTC)
  • Nouns: done.
  • Adjectives: done

Again, thank you so much for your gracious help! —Μετάknowledgediscuss/deeds 03:37, 3 December 2012 (UTC)

That should be it, if you find any bugs, let me know and I'll try and regenerate. - Francis Tyers (talk) 14:17, 3 December 2012 (UTC)
Well, it seems suspicious that there are so few entries, especially so few adjectives, but I guess that's a good thing (less work for me to do!). Thank you again! —Μετάknowledgediscuss/deeds 04:58, 4 December 2012 (UTC)

Armenian corpus[edit]

Hi! This analyzed corpus is out of copyright, if you're interested. --Vahag (talk) 17:15, 8 December 2012 (UTC)

Wow!!!!! Nice :D It will be really useful for testing the analyser! - Francis Tyers (talk) 17:37, 8 December 2012 (UTC)