User:LA2

From Wiktionary, the free dictionary
Jump to navigation Jump to search

LA2 is the username for Lars Aronsson, Sweden. See w:user:LA2.

Wiktionary:Babel
sv Den här användaren talar svenska som modersmål.
en-3 This user is able to contribute with an advanced level of English.
de-2 Dieser Benutzer hat fortgeschrittene Deutschkenntnisse.
da-1 Denne bruger har et grundlæggende kendskab til dansk.
no-1 Denne skribenten har litt kjennskap til norsk.
ru-1 Этот участник владеет русским языком на начальном уровне.
Search user languages or scripts
For my cut-and-paste convenience:
==Swedish==
===Etymology===
{{compound|a|b|lang=sv}}
====Conjugation====
====Declension====
====Related terms====
====Usage notes====
===References===
* {{R:SAOL|åäö|%e5%e4%f6}}
* {{R:SAOB online|åäö}}
====Translations====
{{trans-top|}}
{{trans-mid}}
* Swedish: {{t|sv|}}
{{trans-bottom}}
===Adjective===
{{head|sv|adjective form}}

# {{sv-adj-form-abs-indef-n|}}
# {{sv-adj-form-abs-def-m|}}
# {{sv-adj-form-abs-def+pl|}}
# {{sv-adj-form-comp|}}
# {{sv-adj-form-sup-pred|}}
# {{sv-adj-form-sup-attr|}}
===Adverb===
{{head|sv|adverb}}
===Noun===
{{head|sv|noun form}}

# {{sv-noun-form-indef-gen|}}
# {{sv-noun-form-def|}}
# {{sv-noun-form-def-gen|}}
# {{sv-noun-form-indef-pl|}}
# {{sv-noun-form-indef-gen-pl|}}
# {{sv-noun-form-def-pl|}}
# {{sv-noun-form-def-gen-pl|}}
===Verb===
{{head|sv|verb form}}

# {{sv-verb-form-pre|}}
# {{sv-verb-form-past|}}
# {{sv-verb-form-sup|}}
# {{sv-verb-form-imp|}}
# {{sv-verb-form-inf-pass|}}
# {{sv-verb-form-pre-pass|}}
# {{sv-verb-form-past-pass|}}
# {{sv-verb-form-sup-pass|}}
# {{sv-verb-form-prepart|}}
# {{sv-verb-form-pastpart|}}

Diary[edit]

August 26, 2020: I start working on Appendix:Swedish corpus, based on my 2017 presentation.

June 2017: I submit a proposal for a presentation at the Wikimedia Central and East European conference in Warszaw in September. It is approved.

May 2017: I start to contribute to Ukrainian Wiktionary (my user page).

December 14, 2015: CodeCat is renaming several Swedish inflection templates for no apparent reason, leaving bewilderment and fatigue. For example, sv-noun-reg-er becomes {{sv-infl-noun-c-er}}.

October 2014: I start to contribute actively to Russian Wiktionary (my user page).

May 4, 2013: Should sometimes read:

  • Ladislav Zgusta, Manual of Lexicography (1971; foreword signed 1968) Google Books
  • C.C. Berg (professor at Leiden), Report on the Need for Publishing Dictionaries which do not to-date exist (booklet, between 1960 and 1962, published by CIPSH, Conseil International de la philosophie et des sciences humaines)

February 2013: I start to contribute actively to Danish Wiktionary (my user page).

January 24, 2013: I introduce {{sv-compound}} and category:Swedish compounds with maskin, as used for displaying Derived terms in maskin#Swedish. -- Bad idea.

November 19, 2012: Fun photo gallery: 10 Swedish words you won’t find in English: orka, harkla, hinna#Verb, blunda, mysa, vabba, duktig, jobbig, gubbe/gumma, mormor/farmor/morfar/farfar (actually 14).

August 27, 2012: I give up all hope about the Norwegian entries in en.wiktionary. Please remind me to stay away if any discussion should come up again.

April 18, 2011: To do: handgemäng, hägn, ohägn, hugnad, misshällighet

April 7, 2011: All the words from this article about common translation errors should be incorporated into Wiktionary.

April 3, 2011: I think I'm done with Swedish form entries for now. When the new XML dump arrived 20110402, Wiktionary contained 87,651 Swedish words. After parsing the XML dump I was able to generate 1521 new Swedish form entries. I have the machinery in place to fill in the missing form entries after each new dump. Now we need to expand the 20,000 Swedish gloss entries to a full Swedish vocabulary. But can that work be automated? How do we add the next 20,000 gloss entries without spending 3 minutes on each? (1000 hours or 25 weeks of fulltime work)

March 20, 2011: When spannen#Swedish is the definite singular of spann (bucket) and definite plural of spann (set of horses), I'd like to indicate in the form entry which sense belongs to which form. Perhaps "senseid" is the way to do this. Both the form templates and the declension/conjugation templates would have to take the sense ID as an extra parameter. This would be a major change to the 80,000 existing Swedish entries.

March 18, 2011: I create Appendix:Swedish verbs.

March 10, 2011: The new XML database dump shows 80,000 Swedish entries, yet another giant leap forward. My simple script for generating missing form entries has evolved into one that reads the declension and conjugation table template calls and concludes which form entry templates should be called from where. For example {{sv-noun-reg-ar|2=and}} in ande should generate {{sv-noun-form-def|ande}} in the page anden. If this form entry template call is found, fine. If not, the wanted form entry is saved as a file, that a modified version of pagefromfile.py can read. If the page doesn't exist, it is created. If it exists, a ==Swedish== entry is appended at the bottom. If a Swedish entry already exists, because "anden" is also the definite form of and, this is logged and I have to edit the existing Swedish entry manually. At least for now, this happens a lot. In some cases, a verb form entry is also an adjective form. In some cases, the form entry exists but uses another template (form of, plural of, ...) or no template at all. Right now I have a backlog of 8,000 entries to go through, or 10 percent of the existing stock. Maybe I should automate the addition of adjective form entries to Swedish entries that don't have an adjective subheading already ... done.

March 2, 2011: The most commonly used Norwegian templates are: {{no-noun-infl}} (733 calls), {{nn-noun-m1}} (351), {{nb-noun-m2}} (221), {{nn-noun-form}} (178), {{no-noun}} (125), {{nn-verb}} (101), {{no-noun-c}} (97), {{nb-noun-m1}} (87), {{nn-inf}} (85), {{no-verb}} (76), {{no-noun-m1}} (73), {{no-noun-n1}} (71), {{no-verb-1}} (68), {{no-verb-2}} (54), {{nn-noun-n1}} (51), {{no-noun-mu}} (48), {{no-adj-infl}} (47), {{no-noun-form}} (41), {{no-noun-irreg}} (40), {{no-adj-2}} (39), {{no-adj-1}} (33), {{nn-verb-form}} (32), {{nb-noun}} (32), {{nn-verb-1}} (30), {{no-adj}} (26), {{nn-noun-f2}} (24), {{nb-noun-n1}} (23), {{no-verb_form}} (22), {{nn-noun-irreg}} (21), {{nb-class1}} (18), {{nb-g}} (17), {{nb-noun-c}} (16), {{no-adj-3}} (15), {{no-noun-nu}} (13), {{nn-pers-pron}} (13), {{no-noun-n4}} (12), {{no-noun-n3}} (12), {{nn-noun-f1}} (12), {{nn-adj-2}} (11), {{nb-verb-1}} (11), {{no-noun-cu}} (10), {{nn-adj-table}} (10), {{nb-noun-n3}} (10), {{no-verb-4}} (9), {{nn-adj-1}} (9), {{nb-verb}} (9), {{no-noun-f}} (8), {{nn-verb-2}} (8), {{nb-adj-table}} (8), {{nn-verb-form-pre}} (7), {{nb-pers-pron}} (7), {{no-noun-f1}} (6), {{no-adv}} (6), {{no-adj-irreg}} (6), {{nn-noun-f3}} (6), {{nn-adj-3}} (6), {{nb-verb-2}} (6), {{nb-class2}} (6), {{nb-adj-2}} (6), {{no-verb-form}} (5), {{no-noun-reg-m}} (5), {{nn-g}} (5).

February 27, 2011: I don't speak French or Italian, but when I saw all these form entries (mostly created by Keenebot2 and SemperBlottoBot) for verbs using the primitive {{form of}}, I started to substitute them to the more structured {{conjugation of}}. See Template talk:conjugation of#Stats. I have made the following translations of parameters:

  • lang=French/Italian ⇒ lang=fr/it
  • First/second/third person ⇒ 1/2/3
  • singular/plural ⇒ s/p
  • present indicative ⇒ pres|ind
  • present tense ⇒ pres|ind
  • present subjunctive ⇒ pres|sub
  • imperfect indicative ⇒ imperf|ind
  • imperfect tense ⇒ imperf|ind
  • imperfect subjunctive ⇒ imperf|sub
  • past historic ⇒ [[past historic]]
  • conditional mood/tense ⇒ cond
  • future tense ⇒ fut
  • imperative ⇒ imp
  • infinitive ⇒ inf
  • gerund ⇒ gerund

February 25, 2011: In the XML database dump of 2011-02-05, the most common headings for Swedish entries (compare August 21, 2010) are:

 61850 Swedish
 43879 Noun
 11033 Verb
  5695 Adjective
  5524 Declension
  4499 Etymology
  4115 Related terms
  2774 Pronunciation
  1777 Conjugation
  1675 See also
  1478 Proper noun
  1006 Synonyms
   666 Adverb
   591 Derived terms
   565 References
   533 Usage notes
   333 Antonyms
   135 Pronoun
   129 Abbreviation
   125 Cardinal number
   104 Interjection
    90 Etymology 2
    90 Etymology 1
    86 Inflection
    79 Preposition
    76 Suffix
    71 Compounds
    53 Idiom
    51 Conjunction
    49 Phrase
    39 Prefix
    37 Ordinal number
    25 Proverb
    22 Descendants
    18 Etymology 3
    16 Hypernyms
    12 Hyponyms
    11 Phrases
    11 Initialism
    10 Homophones
    10 Determiner
     9 {{abbreviation|Swedish}}
     8 Acronym
     5 Article
     5 {{abbreviation|sv}}
     4 Troponyms
     4 Letter
     4 {{initialism|sv}}
     4 External links
     4 Antonym
     4 Anagrams

As a comparison, the most common headings for all languages (not counting the L2 headings for the language names themselves) are:

1235093 Verb
 811866 Noun
 272027 Etymology
 267882 Pronunciation
 254013 Adjective
 234614 Anagrams
 123356 Related terms
 119880 Declension
  91788 Synonyms
  76466 Derived terms
  66909 Translations
  66639 References
  62966 Proper noun
  58676 See also
  49495 Alternative forms
  48712 Conjugation
  36726 Adverb
  33230 Participle
  32812 Hanzi
  26224 Han character
 24396 Inflection
 17626 Usage notes
 17535 External links
 16834 Antonyms
 15105 Descendants
 13497 Readings
 13331 Kanji
 10042 Etymology 1
 10033 Etymology 2
  8953 Hanja
  7029 Pronoun
  4578 Interjection
  3809 Compounds
  3623 Phrase
  3610 Suffix
  3452 {{initialism}}
  3422 Numeral
  3341 Verb form
  3238 Symbol
  3142 Cardinal number
  2998 Preposition
  2663 Prefix
  2572 Quotations
  2491 Mutation
  2433 Letter
  2380 Idiom
  2169 Conjunction
  1901 {{abbreviation}}
  1875 Pinyin syllable
  1853 Pronunciation 2
  1852 Pronunciation 1
  1652 Abbreviation
  1569 Coordinate terms
  1507 Proverb
  1485 Etymology 3
  1447 Pinyin
  1429 Hyponyms
  1342 Gismu
  1152 Hypernyms
  1066 Syllable
   973 Statistics
   728 {{acronym}}
   709 Contraction
   677 {{abbreviation|mul}}
   669 Devanagari spelling
   660 Ordinal number
   645 Urdu spelling
   569 Particle
   542 Determiner
   505 Number
   482 Abbreviations
   446 Alternative spellings
   368 Article
   365 Derived characters
   355 Scientific names
   342 Etymology 4
   330 Postposition
   299 Initialism
   288 Homophones
   239 Roman spelling

The most common combinations and sequences for Swedish sections are:

 36998 ((Swedish(Noun)))
  8123 ((Swedish(Verb)))
  3620 ((Swedish(Adjective)))
   981 ((Swedish(Proper noun)))
   918 ((Swedish(Etymology;Noun(Declension))))
   699 ((Swedish(Noun(Declension))))
   410 ((Swedish(Etymology;Noun(Declension;Related terms))))
   372 ((Swedish(Adjective;Verb)))
   359 ((Swedish(Verb(Conjugation;Related terms))))
   340 ((Swedish(Etymology;Verb(Conjugation;Related terms))))
   339 ((Swedish(Noun(Declension;Related terms))))
   330 ((Swedish(Noun;Verb)))
   249 ((Swedish(Pronunciation;Noun)))
   220 ((Swedish(Etymology;Adjective(Declension))))
   211 ((Swedish(Etymology;Noun(Declension)References)))
   182 ((Swedish(Adverb)))
   180 ((Swedish(Noun(Related terms))))
   156 ((Swedish(Pronunciation;Noun(Declension))))
   145 ((Swedish(Etymology;Proper noun)))
   139 ((Swedish(Etymology;Noun)))
   121 ((Swedish(Pronunciation;Noun(Declension;Related terms))))
   120 ((Swedish(Etymology;Adjective(Declension;Related terms))))
   114 ((Swedish(Noun(See also))))
   109 ((Swedish(Proper noun(Related terms))))
   106 ((Swedish(Etymology;Noun(Declension;See also))))
   105 ((Swedish(Noun(Synonyms))))
   104 ((Swedish(Etymology;Verb(Conjugation))))
    99 ((Swedish(Pronunciation;Verb(Conjugation;Related terms))))
    95 ((Swedish(Noun(Declension;See also))))
    91 ((Swedish(Adjective;Adverb)))
    88 ((Swedish(Verb(Conjugation))))
    79 ((Swedish(Pronunciation;Noun(Related terms))))
    79 ((Swedish(Noun(Declension;Related terms;See also))))
    78 ((Swedish(Etymology;Pronunciation;Noun(Declension))))
    72 ((Swedish(Abbreviation)))
    71 ((Swedish(Adjective(Related terms))))
    70 ((Swedish(Cardinal number)))
    69 ((Swedish(Pronunciation;Adjective)))
    62 ((Swedish(Noun(Declension;Synonyms))))
    62 ((Swedish(Etymology;Noun(Declension;Related terms;See also))))
    61 ((Swedish(Adjective(Declension;Related terms))))
    57 ((Swedish(Adjective(Declension))))
    52 ((Swedish(Etymology;Noun(Declension;Synonyms))))
    43 ((Swedish(Etymology;Pronunciation;Noun(Declension;Related terms))))
    42 ((Swedish(Verb(Conjugation;Related terms;See also))))
    42 ((Swedish(Pronunciation;Verb)))
    42 ((Swedish(Pronunciation;Etymology;Verb(Conjugation;Related terms))))
    42 ((Swedish(Noun(Derived terms))))
    42 ((Swedish(Etymology;Pronunciation;Noun)))
    40 ((Swedish(Alternative forms;Proper noun)))
    38 ((Swedish(Pronoun)))
    38 ((Swedish(Etymology;Verb(Conjugation;Related terms;See also))))
    38 ((Swedish(Etymology;Adjective)))
    37 ((Swedish(Etymology;Adverb)))

February 8, 2011: English Wiktionary now contains more Swedish entries (78,985) than Swedish Wiktionary (76,119). The overlap is only 34,178 entries. Swedish Wiktionary has more gloss definitions and English Wiktionary has more form entries, many created by LA2-bot.

February 6, 2011: I should try to incorporate as much as possible of Wikipedia:Swedish Wikipedians' notice board/Terminology into Wiktionary.

February 4, 2011: I set up {{R:Rikstermbanken}} and create some entries that refer to it.

January 30, 2011: I set up {{R:Utrikes namnbok}} and create some entries that refer to it, mostly in Category:sv:Government.

January 20, 2011: How to extract a list of Swedish headwords from the Swedish Wiktionary:

wget -O - "http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php?wikilang=sv&wikifam=.wiktionary.org&basecat=Svenska&basedeep=5&templates=&mode=al&go=Search&format=csv&userlang=en" |
   awk '-F\t' '$1==0 {print $2}' |
   tr _ ' ' | LC_COLLATE=sv_SE.utf8 sort

January 10, 2011: How to extract a list of Swedish headwords:

wget -O - "http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php?wikilang=en&wikifam=.wiktionary.org&basecat=Swedish+language&basedeep=5&templates=&mode=al&go=Search&format=csv&userlang=en" |
   awk '-F\t' '$1==0 && $3!="Translation_requests_(Swedish)" && $3!="Translations_to_be_checked_(Swedish)" && $3!~/derived_from_Swedish/ {print $2}' |
   tr _ ' ' | LC_COLLATE=sv_SE.utf8 sort

November 19, 2010: I import {{R:runeberg.org}} from sv.wikipedia.

November 15, 2010: I think there are now 20,000 Swedish entries in en.wiktionary.org, which is twice as many as the beginning of this year. This has been achieved mainly by adding form entries. Statistics here. I have added more word forms, based on word frequency lists (see corpus coverage in the August 31 entry below). I have focused less on including all defintions and all forms for every word. What I have tried to do is to create links between the entries, so compounds link to their component words. Hopefully, this will attract more users who then start to fill in the missing definitions (second usage of words) and forms. This philosophy, known as eventualism, is similar to creating stub articles in Wikipedia, hoping that later users will fill in more facts. I'm not a general subscriber to that idea, but it can be a useful approach in the early stages of a project. A useful Swedish dictionary probably needs 120,000 basic forms (and half a million form entries), which is ten times more than en.wiktionary has today and five times more than sv.wiktionary has.

September 18, 2010: There are 51,318 pages that call {{t}}, {{t+}} or {{t-}}. The page with most translations is be (607 translations), followed by you (447), set (438), love (421). Halfway down the list we find words like toner and toadstool (4 translations each). The most translated words that don't yet have any Swedish translation (or where the translations didn't use these templates in the database dump of 2010-09-12) are: judge (161), (156), heat (154), jump (153), spread (141), stroke (140), proper (137), cry (131), behind (130), desire (126), nose (125), round (123), article (122), double (121), taste (117), end (117), situation (116), shut up (116), male (116), Albanian (116), draft (112), chest (112), e-mail (110), truth (108), storm (108), squeeze (105), same (105), job (105), exit (105), (104), cheap (103), steer (102), prayer (100), entry (100), cinema (100), split (99), Gypsy (99), care (99), waste (98), sole (97), hook (97), chat (97), welcome (96), believe (96), coach (95), short (94), bend (94), herd (91), finish (91), sit (90), return (90), pickle (90), drill (90), dragon (90), cum (90), cherry (90), butt (90), British (90), masculine (88), correct (88), icon (87), gun (87), gentleman (87), freedom (87), beginning (87), separate (86), Moon (86), account (86), justice (85), I'm Jewish (85), definition (85), puzzle (84), atmosphere (84), corner (83), Macedonian (81), lime (81), lady (80), decline (80), damn (80), cardinal (79), plague (78), interest (78), dash (78), auxiliary (78), study (77), newspaper (77), hi (77), criminal (77), cement (77), bundle (77), bug (77), appropriate (77), agree (77), vacuum (76), swarm (76), reach (76), poetry (76), late (76), harmony (76), custom (76), chip (76), certainly (76), authority (76), rear (75), pumpkin (75), discharge (75), silk (74), dinner (74), crash (74), Commonwealth of Independent States (74), cheat (74), accept (74), walnut (73), transfer (73), grain (73), ceremony (73), abate (73), victim (72), vagina (72), type (72), prophet (72), increase (72), contact (72), constitution (72), constellation (72), budget (72), application (72), soldier (71), plot (71), painting (71), crew (71), brass (71), thunder (70), roast (70), psychology (70), communism (70), brake (70), witch (69), saddle (69), neighbour (69), vault (68), shallow (68), perfume (68), particle (68), harvest (68), electronic (68), coral (68), camp (68), amount (68), odd (67), occupation (67), how much (67), device (67), chamber (67), bust (67), association (67), airplane (67), track (66), stab (66), spice (66), pomegranate (66), crust (66), comfort (66), aeroplane (66), random (65), plough (65), no way (65), married (65), foundation (65), execution (65), channel (65), breath (65), arrest (65), studio (64), Myanmar (64), fail (64), enter (64), dish (64), actual (64), abrupt (64), wizard (63), Vladimir (63), substantial (63), splinter (63), reply (63), purple (63), paddle (63), nucleus (63), notice (63), illusion (63), how are you (63), deliver (63), dairy (63), counterfeit (63), blackmail (63), arrive (63), wardrobe (62), stuff (62), seat (62), not at all (62), deliberate (62), cylinder (62), crop (62), advertisement (62), zone (61), tower (61), source (61), sexuality (61), litter (61), gravity (61), fill (61), composition (61), business (61), bully (61), asshole (61), trial (60), sponge (60), sigh (60), resolution (60), orthography (60), mount (60), Java (60), implement (60), hood (60), half (60), habit (60), forever (60), anyway (60). Of course there can also be many definitions of be or you that don't have Swedish translations.

September 7, 2010: Some Unix/Linux shell commands:

To extract just one language (here: Swedish) from the XML database dump and removing the interlanguage links:
sed 's/<text.*>/\n/;s/<\/text>/\n==End==/' enwiktionary.xml | \
   sed '/^==\s*Swedish/,/^==[^=]/!d;/^==[^=]/d;/^\[\[[a-z][-a-z]*:/d'
To extract just the native language example sentences from the above (beware of the " and ' trick):
sed '/^#:[^:]/!d;s/^#:*\s*//;s/=.*//;s/'"'''"'//g;s/'"''"'//g;s/&[/a-z]*;//g'
To cut plain text into a list of words (I kept hyphen in words, but not digits; you might want to add »:
tr ' -&(-,.-?[]|' '\n'|sed '/^$/d'
To find the most frequent words:
sort | uniq -c | sort -nr

When all of the above are combined, I get a list of all words occurring in the Swedish example sentences, sorted by frequency. And so I can check that Wiktionary provides explanaitions for all or most of them. The Swedish example sentences constitute an 84 kbyte e-text, having 13,255 words of which 4819 are unique. Wiktionary has Swedish entries for 71.1 percent of the occurrences. This is rather low. Part of the explanation is that some text is in English, because the example sentences are incorrectly formatted and contain templates and URLs.

September 4, 2010: Inserting the templates l and t:

python replace.py -family:wiktionary -lang:en -xml:enwiktionary.xml -summary:"l:sv, t:sv" -regex -recursive \
 '\[\[#Swedish\|([^\]]+)\]\]' '{{l|sv|\1}}' \
 '\[\[([^#\|\]]+)#Swedish\|[^\]]*\]\]' '{{l|sv|\1}}' \
 '(\* *Swedish:.*?)\[\[([^\]]*)\]\]' '\1{{t|sv|\2}}' \
 '(\* *Swedish:.*?){{l\|sv\|' '\1{{t|sv|' \
 '(\* *Swedish:.*?{{t[^}]*)}} {{([cfmnp](\|[cfmnp])*}})' '\1|\2'

August 31, 2010: The Swedish Bible of 1917 contains 769,316 words of text, using a vocabular of 26,990 words and word forms, including some capitalized words at the beginning of sentences. Of this vocabulary, 3802 words or 14 % have Swedish entries in en.wiktionary. However, since these 14 % contain many of the most common words, they make up 74 % of the text. This number (74 %) is the definition of the dictionary's coverage of this corpus of text. If you pick a random page, line and word in the Bible, there's 74 % chance that word has a Swedish entry here. 74 % is a very low coverage for a dictionary, and a sign that we have a very long way to go.

Here's how it works on the two first verses: i begynnelsen skapade gud himmel och jord. och jorden var öde och tom, och mörker var över djupet, och guds ande svävade över vattnet. (Genesis 1:1-2) Of these 24 words, 5 are "och", 2 are "var", 2 are "över". These three words alone make up 9 of the 24 words or 37% of the text.

Corpus Bible
(1917)
Herr Arnes
penningar
Swedish
Wikipedia
as of
2010-06-08
Tankar i
utvandrings-
frågan
KB:s underlag
till en nationell
strategi...

(2010)
Kultur-
utredningen

(2009)
SvD Under-
streckare
,
Sept. 1–18,
2010
Framtidens
Internet

by Jan
Kallberg
Words in corpus 769,316 23,514 111,625,635 93,078 18,607 248,282 31,608 23,414
Unique words 26,990 3,303 3,412,039 14,516 4,017 23,050 8,815 5,086
Date of
database
dump
Swedish
entries
Percent coverage of corpus
2010-08-12 10,987 72.6 75.4 55.6 66.1 49.2 58.3 63.9 70.1
2010-08-24 11,531 74.2 76.1 55.8 66.7 49.4 58.5 64.0 70.4
2010-09-01 14,678 84.8 84.7 59.7 73.1 55.0 65.0 69.2 76.0
2010-09-12 16,926 87.3 87.5 61.6 77.0 65.7 73.2 71.2 78.5
2010-09-23 17,836 87.5 88.1 62.9 78.4 70.3 76.4 73.9 80.2
2010-10-05 17,851 87.5 88.1 63.0 78.4 70.4 76.4 74.0 80.2
2010-10-15 17,885 87.5 88.1 63.2 78.4 70.5 76.4 74.1 80.3
2010-10-30 19,449 87.7 88.2 64.0 80.4 71.5 77.8 75.5 81.4
*2010-12-31 22,135 88.7 89.2 65.9 84.5 77.8 83.7 79.3 85.1
**2011-01-10 40,621 89.5 89.6 68.3 85.6 78.4 84.9 81.1 89.7
**2011-01-23 53,421 90.0 89.8 69.4 86.6 82.8 86.3 82.1 90.5
**2011-01-31 59,889 90.2 90.0 69.9 87.5 83.1 86.8 82.6 91.2
**2011-02-08 78,985 91.1 90.6 71.1 89.1 84.1 88.1 83.9 92.3
**2011-03-23 87,267 91.4 90.7 71.8 89.7 84.9 88.5 84.6 92.7

(The Wikipedia corpus used here contains some garbage that will never be covered by the dictionary, e.g. Wikipedia user names, occasional talk pages in English, and some remaining wiki markup, so the coverage percentage will inevitably be lower. It's still interesting to have a really large corpus to study.)

(* No database dump exists for 2010-12-31, but a preliminary dictionary was extracted.)

(** Dictionary generated by category wget. See diary entry for January 10, 2011.)

August 28, 2010: I think it would be helpful to know how common a word is. This can be determined by computing its rank in some large body of text, putting the most frequent word ("the" for English, "och" for Swedish) at position 1. This is what template {{rank}} does, for example able has rank 391, but I think a logarithmic scale would be more informative than a linear one. Color graphics could indicate how "hot" a word is, but with the cool and neutral black, white and light-blue appearance of Wiktionary, the colors must be restricted to a very small area:

rank 8
rank 64
rank 512
rank 4096
rank 32,768
rank 262,144

August 21, 2010: Many open issues:

  • So far, only 10,000 entries in Swedish. Redefining templates is easier now than after many more entries have been created.
  • How should templates be named? Is the -reg-/-irreg- part of the name really necessary? Can we do with fewer templates and shorter names?
  • How do we create entries for all inflected forms? Can this be automated?
  • Can conjugation/declension tables handle passive verbs? Subjunctives? All adjectives?
  • Should template parameters be standardized? Now they are different everywhere: 2=, stem=, sg-def-gen=
  • Can templates support irregular verbs, so avgå, tillstå kan be based on gå, stå?
  • Can templates support prefixed and suffixed words, e.g. "gå an/gick an" smarter than today?
  • Should templates for Swedish words be standardized across languages of Wiktionary?
  • Old spelling (elf/älf/älv) can be handled, but how should we handle giva/ge, hava/ha?

The most common headings in Swedish sections are:

 10969 Swedish        533 Derived terms          72 Compounds          37 Ordinal number
  6402 Noun           319 Adverb                 72 Abbreviation       31 Conjunction
  2618 Pronunciation  251 Usage notes            63 Cardinal number    25 Proverb
  1705 Verb           251 Antonyms               58 Conjugation        22 Verb form
  1520 Related terms  214 Alternative spellings  54 Idiom              22 Descendants
  1300 Adjective      100 Etymology 2            52 References         17 Etymology 3
  1247 Proper noun    100 Etymology 1            51 Preposition        16 Hypernyms
  1013 Etymology       96 Inflection             48 Phrase             14 Homophones
   995 See also        88 Interjection           41 Alternative forms  12 Hyponyms
   789 Synonyms        83 Pronoun                39 Suffix             11 Phrases

The most common heading structures are listed below. "((" means heading level 2.

  3158 ((Swedish(Noun)))                               57 ((Swedish(Pronunciation;Noun(See also))))
   831 ((Swedish(Proper noun)))                        56 ((Swedish(Pronunciation;Noun(Derived terms))))
   660 ((Swedish(Verb)))                               47 ((Swedish(Abbreviation)))
   565 ((Swedish(Pronunciation;Noun)))                 45 ((Swedish(Pronunciation;Adjective(Related terms))))
   505 ((Swedish(Adjective)))                          43 ((Swedish(Noun;Verb)))
   290 ((Swedish(Noun(Related terms))))                42 ((Swedish(Pronunciation;Noun;Verb)))
   206 ((Swedish(Etymology;Noun)))                     41 ((Swedish(Verb(See also))))
   168 ((Swedish(Noun(Synonyms))))                     37 ((Swedish(Alternative spellings;Proper noun)))
   168 ((Swedish(Noun(See also))))                     34 ((Swedish(Pronunciation;Noun(Synonyms))))
   156 ((Swedish(Pronunciation;Verb)))                 34 ((Swedish(Pronunciation;Adverb)))
   142 ((Swedish(Pronunciation;Noun(Related terms))))  34 ((Swedish(Alternative spellings;Noun(Related terms))))
   131 ((Swedish(Pronunciation;Adjective)))            33 ((Swedish(Phrase)))
   121 ((Swedish(Verb(Related terms))))                32 ((Swedish(Adjective(See also))))
   112 ((Swedish(Etymology;Proper noun)))              29 ((Swedish(Adjective;Noun)))
   101 ((Swedish(Proper noun(Related terms))))         28 ((Swedish(Pronunciation;Verb(See also))))
    81 ((Swedish(Adjective(Related terms))))           28 ((Swedish(Etymology;Noun(Related terms))))
    73 ((Swedish(Adverb)))                             27 ((Swedish(Etymology;Verb)))
    72 ((Swedish(Pronunciation;Verb(Related terms))))  27 ((Swedish(Etymology;Adjective)))
    72 ((Swedish(Etymology;Pronunciation;Noun)))       26 ((Swedish(Verb(Synonyms))))
    62 ((Swedish(Noun(Derived terms))))                26 ((Swedish(Interjection)))

Starting to introduce ====Declension==== and ====Conjugation==== on a big scale, will change this pattern.

It seems I have a bot command that works:

python replace.py -family:wiktionary -lang:en -cat:'Swedish verbs' -summary:'Conjugation heading' -regex -dotall \
   '(===Verb===\s*({{infl[^\n]*}})?\s*)({{sv-verb-(irreg|reg-)[^\n]*}}\s*)(([^-=\[][^\n]*\n\s*)*)'    '\1\5====Conjugation====\n\3'    \
   '(====Verb====\s*({{infl[^\n]*}})?\s*)({{sv-verb-(irreg|reg-)[^\n]*}}\s*)(([^-=\[][^\n]*\n\s*)*)'  '\1\5=====Conjugation=====\n\3'


August 20, 2010: In the database dump of 2010-08-12, there were 6341 calls to templates named sv-. Kinds are conj = conjugation table for verbs, decl = declension table for adjectives and nouns, form = referring from an inflected form to the main entry, infl = one-liner inflection pattern.

Calls Template Kind Comment
813 {{sv-noun-reg-er}} decl Since painted blue
707 {{sv-noun-reg-ar}} decl Since painted blue
488 {{sv-verb-reg-ar}} conj Since painted blue
433 {{sv-noun-reg-or}} decl Since painted blue
418 {{sv-noun-n-zero}} decl Since painted blue
343 {{sv-noun}} decl Since painted blue and renamed {{sv-decl-noun}}. Contains table layout and colours, serving as the base for other noun templates.
257 {{sv-adj-reg}} decl
254 {{sv-noun-unc-irreg-c}} decl Since painted blue
250 {{sv-noun-irreg-c}} decl Since painted blue
218 {{sv-verb-irreg}} conj Since painted blue. Contains table layout and colours, serving as the base for other verb templates.
191 {{sv-adv}} infl
153 {{sv-noun-reg-r-c}} decl Since painted blue
137 {{sv-verb-reg}} infl
132 {{sv-noun-unc-irreg-n}} decl Since painted blue
108 {{sv-adj-abs}} decl
105 {{sv-adj-peri}} decl
102 {{sv-verb-reg-er}} conj Since painted blue
101 {{sv-noun-c-zero}} decl Since painted blue
96 {{sv-noun-unc-n}} decl A redirect to {{sv-noun-unc-irreg-n}}
92 {{sv-noun-irreg-n}} decl Since painted blue
86 {{sv-adj}} infl
84 {{sv-noun-unc-c}} decl A redirect to {{sv-noun-unc-irreg-c}}
78 {{sv-verb-form-pre}} form
72 {{sv-noun-reg-n}} decl Since painted blue
60 {{sv-noun-form-indef-pl}} form
56 {{sv-verb-form-past}} form
54 {{sv-noun-form-def}} form
33 {{sv-verb-irr}} infl
31 {{sv-verb-form-sup}} form
31 {{sv-adj-form-abs-pl}} form
30 {{sv-adj-form-abs-indef-n}} form
27 {{sv-verb-form-imp}} form
26 {{sv-adj-form-abs-def}} form
19 {{sv-noun-form-indef-gen}} form
18 {{sv-adj-pastpart}} decl
17 {{sv-noun-reg-r-n}} decl Since painted blue
16 {{sv-noun-form-def-pl}} form
15 {{sv-verb-form-pastpart}} form
14 {{sv-verb-form-prepart}} form
14 {{sv-verb-ar}} infl A redirect to {{sv-verb-reg}}
13 {{sv-noun-form-indef-gen-pl}} form
13 {{sv-noun-ar}} decl A redirect to {{sv-noun-reg-ar}}
13 {{sv-adj-form-abs-def-m}} form
11 {{sv-adj-prepart}} decl
11 {{sv-adj-form-comp}} form
10 {{sv-noun-or}} decl A redirect to {{sv-noun-reg-or}}
10 {{sv-noun-form-def-gen}} form
9 {{sv-noun-form-def-gen-pl}} form
9 {{sv-adj-form-sup-pred}} form
8 {{sv-adv-form-sup}} form
7 {{sv-noun-un}} decl A redirect to {{sv-noun-unc-irreg-c}}
7 {{sv-adj-form-sup-attr-pl}} form
6 {{sv-adj-form-sup-attr-m}} form
5 {{sv-adv-form-comp}} form
5 {{sv-adj-form-sup-attr}} form
4 {{sv-noun-n}} decl A redirect to {{sv-noun-reg-n}}
3 {{sv-verb-form-pre-pass}} form
3 {{sv-verb}} Erroneous call, since replaced.
2 {{sv-verb-form-pres-pass}} form
2 {{sv-verb-form-inf-pass}} form
2 {{sv-adj-irreg}} decl
2 {{sv-adj-form-sup-pred-pl}} form
1 {{sv-noun-reg-}} Mentioned in {{sv-new-noun}}
1 {{sv-noun-proper-def-irreg}} Listed on Wiktionary:Swedish inflection templates
1 {{sv-noun-pl-irreg}} Listed on Wiktionary:Swedish inflection templates
1 {{sv-noun-form-adj}} form
1 {{sv-adj-small}} decl Called from {{sv-adj-decl}}, which is never used.
1 {{sv-adj-form-comp-pl}} form
1 {{sv-adj-abs-irreg}} Listed on Wiktionary:Swedish inflection templates

August 19, 2010: There are currently 81 templates named sv-... (too many for my taste), having the following parts of their names:

Number of templates
having this component
in their name
Name
component
Meaning
6 abs Absolute form of an adjective
27 adj Adjective
4 adv Adverb
4 ar -ar plural declension of noun
3 attr Superlative attribute form of an adjective
5 c Common gender of noun (= utrum, n-gender)
3 comp Comparative form of an adjective
1 custom sv-verb-custom is a base/meta template
2 decl Declension of nouns and adjectives
6 def Definite form of nouns/adjectives
2 er -er plural declension of noun
30 form Inflected forms referring to the main entry
4 gen Genitive form
1 imp Imperative form of a verb
4 indef Indefinite form of nouns/adjectives
1 inf Infinitive passive form of a verb
1 irr Irregular inflection
8 irreg Irregular inflection
2 m Masculine form of adjectives
1 mermest Redirect shorthand for "peri"
8 n Neutral gender (neutrum, t-gender)
5 new Called from "nogomatch"
1 nogomatch "You can create an entry..."
30 noun Noun
2 or -or plural declension of noun
3 pass Passive form of a verb
1 past Past tense form of a verb
2 pastpart Past participle form of a verb
2 peri Adjective comparation with mer/mest
8 pl Plural
2 pre Present tense form of a verb
2 pred Superlative predicative form of an adjective
2 prepart Present particip form of an adjective
1 pres Present passive form of a verb
2 r -r plural declension of noun
11 reg Regular inflection
1 small Smaller table layout, not used
7 sup Superlative form of an adjective
81 sv Swedish
2 un Redirect synonym for abs or unc
4 unc Uncountable noun (no plural forms)
18 verb Verb
2 zero Declension of nouns where plural = singular