Wiktionary talk:Page count
why a template?
[edit]To expand a bit on the explanation, I looked at a number of ways of doing this. There are pros and cons to each.
plain link without rendered text
[edit]Just a link to the project page:
xxx [[Wiktionary:Page count| ]] zzz
generates HTML:
<p>xxx <a href="/wiki/Wiktionary:Page_count" title="Wiktionary:Page count"></a> zzz</p>
The space in the link prevents invoking the "pipe trick". Pros: simple, adds page to what-links-here for the project page. Cons: odd syntax may confuse something reading wikitext, creates a null anchor, which isn't good form. May show up as a link on things like cell phones that still have lots of trouble with which links to display.
category
[edit]Standard link to a hidden category. Pros: no unexpected syntax, won't break anything, easy to find entries. Cons: would render Special:Uncategorizedpages useless, as entries not in POS cats etc would never show up.
HTML comment
[edit]<!-- [[ -->
That is all it takes to make the page count. (Yes, I've tested it, and read the code.) Pros: really simple, won't break anything. Cons: doesn't explain itself, editors may (will) wonder why it is there; can't find pages by any method other than scanning the XML dump (no cat or what-links-here).
template
[edit]Pros: explains itself, contains a link to the project page (although one has to manually copy the link to get here ;-), pages findable with what-links-here. Cons: will affect rendering (extra blank lines) if not immediately following the last text line (so that is what AF does), causes another template fetch in parsing (acceptable). Easy for AF to find and remove when no longer needed.
Also: the MW software will hopefully be "fixed" sometime. I say "fixed" because there is a high probability (from what I've noted in the past) that it will just be broken differently. The template gives us a shot at doing something different if needed, without changing the pages. (It could, for example, create an actual page link, or invoke a COUNTPAGE magic word if such was provided.)
So that's how I ended up with the template. Robert Ullmann 06:00, 24 December 2008 (UTC)
Inflection templates
[edit]Does this mean that articles which only include an inflection line and no link (in the page's source) are excluded from the count? If so, should entries such as clepta change templates, i.e.
- # {{inflection of|clepta|clepta|gen|s|lang=la}}
be changed to the style of amat:
- # {{conjugation of|[[amo#Latin|amō]]||3|s|pres|act|ind|lang=la}}
only in a noun form of it? -- 124.171.169.189 16:18, 24 March 2010 (UTC)
- No, it means exactly the reverse; in the first example, AF will add "count page" so the entry will count; you do not have to (and should not) contrive the template to the second example (in that case done because of the bogus macron)
- Don't worry about it. The right thing gets done. Robert Ullmann 15:00, 13 July 2010 (UTC)
This is pretty ridiculous
[edit]It really is. Adding invisible links to every page to get around some ridiculous software limitation on counting pages that could probably be fixed pretty easily... There are a bunch of bugs in bugzilla about the page count link requirement. I don't suppose anyone knows any way the issue might be hurried along? --Yair rand (talk) 12:38, 18 February 2011 (UTC)
RFD discussion
[edit]The following information has failed Wiktionary's deletion process.
It should not be re-entered without careful consideration.
I was considering updating this to reflect the current situation. But given the number of incoming links (not very many) I think we should just delete it. Also, how would the updated version read? Presumably something like 'this page no longer has any relevance'. Mglovesfun (talk) 19:05, 11 June 2012 (UTC)
- Keep and add
{{inactive}}
at the top. --Μετάknowledgediscuss/deeds 19:11, 11 June 2012 (UTC)- Has anyone tried to count lemmas? By my very crude estimate we only have about 400,000 in English. That is the information I would like to have. I don't doubt that others would like it for English and for other languages.
- I don't see what good this particular page does. DCDuring TALK 19:34, 11 June 2012 (UTC)
- It goes beyond active, it can never be used again in a constructive way. Mglovesfun (talk) 19:36, 11 June 2012 (UTC)
- If you want to know how many entries in English we have, use Wiktionary:Statistics. This information here is OBE, and I don't see how it could be useful to anyone, so delete -- Liliana • 04:15, 12 June 2012 (UTC)
- If I subtract from the number of English gloss definitions (389K) the difference between the number of English definitions (578K) and the number of English entries (446K), I get an estimated number of English entries with lemmas of 257K. Making a generous allowance for multiple PoS and Etymologies on those pages, perhaps there are 300K English lemmas. Any way to get a better estimate than that? DCDuring TALK 04:59, 12 June 2012 (UTC)
- Define "English lemma". Are you looking for the total number of English POS sections that contain a non–"form of" definition? —RuakhTALK 20:43, 12 June 2012 (UTC)
- That would be good enough for me, assuming that "form of" includes all inflected forms, misspellings and similar, but not
{{non-gloss definition}}
, though the last is might be used only 1000 or so times in English. Any reasonable approximation is fine. This would be nice to know from time to time. DCDuring TALK 21:32, 12 June 2012 (UTC)- What about "alternative form of", "obsolete form of", and so on? —RuakhTALK 21:35, 12 June 2012 (UTC)
- I would be fine with excluding them and also any "translation only entries", few though they may be. OTOH I would really like to include translingual terms that are understood in English. How many of the ones in unaccented Latin script would not be understood in English? If there were some generally accepted standard among English or monolingual dictionaries or indeed any accepted standard among any group, I would be happy to accept that. DCDuring TALK 22:53, 12 June 2012 (UTC)
- I'd also like to exclude Phrasebook entries, though I think some of those entries as misclassified. DCDuring TALK 22:55, 12 June 2012 (UTC)
- That would be good enough for me, assuming that "form of" includes all inflected forms, misspellings and similar, but not
- Define "English lemma". Are you looking for the total number of English POS sections that contain a non–"form of" definition? —RuakhTALK 20:43, 12 June 2012 (UTC)
- If I subtract from the number of English gloss definitions (389K) the difference between the number of English definitions (578K) and the number of English entries (446K), I get an estimated number of English entries with lemmas of 257K. Making a generous allowance for multiple PoS and Etymologies on those pages, perhaps there are 300K English lemmas. Any way to get a better estimate than that? DCDuring TALK 04:59, 12 June 2012 (UTC)
- I've just scanned the latest dump (from less than a week ago) for each set of English definitions (split by ety and POS), and used two different approaches to count the sets:
- Approach 1: A set of definitions counts as a lemma if (1) any definition has any wikitext, other than whitespace or periods, that is not inside any template other than perhaps
{{w|...}}
or{{l|en|...}}
; or (2) any definition has any of the templates{{non-gloss definition}}
,{{acronym of}}
,{{initialism of}}
,{{n-g}}
,{{given name}}
,{{surname}}
,{{abbreviation of}}
,{{short for}}
. - Approach 2: A set of definitions counts as a lemma if any definition has any wikitext, other than whitespace or periods, that is not inside any argument-less template, nor inside any of a few dozen form-of templates or a few dozen context templates. (This involved a bunch of special-casing.)
- Approach 1: A set of definitions counts as a lemma if (1) any definition has any wikitext, other than whitespace or periods, that is not inside any template other than perhaps
- Approach 2 is, barring truly bizarre wikitext, a strict superset of Approach 1; the idea is that Approach 1 should give a lower bound, and Approach 2 should give an upper bound, so that I could add special cases to each approach, letting them approach each other asymptotically, until they gave values that I considered "close enough". (Of course, these are lower and upper bounds on an idiosyncratic value; I counted various things as non-lemma "form of"-s, and various other things as yes-lemma definitions, that under a different phase of the moon I might have treated the opposite way.)
- So, that explained . . . Approach 1 gave 298,322; Approach 2 gave 299,516. So, for one arbitrary and haphazard definition of "English lemma", based roughly on DCDuring's definition but cutting out the harder-to-implement parts (phrasebook, translingual) and filling in sketchinesses not covered above, we're within 1% of having 300,000 English lemmata.
- —RuakhTALK 02:25, 13 June 2012 (UTC)
- Thank you. The Translingual entries matter because a competing monolingual dictionary would almost always have entries for symbols, numbers, and taxons and count them as lemmata. And many (most?) of our phrasebook entries are not typically in a monolingual entry. Some of our lemmas may involve some double counting, such as where we have a sense for a verb that has a notation like "usually with to" and a full entry for just that sense at [[VERB to]]. There is no point in trying to achieve further precision when we are still not quite getting to the "right" target.
- I guess my ultimate objective is to be able to compare en.wikt as a monolingual dictionary with other monolingual dictionaries, principally MWOnline, MW 3rd, and the OED, print and online. I am quite sure that we have fewer senses for the most polysemic English words than these dictionaries. It would be nice if we were withing striking distance for lemmas. I will check my references for any counts that have been prepared both for the numbers and for the operationalized definition of "lemma" used. DCDuring TALK 02:58, 13 June 2012 (UTC)
- Landau has a good methodology for when it is justified to include a dictionaries bold-faced terms as valid headwords/lemmas for counting "size". In some ways our format and methodology make things more clear cut as we do not include our nearest equivalent to run-ins: derived terms. Also we exclude abbreviations that only appear on the page of the abbreviendum and terms only in lists in our appendices.
- MW3 claimed some 450,000 headwords, the OED 425,000 or so. MW3's total includes many terms in what it calls International Scientific Vocabulary, some of which we include as Translingual. OED's total includes many terms what we would call Middle English. Neither includes "encyclopedic" entries, in contrast to many other dictionaries.
- We could also take a sample of headwords from MW3 (<1000 words) and determine whether we have the term covered and conversely. That would enable us to estimate our size relative to MW3. The sampling procedure for MW3 would be a random or quasi-random sample of pages, then of columns, and lastly a distance from the first line to get a list of lemmas, then again weighting inversely by the number of lines for the entry (to avoid overrepresenting polysemous terms). The same could be done for any print dictionary. I don't know how to sample MWOnline or other online dictionaries. I assume someone here could generate a random or quasi-random sample of English headwords. DCDuring TALK 00:38, 14 June 2012 (UTC)
- What about splitting by just etymology and not POS? I think that's more like what other dictionaries mean when they say "n entries".—msh210℠ (talk) 22:42, 14 June 2012 (UTC)
- That knocks us down to just 133,470 lemmata. (I liked it better the other way!) —RuakhTALK 23:34, 14 June 2012 (UTC)
- Landau is pretty explicit that he would count each Etymology-PoS combination. I would find it hard to believe that any commercial dictionary would use a method that would reduce their total count of headwords. A few of our distinct etymologies might be questioned, such as those that separate nouns from verbs because there were distinct Old English words for the noun and verb, both from a common root or certain back-formations from derived terms that end with the same spelling and related meaning, but they are defensible and Landau would defend them, I think.
- Landau Dictionaries: The Art and Craft of Lexicography (2001), page 109-114, "Entry Counting" lays out what he would count. It is oriented toward justifying inclusion of some items for which print dictionaries, even big ones with small type, save space by eliminating a separate entry or definition for the bolded "headword". One point he makes is that mere lists of, say, words prefixed with un- count as long as the normal meaning of the word is totally predictable from the meaning of the two morphemes. If we are diligent in including valid redlinks as entries, we need not concern ourselves with such rationalizations. DCDuring TALK 00:09, 15 June 2012 (UTC)
- Any chance of staying on topic, chaps? Mglovesfun (talk) 18:25, 14 June 2012 (UTC)
- Nope, none at all. Personally, I found this to be just as unfit for RFDO as it was fascinating. --Μετάknowledgediscuss/deeds 20:54, 14 June 2012 (UTC)
- Well, at least it helped show the extreme distance between what we have (the page in question and the better pages with data that still don't address the issue very well) and what we could use: data to allow headword-to-headword comparison with other dictionaries. DCDuring TALK 21:07, 14 June 2012 (UTC)
- Nope, none at all. Personally, I found this to be just as unfit for RFDO as it was fascinating. --Μετάknowledgediscuss/deeds 20:54, 14 June 2012 (UTC)
- Maybe move the contents to Template talk:count page? I think we should have an explanation somewhere, for anyone who wonders why there was a template with a link inside it on every linkless page on Wiktionary for a long while... --Yair rand (talk) 19:46, 19 August 2012 (UTC)
- I thought I'd commented in this discussion already; it seems I haven't: I favour keeping the page, marking it as historical. If the contents are moved to Template talk:count page, I favour leaving a redirect from WT:Page count. - -sche (discuss) 20:02, 19 August 2012 (UTC)