Lemma List

Jump to navigation Jump to search

Lemma List

I have noticed that not all lemma entries are automatically going to a lemma list for their respective language. Is this intentional or are there some issues? For example, Dutch seems to have only around 700 lemma terms in the lemma list, whereas many more lemma terms have been entered for it all in all. I thought this odd. I also noticed that for some Serbo-Croatian nouns, such as ogrlica, there is no link to the lemma list at the bottom of the page, although the Serbo-Croatian lemma list has quite a few entries, almost 20,000.

Martin123xyz (talk)19:22, 16 July 2014

Right now, only {{head}}, and by extension any templates that use it, have been modified in this way. But there are still several headword templates that don't use it, but something else. There are also some that use a special module instead of going through {{head}}. I do intend to fix those eventually, but it will take a while. Right now I'm trying to make sure that all pages that use {{head}} also specify a part of speech. There's still 3000 to fix...

CodeCat19:33, 16 July 2014

I'm looking forward to the English lemma list (also very empty at the moment) being fully populated. It's something I would use relatively often (as a reader rather than an editor), so thanks! Maybe we should destroy or redirect the out-of-date Index:English at that future stage.

Equinox 19:30, 23 July 2014

If you want to help out, something like a list of all headword templates that use neither {{head}} nor (possibly indirectly) Module:headword would be very useful. Those are the ones that would be the hardest to track down and update.

CodeCat19:32, 23 July 2014

Not really sure how I'd approach that, as I haven't learned any Lua (maybe some day, hrmm). However, I was scanning the lemma list again today and it's looking much better, actually better than the old Index:English. I really get a kick out of browsing the weird words in the J,Q,X,Z lists for some reason.

Equinox 22:51, 30 July 2014

That's because it's not just {{head}} that adds the category, but Module:headword itself now. So it also works for all the modules that use it, like Module:en-headword which all the English ones use. But there may still be some that don't use the module or the template yet, so we need to find them. Not easy, but still...

CodeCat22:55, 30 July 2014

There are also several thousand entries that have no headword template at all, if either of you are looking for something tedious to do.

DTLHS (talk)22:55, 30 July 2014

Do you think you could find them in a dump? Something like finding all entries with three quotes or a semicolon at the beginning of a line should find most of them. A bot might then be able to thin the list out enough that doing the rest by hand becomes feasible.

CodeCat22:57, 30 July 2014

Actually, I should just be able to use the pagelinks dump to find what links to Module:head, and then subtract that from the set of page titles.

DTLHS (talk)23:41, 30 July 2014

That won't find entries where some sections use a template but not others.

CodeCat23:58, 30 July 2014

That's true, I might do that one later. In the mean time, there are 367,479 entries that probably have no headword template (most of them are Italian, ~64,000 are not). Here's the list: https://dl.dropboxusercontent.com/u/28940500/no%20head.txt.gz

DTLHS (talk)02:25, 31 July 2014

Thank you for that list. Can I ask how it was generated?

CodeCat11:01, 31 July 2014
 

I've started working on the list, but it will take a while to work through it. Right now the bot is replacing all instances of a bolded page title on a line by itself with {{head|xx}}. That seems to be catching almost all of them now.

I would appreciate it if you could give updates regularly as the dumps are released, so that I know what still needs to be done. Also I'd like to request that the list contains only the page names (not the languages), and also sorted and with duplicates removed. Would that be possible for the next dump?

CodeCat17:29, 31 July 2014

Of course. And I just split pages by language section, then looked for any lines containing '''PAGENAME'''.

DTLHS (talk)19:28, 31 July 2014

That won't have caught them all, but it's a good start. Thank you.

CodeCat19:38, 31 July 2014