User talk:JeffDoozan/lists/approved taxons

From Wiktionary, the free dictionary
Latest comment: 2 months ago by JeffDoozan
Jump to navigation Jump to search

The double bracketed items are excellent candidates for {{taxlink}} Those with two-part names are species; Those with three- or four-part names are subspecies or varieties. Others need more discriminating tests. Can you extract them?

The second group has a much lower percentage yield. Getting rid of all-capitilized terms and those with more than 4 words in an item would help. Extracting terms with probably species layout (in regex: /[A-Z][a-z]+ [a-z]+/) would give a large percentage of the possibly good names with a manageable number of false positives.

Is there a relatively easy way to find and extract redlinks in taxonomic Translingual L2s? That would have a high yield of correct names. In addition the rank is usually available.

Is there a relatively easy way to find taxonomic Translingual L2s with redlinks? — This unsigned comment was added by DCDuring (talkcontribs).

@DCDuring: I extracted the image links you flagged to User:JeffDoozan/lists/approved taxons and labelled those with two-part names as species. When I can run the bot again, it will update User:JeffDoozan/lists/possible taxons with your suggested changes and split italics/redlinks into three groups: (translingual with taxon, translingual, and everything else). It looks like the "translingual with taxon" list is pretty useful and the others can probably be ignored or skimmed for easy matches. The list of taxonomic Translingual L2s with redlinks will also be created by the bot. JeffDoozan (talk) 12:59, 8 March 2024 (UTC)Reply
I admit that I am losing track of the workflow in this overall process. I don't really know what to expect from a run of the bot on a given taxonomic Translingual page (henceforth "tTp"), nor how many passes it might take, nor how I can best help the bot by feeding it taxonomic names (henceforth, "taxon/taxa"). I am also a bit disappointed that so many items on a tTp need manual attention after the bot makes a pass. It is further convincing me that we need to use Catalog of Life data as a spell-checker and a taxon finder. (I doesn't cover many ranks, like infraorder, tribe, subtribe, clade [not really a "rank", but so called for us].) For those other ranks WP is best in content, but WikiSpecies is much more convenient for data extraction. DCDuring (talk) 14:00, 8 March 2024 (UTC)Reply
Don't worry about the number of passes, that's just me being very cautious that I don't break anything and making it easier for me to fix in case I do. Right now, the bot gathers taxon names and ranks from {{taxon}} and uses that to find wikitext that should use {{taxfmt}}. It also extracts taxon name and rank from existing {{taxlink}} templates and uses that to identifying wikitext that should use {{taxlink}}. If it's not replacing some items on a tTp, it's because they're not referenced by an existing {{taxon}} or {{taxlink}} and if you want it fix them, we'll need another source of "taxon name" and "rank". User:JeffDoozan/lists/possible_taxons is a potential source of names that you can add ranks to, with the bonus of knowing that they already exist on Wiktionary and will be replaced. If there's a way to extract name and rank from Catalog of Life or Wikispecies, that would probably yield a much bigger list of possible substitutions. JeffDoozan (talk) 15:43, 8 March 2024 (UTC)Reply
The Catalogue of Life (henceforth "CoL") Download page explains the data formats and fields that they have available. Clearly, we could use it to fabricate large numbers (2+MM) of tTps — not what I have in mind. It would make more sense to use it to make vernacular-name (all languages!?!?!) entries and use the vernacular-name entries to trigger the creation of tTps. But just extracting taxa (both accepted names and synonyms) and ranks for purposes of spell-checking, 1., tTps, 2. {{taxlink}}s, and, 3., {{taxfmt}}s would be great.
Mining WikiSpecies (800+K potential tTps) would be a next step, concentrating on the ranks that CoL does not bother with.
Mining WP is too hard, because too many of the article names are vernacular names, which would mean that some (sometimes lots of) parsing of the full text would be required to get rank info and all the taxa embedded in the articles.
What I could most use are lists of:
  1. tTps that also have L2s in other languages
User:JeffDoozan/lists/ttp_with_other_l2
  1. tTps that have multiple {{taxon}}s
User:JeffDoozan/lists/local taxons/errors
    1. tTps that have more than one rank
I think I need to deal manually with the pages that link to each of these, possibly using regex searches, at least until I can figure out a (semi-)automated alternative. I will try to get a count of the pages of each kind using regex is the search box, but I doubt that I can get complete counts, either due to timing out or inadequacy of my regex skills. DCDuring (talk) 16:49, 8 March 2024 (UTC)Reply
I added links to the requested lists, I'll have to think about what's reasonable as far as mining another source for names/ranks. JeffDoozan (talk) 17:20, 8 March 2024 (UTC)Reply