User:Tbot

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

Owner/runner: User:Robert Ullmann as of 4 September 2007

Tasks

The bot's primary task is to update the instances of template {{t}} in translations sections:

  • If the FL (foreign language) term exists in the FL wikt, the template is changed to {{t+}}.
  • If the FL term does not exist in the FL wikt, it is changed to {{t-}}.
  • If the status is not known, the template is not changed.
  • When the language code is not in the list of codes handled by {{t-sect}}, it adds an xs= parameter to provide the section link; the language code template is then not used to parse the entry.
  • When the xs= parameter is present, but not needed, it is removed.
  • change the template when the FL wikt does not exist

Adding templates

The bot presently tries to add templates when possible. There are a lot of variations in format, mostly "standard", but that still allows for considerable variation.

  • A translation line is only modified if it is completely parsed.
  • Languages without wikts are not done.
  • (at present) Words without a local entry are skipped.
  • Most simple combinations of #section references, gender and number are handled.
  • (to do) Replace explicit links to FL wikt.
  • (to do) Match script templates.
  • (partly done, knows some) Recognize a large set of simple transliterations of the FL word. In general, text found in parenthesis prevents conversion.
  • Very simple cases of an unlinked word are parsed.

At present the bot only converts to t templates in entries that it is looking at; i.e. entries that already have at least one that may need to be updated.

Creating entries

Tbot creates new entries from the translations tables it is updating.

Note that the concept of "translation" as commonly understood is rarely correct. The naïve question "What is the word in X for Y?" is unanswerable in the general case, languages don't work that way. English speak, talk, chat, gossip "translate" to Swahili kunena, kusema, ongea, but not one-to-one. Almost all combinations of those are possible. (All but gossip to kunena which would never be correct in Tanzania, but is heard in Kenya!). Then there is announce, proclaim, declaim; and further enunciate and so on.

Even technical terms that one would think are 1-to-1 are often not. (English high voltage, French haute tension, but English high tension)

What we call "Translations" is, and can only be, "words that correspond in some way in the other language."

That said, it is usually useful to create an entry for the FL word, "defining" it as the English word, with the translations gloss.

Languages

The languages that Tbot will create entries for are controlled by the existence of the language wiktionary. The word must exist in the FL wiktionary, with sufficient information for Tbot to verify it. (A null template, like a large number of entries in the ru.wikt, will not allow creation of the entry.)

Each entry is put in the corresponding language category, for example Category:Tbot entries (French). If the category doesn't exist, it will appear directly in Category:Tbot entries.

Each entry is also in a monthly category, e.g. Category:Tbot entries December 2007. The categories are added by {{tbot entry}}

Process

For each translation where the local entry [TBD: language section] does not exist, and the FL wikt exists, and the word is in the FL wikt, Tbot reads the FL entry. If the entry refers to the English language word, it concludes that the translation is valid, if not "exact" (which, as observed above, is generally unattainable anyway).

It then sporks the picture and audio if found, checking each on commons (making sure they weren't uploaded locally on the FL wikt, or are at least duplicated on commons), and isolates the IPA if possible. It also recognizes the local equivalent of {{wikipedia}} as far as its table goes, or if the same name. (But does not check the existence of the FL 'pedia article.)

Tbot creates the local entry, adding {{tbot entry}}, and adds (or updates to) the {{t+}} template, since the FL entry is known to exist.

Scripts

Tbot recognizes words in various scripts, and adds them to {t} and {infl} when creating an entry. The scripts it knows now are Greek, Cyrllic, Armenian, Hebrew, Syriac, Arabic, Devanagari, Bengali, Georgian, and all the CJKV scripts and variants (including Han Extension B on plane 2). For Arabic, it uses fa-Arab, ur-Arab, and pa-Arab as appropriate, also Hayeren for Armenian, and polytonic for Ancient Greek.

Other attributes

Tbot uses link-alternation to produce alt= in {t} and head= in {infl}}, similarly from the parameter of {he-translation}. It recognizes and uses the transliteration as tr= in {infl} and the various genders and numbers in both {t} and {infl}.

Issues, current status

(a number of the restrictions are temporary, as a starting point)

  • Tbot only updates entries that already have {t} templates, or have section references or explicit FL.wikt references that would be good to fix.
  • It only converts a line to use {t} if the (local) word exists, or if it can create it.
  • It doesn't add the template or create an entry if the language is not in the set of 170 (2007) that have wikts.
  • It doesn't add language sections to existing entries; this makes it easier to remove bad entries.
  • It only recognizes transliterations for Cyrillic, Arabic, a few in Hebrew etc. It doesn't do this yet for things like Kanji and Hanzi (although it does for kana). If it can't recognize the transliteration, it won't modify the line because it can't know whether the text in parenthesis is the transliteration or a qualifier.

Also see User:Tbot/tbot entry.

Technical notes

Optimized codes

Template {{t-sect}}, used by t/t-/t+, optimizes the performance by supplying the language name for a number of common languages. It uses two sub-templates, {{t-lang}} and {{t-lan2}}. The first set is the languages with the most entries and translations; the second set is the other languages with more than 2000 translations as of September 2007.

t-lang languages (16)
  • ar Arabic
  • da Danish
  • de German
  • es Spanish
  • fi Finnish
  • fr French
  • el Greek
  • he Hebrew
  • it Italian
  • ja Japanese
  • ko Korean
  • nl Dutch
  • no Norwegian
  • pt Portuguese
  • sv Swedish
  • ru Russian
t-lan2 languages (17)
  • bg Bulgarian
  • bs Bosnian
  • ca Catalan
  • cs Czech
  • et Estonian
  • hr Croatian
  • hu Hungarian
  • is Icelandic
  • ku Kurdish
  • la Latin
  • pl Polish
  • ro Romanian
  • sr Serbian
  • sk Slovak
  • sl Slovene
  • te Telugu
  • tr Turkish

Note that the Chinese languages have not yet been addressed. Also, the constructed languages Esperanto and Interlingua have more than 2000 translations but have not been included in the optimizations. (Ido has just over 1000.)