This is the page for Polyglot's bot.
I have been considering to create a bot for several months now.
I used the pyWikipediaBot framework to finally get around to it.
The primary purpose for the bot is to transfer entries from one language Wiktionary to another. Of course, if/once it can do that, it will also be able to clean up articles on one Wiktionary.
At first it won't be able to do this, but eventually the intention is to start with an article, look whether counterparts for this article exist on other Wiktionaries (if they do exist, Interwiktionary links can be created), follow the links to the translations on the native Wiktionaries and check whether the translations given there correspond.
If they don't, ask the operator (I think it will always be an interactive bot, at least for the purpose I have in mind for it), whether they may be added.
To not overly complicate things, I'm setting the goal to something a little bit less ambitious. The purpose now is to parse an entry, store it in an internal representation and reproduce the entry, all cleaned up. The second part of this already works. This is of course simpler since it's not as open ended as trying to parse entries with wildly diverging standards for editing used. I'm about to start on the parsing part now though. It should be feasible to parse at least the entries that mostly adhere to standards and of course it's always possible to involve the bot operator when it's too hard.
How does it work?
It starts by representing a Wiktionary article as a series of objects:
- WiktionaryPage -> one page for a particular spelling
- Entry -> one language block on a page, description of the lemma in a particular language
- Meaning -> one meaning for this lemma
- Term -> Terms/Lemmas/Expression in the same or another language that can be referred to (that can become links)
So a WiktionaryEntry can contain many SubEntry objects. A SubEntry can contain many Meaning objects. A Meaning object uses Term objects for synonyms, translations, related words, etc.
A Term object has subclasses for the different parts of speech.
What can it do?
To start I'm working on a way to output these objects according to the proper standards on the different Wiktionaries. (I'll start by concentrating on Dutch and English at first, but I make sure it is easily extensible)
In a later stage I will try to parse actual Wiktionary entries, either directly on line or from an SQL dump.
The first thing I want to do with the bot is to add the names of all the countries with their translations in 6 languages on the Dutch and the English Wiktionaries. I have those in a database of my own right now.
Before I start doing anything, I will have the bot create one or two pages. Then I'll wait for comments and act accordingly to fine tune the bot and what it submits.
Later I want to try and convert an existing entry into the objects described above and then reoutput them (cleaned up and possibly extended with info from other Wiktionaries).
When do I want to start?
It may take me a few days or weeks before I can get started. (OK, it became months...)
Can we see the code?
Everybody can have a look at the code I have so far over here: http://cvs.sourceforge.net/viewcvs.py/pywikipediabot/pywikipedia/wiktionary.py?rev=1.10&view=log
I have been doing something similar for the names of the elements. Only there I used a php script. It makes more sense to work together with some bright people on this though and to build upon and extend an existing framework. Python is also great language to be doing this with.