Automated Translations Harvesting (semi automated)
- disassembling entries on various Wiktionaries, storing what they contain in Classes/Objects
- Fetching more possible translations on Reta Vortaro, OmegaWiki and by harvesting Wikipedia interwiki links (with caution)
- Storing the results in a text file for scrutiny by an editor (person)
- reassembling all that was found in the proper formats for various Wiktionaries and possibly OmegaWiki (but then we'd have to grok their API first; not urgent)
How to get there
- That's an ambitious project, so we'll have to start small. Probably 1-3 Wiktionaries and the harvesting of Wikipedia interwiki links will be implemented first.
People who might read this page, will probably say: translations are tricky, adding them shouldn't be automated. My answer: this certainly won't work for many words, but there are certain terms that can be translated unambiguously. It's for those terms and also for connecting the contributors of the different Wiktionary projects and bringing the fruit of their efforts together in a way that takes as little effort as feasible. Don't worry, it's not a fully automated bot.
We've chosen to use mwclient to talk to the MediaWiki servers.
This is to retrieve a page with all the templates expanded:
For expanding only one template:
mwclient.Site('en.wiktionary.org').exandtemplates('ATH (plural Polyglot/ATHs)','badger')
How to store the confidence about a certain translation?
I'm still breaking my head on where the sources for the translations must go. The following scenario:
We have translations, say chambre (room). When we parse en.wikt, the gender is not present yet.
It gets stored as a Noun object without a gender and added to the list of translations, keyed under fr
In this list is also an other object containing the French noun object 'pièce'.
Now we encounter the translation chambre on another source, (maybe another wiktionary)
This time we do know the gender is f.
So we have to go through the list of translations, to see whether we already have a Term object in it, containing 'chambre'.
We encounter it, notice it doesn't have a gender and we add it. We also have to somehow add where this information came from.
In fact the information about this particular string being a translation for room came from two sources. The information about the gender came from one place.
An other issue is, that if it came from fr.wikt for a French word, this should carry more weight, but I suppose just storing the source should suffice for now.
The same goes for information about plural forms, diminutives, genitive, accusative, dative forms, superlative and comparative forms, etc.
When does this get checked? When adding a translation to a Meaning? Or when/before creating the Term object?
I'm struggling to see how to incorporate the prototype that's harvesting the WPs into the code I already had. I had never considered keeping track of the sources would be a requirement, when I got started 3,5 years ago.