Translations from an Xml dump

Jump to navigation Jump to search

Translations from an Xml dump

I'm looking for a way to parse out tranlations from an Xml dump. Before, I made myself an offline Finnish-English-Finnish dictionary for my Blackberry phone in the form of an e-book. It uses definitions from the English and Finnish Wiktionaries. However, English Wiktionary has more English-to-Finnish translations, so I want to use those instead of the definitions from the Finnish Wiktionary. Conrad.Irwin has a related script, but it is for for making indices. Matthias Buchmeier has one, but it's in Awk. That's pretty helpful, but I think it would be great if we all had a more or less standard PyWikipedia tool for that. I had asked Conrad, but he has been busy outside Wiktionary for weeks. You have been so helpful to me in the past. Do you have any time to adapt Conrad's script?

heyzeuss08:43, 23 March 2011

Sorry, I don't know Python, so I can't be of help.

Yair rand (talk)17:55, 23 March 2011
Edited by another user.
Last edit: 18:20, 23 March 2011

Hi, we are still developing the extraction of RDF from the Wiktionaries. We made some progress, but it is a difficult process. Our idea is to use templates (like simplified regexes) for scraping data out of the Wiktionaries. The templates can be configured for each language according to Wiktionary:Entry_layout_explained . It will need one more month (or two) though until we are finished and then some more testing is needed. But it will be able to cover most languages, not just English... After that we can start working on the sense ids again.SebastianHellmann 18:20, 23 March 2011 (UTC), 23 March 2011

(Is this supposed to be under a different heading?) I brought up the issue of senseids in the Beer Parlour a short while ago at WT:BP#Tabbed Languages, Definition side boxes, and Sense IDs, though no one has responded yet to the senseids part of that discussion. I don't understand what you mean by using templates to scrape data out.

Yair rand (talk)19:01, 23 March 2011

On the one hand I replied to heyzeuss, that we are working on something that he is looking for, but it will need some more time. On the other hand, I wanted to give you a short update and piggybacked that on the message. I will have a look at the Beer Parlour discussion soon. Scraping data from Wiktionary is difficult since each language version is different. So instead of making some fixed code in python or Java to scrape the data, we are implementing a configurable system. Every interested Wiktionry user like heyzeuss can then alter the configuration (which is like a simplified regex) to say what he wants to have parsed out. SebastianHellmann 19:20, 23 March 2011 (UTC)

SebastianHellmann19:20, 23 March 2011

Hm, I still not sure I understand what you mean. Do you mean adding code to inflection/headword-line templates to allow data to be pulled out?

Yair rand (talk)19:34, 27 March 2011

No, I want to scrape data from Wiktionary, covering all language editions. Since all language editions are different the scraper needs to be configured differently for each language. It is not so easy to produce code that allows to be arbitrarily configured to scrape code. It will work similar to a template, but the other way round. I think the opposite of a template is a pattern such as Regex. I think it is probably more complicated to explain it than to give an example. I will prepare one and send it to you. SebastianHellmann 20:16, 27 March 2011 (UTC)

SebastianHellmann20:16, 27 March 2011

You might want to ask User:Robert Ullmann for guidance. He runs a bot called User:AutoFormat that fixes mishappen entries. User:Prince Kassad does the same thing with User:KassadBot.

heyzeuss07:41, 28 March 2011

Robert Ullmann has been out. Best check with Prince Kassad.

heyzeuss08:06, 28 March 2011

I installed Gawk and ran User:Matthias Buchmeier's script. It was quite painless.

heyzeuss07:48, 28 March 2011