Wiktionary:Parsing

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

Introduction[edit]

I am starting this area with just some brief sketches and notes. Hopefully the community will flesh it out.

There are currently several parties who already parse Wiktionary data. In fact Wiktionary data has been parsed since close to its birth. Most such projects are fairly isolated and lack sharing of ideas and discussion of common problems.

Wiktionary is designed to be readable by humans. Though it has a much more rigorous format than Wikipedia it still turns out to be very unfriendly for computer parsing.

Wiktionary is full of useful data all under the GPL so it's free for anyone to use. But this data is stored as plain text with only wiki markup. Wiki markup is great for specifying the appearance of a page with little effort. It was never intended for specifying the structure, attributes, and semantics of data.

Anybody who has wanted to use the free data has had to roll their own data mining software. There exist a couple of online dictionaries, translation, and language sites who mine Wiktionary data. There are tools which analyse Wiktionary data to display statistics. There are bots which autoformat live articles. There are alternate Wiktionary page viewers which analyse the structure of articles to present them in a new style.

We need to join forces and make these jobs easier and our Wiktionary data freer. Ideally some day we will have an action=data or Api.php support to make our data accessible to everyone with no special effort. That day is not yet close.

Data formats[edit]

There are three main ways to access raw Wiktionary data:

Raw wikitext[edit]

Pros[edit]

  • Wiktionary is regularly made available in an XML dump. This file wraps the raw articles in XML metadata. The articles are the same as you will see them when you edit an article online.
  • Any individual article can be accessed live online by action=raw.

Cons[edit]

  • Templates are difficult to handle.

HTML[edit]

In Linux environment some simple parsing can be done in a way like this: firstly, it is necessary to obtain a list of words for a certain language, then, execute a BASH command for every word in the list

 wget "https://en.wiktionary.org/w/index.php?title=word&printable=yes" 2>/dev/null -O - | w3m -dump -no-graph -T text/html |\
    gawk '($1=="English"){flag=1} ($1=="Retrieved"&&$2=="from"||$1~/-----/){flag=0} {if (flag==1) print $0} ' >> raw.data

In this pipeline wget dowloads the desired article, w3m converts it to a plain text and the gawk script preserves a section related to a certain language only. The produced text can be used for further parsing; this way you will avoid dealing with HTML or XML tags. Note: it is just a simple example which is not failproof

Pros[edit]

  • A static dump of all articles is made alongside the XML dump.
  • Any individual article can be accessed live online by action=render.
  • Templates are already expanded.

Cons[edit]

  • HTML varies depending on the skin used.

DOM[edit]

Pros[edit]

  • Available to JavaScript.
  • Templates are already expanded.

Cons[edit]

  • The DOM varies depending on the skin used.
  • Not available outside the browser.

Sections[edit]