Definition from Wiktionary, the free dictionary
Jump to navigation Jump to search


I am starting this area with just some brief sketches and notes. Hopefully the community will flesh it out.

There are currently several parties who already parse Wiktionary data. In fact Wiktionary data has been parsed since close to its birth. Most such projects are fairly isolated and lack sharing of ideas and discussion of common problems.

Wiktionary is designed to be readable by humans. Though it has a much more rigorous format than Wikipedia it still turns out to be very unfriendly for computer parsing.

Wiktionary is full of useful data all under the GPL so it's free for anyone to use. But this data is stored as plain text with only wiki markup. Wiki markup is great for specifying the appearance of a page with little effort. It was never intended for specifying the structure, attributes, and semantics of data.

Anybody who has wanted to use the free data has had to roll their own data mining software. There exist a couple of online dictionaries, translation, and language sites who mine Wiktionary data. There are tools which analyse Wiktionary data to display statistics. There are bots which autoformat live articles. There are alternate Wiktionary page viewers which analyse the structure of articles to present them in a new style.

We need to join forces and make these jobs easier and our Wiktionary data freer. Ideally some day we will have an action=data or Api.php support to make our data accessible to everyone with no special effort. That day is not yet close.

Data formats[edit]

There are three main ways to access raw Wiktionary data:

Raw wikitext[edit]


  • Wiktionary is regularly made available in XML and SQL dumps at dumps.wikimedia.org. The articles are the same as you will see them when you edit an article online.
  • Any individual article can be accessed live online by action=raw.
  • Wikimedia Toolforge (tools.wmflabs.org) provides access to replicas of Wikimedia databases in a Linux environment for community developers


  • Templates are difficult to handle.


In Linux environment some simple parsing can be done in a way like this: firstly, it is necessary to obtain a list of words for a certain language, then, execute a BASH command for every word in the list

 wget "https://en.wiktionary.org/w/index.php?title=word&printable=yes" 2>/dev/null -O - | w3m -dump -no-graph -T text/html |\
    gawk '($1=="English"){flag=1} ($1=="Retrieved"&&$2=="from"||$1~/-----/){flag=0} {if (flag==1) print $0} ' >> raw.data

In this pipeline wget downloads the desired article, w3m converts it to a plain text and the gawk script preserves a section related to a certain language only. The produced text can be used for further parsing; this way you will avoid dealing with HTML or XML tags. Note: it is just a simple example which is not failproof


  • A static dump of all articles is made alongside the XML dump.
  • Any individual article can be accessed live online by action=render.
  • Templates are already expanded.


  • HTML varies depending on the skin used.



  • Available to JavaScript.
  • Templates are already expanded.


  • The DOM varies depending on the skin used.
  • Not available outside the browser.

Wikimedia Toolforge[edit]

is a hosting environment for community developers working on tools and bots that help users maintain and use Wikimedia wikis. Toolforge provides access to replicas of Wikimedia databases and other services that allow developers to easily compute analytics, do bot work, and create tools to help editors and other volunteers in their work.


  • Access replica of Wiktionary database from a Linux environment
  • No need to download entire database locally


  • Requires setup and expect knowledge



  • wiktionary_dump_to_xml_1 Attempt to parse some of the wikitext constructs of the German Wiktionary in order to convert them to XML. Written in Java, licensed under the GNU AGPL 3 + any later version.