Wiktionary:Parsing

From Wiktionary, the free dictionary
Jump to navigation Jump to search
See also: Data dumps

Introduction[edit]

I am starting this area with just some brief sketches and notes. Hopefully the community will flesh it out.

There are currently several parties who already parse Wiktionary data. In fact Wiktionary data has been parsed since close to its birth. Most such projects are fairly isolated and lack sharing of ideas and discussion of common problems.

Wiktionary is designed to be readable by humans. Though it has a much more rigorous format than Wikipedia it still turns out to be very unfriendly for computer parsing.

Wiktionary is full of useful data all under a Creative Commons license so it's free for anyone to use. But this data is stored as plain text with only wiki markup. Wiki markup is great for specifying the appearance of a page with little effort. It was never intended for specifying the structure, attributes, and semantics of data.

Anybody who has wanted to use the free data has had to roll their own data mining software. There exist a couple of online dictionaries, translation, and language sites who mine Wiktionary data. There are tools which analyse Wiktionary data to display statistics. There are bots which autoformat live articles. There are alternate Wiktionary page viewers which analyse the structure of articles to present them in a new style.

We need to join forces and make these jobs easier and our Wiktionary data freer. Ideally some day we will have an action=data or Api.php support to make our data accessible to everyone with no special effort. That day is not yet close.

Data formats[edit]

See meta:Data_dumps for comprehensive and up-to-date guidance on the approaches and techniques.

There are three main ways to access raw Wiktionary data:

Raw wikitext[edit]

Pros[edit]

  • Wiktionary is regularly made available in XML and SQL dumps at dumps.wikimedia.org. The articles are the same as you will see them when you edit an article online.
  • Any individual article can be accessed live online by |action=raw.
  • Wikimedia Toolforge (tools.wmflabs.org) provides access to replicas of Wikimedia databases in a Linux environment for community developers

Cons[edit]

  • Templates are difficult to handle.

HTML[edit]

In Linux environment some simple parsing can be done in a way like this: firstly, it is necessary to obtain a list of words for a certain language, then, execute a BASH command for every word in the list

 wget "https://en.wiktionary.org/w/index.php?title=word&printable=yes" 2>/dev/null -O - | w3m -dump -no-graph -T text/html |\
    gawk '($1=="English"){flag=1} ($1=="Retrieved"&&$2=="from"||$1~/-----/){flag=0} {if (flag==1) print $0} ' >> raw.data

In this pipeline wget downloads the desired article, w3m converts it to a plain text and the gawk script preserves a section related to a certain language only. The produced text can be used for further parsing; this way you will avoid dealing with HTML or XML tags. Note: it is just a simple example which is not failproof

Pros[edit]

  • A static dump of all articles is made alongside the XML dump.
  • Any individual article can be accessed live online by |action=render.
  • Templates are already expanded.

Cons[edit]

  • HTML varies depending on the skin used.

DOM[edit]

Pros[edit]

  • Available to JavaScript.
  • Templates are already expanded.

Cons[edit]

  • The DOM varies depending on the skin used.
  • Not available outside the browser.

Wikimedia Toolforge[edit]

is a hosting environment for community developers working on tools and bots that help users maintain and use Wikimedia wikis. Toolforge provides access to replicas of Wikimedia databases and other services that allow developers to easily compute analytics, do bot work, and create tools to help editors and other volunteers in their work.

Pros[edit]

  • Access replica of Wiktionary database from a Linux environment
  • No need to download entire database locally

Cons[edit]

  • Requires setup and expect knowledge

Sections (2008 edition)[edit]

This content is long obsolete and is preserved for historical purposes.

Alternative spellings[edit]

The alternative spellings section is just a list of links to other words, though each may be followed by an annotation.

Recently some contributors have begun using an alternative forms section for words they feel do not quite fit the concept of alternative spelling. The difference between the two sections seems fuzzy however and I expect people to differ on the criteria. In any case the newer section seems to be more of a grab bag than the traditional section.

Pronunciation[edit]

The pronunciation section has always been the one to suffer most from a lack of structured data. It has the most variation in all its parts: IPA vs non-APA, many flavours of IPA, how to organize US, UK, other; IPA, other, audio, etc.

Any attempt to gather useful data from this section would need to use many heuristics and pattern matching. Traditional parsing techniques are unlikely to be useful.

Etymology[edit]

The data here could be presented schematically and in many articles it is. In many other articles it is presented as prose text which is not readily parseable. Perhaps the majority of articles lay somewhere in between.

From a datamining standpoint a schematic structure is best but there are contributors who object to this and insist on a prose format.

It ought to be possible to design a set of templates which encapsulate the data structure and then render it as prose text. The templates would wrap the prose in HTML spans with appropriate CSS classes which would be machine readable. This would result in a microformat or something close to it.

Part of speech[edit]

The name of this section has long been a bugbear on Wiktionary.

Pragmatically we need to state what kind of word we are defining. Nouns, verbs, and adjectives seem clear cut. These can be contrasted with the many sources of confusion and debate: abbreviation, adfix, clitic, determiner, idiom, initialism, intransitive verb, noun phrase, particle, phrasal verb, phrase, transitive verb, etc.

The term part of speech goes back to ancient grammarians of western languages such as Latin. It does not fit well with non-Indo-European languages and it does not fit well with the modern linguistics concept of word class even for English. Yet the dictionaries of typical European languages still stick to the basic subset of part-of-speech terms.

This is not an issue for print dictionaries because they do not name each part of an entry for their words and they do not have separate entries for phrases. Instead these relationships are indicated by standardized entry formats utilizing bold, italics, small capitals, ordered lists, and subentries.

Synonyms[edit]

The synonyms section is just a list of links to other words, though each may be preceded or followed by an annotation.

Antonyms[edit]

The antonyms section is just a list of links to other words, though each may be preceded or followed by an annotation.

Derived terms[edit]

The derived terms section is just a list of links to other words, though each may be followed by a part of speech and/or annotation.

There has long beeen confusion between derived terms vs related terms sections, and between these two vs see also.

Related terms[edit]

The related terms section is just a list of links to other words, though each may be followed by a part of speech and/or annotation.

There has long beeen confusion between related terms vs derived terms sections, and between these two vs see also.

Translations[edit]

See also[edit]

The see also section is just a list of links to other words, though each may be followed by a part of speech and/or annotation.

There has long been confusion between see also vs the pair of sections related terms and derived terms.

Software/Projects[edit]

The known parsing libraries, including Wiktionary-specific tools, are listed at mw:Alternative_parsers and meta:Data_dumps/Other_tools.