User:Amgine/Wiktionary data & API

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Wiktionary's data is not machine-readable, nor is it available in a standard dictionary data format from the Wiktionary project. There is no data-only API.

This is simply a statement of the case, and does not indicate any opposition to getting Wiktionary's content more readily available for consumption, only that we do not have it available that way right now.

Automated access to Wiktionary content[edit]

There are three basic ways to access the content of Wiktionary articles:

The browser-based interface is the most flexible, but for automated access it is the most brittle and we do not recommend it for software data retrieval.

Wiktionary database dumps[edit]

Exactly as you might expect, a Wiktionary database dump can include all the current public article content and history. This kind of data is available for all Wikimedia projects and languages, however the sheer variety of types of dump files maintained requires research to discover exactly the data you are looking for. As of this writing (Aug 2016) the various tables for Wiktionary are available as SQL dumps, and the latest content of articles are in the page enwiktionary-latest-page.sql.gz The most recent version of the English Wiktionary's main namespace articles can also be found in an XML format at enwiktionary-latest-pages-articles.xml.bz2 The format of the data files is somewhate explained here.

Be aware that much of the structure of Wiktionary is provided by extensive use of templates, which require parsing, and are not included in either of the above database dumps. The Wikimedia foundation does not provide an index parser for wikitext or data; it is assumed that MediaWiki software is itself the index parser.

Or, more bluntly: do not use dumps, since you will just have to re-import it to an instance of MediaWiki. Or write your own MediaWiki parser (there are many high-compatibility third-party parsers in many languages, with the caveat that none are considered successful for round-trip at the moment.)

Wiktionary API[edit]

There are two different APIs available for full editor and reader access to Wiktionary. The ?action= query-string based api is necessarily complex, and is well-documented on the MediaWiki wiki, and is more cryptically self-documenting. The REST v1 api is in beta and becoming as feature-rich as the original, complies with standards, and will be familiar for many developers. REST v1 is self-documenting (RESTbase) with documentation written and maintained by MediaWiki developers. There are actually a third and fourth APIs available. Wikidata query service which is based on SPARQL, however Wikidata has not implemented a project for Wiktionary content. And Get Page Definition, a custom endpoint of the REST v1.

Using the API to extract a given page is usually a multi-step affair: request the list of revisions for an article title, then request the content of the title with a specific revision. The contents are usually delivered as an html blob in a json structure, and will need to be traversed to extract the specific data you are looking for. XSLT may be useful where articles are properly formatted, but keep in mind Wiktionary articles are manually edited bodies of text - not strictly formatted data files. It is possible to abstract specific sections via the API, but in most cases it will involve at least another connection to the API and so may in practice be slower than performing that parsing locally.

As of this writing there is no obvious method of requesting a csrf editing token via REST v1 api. You will need to do your own research if your project involves editing/updating Wiktionary entries.

Proposals[edit]

It has been regularly proposed that Wiktionary be made available in any of several standard data formats. We, as a community, strongly support this notion in general. We quibble about the details.

Current known efforts[edit]