Wiktionary:Wikitext style/Code snippets

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Introduction[edit]

This is a set of "snippets" of code useful for parsing a wikitext page from the main namespace of the English Wiktionary.

The examples are written in Python. It is fairly easy to read even if you don't know it, so may be helpful if you are writing in another language.

The presentation order is fairly random at present.

Reading a text page[edit]

If you are using the "pywikipedia framework" to access the wiktionary, to get the text of a page:

   page = wikipedia.Page(site, title)
   text = page.get()

then the text can be parsed as a whole, or line-by-line:

   for line in text.splitlines():

the examples assume that "text", "line", etc are as above.

Headers[edit]

Headers can be recognized one line at a time, probably the best since you'll want to look at the content lines following one at a time as well.

Some simple code:

   if line[4:5] == '=': level = 5
   elif line[3:4] == '=': level = 4
   elif line[2:3] == '=': level = 3
   elif line[1:2] == '=': level = 2
   elif line[0:1] == '=': level = 1
   else: level = 0
   if level > 0:
       header = line[level:-level]
       header = header.strip()
   else header = ''

the [ ] syntax on "line" says to take the characters starting from "level" and ending at the end minus "level" characters. Then the strip() function removes leading and trailing spaces. (If someone writes "=== Noun ===".)

At this point "level" is the header level (1 to 6, but the wikt only normally uses 2-5), and "header" is the header itself.

Doing the same thing using a regular expression, at the top:

   import re
   reheader = re.compile(r'(={2,6})\s*(.+?)={2,6}(.*)')

then for each line:

   mo = reheader.match(line)
   if mo:
       level = len(mo.group(1))
       header = mo.group(2).rstrip()
   else:
       level = 0
       header = ''

note this is not identical to the above; it leaves level equal to 0, not matched, given a level 1 header. However you shouldn't find L1 headers in entries.