User:AutoFormat/code

notice

This comes with several very important caveats:
I am a professional software engineer, this is what I do; however this code was written for my own use, and is not warranted, and does not carry any implication of merchantability or fitness for use.
Like everything else in the Wiktionary, this is under GFDL. GFDL is not compatible with the GPL, this document is not licensed under the GPL as software. (!)
At any given moment, this code may not represent what is being run; I have no intention of updating this page every time I make a change.
technical notes

I don't have my ego attached to code I write; I routinely dump code that has gotten too complex, and re-write it. On the other hand, even if something is sloppy, if it is tested and works, I leave it alone.
Some of the comments may be snarky.
The comments are often (usually) written to remind me of something, not to explicate the code.
Since I modify this regularly, there is code that is not reached or otherwise redundant.
The pre-parsing should go deeper; a fairly major restructuring would be helpful at some point soon.
There are a small number of known (to me ;-) bugs that I handle by monitoring the edits done, having not yet fixed them. (Like handling multi-line comments.)
The wikipedia.py module AF uses is heavily modified from the distro; however the interface is the same. In the presence of network problems/failures/outages AF may abort when the modified version would have recovered. The exceptions thrown are the same, but under differing conditions.
On Linux, the clock timing works, but will display ugly large values.
The code to handle headers is largely hacked to implement the "Connel" flag ....
The code that handles Etymology headers is based on the current WT:ELE; there is no problem changing it when we figure out how Etymology and Pronunciation are supposed to play nicely together in the general case.
It must have a sysop account as well, to read patrolled flags in RC; "enhanced" RC mode must be turned off.
outline

prescreen
Reads the XML dump, uses simple regex to find entries that may need attention, and builds a random index
rcpages
Generator called by the main routine. Calls prescreen, then cycles through reading Recent Changes, looking at the request category, and yielding pages found.
main
Reads configuration pages, builds tables to be used. Loops on rcpages generator, for each entry:
runs regex on the entire text
breaks entry into language sections, plus prolog (above first section), and iwikis
in each language section:
- looks for and fixes Etymology headers
- herds cats
- fixes bad headers
- fixes linking in trans tables
- fixes top to trans-top
- subst's (replaces) language code template
- etc
then reassembles the entry, removing multiple blank lines, adding ---- rules, and so on
checks the actions performed
if any resulting action, rewrites the page
code

#!/usr/bin/python
# -*- coding: utf-8  -*-

"""
This bot looks for entries tagged for autoformatting, does a number of tasks

No command line arguments.

"""

import wikipedia
import catlib
import sys
import re
import pickle
import time
import xmlreader
import socket
from mwapi import getwikitext, getedit

def safe(s):
    return pickle.dumps(s)[1:-5]

def lkey(l):
    n = l.strip('[]')
    if not n: return n

    if n == 'Translingual': return '0' + n
    if n == 'English': return '1' + n

    # bad L2 headers
    if n.lower() == 'cyrillic alphabet': return '0' + n
    if n.lower() == 'arabic alphabet': return '0' + n
    if n.lower() == 'see also': return '3' + n
    if n.lower() == 'references': return '4' + n

    # handle names like !Kung and 'Auhelawa: move non-alpha to the end of key

    if not n[0].isalpha(): n = n[1:] + n[0]

    return '2' + n

Scripts = { 'ARchar' : 'Arab',
            'Cuneiform' : 'Xsux',
            'ELchar' : 'Grek',
            'FAchar' : 'fa-Arab',
            'HEchar' : 'Hebr',
            'JAchar' : 'Jpan',
            'KMchar' : 'Khmr',
            'LOchar' : 'Laoo',
            'RUchar' : 'Cyrl',
            'THchar' : 'Thai',
            'URchar' : 'ur-Arab',
            'ZHchar' : 'Hani',
            'ZHsim'  : 'Hans',
            'ZHtra'  : 'Hant'   }

PSK = { }
from random import random
from math import log as ln
AH = set()
#newpages = set()
Regex = { }
Prex = {}

# work cache, record time last looked at entry
# each record is key: lc:word, pickled with safe(), value is integer time()

import shelve
cache = shelve.open("af-cache")

def prescreen():

  while True: # indef repeat

    cis = 0

    # get XML dump
    dump = xmlreader.XmlDump("../hancheck/en-wikt.xml")

    srx = { }

    srx['lcode header'] = re.compile(r'^== *\{\{.{2,3}\}\} *==', re.M)
    srx['lcode trans'] = re.compile(r'^\* *\{\{.{2,3}\}\} *:', re.M)
    srx['top template w/param'] = re.compile(r'\{\{top\|')
    srx['top template w/semi gloss'] = re.compile(r'^;.*\n\{\{top\}', re.M)
    srx['top template w/qbold gloss'] = re.compile(r"'''\n\{\{top\}")
    # srx['gender'] = re.compile(r"^\*.*:.*''[fmcn]''", re.M)

    """ block covered by regex
    srx['Wikipedia'] = re.compile(r"\{\{Wikipedia")
    srx['Unicode'] = re.compile(r"\{\{Unicode")
    srx['Acronym'] = re.compile(r"\{\{Acronym")
    srx['Initialism'] = re.compile(r"\{\{Initialism")
    srx['Abbreviation'] = re.compile(r"\{\{Abbreviation")
    srx['cattag'] = re.compile(r"\{\{cattag")
    srx['trad'] = re.compile(r"\{\{trad-?\|")
    """

    srx['rest after header'] = re.compile('^=+[^=\n]+=+[^=\n]+$', re.M)
    srx['Pronounciation'] = re.compile('Pronounciation')
    srx['categorized'] = re.compile('[Cc]ategori[sz]ed')

    # srx['etymology with parens'] = re.compile('Etymology ?\(')
    # srx['etymology at L4'] = re.compile('^==== ?Etymology', re.M)

    # srx['also see'] = re.compile('= ?Also see')
    # srx['indented see'] = re.compile(r'^:\{\{see\|', re.M)
    # srx['indented Cyrillic'] = re.compile(r'^:Cyrillic', re.M)
    # srx['indented Roman'] = re.compile(r'^:Roman', re.M)
    srx['Han ref w/o *'] = re.compile(r'^\{\{Han ref\|', re.M)
    # srx['AHD'] = re.compile(r'\{\{AHD')

    # next really needs something re lang sects, try it for now, sorta works
    # srx['maybe un-numbered ety'] = re.compile(r'^=== ?Etymology ?===.*Etymology', re.M|re.S)

    # srx['PAGENAME'] = re.compile('\{PAGENAME')
    srx['-----'] = re.compile('-----')

    # header case problems ...
    srx['lc header'] = re.compile(r'^={1,6} ?[a-z][-a-zA-Z ]*=+$', re.M)
    srx['non sc header'] = re.compile(r'^={3,6} ?[A-Z][-a-z ]*[A-Z][-a-zA-Z ]*=+$', re.M)

    # contexts
    srx['context tag'] = re.compile(r"^# *\(''.+?''\)", re.M)
    srx['context tag 2'] = re.compile(r"^# *''\(.+?\)''", re.M)
    srx['context italbrac'] = re.compile(r"^# *\{\{italbrac", re.M)

    # hunt down trans fixes, several cases ... ;-) regex will do
    # srx['translations to be checked'] = re.compile(r'ranslations to be')

    # re-work rfc level tags (maybe? try 25.2.8?)
    # srx['rfc level'] = re.compile(r'^\{\{rfc-level.*\+',re.M)
    # X phrase is pretty much gone anyway
    # srx['X phrase'] = re.compile(r'^={3,5} *[-a-zA-Z ]* phrase *=+$', re.M)

    # and so on

    # srx['gender pls'] = re.compile(r'\{\{[mfnc]\.?pl\.?}}')
    # srx['gender comb'] = re.compile(r'\{\{[mfnc]}} \{\{[ps]}}')
    # srx['gender f/m.'] = re.compile(r'\{\{[mf]\.}}')
    # srx['gender pl.'] = re.compile(r'\{\{pl\.}}')

    # scripts: covered by regex
    # for s in Scripts:
    #    srx[s + ' script template'] = re.compile(r'[\{=]' + s + r'[\{\|]')

    # srx['given name lang'] = re.compile(r':}}') # !? probably enough / now automatic

    # canonical cats, headers, did all these 19.8.8, put these back when we get a new dump
    srx['canonical cat'] = re.compile(r'category:')
    srx['canonical cat space'] = re.compile(r'ory: ')
    srx['canonical head'] = re.compile(r'== ')

    srx['multiple blanks'] = re.compile(r'\n\n\n\n') # 3 or more, not just 2
    srx['template detritus'] = re.compile(r'\{\{\{') # should never be in an entry

    srx['IPA star'] = re.compile(r'^\{\{IPA\|', re.M)

    reah = re.compile(r'^={3,6} *([-a-zA-Z ]+) *=+$', re.M)
    rel2 = re.compile(r'^==[^=]', re.M)
    rehr = re.compile(r'^----', re.M)

    counts = { }

    entries = 0
    tags = 0
    tran = 0 # tagged at random

    piscine = set()
    # skip a few others besides the level 3-6 headers
    AH.add('Mandarin')
    AH.add('Cantonese')
    AH.add('Min Nan')

    for entry in dump.parse():
        text = entry.text
        title = entry.title
        if ':' in title: continue
        if text and text.startswith('#'): continue

        entries += 1
        if entries % 1000 == 0: print "prescreen: %d entries, %d tagged" % (entries, tags)

        ckey = safe(title) # must be string for bsd dbm
        if ckey in cache:
            last = cache[ckey]
            if last > time.time() - (35 * 24 * 3600):
                # print "prescreen: %s (35 day cache)" % safe(title)
                continue

        # screen entries:
        tag = False

        for reason in srx:
            if srx[reason].search(text):
                tag = True
                break

        if not tag and '{{rfc' not in text:
            for mo in reah.finditer(text):
                h = mo.group(1).strip()
                if h not in AH:
                    if h not in piscine:
                        print "prescreen: header %s is tagged" % safe(h)
                        piscine.add(h)
                    reason = "unknown header"
                    tag = True

        if not tag and '=Pronunciation ' in text and '{{rfc-pron-n' not in text:
            reason = 'Pronunciation n header'
            tag = True

        if not tag and '[[' not in text and random() * 17.0 < 1.0:
            reason = 'no [[ in text'
            tag = True

        # some exceptions, various that may not be usefully fixed:
        if tag and reason == 'translations to be checked' and '=Translations=' not in text:
            # some cases we don't want ... should be fixed in main routine
            tag = False

        # overall regex prescreen, using table for format

        if not tag:
            for rx in Regex:
                # skip a trans case, see above
                if rx.startswith('elided Translations to be checked') and '=Translations=' not in text: continue
                if Regex[rx][0].search(text):
                    reason = 'regex ' + rx
                    # but slow this down by canonical factor, too many for now
                    if '+also' in reason and random() * 17.0 > 1.0: continue
                    tag = True
                    break

        # Pron section regex, may cause some false hits in other sections
        if not tag:
            for rx in Prex:
                if Prex[rx][0].search(text):
                    reason = 'pron regex ' + rx
                    tag = True
                    break

        # look for L2 header and horiz rules mismatch
        # L2 header match is not perfect (requires canonical form), but that just means we will
        # be looking at entries that may need looking at
        if not tag and len(rel2.findall(text)) - 1 != len(rehr.findall(text)) and '{{only in' not in text:
            reason = 'horiz rules'
            tag = True

        # tag at random, limiting to something proportionate to other tags
        if not tag and tran < (tags/2)+100 and random() < 0.007:
            tran += 1
            reason = 'at random'
            tag = True

        if not tag: continue

        r = random()

        # collisions don't matter much, but easy to fix (careful about significance! hence *):
        while r in PSK: 
            print "prescreen: collision, r bumped %f to %f" % (r, (r+0.0001)*1.0001)
            r = (r+0.0001)*1.0001

        # (debug) if 'accent' in reason:
        #    print "prescreen: %s, %s (%.4f)" % (safe(title), reason, r)

        if reason not in counts: counts[reason] = 0
        counts[reason] += 1
        PSK[r] = (title, reason)
        tags += 1

        # some (1/10) of the time we return an entry, else just pool
        if tags % 10 != 0: continue

        # yield best/minimum
        m = min(PSK.keys())
        title, reason = PSK[m]
        del PSK[m]
        # print "prescreen return: %s, %s (%.4f)" % (safe(title), reason, m)
        ckey = safe(title) # must be string for bsd dbm
        cache[ckey] = time.time() # entry has been fixed for now
        cis += 1
        if cis % 20 == 0: cache.sync()
        yield title, reason, m

    # end of file:
    for r in sorted(counts):
        print 'prescreen: count for %s is %d' % (r, counts[r])

    # return/yield the rest
    for m in sorted(PSK.keys()):
        title, reason = PSK[m]
        del PSK[m]
        # print "prescreen return: %s, %s (%.4f)" % (safe(title), reason, m)
        ckey = safe(title) # must be string for bsd dbm
        cache[ckey] = time.time() # entry has been fixed for now
        cis += 1
        if cis % 20 == 0: cache.sync()
        yield title, reason, m
    
def now(): return int(time.clock())

# share timer with main
naptime = 0
    
def rcpages(site):
    # generator which yields recentchanges, but not unpatrolled changes
    # also entries in category
    # in between, yields pages that satisfy the prescreen in random order
    global naptime

    site = wikipedia.getSite("en", "wiktionary")
    cat = catlib.Category(site, "Category:Requests for autoformat")

    seen = set()

    nextcat = now() - 1
    nextrc = now() - 1

    hold = { }
    rcex = re.compile(r'title="(.+?)"')

    for title, reason, m in prescreen():

        seen.add(title)
        print '(%d, from prescreen %s, %.4f)' % (now(), reason, m)
        page = wikipedia.Page(site, title)
        yield page

        nf = 0
        nd = 0

        # get our category, every 10-15 minutes or so
        if now() > nextcat:
            cat.catlist(purge = True)
            for page in cat.articles():
                nf += 1
                if nf > 7: break    # just munch the cat, not too hungry ;-)
                # if len(hold) > 100 and nf > 1: break   # try to keep up, cat can wait? needed?
                print '(%d)' % now()
                seen.add(page.title())
                if page.title() in hold: del hold[page.title()]
                yield page
            nextcat = now() + 740

        # recent changes
        if now() > nextrc:
            print '(%d, reading recent changes)' % now()
            try:
                rct = site.getUrl("/w/api.php?action=query&list=recentchanges&format=xml&rcprop=title" +
                     "&rclimit=5000&rcshow=patrolled|!bot&rctype=edit|new&rcnamespace=0", sysop = True)
            except wikipedia.NoPage:
                print "Can't get recentchanges from en.wikt!"
                rct = ''
                time.sleep(30)
                continue
            if '</recentchanges>' not in rct:
                print "some bad return from recentchanges, end tag not found"
                rct = ''
                time.sleep(30)
                continue
            nextrc = now() + 1400

            ht = 480
            for title in rcex.findall(rct):
                if ':' in title: continue # other stray stuff in NS:0

                if title not in seen:
                    seen.add(title)
                    hold[title] = now() + ht
                    # scatter out into future ... (numbers fairly arbitrary, but work well)
                    ht += 34
                    if ht > 21 * 3600: ht /= 7 # ? if more than most of a day
                    nf += 1
                    print "found: [%s] hold until %d" % (safe(title), hold[title])

        pastime = now()
        for title in sorted(hold):
             # 10 on a pass is enough
             if nd > 9: break
             if hold[title] > pastime: continue
             print '(%d, rc held to %d)' % (now(), hold[title])
             del hold[title]
             nd += 1
             page = wikipedia.Page(site, title)
             yield page

        if not nd and not nf and naptime > 5:
            naptime = min(naptime, 340)  # max to keep timers running
            print "(%d, sleeping %d)" % (now(), naptime)
            # also rely on put throttle
            time.sleep(naptime)

        print '(%d, %d held)' % (now(), len(hold))

        continue

# now have some serious recursion fun!
# fuzzy returns string match score
# r is min required, calls may have neg r, may return value < r

def fuzzy(a, b, r):

    if not a or len(a) < r: return 0
    if not b or len(b) < r: return 0

    if a == b: return len(a)
    if a[0] == b[0]: return 1 + fuzzy(a[1:], b[1:], r-1)
    if a[-1] == b[-1]: return 1 + fuzzy(a[:-1], b[:-1], r-1)

    # try with each char forward
    p = a.find(b[0])
    if p >= 0: sca = 1 + fuzzy(a[p+1:], b[1:], r-1)
    else: sca = 0

    p = b.find(a[0])
    if p >= 0: scb = 1 + fuzzy(b[p+1:], a[1:], r-1)
    else: scb = 0

    # no match either/or way, skip this char, one or both
    if not sca and not scb: sk = fuzzy(a[1:], b[1:], r)
    elif not sca: sk = fuzzy(a, b[1:], r)
    elif not scb: sk = fuzzy(a[1:], b, r)
    else: sk = 0

    return max(sk, sca, scb)

def infline(title, lang, header):

    pos = header.lower()
    if pos.startswith('{{'):
        pos = pos[2:-2].split('|')[0]

    if lang == 'en':
        if pos in ['verb', 'noun', 'adjective', 'adverb']:
            return "{{infl|en|" + pos + "}}[[Category:English "+ pos +"s that lack inflection template]]"

    a = ord(title[0:1])

    # Arabic:
    if 0x0600 <= a < 0x0780:
        return "{{infl|%s|%s|sc=Arab}}" % (lang, pos)

    # Han:
    # this is planes 1-2, needs closer check
    if 0x3400 <= a < 0xA000 or 0xd800 <= a < 0xdc00:
        if lang == 'ko':
            return "{{infl|%s|%s|sc=Hant}}{{ko-attention|may need inflection template}}" % (lang, pos)
        elif lang == 'ja':
            return "{{infl|%s|%s|sc=Jpan}}{{ja-attention|needs inflection template}}" % (lang, pos)
        elif lang == 'vi':
            return "{{infl|%s|%s|sc=Hant}}{{vi-attention|may need inflection template}}" % (lang, pos)
        else:
            return "{{infl|%s|%s|sc=Hani}}{{zh-attention|needs inflection template}}" % (lang, pos)

    if lang == 'ja':
            return "{{infl|%s|%s}}{{ja-attention|needs inflection template}}" % (lang, pos)
    if lang == 'ko':
            return "{{infl|%s|%s}}{{ko-attention|may need inflection template}}" % (lang, pos)
    if lang in ['zh', 'cmn', 'yue', 'nan']:
            return "{{infl|%s|%s}}{{zh-attention|may need inflection template}}" % (lang, pos)

    return "{{infl|%s|%s}}" % (lang, pos)

MOD = [ 'chiefly', 'coarse', 'especially', 'extremely', 'frequently', 'generally', 'mainly', 'markedly',
    'mildly', 'mostly', 'often', 'particularly', 'primarily', 'sometimes', 'usually', 'very' ]

reunlink = re.compile(r'\[\[(.*?)\]\]')
# match a simple context, words but no odd puncuation etc
resimctx = re.compile(r'[-\w ]*$')

PRETULIP = ('of ', 'by ')

def cpar(cstr, ctxs):

    # convert context string to template name(s)

    tname = ''
    cstr = re.sub(r'[,;\|]+', ',', cstr) 
    for cs in cstr.split(','):
        cs = cs.strip(" '")
        if '[' in cs: cs = reunlink.sub(r'\1', cs)
        # handles n modifiers, does context? yes.
        while cs.split(' ')[0].lower() in MOD:
            mod = cs.split(' ')[0].lower()
            tname += mod + '|'
            cs = cs[len(mod):].strip()
        if cs.lower() in ctxs:
            tname += ctxs[cs.lower()] + '|'
        elif cs.startswith(PRETULIP):
            if not tname: tname = 'context|'
            tname += cs + '|'
        elif tname and resimctx.match(cs):
            tname += cs + '|'
        else: return ''

    tname = tname.rstrip('|')
    return tname

def ibsub(imo):

    # some prefix captured
    pref = imo.group(1)
    istr = imo.group(2)
    s = reunlink.sub(r'\1', istr)
    # not general enough, bar pipes in match for now in re precomp
    #if s != istr and '|' in s: s = s.split('|')[1]
    s = re.sub(r',\s*', '|', s)
    if imo.group(3) == ':':
        return pref + '{{i-c|' + s + '}}'
    else:
        return pref + '{{i|' + s + '}}'

def sdif(a, b):

    # returns -(a stuff) +(b stuff) when one change

    i = 0
    while a[i:i+1] and a[i:i+1] == b[i:i+1]: i += 1
    an = a[i:]
    bn = b[i:]
    j = 1
    while j < len(an) and an[-j:] == bn[-j:]: j += 1
    j -= 1

    # special case: improve on -}} {{ +| :
    if j >= 3 and an.startswith('}} {{') and bn[:-j].endswith('|'):
        an = a[i-3:]
        bn = b[i-3:]
        j -= 3
        # return '-' + a[i-3:][:11] + ' +' + b[i-3:][:7] # gaa ...

    if j: return '-' + an[:-j] + ' +' + bn[:-j]
    else: return '-' + an + ' +' + bn

    # okay, try that! not so pretty is it?


# sort language sections:

retransline = re.compile(r'\* \[*([^\]:\{\}]+?)\]*:')     # match an already canonicalized line
retransreq = re.compile(r'\* \{\{trreq\|([^\}]+?)\}\}')   # trans req template
retranstbc = re.compile(r'\* \{\{ttbc\|([^\}]+?)\}\}')    # trans to be checked, allow here?
redetemp = re.compile(r'\{\{\w*\|')
redechar = re.compile(r'[\{\}\|\[\]]')
redecomm = re.compile(r'<!--.*?-->')

def nlen(s):
    
    # simplest form:
    # return 1 + len(s)/135 # +1 for each length of line that will probably wrap (WAG)
    # this routine can be twaeked more if needed

    # better:
    s2 = redetemp.sub('', s)
    s2 = redechar.sub('', s2)
    s2 = redecomm.sub('', s2)

    # dbg:
    # if len(s2) >= 85: print "long line (%d): %s" % (1+len(s2)/85, safe(s2))

    return 1 + len(s2)/85

# reduce text to "safe" for wiki as a template parameter:
rewsafe = re.compile(r'[\{\}\[\]\|\<\>]+')

# match a see-only case:
reseeonly = re.compile(r"\{\{trans-top\|(.+?)\}\}\n+[ :']*[Ss]ee[ ':]*(\[\[.+?\]\])(.*)$", re.S)

def transort(tmo):

    ts = { }
    tsk = { }

    # take apart by language, treat header as "language" nil

    prob = ''
    prev = ''
    k = 0
    for tline in tmo.group(0).splitlines():
        if tline.startswith('{{trans-top'):
            if '' in ts:
                prob = "trans-top found inside section, missing trans-bottom?"
                break
            ts[''] = tline
            tsk[''] = 0
            continue
        if tline == '{{trans-mid}}': continue
        if tline == '{{trans-bottom}}': continue
        if not tline: continue
        mo = retransline.match(tline)
        if not mo: mo = retransreq.match(tline)
        if not mo: mo = retranstbc.match(tline)
        if mo:
            lang = mo.group(1)
            if lang in ts:
                prob = "duplicate language: " + lang
                break
            if lang.startswith('{{'):
                prob = "unexpected template: " + lang
                break
            ts[lang] = tline
            nl = nlen(tline)
            tsk[lang] = nl
            k += nl
            prev = lang
            continue
        if tline.startswith('* '):
            prob = "unparsed language line: " + tline
            break
        # [tbd: treat ** as a sub language, eg key is "Chinese | Mandarin"]
        if tline.startswith('*:') or tline.startswith('**'): # allow both here
            ts[prev] += '\n' + tline
            nl = nlen(tline)
            tsk[prev] += nl
            k += nl
            continue
        if tline.startswith(': ') and not prev: # e.g. : ''see'' reference
            ts[prev] += '\n' + tline
            tsk[prev] += 1
            k += 1
            continue
        if tline.startswith('<!--') and not prev:
            ts[prev] += '\n' + tline
            # no addition to counts
            continue
        prob = "unknown line format: " + tline
        break

    # blank section or nothing worth sorting, do nothing? um, format it default
    # if not k: return tmo.group(0)

    # pick up see-only case before looking at prob:
    if not prev:
        # no languages found
        mo = reseeonly.match(tmo.group(0))
        if mo:
            print "matched see in trans section"
            gloss = mo.group(1).strip()  # leaves ''s as an issue
            target = mo.group(2).strip()
            if '#' not in target and '|' not in target: target = target.strip('[]')
            rest = mo.group(3)
            # check remainder
            rest = rest.replace("{{trans-mid}}", '')
            rest = rest.replace("{{trans-bottom}}", '')
            if not rest.strip(" '\n"):
                if gloss == target: return "{{trans-see|" + target + "}}"
                else: return "{{trans-see|" + gloss + "|" + target + "}}"

    if prob:
        print "in trans section,", safe(prob)
        prob = rewsafe.sub(' ', prob)  # wiki-safe ;-)
        return "{{rfc-tsort|" + prob + "}}\n" + tmo.group(0)   # rfc tag + unchanged

    # re-assemble, balance columns

    m = 0
    tsnew = ''
    for lang in sorted(ts, key=lkey):
        tsnew += ts[lang] + '\n'
        m += tsk[lang]
        if k and m >= (k + 1) / 2:
            tsnew += '{{trans-mid}}\n'
            k = 0
    # if not m: tsnew += '{{trans-mid}}\n'
    if '{{trans-mid}}' not in tsnew: tsnew += '{{trans-mid}}\n' # better test? should be the same as not m
    tsnew += '{{trans-bottom}}\n'

    return tsnew

def prokey(s):

    # is (sorted) stable? as of Python 2.3, yes ;-)

    # simple prolog sort, LHS after RHS, unknown in the middle
    if s.startswith('{{was wotd'): return '0' # moved in monobook
    if s.startswith('{{wiki'): return '1'     # sister templates
    if s.startswith('{{commons'): return '1'  # sister templates
    if s.startswith('{{inter'): return '1'    # sister templates
    if s.startswith('{{zh-'): return '2'      # Chinese floatright
    if s.startswith('{{ja-'): return '2'      # Japanese floatright
    if s.startswith('[[Image'): return '3'         # images
    if s.startswith('[[image'): return '3'         # images

    # LHS:
    if s.startswith('{{selfref'): return '6'
    if s.startswith('{{also'): return '7'
    if s.startswith('{{xsee'): return '7'
    if s.startswith('{{xalso'): return '7'

    if s: print "prolog sort: no key for %s" % safe(s)
    else: return '9' # blank lines usually are at end, will be removed

    return '5'

def main():
    global naptime

    socket.setdefaulttimeout(30)

    # regex precomp, force headers to canonical:
    # first allows singleton =
    rehead1 = re.compile(r'(={2,6})(.+?)={2,6}(.*)$')
    rehead2 = re.compile(r'(={1,6})([^=<]+?)={1,6}(.*)$')
    rehead3 = re.compile(r'(={1,6})([^=<]+?)=+(.*)$')
    rehead4 = re.compile(r'(=+)([^=<]+)(.*)$')
    realleq = re.compile(r'=+$')

    # L2 headers
    reL2head = re.compile(r'==?\s*([^=]+)={1,6}(.*)')
    # lang= on bad headers, so allow singleton ='s:
    reheader = re.compile(r'(={3,6})\s*(.+?)={2,6}(.*)')
    reiwiki = re.compile(r'\[\[[-a-z]{2,11}:(.*)\]\]')
    recat = re.compile(r'\[\[category:.*?\]\]', re.I)
    retrans1 = re.compile(r'\* \[\[w:.+\|([^\]]+?)\]\]\s*:(.*)')
    retrans2 = re.compile(r'\* \[\[([^\]]+?)\]\]\s*:(.*)')
    retrans3 = re.compile(r'\* ([^:]+?):(.*)')
    retrans4 = re.compile(r'\* (\w+)(.*)') # missing :
    retag = re.compile(r'\{\{rfc-auto(\|.*?|)}}')
    regender = re.compile(r"''([mfcn])''")
    reglossfix = re.compile(r'(.+)\(\d+\)$')
    retopgloss = re.compile(r'\{\{top(\|.*?|)}}$')
    recontext = re.compile(r"^# *\(''(.+?)''\):? ?(.*)$", re.M)
    recontext2 = re.compile(r"^# *''\((.+?)\):?'' ?(.*)$", re.M)
    recontext3 = re.compile(r"^# *\{\{italbrac\|([^}]+?)}}:? ?(.*)$", re.M)
    repronn = re.compile(r'Pronunciation \d+')

    # be careful to match and remove newline in these unless they happen to be at the very end:
    rerfclevel = re.compile(r"^\{\{rfc-level\|.*\+.*\}\}\n?", re.M)
    rerfcxphrase = re.compile(r"^\{\{rfc-xphrase\|.*\}\}\n?", re.M)
    rerfcheader = re.compile(r"^\{\{rfc-header\|.*\}\}\n?", re.M)
    rerfcsubst = re.compile(r"^\{\{rfc-subst\}\}\n?", re.M)
    rerfcpronn = re.compile(r"^\{\{rfc-pron-n\|.*\}\}\n?", re.M)

    # italbracs not on context/defn lines, template italbrac->i replacement separate
    # limited forms ... nowilink with pipes, no templates, look for : in mo.g3
    # look for gloss, etc, * lines to start ...
    reibcomma = re.compile(r"^(\*\s*)\(''([^\)^'^\|^\{]+):?''\)(:?)")
    reibcomma2 = re.compile(r"^(\*\s*)''\(([^\)^'^\|^\{]+):?\)''(:?)")

    # match "stackable" format characters at start of lines, so we can have one space exactly
    restack = re.compile(r"^([:#\*]+)\s*")

    # regex table (dict, name = tuple of compiled object and replacement)
    Regex['subst:PAGENAME'] = (re.compile(r'\{\{PAGENAME}}'), '{{subst:PAGENAME}}')
    Regex['template -cattag +context'] = (re.compile(r'\{\{cattag\|'), '{{context|')
    Regex['template -Unicode +unicode'] = (re.compile(r'\{\{Unicode\|'), '{{unicode|')
    Regex['template -Wikipedia +wikipedia'] = (re.compile(r'\{\{Wikipedia([\|\}])'), r'{{wikipedia\1')
    Regex['template -WP +wikipedia'] = (re.compile(r'\{\{WP([\|\}])'), r'{{wikipedia\1')
    Regex['template -Acronym +acronym'] = (re.compile(r'\{\{Acronym([\|\}])'), r'{{acronym\1')
    Regex['template -Initialism +initialism'] = (re.compile(r'\{\{Initialism([\|\}])'), r'{{initialism\1')
    Regex['template -Abbreviation +abbreviation'] = (re.compile(r'\{\{Abbreviation([\|\}])'), r'{{abbreviation\1')
    Regex['template -AHD +enPR'] = (re.compile(r'\{\{AHD([\|\}])'), r'{{enPR\1')

    # translations
    Regex['template -trans-bot +trans-bottom'] = (re.compile(r'\{\{trans-bot\}\}'), '{{trans-bottom}}')
    Regex['template -trans-middle +trans-mid'] = (re.compile(r'\{\{trans-middle\}\}'), '{{trans-mid}}')

    Regex['elided Translations to be checked header'] = (re.compile(
         r'^={3,6}Translations to be checked={3,6}\n*\{\{checktrans', re.M), '{{checktrans')
    Regex['elided Translations to be checked header and comment'] = (re.compile(
         r'^={3,6}Translations to be checked={3,6}\n*<!--\s*Remove this section.*\n*\{\{checktrans', re.M),
          '{{checktrans')
    Regex['checktrans and trans-top to checktrans-top'] = (re.compile(
         r'^\{\{checktrans\}\}\n*\{\{trans-top\|\w*lations to be \w*\}\}', re.M), '{{checktrans-top}}')

    Regex['checktrans/top/mid/bottom to checktrans-top etc'] = (re.compile(
         r'^\{\{checktrans\}\}\n*\{\{top\}\}(.*?)^\{\{mid\}\}(.*?)^\{\{bottom\}\}', re.M|re.S),
         r'{{checktrans-top}}\1{{checktrans-mid}}\2{{checktrans-bottom}}')

    Regex['template -ttbc-top +checktrans-top'] = (re.compile(r'\{\{ttbc-top\}\}'), '{{checktrans-top}}')
    Regex['template -ttbc-mid +checktrans-mid'] = (re.compile(r'\{\{ttbc-mid\}\}'), '{{checktrans-mid}}')
    Regex['template -ttbc-bottom +checktrans-bottom'] = (re.compile(r'\{\{ttbc-bottom\}\}'), 
          '{{checktrans-bottom}}')

    Regex['template -trad +t'] = (re.compile(r'\{\{trad\|'), '{{t|')
    Regex['template -trad- +t-'] = (re.compile(r'\{\{trad-\|'), '{{t-|')

    Regex['un-indent {{see}} template'] = (re.compile(r'^:\{\{see\|', re.M), '{{see|')

    Regex['template -cpl +{{c|p}}'] = (re.compile(r'\{\{c\.?pl\.?}}'), '{{c|p}}')
    Regex['template -fpl +{{f|p}}'] = (re.compile(r'\{\{f\.?pl\.?}}'), '{{f|p}}')
    Regex['template -pl. +{{p}}'] = (re.compile(r'\{\{pl\.}}'), '{{p}}')

    Regex['template -m. +{{m}}'] = (re.compile(r'\{\{m\.}}'), '{{m}}')
    Regex['template -f. +{{f}}'] = (re.compile(r'\{\{f\.}}'), '{{f}}')
    Regex['template -mf +{{m|f}}'] = (re.compile(r'\{\{mf}}'), '{{m|f}}')
    Regex['template -fn +{{f|n}}'] = (re.compile(r'\{\{fn}}'), '{{f|n}}')
    Regex['template -fp +{{f|p}}'] = (re.compile(r'\{\{fp}}'), '{{f|p}}')
    Regex['template -mp +{{m|p}}'] = (re.compile(r'\{\{mp}}'), '{{m|p}}')
    Regex['template -fm +{{m|f}}'] = (re.compile(r'\{\{fm}}'), '{{m|f}}')
    Regex['template -nf +{{f|n}}'] = (re.compile(r'\{\{nf}}'), '{{f|n}}')
    Regex['template -nm +{{m|n}}'] = (re.compile(r'\{\{nm}}'), '{{m|n}}')

    # given name, preferred syntax

    Regex['xx: to lang=xx in given name template'] = (
           re.compile(r'(\{\{given name[^\}]*?\|)\|?([-a-z]{2,10}):\}\}'), r'\1lang=\2}}')
    Regex['from language to from=language in given name template'] = (
           re.compile(r'(\{\{given name[^\}]*?\|)from ([-a-zA-Z ]+)\|?([\}\|])'), r'\1from=\2\3')

    # table format lines, row divs to one "-"
    Regex['table |--* to |-'] = (re.compile(r'^\|--+', re.M), r'|-')

    # stuff left from preload templates
    # careful this first one starts with 3 {'s, check previous character? not for now
    Regex['remove template subst detritus'] = (re.compile('\{\{\{[0-9a-z]+\|(.*?)\}\}\}'), r'\1')
    Regex['remove template subst detritus #if etc'] = (re.compile('\{\{#\w+:\|\|?\}\}'), r'')
    # temp for esbot leftovers:
    Regex['remove esbot:catline'] = (re.compile('\{\{esbot:catline.*\{\{ending\}{5,5}'), r'')

    # script code replacements, first a dict, then generate the two regex forms for each:

    for sc in Scripts:
         Regex['script template -'+sc+' +'+Scripts[sc]] = (re.compile(r'\{\{'+sc+r'\|'), '{{'+Scripts[sc]+'|')
         Regex['script parameter -sc='+sc+' +sc='+Scripts[sc]] = (
            re.compile(r'\|sc='+sc+r'([\}\|])'), '|sc='+Scripts[sc]+r'\1')
    # whoa(!)

    # see templates
    Regex['template -see +also'] = (re.compile(r'\{\{see\|'), r'{{also|')
    Regex['template -See +also'] = (re.compile(r'\{\{See\|'), r'{{also|')
    Regex['template -see also +also'] = (re.compile(r'\{\{see also\|'), r'{{also|')
    # fix Japanese sees, allow a line for kanjitab after header (do not use re.S)
    Regex['Japanese see/also in section to ja-see-also'] = \
                            (re.compile(r'^(==Japanese==\n*.*\n*){\{(see|also)\|', re.M), \
                             r'\1{{ja-see-also|')

    Regex['add language in front of {{t}}'] = (re.compile(r'^\*? *\{\{t(\+|-|)\|([a-z-]+)\|', re.M), \
                  r'* {{\2}}: {{t\1|\2|')
    # (a few more general Regex below)

    StarTemp = set([ 'Han ref', 'ja-readings', 'ethnologue', 'websters-online', 'pedialite',
                     'Hanja ref', 'Linguist List', 'IPA', 'SAMPA', 'enPR', 'ISO 639', 'R:1913' ])
    restartemp = re.compile(r'\{\{(.+?)[\|\}]')

    # trans lines gender templates regex, ordered list:
    Trex = [ ]
    # first replace ' cases with templates, look for leading space:
    Trex.append((re.compile(r" ''([mfcn])''"), r' {{\1}}'))
    Trex.append((re.compile(r" ''(pl|plural)''"), ' {{p}}'))
    Trex.append((re.compile(r" ''(sg|sing|singular)''"), ' {{s}}'))
    Trex.append((re.compile(r" ''m( and| or|,|/|) ?f''"), ' {{m|f}}'))
    # now look for combinations:
    Trex.append((re.compile(r"\{\{([mfcn])}},? \{\{([fcn])}},? \{\{([cnps])}}"), r'{{\1|\2|\3}}'))
    Trex.append((re.compile(r"\{\{([mfcn])}},? \{\{([fcnps])}}"), r'{{\1|\2}}'))
    # hmmm...
    Trex.append((re.compile(r"\{\{t([\+\-]?)\|([^\|]*?)\|([^\|]*?)\|mf}}"), r'{{t\1|\2|\3|m|f}}'))

    # match trans sections
    retransect = re.compile(r"^\{\{trans-top.*?^\{\{trans-bottom\}\}\n", re.M|re.S)

    # Pronunciate
    # like Regex, but applied line by line only in pronunciation sections
    # use ^ and $ as needed with re.M for prescreen
    Prex['template enPR/IPA/SAMPA'] = \
                    (re.compile(r'^\*? ?([^ \{\|\}/]+), /([^\{\|\}/]+)/, /<tt>([^\|\}/]+)</tt>/$', re.M),
                     r'* {{enPR|\1}}, {{IPA|/\2/}}, {{SAMPA|/\3/}}')
    Prex['template enPR/IPA/SAMPA (RP, UK, US)'] = \
                    (re.compile(r"^\*? ?\(''(RP|UK|US)''\):? *"
                     r'([^ \{\|\}/]+), /([^\{\|\}/]+)/, /<tt>([^\|\}/]+)</tt>/$', re.M),
                     r'* {{a|\1}} {{enPR|\2}}, {{IPA|/\3/}}, {{SAMPA|/\4/}}')
    Prex['template enPR/IPA/SAMPA with {a}'] = \
                    (re.compile(r"^\*? ?(\{\{a\|[^\}]+\}\}):? *"
                     r'([^ \{\|\}/]+), /([^\{\|\}/]+)/, /<tt>([^\|\}/]+)</tt>/$', re.M),
                     r'* \1 {{enPR|\2}}, {{IPA|/\3/}}, {{SAMPA|/\4/}}')

    Prex['+rhymes template'] = (re.compile("'*Rhymes:'* *\[\[[Rr]hymes:English:-(?P<s>.+?)\|-(?P=s)\]\]"),
                                r'{{rhymes|\1}}')
    # w/O "Rhymes:":
    Prex['+rhymes template w/Rhymes: in link'] = \
           (re.compile("^([\*:]+) *\[\[[Rr]hymes:English:-(?P<s>.+?)\|Rhymes: -(?P=s)\]\]", re.M),
                                r'\1 {{rhymes|\2}}')
    Prex['+rhymes template (Finnish)'] = (re.compile("'*Rhymes:'* *\[\[[Rr]hymes:Finnish:-(?P<s>.+?)\|-(?P=s)\]\]"),
                                r'{{rhymes|\1|lang=fi}}')
    Prex['+rhymes template w/Rhymes: in link (Finnish)'] = \
           (re.compile("^([\*:]+) *\[\[[Rr]hymes:Finnish:-(?P<s>.+?)\|Rhymes: -(?P=s)\]\]", re.M),
                                r'\1 {{rhymes|\2|lang=fi}}')
    Prex['+rhymes template w/Rhymes: in link (French)'] = \
           (re.compile("^([\*:]+) *\[\[[Rr]hymes:French:-(?P<s>.+?)\|Rhymes: -(?P=s)\]\]", re.M),
                                r'\1 {{rhymes|\2|lang=fr}}')
    Prex['+rhymes template (Icelandic)'] = \
           (re.compile("'*Rhymes:'* *\[\[[Rr]hymes:Icelandic:-(?P<s>.+?)\|-(?P=s)\]\]"),
                                r'{{rhymes|\1|lang=is}}')
    Prex['template -Rhymes +rhymes'] = (re.compile(r'\{\{Rhymes([\|\}])'), r'{{rhymes\1')
    # multiple rhymes (assume language matches! ;-)
    Prex['add additional rhyme to template'] = \
           (re.compile(r'(\{\{rhymes\|[^\}]+)\}\} *(,|or|) *\[\[[Rr]hymes:[A-Za-z -]+:-(?P<s>.+?)\| ?-(?P=s)\]\]'),
                                r'\1|\3}}')

    Prex["rm /'s from enPR template"] = (re.compile(r'\{\{enPR\|/([^ /\[\]\{\}]+?)/\}\}'), r'{{enPR|\1}}')

    # RP, UK, and US in a wide variety of cases
    Prex['(RP) to {{a|RP}}'] = (re.compile(r"^\*? ?[\(\[\{']+RP[\]\)\}:']+", re.M), r'* {{a|RP}}')
    Prex['(UK) to {{a|UK}}'] = (re.compile(r"^\*? ?[\(\[\{']+UK[\]\)\}:']+", re.M), r'* {{a|UK}}')
    Prex['(US) to {{a|US}}'] = (re.compile(r"^\*? ?[\(\[\{']+US[\]\)\}:']+", re.M), r'* {{a|US}}')
    Prex['(italbrac RP) to {{a|RP}}'] = (re.compile(r"^\*? ?\{\{italbrac\|\[*RP\]*\}\}:?", re.M), r'* {{a|RP}}')
    Prex['(italbrac UK) to {{a|UK}}'] = (re.compile(r"^\*? ?\{\{italbrac\|\[*UK\]*\}\}:?", re.M), r'* {{a|UK}}')
    Prex['(italbrac US) to {{a|US}}'] = (re.compile(r"^\*? ?\{\{italbrac\|\[*US\]*\}\}:?", re.M), r'* {{a|US}}')
    Prex['IPA: [[WEAE]] to {{a|WEAE}} IPA:'] = \
               (re.compile(r"^\*? ?IPA: [\(\[\{']+WEAE[\]\)\}:']+", re.M), r'* {{a|WEAE}} IPA:')
    Prex['(GenAm) to {{a|GenAm}}'] = (re.compile(r"^\*? ?\[\[w:G[^\|]+\|GenAm\]\]", re.M), r'* {{a|GenAM}}')
    Prex['(Canada) to {{a|Canada}}'] = (re.compile(r"^\*? ?[\(\[\{']+Canada[\]\)\}:']+", re.M), r'* {{a|Canada}}')
    Prex['(Australia) to {{a|Australia}}'] = \
               (re.compile(r"^\*? ?[\(\[\{']+Australia[\]\)\}:']+", re.M), r'* {{a|Australia}}')
    Prex['(Aus) to {{a|Aus}}'] = (re.compile(r"^\*? ?[\(\[\{']+Aus[\]\)\}:']+", re.M), r'* {{a|Aus}}')
    Prex['(GenAm|US) to {{a|GenAm}}'] = \
           (re.compile('^' + re.escape("* (''[[General American|US]]'')"), re.M),
            r'* {{a|GenAm}}')
    Prex['(RecPr|UK) to {{a|RP}}'] = \
           (re.compile('^' + re.escape("* (''[[Received Pronunciation|UK]]'')"), re.M),
            r'* {{a|RP}}')

    # untemplated SAMPA and IPA, several combinations, also for "AHD", allow an {{a}} template in front
    Prex['template IPA'] = \
         (re.compile(r"^\*? ?(\{\{a\|.+?\}\} *|)"
             r"\[*(w:IPA\||)IPA\]*:? *([/\[][^\{\|\}/\]]+?[/\]])$", re.M),
             r'* \1{{IPA|\3}}')
    Prex['template IPA -IPAchar'] = \
         (re.compile(r"^\*? ?(\{\{a\|.+?\}\} *|)"
             r"\[*(w:IPA\||)IPA\]*:? *\{\{IPAchar\|([/\[][^\{\|\}/\]]+?[/\]])\}\}$", re.M),
             r'* \1{{IPA|\3}}')
    Prex['template SAMPA'] = \
            (re.compile(r"^\*? ?(\{\{a\|.+?\}\} *|)"
             r"\[*(w:SAMPA\||)SAMPA\]*:? *([/\[])(<tt>|)([^\|\}/]+?)(</tt>|)([/\]])$", re.M),
             r'* \1{{SAMPA|\3\5\7}}')
    Prex['template enPR (was AHD)'] = \
            (re.compile(r"^\*? ?(\{\{a\|.+?\}\} *|)\[*(w:AHD\||)AHD\]*:? *([^ \{\|\}/]+?)$", re.M),
             r'* \1{{enPR|\3}}')
    Prex['template X-SAMPA'] = \
            (re.compile(r"^\*? ?(\{\{a\|.+?\}\} *|)"
             r"\[*(w:X-SAMPA\||)X-SAMPA\]*:? *([/\[])(<tt>|)([^\{\|\}/]+?)(</tt>|)([/\]])$", re.M),
             r'* \1{{X-SAMPA|\3\5\7}}')

    Prex['or/comma to multiple parameters in IPA template'] = \
            (re.compile(r"\{\{IPA\|([^\}]+/)(, ?| or | ''or'' )(/[^\}]+)\}\}"), r'{{IPA|\1|\3}}')
    Prex['or/comma to multiple parameters in enPR template'] = \
            (re.compile(r"\{\{enPR\|([^\}]+/)(, ?| or | ''or'' )(/[^\}]+)\}\}"), r'{{enPR|\1|\3}}')
    Prex['or/comma to multiple parameters in SAMPA template'] = \
            (re.compile(r"\{\{SAMPA\|([^\}]+/)(, ?| or | ''or'' )(/[^\}]+)\}\}"), r'{{SAMPA|\1|\3}}')

    # accent templates, try to cover the A-cai/Min Nan cases and others, up to 4

    Prex['+accent template 1'] = (re.compile(r"^\* \(''"
            r"\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r"''\):?", re.M), r'* {{a|\2}}')
    Prex['+accent template 2'] = (re.compile(r"^\* \(''"
            r"\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r"''\):?", re.M), r'* {{a|\2|\4}}')
    Prex['+accent template 3'] = (re.compile(r"^\* \(''"
            r"\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r"''\):?", re.M), r'* {{a|\2|\4|\6}}')
    Prex['+accent template 4'] = (re.compile(r"^\* \(''"
            r"\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r", *\[*(w?:?[A-Za-z -]+\||)([A-Za-z -]+)\]*"
            r"''\):?", re.M), r'* {{a|\2|\4|\6|\8}}')

    # hyphenation ...
    Prex['+hyphenation template'] = (re.compile(r"'*Hyphenation:?'*:? *([^ \{\}]+)$"), r'{{hyphenation|\1}}')
    Prex['middot to | in hyphenation template'] = (re.compile(r'(\{\{hyphenation\|.+?)' + u'\u00B7' + '(.+?\}\})'),
            r'\1|\2')
    Prex['hyphpt to | in hyphenation template'] = (re.compile(r'(\{\{hyphenation\|.+?)' + u'\u2027' + '(.+?\}\})'),
            r'\1|\2')
    Prex['middot (HTML) to | in hyphenation template'] = (re.compile(r'(\{\{hyphenation\|.+?)&middot;(.+?\}\})'),
            r'\1|\2')

    # "blank" IPA/SAMPA/AHD, include new-line, so put these in general regex
    Regex['replaced IPA // with {{rfp}}'] = (re.compile(r'^\* \[\[IPA\]\]:? *//\n', re.M), '{{rfp}}\n')
    Regex['removed SAMPA //'] = (re.compile(r'^\* \[\[SAMPA\]\]:? *//\n', re.M), '')
    Regex['removed AHD //'] = (re.compile(r'^\* \[\[AHD\]\]:? *//\n', re.M), '')

    # IPA template fix to add lang=, capture all but }} without =
    reIPAlang = re.compile(r'(\{\{IPA\|[^}=]+)\}\}')

    # combine to single lines, lines are canonical
    repronsing3 = re.compile(r"^\* \{\{enPR\|(.*?)\}\}\n\* \{\{IPA\|(.*?)\}\}\n\* \{\{SAMPA\|(.*?)\}\}", re.M)
    repronsing3a = re.compile(r"^\* \{\{IPA\|(.*?)\}\}\n\* \{\{SAMPA\|(.*?)\}\}\n\* \{\{enPR\|(.*?)\}\}", re.M)
    repronsing2 = re.compile(r"^\* \{\{IPA\|(.*?)\}\}\n\* \{\{SAMPA\|(.*?)\}\}", re.M)

    # add links to form of to make pages countable
    Forms = [ 'es-verb form', 'superlative', 'comparative', 'alternative spelling', 'alternative form',
              'past', 'archaic spelling', 'fi-participle', 'present participle',
              'feminine', 'diminutive', 'obsolete spelling', 'infinitive',
              'plural', 'fi-form', 'pt-verb form' ]
    Frex = { }
    for form in Forms:
        Frex['make page count: add link in {{' + form + ' of}}'] = \
              (re.compile(r'\{\{(' + form +r') of\|([\w-]+)(}}|\|[^=}\[]+=[^}\[]+}})'), r'{{\1 of|[[\2]]\3')
        Frex['make page count: add link in {{' + form + ' of}} 2'] = \
              (re.compile(r'\{\{(' + form +r') of([^=}\[]+=[^}\[]+)\|([\w-]+)}}'), r'{{\1 of\2|[[\3]]}}')

    # make sure we are logged in
    site = wikipedia.getSite("en", "wiktionary")
    site.forceLogin(sysop = True)
    site.forceLogin(sysop = False)

    # get our config pages, throw exceptions: we have to stop if we can't read these

    print "read languages"
    page = wikipedia.Page(site, "User:AutoFormat/Languages")
    langtab = getwikitext(page)
    print "read headers"
    page = wikipedia.Page(site, "User:AutoFormat/Headers")
    headtab = getwikitext(page)
    print "read Top40"
    page = wikipedia.Page(site, "Wiktionary:Translations/Wikification")
    top40tab = getwikitext(page)
    print "read contexts"
    page = wikipedia.Page(site, "User:AutoFormat/Contexts")
    ctxtab = getwikitext(page)
    print "read etys"
    page = wikipedia.Page(site, "User:AutoFormat/Ety temps")
    etytab = getwikitext(page)

    Lcodes = { }
    Ltocode = { }
    relangtab = re.compile(r'\| (.*?)\|\|(.*)')
    i = 0
    for line in langtab.splitlines():
        mo = relangtab.match(line)
        if mo:
            for code in mo.group(1).split(','):
                Lcodes[code.strip()] = mo.group(2).strip()
                i += 1
            Ltocode[mo.group(2).strip()] = mo.group(1).split(',')[0].strip()

    print "found %d language codes" % i

    # treat a couple of other codes as Mandarin etc, since they are in cats:
    Lcodes['zh-cn'] = 'Mandarin'
    Lcodes['zh-tw'] = 'Mandarin'
    Lcodes['nan-cn'] = 'Min Nan'
    Lcodes['nan-tw'] = 'Min Nan'
    Lcodes['yue-cn'] = 'Cantonese'
    Lcodes['yue-hk'] = 'Cantonese'

    Level = { }
    L43 = { }
    POS = { }
    EOS = [ 'See also', 'References', 'External links', 'Anagrams', 'Dictionary notes', 'Trivia']
    TOS = [ 'Pronunciation', 'Alternative spellings', 'Alternative forms', 'Production' ]
    HAN = ['Han character', 'Kanji', 'Hanzi', 'Hanza']
    HT = ( '{{abbreviation', '{{initialism', '{{acronym', '{{numeral' )
    NS = { }
    Hfix = { }
    reheadtab = re.compile(r'\| (.*?)\|\|\s*([1-5/]*)\s*\|\|(.*?)\|\|(.*?)\|\|(.*)')
    i = 0
    for line in headtab.splitlines():
        mo = reheadtab.match(line)
        if mo:
            header = mo.group(1).strip()
            if mo.group(2).strip() == '4/3':
                L43[header] = True
                Level[header] = 4
                print "header %s is 4/3" % header
            else: Level[header] = int(mo.group(2))
            if mo.group(3).strip() == 'NS': ns = NS[header] = True
            else: ns = False
            if mo.group(4).strip() == 'POS': POS[header] = True

            for variant in mo.group(5).split(','):
                variant = variant.lower().strip()
                if not variant: continue
                Hfix[variant] = header
                """
                if not ns:
                    if variant.endswith('s'): Hfix[variant[-1]] = header
                    else: Hfix[variant + 's'] = header
                """

            Hfix[header.lower()] = header
            if not ns:
               if header.endswith('s'): Hfix[header.lower()[-1]] = header
               else: Hfix[header.lower() + 's'] = header
            i += 1

    print "found %d headers" % i

    # lots of possible ety sects, 1 to 24
    for i in range(1, 25):
        Hfix['etymology %d'%i] = 'Etymology %d'%i
        Level['Etymology %d'%i] = 3

    Top40 = { }
    Classics = { }
    retop40tab = re.compile(r'\*\s*(.*)')
    i = j = 0
    inT40 = True
    for line in top40tab.splitlines():
        if line.startswith('----'): inT40 = False
        mo = retop40tab.match(line)
        if mo:
            lang = mo.group(1).strip(' []')
        else: continue
        if inT40:
            Top40[lang] = True
            i += 1
        else:
            if lang in Top40:
                print "language %s in both Top40 and Classics?" % safe(lang)
                continue
            Classics[lang] = True
            j += 1
                

    print "found %d Top 40 languages" % i
    print "found %d Classic languages" % j

    # add all other known languages not in Top40:
    i = 0
    for code in Lcodes:
        lang = Lcodes[code]
        if lang not in Top40 and lang not in Classics:
            if lang == 'English': continue
            Classics[lang] = True
            i += 1
            # print "added Classic: %s" % safe(lang)

    print "added %d languages to Classics" % i

    Contexts = { }
    rectxtab = re.compile(r"\|\s*''(.*?)''\s*\|\|(.*)")
    i = 0
    for line in ctxtab.splitlines():
        mo = rectxtab.match(line)
        if mo:
            m1 = mo.group(1).strip()
            m2 = mo.group(2).strip()
            if not m1 or not m2: continue
            # only use first, table at top over-rides auto, templates over-ride redirects
            if m1 not in Contexts: Contexts[m1] = m2
            i += 1

    print "found %d context templates" % i

    # turn on/off for now
    contextp = True

    # Etyl conversions
    reetytab = re.compile(r'\| ?\{\{temp\|([A-Z][A-Za-z]*)\.\}\} ?\|\| ?([a-z]{2,3}) ?\|\|')
    i = 0
    for line in etytab.splitlines():
        mo = reetytab.match(line)
        if mo:
            m1 = mo.group(1)
            m2 = mo.group(2)
            if not m1 or not m2: continue
            # add regex:
            Regex['convert %s. to etyl|%s' % (m1, m2)] = \
                  (re.compile(r'\{\{' + m1 + r'\.(\|[a-z]{2,3}|)\}\}'), r'{{etyl|' + m2 + r'\1}}')
            print "add regex to convert %s. to etyl|%s" % (m1, m2)
            i += 1

    print "found %d ety template conversions" % i

    entries = 0
    fixed = 0

    # (specific stats)

    # Set up set of all headers that are valid (at L3 or higher)

    for header in Level:
        AH.add(header)

    # Sigh. True means prohibited from changing 4/3 levels
    Connel = True

    for page in rcpages(site):

        naptime += 3
        days = (time.time() - 1199145600) / 86400 # days since 1 Jan 08
        if random() < days/370: Connel = False    # some of the time, as they need to be checked
        else: Connel = True

        title = page.title()

        print "page %s" % safe(title)

        if ':' in title:
            print "not in main namespace"
            continue

        if title.lower() == 'main page':
            print "skip Main page ..."
            continue

        entries += 1

        try:
            # text = page.get()
            text = getwikitext(page)
            origtext = text
        except wikipedia.NoPage:
            print "Can't get %s from en.wikt" % safe(title)
            text = ''
            continue
        except wikipedia.IsRedirectPage:
            print "Redirect page %s" % safe(title)
            text = ''
            continue
        except wikipedia.LockedPage:
            print "Locked/protected page %s" % safe(title)
            text = ''
            continue

        acts = set()

        mo = retag.search(text)
        if mo:
            if mo.group(1).strip(' |'):
                acts.add('rm tag:' + mo.group(1).strip(' |'))
            else:
                acts.add('rm tag')
            text = retag.sub('', text)

        # rfc level trickery
        newtext = rerfclevel.sub('', text)
        if newtext != text:
            print 'took out rfc-level'
            acts.add('rm rfc-level tag')
            text = newtext

        # same for xphrase
        newtext = rerfcxphrase.sub('', text)
        if newtext != text:
            print 'took out rfc-xphrase'
            acts.add('rm rfc-xphrase tag')
            text = newtext

        # same for header
        newtext = rerfcheader.sub('', text)
        if newtext != text:
            print 'took out rfc-header'
            acts.add('rm rfc-header tag')
            text = newtext

        # same for subst
        newtext = rerfcsubst.sub('', text)
        if newtext != text:
            print 'took out rfc-subst'
            acts.add('rm rfc-subst tag')
            text = newtext

        # same for pron-n
        newtext = rerfcpronn.sub('', text)
        if newtext != text:
            print 'took out rfc-pron-n'
            acts.add('rm rfc-pron-n tag')
            text = newtext

        if '{{rfc' in text: rfc = True
        #elif '{{rfc|' in text: rfc = True
        #elif '{{rfc-' in text: rfc = True
        else: rfc = False
        rfcact = ''

        # overall regex, using table 

        for rx in Regex:
            newtext = Regex[rx][0].sub(Regex[rx][1], text)
            if newtext != text:
                acts.add(rx)
                text = newtext

        # report multiple blank lines (force save), will be taken out by parsing
        if '\n\n\n\n' in text:
             # 3 or more, not just 2
             acts.add("remove multiple blank lines")

        # categories found in the entry or implied by context and perhaps inflection templates
        catseen = set()

        # now parse. take the entry apart into languages (ha!)

        curr = '*prolog'
        last = ''
        Lsect = { '*prolog':[ ], '*iwiki':[ ] }
        Lcats = { }
        waslinked = [ ]
        divs = 0
        header = ''

        for line in text.splitlines():

            # canonical headers first. some later code is redundant, but so what? it does "rest"
            if line and line.startswith('='):
                mo = rehead1.match(line)
                if not mo: mo = rehead2.match(line)
                if not mo: mo = rehead3.match(line)
                if not mo: mo = rehead4.match(line)
                # must match 4 or else what?! (all eq = is the answer to this question!)
                if not mo:
                    mo = realleq.match(line)
                    if mo: acts.add("remove line of only ='s")
                    else: acts.add('remove bogus = line')
                    continue
                oline = line
                level = len(mo.group(1))
                if not mo.group(2).strip():
                    acts.add('removed nil header') # !!!
                    line = ''
                else: line = '='*level + mo.group(2).strip() + '='*level + mo.group(3)
                if line != oline: acts.add('format headers')

            # L2 headers
            mo = reL2head.match(line)
            if mo:
                header = mo.group(1).strip()
                hf = reunlink.sub(r'\1', header)
                if hf != header:
                    if '|' in hf: hf = hf.split('|')[1]
                    if hf not in Top40: waslinked.append(hf)
                    elif hf not in Level: acts.add('unlink language header ' + hf)
                    header = hf

                # validate language [needs to be fixed for case before first lang section!]
                if header.capitalize() in Level:
                    """
                    if not rfc:
                        text = '{{rfc-level|' + header + ' as level 2 header}}\n' + text
                        rfcact = 'add rfc-level tag for L1/2 header ' + header
                        rfc = True
                    else:
                        print "(no edit, bad L2 header and rfc)"
                        rfcact = 'bad L1/2 header ' + header
                    """
                    # try fixing, move to min level for this header:
                    level = Level[header.capitalize()]
                    acts.add('L1/2 header ' + header + ' to L' + str(level))
                    # header + anything else, will get moved later
                    Lsect[curr].append('='*level + header + '='*level + mo.group(2))
                    continue # with current language section

                # subst code template
                if header.startswith('{{'):
                    if header[2:-2] in Lcodes:
                        hf = Lcodes[header[2:-2]]
                        acts.add('L2 header -' + header + ' +' + hf)
                        header = hf

                # check sort order
                if header and last and lkey(header) < lkey(last):
                    acts.add(last + '/' + header + ' sorted into order')
                last = header

                if header not in Lsect:
                    Lsect[header] = [ ]
                    Lcats[header] = [ ]

                else:
                    acts.add('merged ' + header + ' sections')
                curr = header
                if mo.group(2).strip():
                    acts.add('stuff after L2 header moved')
                    Lsect[curr].append(mo.group(2).strip())
                continue

            # look for iwiki
            mo = reiwiki.match(line)
            if mo and mo.group(1) == title:
                Lsect['*iwiki'].append(line)
                continue

            # wiki format + one space
            line = restack.sub(r'\1 ', line)

            # trailing spaces
            if len(line) > 2 and line.startswith('=') and line.endswith(' '): acts.add('rm spaces after header')
            line = line.rstrip()

            # take out dividers
            if line.startswith('----'):
                if line == '----': divs += 1
                continue

            # other lines
            Lsect[curr].append(line)

        # any language sections?
        if len(Lsect) == 2:
            # no, tag if not tagged
            if ( 'nolanguage/box' not in text and '{{wikify' not in text and
                 '{{delete' not in text and '{{only in' not in text ):
                text = '{{subst:nolanguage}}\n' + text
                rfcact = 'tagged nolanguage'
                rfc = True
            else:
                print "(no edit, tagged nolanguage, wikify or delete)"
                continue # next entry

        # each section

        for lang in Lsect:
            if lang.startswith('*'): continue
            if lang in Ltocode: lcode = Ltocode[lang]
            else: lcode = ''

            # find Etymologies first

            etys = [ ]
            etycount = 0
            fh = True
            for i, line in enumerate(Lsect[lang]):
                # look for ety headers, and Pronunciation first at L4
                mo = reheader.match(line)
                if mo:
                    level = len(mo.group(1))
                    header = mo.group(2).strip()
                    # rest = mo.group(3)

                    # special case pronunciation, occurs with some frequency

                    if fh and level != 3 and fuzzy(header.lower(), 'pronunciation', 11) >= 11 and len(header) < 15:
                        acts.add('Pronunciation changed to level 3')
                        Lsect[lang][i] = '===' + header + '==='
                        # and leave fh set:
                        continue

                    # just do fuzzy!
                    if fuzzy(header.lower(), 'etymology', 7) >= 7 and len(header) < 20:
                        if level != 3:
                            if fh:
                                # first header, okay to fix!
                                acts.add('Etymology changed to level 3')
                                # and leave fh set:
                                etycount += 1
                                etys.append(i)
                                continue
                            elif not rfc:
                                Lsect[lang][i] = line + '{{rfc-level|Etymology not at level 3|lang=%s}}'%lcode
                                acts.add('+{{rfc-level|Etymology not at level 3}}')
                                rfc = True
                                continue
                            else:
                                print "(ety not at L3 and already rfc)"
                                continue
                        etycount += 1
                        etys.append(i)
                    fh = False

            # then fix/rewrite the ety headers, use sub to handle rest, report any changes (spacing an issue):
            if etycount:
                for i in range(etycount):
                    line = Lsect[lang][etys[i]]
                    # print 'ety check replace ' + line
                    if etycount > 1: newline = reheader.sub(r'===Etymology %d===\3' % (i+1), line)
                    else: newline = reheader.sub(r'===Etymology===\3', line)
                    if newline.strip('= ') != line.strip('= '):
                        acts.add('header -' + line.strip('= ') + ' +' + newline.strip('= '))
                    Lsect[lang][etys[i]] = newline

            # sigh, think that's it? Sweet, if true...

            # general format
            newlines = [ ]

            inPos = inTrans = inPro = inext = defnext = False
            npos = 0
            ety = nety = 0
            levelact = ''
            rfctag = ''
            header = ''

            for line in Lsect[lang]:

                # minor spacing on stackable wiktext ...
                # already done line = restack.sub(r'\1 ', line)

                # move cats, may be something else on the line too, or multicats ...
                # first we need a cat-present predicate
                catp = False
                for cat in recat.findall(line):
                    ocat = cat
                    catp = True
                    catname = cat[11:-2].split('|')[0]
                    catname = re.sub('_', ' ', catname).strip()
                    cf = cat.find('|')
                    if cf > 0: cat = '[[Category:' + catname + cat[cf:]
                    else: cat = '[[Category:' + catname + ']]'
                    # we have a canonical cat! is it a novel cat?
                    if cat in catseen:
                         acts.add('rm dup cat [[:' + cat[2:])
                         continue
                    catseen.add(cat)
                    # rm bad cats from substs left around, see how this works
                    if '{{{' in cat:
                         acts.add('rm bad cat [[:' + cat[2:])
                         continue
                    if cat != ocat: acts.add('canonical cats')
                    # see if it belongs in a different sect
                    catmove = False
                    
                    if ':' in catname:
                        catcode = catname.split(':')[0]
                        if catcode in Lcodes:
                            catlang = Lcodes[catcode]
                            if catlang != lang and catlang in Lcats:
                                acts.add('category ' + catname + ' moved to ' + catlang + ' section')
                                Lcats[catlang].append(cat)
                                catmove = True
                    elif not catname.lstrip(' 01').startswith(lang) and not catname.endswith('derivations'):
                        for other in Lcats:
                            if other == lang: continue
                            if catname.lstrip(' 01').startswith(other+' '):
                                acts.add('category ' + catname + ' moved to ' + other + ' section')
                                Lcats[other].append(cat)
                                catmove = True
                                break

                    # not moved
                    if not catmove: Lcats[lang].append(cat)
                if catp:
                    line = recat.sub('', line).strip()
                    if not line: continue

                # headers
                mo = reheader.match(line)
                if mo:
                    # hit header with no infl/defn line in previous section?
                    if inext:
                        acts.add('added inflection line for %s/%s' % (lang, header))
                        newlines.append(infline(title, lcode, header))
                        newlines.append('')
                        inext = False
                        defnext = True
                    if defnext and header not in HAN:
                        newlines.append('# {{defn|%s}}' % lang)
                        acts.add('no definition line for %s/%s added {defn}' % (lang, header))

                    level = len(mo.group(1))
                    header = mo.group(2).strip()
                    rest = mo.group(3)

                    # unlink header
                    hf = reunlink.sub(r'\1', header)
                    if hf != header:
                        if hf.find('|') > 0: hf = hf.split('|')[1]
                        acts.add('header -' + header + ' +' + hf)
                        header = hf

                    # fix header
                    if header.lower() in Hfix:
                        hf = Hfix[header.lower()]
                        if hf != header:
                            acts.add('header -' + header + ' +' + hf)
                            header = hf

                    # try a fuzzy!
                    if header.lower() not in Hfix and not header.startswith('{{'):
                        high = 0
                        replac = ''
                        hf = header.strip('[]{}').lower()
                        for val in sorted(Hfix):
                            # first character must match
                            if hf[0] != val[0]: continue
                            rawsc = fuzzy(hf, val, len(val) - 4)
                            print safe('fuzzy "%s" "%s" score %d' % (hf, val, rawsc))
                            if rawsc > high and rawsc > max(max(len(hf), len(val)) - 3, 5):
                                high = rawsc
                                replac = val
                        print safe('fuzzy for %s: %s score %d' % (hf, replac, high))
                        if high:
                            hf = Hfix[replac]
                            acts.add('header -' + header + ' +' + hf)
                            header = hf

                    # tag Transitive and Intransitive verb, and Reflexive
                    if header.lower() in ('transitive verb', 'intransitive verb', 'reflexive verb') and not rfc:
                        rfctag = '{{rfc-trverb|' + header + '}}'
                        rfc = True
                        # print "trans/intrans header: %s" % safe(header)

                    # tag X phrase
                    if header.endswith(' phrase') and not rfc:
                        rfctag = '{{rfc-xphrase|' + header + '}}'
                        rfc = True
                        # print "X phrase header: %s" % safe(header)

                    # tag Pronunciation N headers, preventing the level errors later
                    if repronn.match(header) and not rfc:
                        # not sure if we need the header in the template, but follows the pattern (with a |)
                        rfctag = '{{rfc-pron-n|' + header + '}}'
                        rfc = True

                    # rfc unrecognized, ignore templates for now, use NS later
                    if header.lower() not in Hfix and not rfc and not header.startswith('{{'):
                        rfctag = '{{rfc-header|' + header + '}}'
                        rfc = True
                        # print "unknown header: %s" % safe(header)

                    # min level, set and comp for nested ety
                    if level == 3 and header.startswith("Etymology") and etycount > 1:
                        ety = 1
                        nety += 1
                        npos = 0
                        push = False
                    else:
                        if ety:
                            # if we are in the last ety sect, and see end of section things at L3:
                            if level < 4 and nety == etycount and header in EOS: inPos = ety = 0
                            # and ... independent of connel flag, because we always push ;-)
                            if level < 4 and nety == etycount and header in L43: inPos = ety = 0

                        # push POS (or level 3?) sections down in ety, push flag because of Connel fix
                        # may be a good idea anyway ... yes, but if we rfc, stop
                        if ety and not rfc:
                            if (header in POS and header not in HAN or header in TOS) and level == 3:
                                level = 4
                                acts.add('header in ety sect ' + header + ' to L' + str(level))
                                if header == 'Pronunciation':
                                    rfctag = '{{rfc-level|check placement of Pronunciation}}'
                                push = True
                            elif header in POS and header not in HAN or header in TOS:
                                # at correct level! (or too deep already)
                                push = False
                            elif push and header in Level and (level == 4 or level < Level[header] + ety):
                                level += 1
                                acts.add('header in ety sect ' + header + ' to L' + str(level))
                            elif level < 4: push = False

                        # code to shift header levels (general case in POS), disabled per Connel, 18.4.7
                        if inPos and header in L43:
                            if npos < 2 and level < 4 + ety:
                                if not Connel:
                                    level = 4 + ety
                                    acts.add('header ' + header + ' to L' + str(level))
                                else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
                        elif inPos and header in Level:
                            if level < Level[header] + ety:
                                if not Connel:
                                    level = Level[header] + ety
                                    acts.add('header ' + header + ' to L' + str(level))
                                else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
                        
                        # now tag remaining problems if any, various cases
                        # should all contain "+" for the re-visit trick ...
                        if not rfc:
                           if level == 4 + ety and not inPos and header in POS and header not in NS:
                              rfctag = '{{rfc-level|' + header + ' at L4+ not in L3 Ety section' + levelact + '}}'
                           elif level == 4 + ety and not inPos and header in Level and header not in NS:
                              rfctag = '{{rfc-level|' + header + ' at L4+ not in L3 POS section' + levelact + '}}'
                           elif level == 3 + ety and header.startswith('Translation'):
                              rfctag = '{{rfc-level|' + header + ' at L3+' + levelact + '}}'
                           elif level == 5 + ety and not inTrans and header.startswith('Translations to'):
                              rfctag = '{{rfc-level|' + header + ' at L5+, not in Translations' + levelact + '}}'

                    # blank line
                    newlines.append('')

                    # header + anything else that wasn't blank
                    newlines.append('='*level + header + '='*level)
                    if rest.strip():
                        if not rest.startswith('{{rfc-'): acts.add('moved stuff after ' + header + ' header')
                        newlines.append(rest.strip())
                    # Usage notes can be anywhere (see ELE)
                    if 'rfc-level|Usage notes' in rfctag: rfctag = ''
                    # suppress the "AF would have" now, just don't tag:
                    if "AutoFormat would have" in rfctag: rfctag = ''
                    if rfctag:
                        if lcode: rfctag = rfctag[:-2] + '|lang=%s}}'%lcode
                        acts.add('+' + rfctag)
                        if 'check placement' not in rfctag: rfc = True
                        newlines.append(rfctag)
                        rfctag = ''

                    # set flags:
                    inext = defnext = False
                    if level < 4 + ety and (header in POS or header.startswith(HT)):
                        inext = inPos = True
                        npos += 1
                    elif level < 4 + ety: inPos = False
                    inTrans = (header == 'Translations')
                    tt = False
                    inPro = (header == 'Pronunciation')
                    continue

                # look for inflection line
                if inext:
                    if line.startswith('{{') and not line.startswith('{{wikipedia') or line.startswith("'''") or   \
                            fuzzy(line, title, len(title) - 1) > len(title) - 1:
                        if line == title:
                            acts.add('replace unformatted headword')
                            continue
                        inext = False
                        defnext = True
                    if line and line.startswith('#'):
                        acts.add('added inflection line for %s/%s' % (lang, header))
                        newlines.append(infline(title, lcode, header))
                        defnext = True
                        inext = False
                        # and also do next case for defnext
                    # elide blanks above inflection line
                    if not line: continue

                # look for definition lines
                if defnext and line.startswith('#'):
                    newlines.append('')
                    defnext = False
                # # used where it shouldn't be
                if line.startswith('#') and header not in POS:
                    if header in TOS or header in EOS or (header in Level and Level[header] == 4):
                        line = '*' + line[1:]
                        acts.add("-# +* in %s section" % header)

                # serious stuff ...

                if line.startswith('# '):

                    # look for context tag
                    if lang in Ltocode:
                        ctxn = 1
                        mo = recontext.match(line)
                        if not mo:
                            ctxn = 2
                            mo = recontext2.match(line)
                        if not mo:
                            ctxn = 3
                            mo = recontext3.match(line)
                        if mo:
                            print "match context tag %s" % safe(mo.group(1))
                            tname = cpar(mo.group(1), Contexts)
                        if mo and tname:
                            if lang != 'English': tname += '|lang=' + Ltocode[lang]
                            if contextp and ctxn == 1:
                                acts.add("-(''" + mo.group(1) + "'') +{{" + tname + "}}")
                                line = recontext.sub(r'# {{' + tname + r'}} \2', line)
                            elif contextp and ctxn == 2:
                                acts.add("-''(" + mo.group(1) + ")'' +{{" + tname + "}}")
                                line = recontext2.sub(r'# {{' + tname + r'}} \2', line)
                            elif contextp and ctxn == 3:
                                acts.add("-{{italbrac|" + mo.group(1) + "}} +{{" + tname + "}}")
                                line = recontext3.sub(r'# {{' + tname + r'}} \2', line)
                            else: print "would have replaced %s with %s" % (safe(mo.group(1)), safe(tname))
                            # elide cats that correspond
                            for catname in tname.split('|'):
                                if catname == 'context' or catname.startswith('lang='): continue
                                catname = catname[0].upper() + catname[1:]
                                # code is prefix ...
                                if lang != 'English': catname = Ltocode[lang] + ':' + catname
                                if contextp: 
                                    catseen.add('[[Category:' + catname + ']]')
                                    # catseen.add('[[Category:' + catname + 's]]')
                                    print "added catseen %s" % safe(catname)

                    # wikilinking?
                """
                # (remember to correct for spacing)
                elif not line.startswith('#') and not inTrans and "''" in line:
                    # look for italbrac cases not on defn lines
                    newl = reibcomma.sub(ibsub, line)
                    newl = reibcomma2.sub(ibsub, newl)
                    if newl != line:
                        # acts.add('-' + line + ' +' + newl)
                        # acts.add('template i')
                        # in pronunciation, use a, anywhere else, we want i-c if at start of * line
                        if inPro:
                            newl = re.sub(r'\{\{(i|i-c)\|', '{{a|', newl)
                        else:
                            newl = re.sub(r'\{\{i\|', '{{i-c|', newl)
                        acts.add(sdif(line, newl))
                        line = newl
                    # think that will work?
                 """
                # translations lines

                # stopgap check: (should be improved, tsort knows haow to handle this)
                if '{{ttbc|' in line: inTrans = False

                if inTrans:
                    # special indent rule, we know there is a previous line
                    if line.startswith(': ') and newlines[-1:][0].startswith('*'):
                        acts.add('-: +*: in trans')
                        line = '*' + line
                    # similar rule for :*, we leave ** alone (is correct for grouped language)
                    # may have intended **, but this is better than leaving it :*
                    if line.startswith(':* ') and newlines[-1:][0].startswith('*'):
                        acts.add('-:* +*: in trans')
                        line = '*:' + line[2:]
                    was = False
                    mo = retrans1.match(line)
                    if not mo: mo = retrans2.match(line)
                    if mo: was = True
                    if not mo: mo = retrans3.match(line)
                    if not mo:
                         mo = retrans4.match(line)
                         if mo: # missing ':'
                             tlang = mo.group(1).strip()
                             if tlang in Top40 or tlang in Classics:
                                 acts.add("added : after %s in translations" % tlang)
                             else: mo = None
                    if mo:
                        tlang = mo.group(1).strip()
                        if was and tlang.find('|') > 0: tlang = tlang.split('|')[1]
                        trest = mo.group(2).strip()

                        if tlang.startswith('{{') and tlang[2:-2] in Lcodes:
                            acts.add('subst %s in trans' % tlang)
                            tlang = Lcodes[tlang[2:-2]]
                            was = False

                        if was and (tlang in Top40 or title == tlang):
                            acts.add('trans unlink ' + tlang)
                        elif not was and tlang in Classics and title != tlang:
                            tlang = '[[' + tlang + ']]'
                            acts.add('trans link ' + tlang)
                        elif was:
                            # leave as is (was)
                            tlang = '[[' + tlang + ']]'

                        # conform gender specification templates
                        # tr = regender.sub(r'{{\1}}', trest)
                        tr = trest
                        for rx in Trex:
                            tr = rx[0].sub(rx[1], tr)
                        if tr != trest:
                            #acts.add('gender -' + trest + ' +' + tr)
                            acts.add('gender ' + sdif(trest, tr))
                            trest = tr

                        if trest: line = '* ' + tlang + ': ' + trest
                        else: line = '* ' + tlang + ':'

                    # convert templates
                    # has to be a non-blank previous line, we are in trans section

                    if line == '{{rfc-trans}}': inTrans = False
                    if line == '{{checktrans}}': inTrans = False
                    if line == '{{checktrans-top}}': inTrans = False
                    if line == '{{ttbc-top}}': inTrans = False

                    mo = retopgloss.match(line)
                    if mo:
                        if mo.group(1):
                            gloss = mo.group(1)[1:]
                        else:
                            prev = newlines[-1:][0]
                            while not prev:
                               newlines = newlines[:-1]
                               prev = newlines[-1:][0]
                            if prev.startswith(';'): gloss = prev[1:]
                            elif prev.startswith("'''") and prev.endswith("'''"): gloss = prev[3:-3]
                            else: gloss = ''
                            if gloss: newlines = newlines[:-1]
                        if gloss:
                            gloss = reglossfix.sub(r'\1', gloss).strip()
                            prev = line
                            line = '{{trans-top|' + gloss + '}}'
                            # <- else: line = '{{trans-top}}'
                            acts.add('-' + prev + ' +' + line)
                            tt = True
                    if tt and line == '{{mid}}':
                        line = '{{trans-mid}}'
                    if tt and line == '{{bottom}}':
                        newlines.append('{{trans-bottom}}')
                        # add blank line
                        line = ''
                        tt = False

                    # end of trans

                # templates that should have * outside them
                mo = restartemp.match(line)
                if mo and mo.group(1) in StarTemp:
                    line = '* ' + line
                    acts.add('* before ' + mo.group(1))

                # pronunciation specific
                if inPro:
                    refire = True
                    while refire:
                        refire = False
                        for rx in Prex:
                            if "enPR" in rx and lcode != "en": continue
                            line, k = Prex[rx][0].subn(Prex[rx][1], line)
                            if k:
                                acts.add(rx)
                                refire = True # fire ruleset again

                    if 'IPA' in line and lcode and lcode != 'en' and '|lang=' not in line:
                        line, k = reIPAlang.subn(r'\1|lang=' + lcode + '}}', line)
                        if k: acts.add('added lang=' + lcode + ' to IPA')

                    if line == '{{rfp}}' and lcode and lcode != 'en':
                        line = '{{rfp|lang=' + lcode + '}}'
                        acts.add('added lang=' + lcode + ' to rfp')
          
                # move {{also}} to prolog, we are in a language section
                if line.startswith("{{also|"):
                    Lsect['*prolog'].append(line)
                    acts.add("moved {{also}} to prolog")
                    continue

                # all else
                newlines.append(line)

            # at end with no infl / defn line in previous section?
            if inext:
                acts.add('added inflection line for %s/%s' % (lang, header))
                newlines.append(infline(title, lcode, header))
                newlines.append('')
                inext = False
                defnext = True
            if defnext and (header not in HAN or npos == 1):
                newlines.append('# {{defn|%s}}' % lang)
                acts.add('no definition line for %s/%s added {defn}' % (lang, header))

            # done with sect
            Lsect[lang] = newlines

        # reassemble ...

        newtext = ''
        prior = False

        # sort prolog, and add to newtext
        if len(Lsect) > 2:
            pcopy = sorted(Lsect['*prolog'], key=prokey)  # shallow copy, sorted
            if pcopy != Lsect['*prolog']: acts.add('sorted prolog')
        else: pcopy = Lsect['*prolog']  # no language sections, leave "prolog" alone
        for line in pcopy:
            # no blank lines
            if line: newtext += line + '\n'
            if line.startswith('=') and not rfc:
                newtext += '{{rfc-level|header line in prolog, before first L2 header}}\n'
                acts.add('tagged header before first L2 header')
        del Lsect['*prolog']
        blank = True # not really, this is to suppress blank before 1st L2 header

        for lang in sorted(Lsect, key=lkey):
            if lang == '*iwiki': continue
            if prior:
                if not blank: newtext += '\n'
                newtext += '----\n\n'
                divs -= 1
            prior = True
            if lang not in waslinked: newtext += '==' + lang + '==\n'
            else: newtext += '==[[' + lang + ']]==\n'
            blank = False
            for line in Lsect[lang]:
                # no dup blank lines
                if line or not blank: newtext += line + '\n'
                if line: blank = False
                else: blank = True
            if Lcats[lang]:
                if not blank: newtext += '\n'
                # (note lkey is a different function, but does strip brackets, so works ...)
                for cat in sorted(Lcats[lang], key=lkey): newtext += cat + '\n'
                blank = False
            del Lsect[lang]

        # residual tag(s):
        if ('{{{' in newtext and '}}}' in newtext) or '{{#' in newtext:
            acts.add('+{{rfc-subst}} syntax tag')
            newtext += '{{rfc-subst}}\n\n' # force newline even if at end
            blank = True

        # add the iwikis
        if not blank: newtext += '\n'
        for line in Lsect['*iwiki']:
            # no blank lines
            if line: newtext += line + '\n'

        if divs != 0: acts.add("fixed ----'s")

        # rfc-level, etc trickery
        for rfname in ('level', 'xphrase', 'header', 'subst', 'pron-n'):
            if 'rm rfc-' + rfname + ' tag' in acts:
                for ac in sorted(acts):
                    if ac.startswith('+{{rfc-' + rfname):
                        acts.remove('rm rfc-' + rfname + ' tag')
                        acts.remove(ac)
                        print 'elided -' + rfname + ' +' + rfname
                        break

        # sort translations if any, if not tagged already:
        if "{{trans-top" in newtext and "{{rfc-tsort" not in newtext:
            new2 = retransect.sub(transort, newtext)
            if new2 != newtext:
                if "{{trans-see" in new2 and "{{trans-see" not in newtext: acts.add("+trans-see template")
                if "{{rfc-tsort" not in new2: acts.add("sorted/rebalanced translations")
                else: acts.add("tagged translations table problem")
                newtext = new2

        # do some combining of pron lines, now that we've done the rulesets:
        newtext, k = repronsing3.subn(r"* {{enPR|\1}}, {{IPA|\2}}, {{SAMPA|\3}}", newtext)
        if k: acts.add("combined enPR, IPA, SAMPA on one line")
        # variant order
        newtext, k = repronsing3a.subn(r"* {{enPR|\3}}, {{IPA|\1}}, {{SAMPA|\2}}", newtext)
        if k: acts.add("combined enPR, IPA, SAMPA on one line")
        newtext, k = repronsing2.subn(r"* {{IPA|\1}}, {{SAMPA|\2}}", newtext)
        if k: acts.add("combined IPA and SAMPA on one line")

        # if page isn't "countable", see if we can add a link in a form-of template
        if '[[' not in newtext:
            for rx in Frex:
                newtext, k = Frex[rx][0].subn(Frex[rx][1], newtext)
                if k:
                    acts.add(rx)
                    break # only need one
            if '[[' not in newtext: print "page still not counted in stats"

        # do minor spacing 1% of the time that there is nothing else to do
        if not acts and random() < 0.01 and newtext.rstrip(' \n') != text.rstrip(' \n'):
            acts.add('minor spacing')

        # if we added a major rfc, just do that, dump the rest of the work!!
        if rfcact:
            acts = set()
            acts.add(rfcact)
            newtext = text

        act = ', '.join(sorted(acts))

        # some change, write it (even just rm tag)
        if act:

            fixed += 1
            naptime /= 2

            print "format %s: %s" % (safe(title), safe(act))

            saved = False
            retries = 5

            while not saved and retries:
                # try to fix the entry
                try:
                    currtext = getedit(page)
                    if currtext.strip('\n ') != origtext.strip('\n '):
                        print "page changed while doing format, not saved"
                        break
                    wikipedia.setAction(act)
                    page.put(newtext)
                    saved = True
                except wikipedia.PageNotSaved:
                    print "failed to save page"
                    # other action?
                except socket.timeout:
                    print "socket timeout, maybe not saving page"
                except socket.error:
                    print "socket error, maybe not saving page"
                except Exception, e:
                    print "some other error saving page, no retry"
                    print str(e)
                    break
                # put throttle will do: if not saved: time.sleep(30)
                retries -= 1

        # end loop

    print "entries fixed %d" % fixed

    # done

if __name__ == "__main__":
    try:
        main()
    finally:
        wikipedia.stopme()
User:AutoFormat/code

Contents

notice

technical notes

outline

code

Navigation menu

User:AutoFormat/code

notice

technical notes

outline

code

Navigation menu

Search