User:Interwicket/code
Definition from Wiktionary, a free dictionary
Contents |
[edit] notice
This comes with several very important caveats:
- I am a professional software engineer, this is what I do; however this code was written for my own use, and is not warranted, and does not carry any implication of merchantability or fitness for use.
- Like everything else in the Wiktionary, this is under GFDL. GFDL is not compatible with the GPL, this document is not licensed under the GPL as software. (!)
- At any given moment, this code may not represent what is being run; I have no intention of updating this page every time I make a change.
[edit] technical notes
This code runs on a highly modified version of the Python Wikipedia framework, the standard version is terrible at handling network faults. It is provided here for interest and reference, don't try to run it as is! If you would like to use it, either steal (;-) all you like on your own responsibility, or ask me; I'll be very glad to help.
I will be trying to provide a more directly useful version!
Specific issues:
- The "allpages" page generator in wikipedia.py does not handle conditions like socket.timeout. It is not possible for the caller to handle the condition, it can't just recall the generator. So the code is modified to catch these conditions and retry the http request. This is important because of the initialization time for the program (~20 minutes), restarting it because of some network transient is annoying.
- There is a bug in the page instance initializer, it tries to "clean up" a title by doing
t = re.sub('[ _]+', ' ', t).strip()
which is no good; a title can start or end with things like U+3000 IDEOGRAPHIC SPACE or U+00A0 NO-BREAK SPACE. The strip method on a unicode string removes whitespace by default. This is not correct, it should be:
t = re.sub('[ _]+', ' ', t).strip(' ')
removing only U+0020 SPACE.
- The code in pagegenerators.py is modified This is fairly trivial, this can probably be fixed to use the standard version, or have the class simply copied into iwikt.py (I intend to provide a new allpages module using the MW API and more optimizaions)
[edit] outline
- tbw
[edit] code
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
This bot updates iwiki links between wiktionaries
runs on modified pywikipedia framework at this time, will be fixing that
Start point is command line argument -start: (character set an issue; doesn't do escapes yet)
With a current XML, use -xml -update
With a stale XML, don't use it, or use -xml by itself
With -rec[iprocal]: will operate in en.wikt Interwicket bot mode for updates to FL.wikts
"""
import wikipedia
import xmlreader
import sys
import socket
import re
import pickle
import pagegenerators
import time
from config import usernames
from mwapi import getwikitext, getedit
from reciprocal import addrci
def safe(s):
return pickle.dumps(s)[1:-5]
# Iwiki cache:
import shelve
Iwikis = None
def iwopen(home):
global Iwikis
Iwikis = shelve.open(home + "-iwiki-cache")
cis = 0
def iwadd(title, iws, upd = True):
global cis
if safe(title) in Iwikis and not upd: return
if not iws or not len(iws): return
# print "iwikis cache %s: %s" % (safe(title), safe(u' '.join(iws)))
Iwikis[safe(title)] = iws
cis += 1
if cis % 100: Iwikis.sync()
return
Redirs = set()
Sorts = set()
Lcodes = { }
Exists = set()
Skips = { }
Sites = { }
# merge to union index:
# read the indexes twice (although disjoint sets), once for non-redirects, once for redirects
# return dict of all pages, and dict of redirect pages
# for home wikt we don't need redirs, we won't look at those titles at all
def merge(home, start = '!'):
pages = { }
current = { }
curkey = { }
# get site encoding, which will be UTF-8 ...
myenc = wikipedia.getSite(home, 'wiktionary').encoding()
for code in sorted(Exists):
# do this twice:
pages[code] = pagegenerators.AllpagesPageGenerator(site=wikipedia.getSite(code, 'wiktionary'),
start=start, namespace = 0, redirects = False).__iter__()
if code == home: continue
coder = code + '*'
pages[coder] = pagegenerators.AllpagesPageGenerator(site=wikipedia.getSite(code, 'wiktionary'),
start=start, namespace = 0, redirects = True).__iter__()
for code in sorted(pages):
if "*" not in code: print "setting up %s" % code
else: print "setting up %s (redirects)" % code[:-1]
try:
current[code] = pages[code].next()
curkey[code] = current[code].title(underscore = True).encode(myenc)
print "first page is " + safe(current[code].title())
except StopIteration:
print "no pages from start in %s" % code
except wikipedia.ServerError:
print "bad format in %s wikt, not updating iwikis" % code
Skips[code] = True
# now, while the home wikt has pages (else there is nothing to do):
while home in current:
lps = { }
reds = { }
# find the lowest collating title to return
# needs UTF-8 keys to make this work correctly for planes > 0
# page.page_title in the SQL DB has spaces replaced with underscores, so that's the order we get
lowest = curkey[home]
lowtitle = current[home].title()
for code in current:
if curkey[code] < lowest:
lowest = curkey[code]
lowtitle = current[code].title()
# now collect all of them, use keys() so we can del from the dict and not the iterator
for code in current.keys():
if curkey[code] == lowest:
if "*" not in code:
lps[code] = current[code]
else:
lps[code[:-1]] = current[code]
reds[code[:-1]] = current[code]
try:
current[code] = pages[code].next()
curkey[code] = current[code].title(underscore = True).encode(myenc)
except StopIteration:
print "no pages left in %s" % code
del current[code]
# this may have been a bad page read, we don't know, don't remove iwikis for this
code
Skips[code] = True
except wikipedia.ServerError:
# this may have been a bad page read, we don't know, don't remove iwikis for this
code
# as of 24.5.8, 'kk' has some serious issue that throws this (no links on page)
print "bad format in %s wikt, not updating iwikis from this point" % code
Skips[code] = True
# we have a list of links at the lowest title
yield lowtitle, lps, reds
# no more English (or home wikt), done.
def main():
socket.setdefaulttimeout(40)
# an iwiki on a line by itself as it should be
# if more than one we end up putting it in Sorts
# if we miss one then we will try to add it, and all will be well
reiwiki = re.compile(r'^\[\[([-a-z]{2,10}):(.*?)\]\]$', re.M)
home = 'en'
# start = u'\u04e9'
# note that starting above the start of the "UTF-16" range will not work correctly (> D800)
# it will run correctly started below this point, e.g. D000 in the Hangeul block
start = '!'
stop = ''
newonly = False
sort = False
xml = False
update = False
recp = False
for arg in sys.argv[1:]:
if arg.startswith('-start:'):
start = arg[7:]
print "starting at %s" % start
elif arg.startswith('-stop:'):
stop = arg[6:]
print "stopping at %s" % stop
elif arg.startswith('-new'):
newonly = True
print "new entries only"
elif arg.startswith('-sort'):
sort = True
print "do edits for sort"
elif arg.startswith('-xml'):
xml = True
print "read XML file"
elif arg.startswith('-update'):
update = True
print "update cache from XML (XML is current!)"
elif arg.startswith('-rec'):
recp = True
print "will create reciprocal links"
else: print "unknown command line argument %s" % arg
mysite = wikipedia.getSite(home, 'wiktionary')
# make sure we are logged in
mysite.forceLogin()
meta = wikipedia.getSite(code = "meta", fam = "meta")
# get active wikt list
# minus crap. Tokipona? what are they thinking? Klingon? ;-)
Lstops = ['tokipona', 'tlh']
page = wikipedia.Page(meta, "List of Wiktionaries/Table")
existtab = getwikitext(page)
# reextab = re.compile(r'^\[\[:([a-z-]+):')
reextab = re.compile(r'\| \[http://([a-z-]+)\.wiktionary\.org')
i = 0
for line in existtab.splitlines():
i += 1
mo = reextab.match(line)
if mo:
if mo.group(1) in Lstops: continue
lc = mo.group(1)
Exists.add(lc)
# see if we have a login in user config, else pretend we do
# has to be done before any call, or login status gets confused!
if lc not in usernames['wiktionary']:
usernames['wiktionary'][lc] = "Interwicket"
Sites[lc] = wikipedia.getSite(lc, 'wiktionary')
print "%d lines" % i
print "found %d active wikts" % len(Exists)
if len(Exists) < 150: return
# sort order
pf = mysite.interwiki_putfirst()
Idx = { }
i = 1
for code in pf:
Idx[code] = i
i += 1
for code in sorted(Exists):
if code not in Idx:
Idx[code] = i
i += 1
print "code %s exists, not in putfirst list, sorts at end" % code
imax = i
# naps ... ;-)
naptime = 0
maxnap = 340
# Iwikis cache
iwopen(home)
# build table of iwikis from xml:
iws = { } # in memory cache
if xml:
# get XML dump
dump = xmlreader.XmlDump("../hancheck/en-wikt.xml")
ti = 0
entries = 0
reds = 0
iws = { } # in memory cache
for entry in dump.parse():
text = entry.text
title = entry.title
if title.find(':') >= 0: continue
if title < start or (stop and title > stop): continue
if text.startswith('#'):
Redirs.add(title)
reds += 1
continue
entries += 1
if entries % 5000 == 0: print "prescan %d entries, %d iwikis, %d redirects" % (entries, ti,
reds)
i = 0
first = None
iw = [ ]
for mo in reiwiki.finditer(text):
code = mo.group(1)
if mo.group(2) == title:
iw.append(code)
if not first: first = mo.group(0)
ti += 1
if code in Idx:
if Idx[code] <= i and sort:
Sorts.add(title)
print "added %s to set needing sort" % safe(title)
i = Idx[code]
else:
print "added %s to sort, code %s not in putfirst list" % (safe(title), safe(code))
Sorts.add(title)
i = imax
elif code in Idx:
# apparent iwiki to different title, use sorts set
Sorts.add(title)
print "added %s to sort, iwiki is %s" % (safe(title), safe(mo.group(0)))
if first and sort:
# is there other text after first iwiki?
str = text[text.find(first):]
str = reiwiki.sub(' ', str).strip()
if str:
print "text after first iwiki in %s, added to sort" % safe(title)
Sorts.add(title)
# keep for the time present if not in the cache or if we want to update all
if iw and (update or safe(title) not in Iwikis):
iws[title] = iw
# if not update: print "precache %s: %s" % (safe(title), safe(u' '.join(iw)))
continue
print "totals %d entries, %d iwikis, %d redirects" % (entries, ti, reds)
# now look for iwikis needed
union = 0
entries = 0
probs = 0
fixed = 0
for title, links, redirs in merge(home = home, start = start):
#if title.find(':') >= 0: continue
if title.lower() == 'main page': continue
if stop and title > stop:
print 'stopping at/after %s' % safe(stop)
break
# if we cached iws internally, add to "real" cache, only if not there previously (modulo update
flag):
if title in iws:
iwadd(title, iws[title], upd = update)
del iws[title]
# and then continue as before
# temp, report so we know where we are:
"""
iw = ''
for code in links: iw += ' ' + code
print 'page %s:%s' % (safe(title), safe(iw))
"""
union += 1
if union % 200 == 0:
print "%d in union index, %d entries, %d possible, %d updated" % (union, entries, probs,
fixed)
# English/home wikt in set?
if home in links:
page = links[home]
del links[home]
else: continue
entries += 1
# screen entries:
tag = False
# do we have the entry?
if safe(title) in Iwikis and not newonly:
for code in Iwikis[safe(title)]:
if code in Skips: continue
if code not in links: tag = True
for code in links:
if code in Skips: continue
if code not in Iwikis[safe(title)]: tag = True
# if not, and not a redirect, and there are other links pick it up as new:
elif safe(title) not in Iwikis and title not in Redirs and links: tag = True
# n.b.: the case where it is new, but has a bad link that should be removed will be caught
# the next time when it is in the XML
# iwikis out of order, or ref to different title (ignoring "newonly" at the present ;-)
if title in Sorts: tag = True
# now see if it is something that should be tagged/replaced:
if tag:
probs += 1
naptime += 1
# ... pick up current version from en.wikt
print '%s: %s (%s)' % (safe(title), safe(u','.join(links.keys())),
safe(u','.join(redirs.keys())))
try:
# text = page.get()
text = getwikitext(page)
oldtext = text
except wikipedia.NoPage:
print "Can't get %s from en.wikt!" % safe(page.aslink())
text = ''
except wikipedia.IsRedirectPage:
print "Redirect page"
text = ''
except KeyError:
# annoying local error, from crappy framework code
print "KeyError"
time.sleep(200)
continue
if not text: continue
# now parse the current entry, see if we need to add or remove iwikis
oldlinksites = wikipedia.getLanguageLinks(text)
oldlinks = { }
# now unwrap insane stupidity, why would we POSSIBLY want the site structures?!
for s in oldlinksites:
oldlinks[s.language()] = oldlinksites[s]
# handle wikts we can't read:
for code in Skips:
if code in oldlinks: links[code] = oldlinks[code]
# bad titles?
act = ' '
for code in sorted(oldlinks):
if oldlinks[code].title() != title:
act += '-[[:%s:%s]], ' % (code, oldlinks[code].title())
del oldlinks[code]
act = act.rstrip(', ')
# additions? removals?
act += ' +'
for code in sorted(links):
if code not in oldlinks:
act += code + ', '
# try adding reciprocal and/or new links, other pages probably need this:
if recp and code not in redirs:
addrci(wikipedia.Page(Sites[code], title), mysite,
links = links, redirs = redirs, skips = Skips, remove = True)
act = act.rstrip(', +') + ' -'
for code in sorted(oldlinks):
if code not in links: act += code + ', '
act = act.rstrip(', -')
if act: act = "iwiki" + act
if not act and title in Sorts: act = "sort iwikis"
# update cache with links read:
if not act: iwadd(title, oldlinks.keys())
else: continue
# some change, write it
if act:
# more insanity, now we have to generate a list with sites!
# we really should do this ourselves, using putfirst ...
linksites = { }
for code in links:
linksites[wikipedia.getSite(code, 'wiktionary')] = links[code]
newtext = wikipedia.replaceLanguageLinks(text, linksites, site = mysite)
fixed += 1
naptime /= 2
print "Updating %s: %s" % (safe(title), safe(act))
# try to fix the entry
try:
utext = getedit(page)
# utext = page.get()
if utext != oldtext:
print "page changed during attempted update"
continue
wikipedia.setAction(act)
page.put(newtext)
iwadd(title, links.keys())
except wikipedia.EditConflict:
print "Edit conflict?"
continue
except wikipedia.PageNotSaved:
print "failed to save page"
# other action?
continue
except wikipedia.NoPage:
print "Can't get %s from en.wikt?" % safe(page.aslink())
continue
except wikipedia.IsRedirectPage:
print "Redirect page now?"
continue
except socket.timeout:
print "socket timeout, maybe not saving page"
continue
except socket.error:
print "socket error, maybe not saving page"
continue
except KeyError:
# annoying local error, from crappy framework code
print "KeyError"
time.sleep(200)
continue
# limit number of fixes for testing
# if fixed > 7: break
# pace
if naptime > maxnap: naptime = maxnap
if naptime > 4:
print "sleeping %d seconds" % naptime
time.sleep(naptime)
continue
print "%d in union index, %d entries, %d possible, %d updated" % (union, entries, probs, fixed)
# done
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
if Iwikis:
print
print "(sync cache)"
Iwikis.sync()
Iwikis.close()
finally:
wikipedia.stopme()