User:OrenBochman/BetterSearch

Brainstorm Some Search Problems

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.

(Also) index output HTML file?

Problem: HTML also contains CSS, HTML, Script, Comments

solution:
Either index these too, or run a filter to remove them. Some Strategies are:
1. Discard all markup.
  1. A markup_filter/tokenizer could be used to discard markup.
  2. Lucene Tika project can do this.
  3. Other ready made solutions.
2. Keep all markup
  1. Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
    (interesting if one wants to also compress output for integrating into DB or Cache.
3. Selective processing
  1. A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
  2. This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

real-time "only" - slowly build index in background
offline "only" - used dedicated machine/cloud to dump and index offline.
dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
1. production of a linguistic/entity data or a new software milestone.
2. offline analysis from dump (xml,or html)
3. online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

N-Gram analyzer is language independent.
A new Multilingual analyzer with a language detector can produced by
extract features from query and check against model prepared of line.
model would contain lexical feature such as:
1. alphabet
2. bi/trigram distribution.
3. stop lists; collection of common word/pos/language sets (or lemma/language)
4. normalized frequency statistics based on sampling full text from different languages..

Problem: Search is not aware of morphological language variation

in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
index Wiktionary so as to produce data for a "lemma analyzer".
1. dumb lemma (bag with a representative)
2. smart lemma (list ordered by frequency)
3. quantum lemma (organized by morphological state and frequency)
lemma based indexing.
run a semantic disambiguation algorithm (tag )on disambiguate

other benefits:

lemma based compression. (arithmetic coding based on smart lemma)
1. indexing all lemmas
smart resolution of disambiguation page.
algorithm translate English to simple English.
excellent language detection for search.

metrics:

extract amount of information contributed by a user
1. since inception.
2. in final version.

Plan

Resources

Developer/Admin Information

Search Options

Search Extentions

highlights:

Notes

A quick review of the above is summarized as follows:

Mediawiki does not appear to have native search capabilities. It can be searched via external components (indexed and then searched) via three extensions:

Sphinx Search - for small sites (updated 2010)
Lucene Search - Lucene search for large sites
EzMwLucene - Easy Lucene search - an unadapted package from

MWSearch does not perform searches rather it provides integration with Lucene-search.

Potential Contact People

comitt capable developers, irc:#mediawiki

Screened

Brion Vibber - lead dev
Multichill
Andrew Garrett - active paid developer
Roan Kattouw - usability initiative, previously lead developer and maintainer of the MediaWiki API.
Siebrand Mazeland

Unscreened

Seb35
Aryeh Gregor
David Richfield
Niklas Laxström - experienced MediaWiki developer

Misc

zim offline format