User:OrenBochman/BetterSearch

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Brainstorm Some Search Problems[edit]

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.[edit]

  1. (Also) index output HTML file?

Problem: HTML also contains CSS, HTML, Script, Comments[edit]

  1. solution:
    Either index these too, or run a filter to remove them. Some Strategies are:
    1. Discard all markup.
      1. A markup_filter/tokenizer could be used to discard markup.
      2. Lucene Tika project can do this.
      3. Other ready made solutions.
    2. Keep all markup
      1. Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
        (interesting if one wants to also compress output for integrating into DB or Cache.
    3. Selective processing
      1. A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
      2. This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online[edit]

  1. real-time "only" - slowly build index in background
  2. offline "only" - used dedicated machine/cloud to dump and index offline.
  3. dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
    1. production of a linguistic/entity data or a new software milestone.
    2. offline analysis from dump (xml,or html)
    3. online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific[edit]

  1. N-Gram analyzer is language independent.
  2. A new Multilingual analyzer with a language detector can produced by
  3. extract features from query and check against model prepared of line.
  4. model would contain lexical feature such as:
    1. alphabet
    2. bi/trigram distribution.
    3. stop lists; collection of common word/pos/language sets (or lemma/language)
    4. normalized frequency statistics based on sampling full text from different languages..

Problem: Search is not aware of morphological language variation[edit]

  1. in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
  2. index Wiktionary so as to produce data for a "lemma analyzer".
    1. dumb lemma (bag with a representative)
    2. smart lemma (list ordered by frequency)
    3. quantum lemma (organized by morphological state and frequency)
  3. lemma based indexing.
  4. run a semantic disambiguation algorithm (tag )on disambiguate
  • other benefits:
  1. lemma based compression. (arithmetic coding based on smart lemma)
    1. indexing all lemmas
  2. smart resolution of disambiguation page.
  3. algorithm translate English to simple English.
  4. excellent language detection for search.
  • metrics:
  1. extract amount of information contributed by a user
    1. since inception.
    2. in final version.

Plan[edit]

Resources[edit]

Developer/Admin Information[edit]

Search Options[edit]

highlights:

Notes[edit]

A quick review of the above is summarized as follows:

Mediawiki does not appear to have native search capabilities. It can be searched via external components (indexed and then searched) via three extensions:

  1. Sphinx Search - for small sites (updated 2010)
  2. Lucene Search - Lucene search for large sites
  3. EzMwLucene - Easy Lucene search - an unadapted package from

MWSearch does not perform searches rather it provides integration with Lucene-search.


Potential Contact People[edit]

comitt capable developers, irc:#mediawiki

Screened[edit]

Unscreened[edit]

Misc[edit]