User:Jakerylandwilliams

From Wiktionary, the free dictionary
Jump to navigation Jump to search

I am an information scientist and natural language complexity researcher at the University of Vermont, leading a project that predicts "missing" phrase-entries from a dictionary. To see the very first batch of potential missing entries, please see my page: Potential missing entries, and feel free to write me with comments regarding them (talk).

This development only applies to dictionaries that include larger-than-word lexical objects (such as the the Wiktionary). For example, I am able to generate shortlists of four-word phrases that are similar to those defined in the Wiktionary, which in fact may be missing as defined or referenced entries:

  • benefit of a doubt
  • keep an eye to
  • roll off the presses
  • one of a million
  • one upon a time
  • made up your mind
  • what time is new
  • down in the count
  • keep an eye for
  • ...

These lists are ordered according to how likely they are to be meaningful (in need of definition), and represent real usage from real English language practitioners.

In general, these lists are constructed through the parsing of large quantities of text---the above example was generated from approximately 0.1% of English tweets appearing over April--July 2014. Toward the completion of this project I will be building these lists from very large sources of English text, such as >20,000 Project Gutenberg eBooks, 20 years of New York Times articles, Wikipedia articles, music lyrics, and many many more tweets. I expect this project to generate thousands of reasonable hits.

Aside from gathering large quantities of English phrases used by modern speakers and writers, the list-generating process is built off of a context model, which observes word co-occurrence and the similarity of phrases by such measures.

For example, the phrase "roll off the presses" is a member of contexts represented by the removal of the various subphrases:

  1. **** off the presses
  2. roll *** the presses
  3. roll off *** presses
  4. roll off the *******
  5. **** *** the presses
  6. roll *** *** presses
  7. roll off *** *******
  8. **** *** *** presses
  9. roll *** *** *******
  10. **** *** *** *******

whereupon our prediction was based on the observation of the existence of defined phrases "hot off the presses" (context 1) and "roll off the tongue" (context 4) appearing in real-word text (along with the undefined phrase of interest).

While some phrases are misspellings, or imperfect executions, a large number are indeed well-used phrases with no existing reference, and an even larger number are reasonable (and used) variants and tenses of phrases for which definitions exist (and hence in need of a redirect link).

This is a purely academic and ongoing project aimed at identifying and defining the larger, English lexicon of phrases, and I would like to apply it to the Wiktionary. If you have any suggestions as to how I might be able to implement this tool and flag these phrases for definition, removal, or redirect, please let me know.

-Jakerylandwilliams (talk) 04:45, 17 February 2015 (UTC)