User:Jakerylandwilliams/Potential missing entries
- 1 In general
- 2 Updates
- 3 Tables of potential missing entries
- 3.1 Brief description
- 3.2 Twitter (2009): filtration by definition likelihood
- 3.3 Twitter (2009): filtration by frequency
- 3.4 The New York Times (1987-2007): filtration by definition likelihood
- 3.5 The New York Times (1987-2007): filtration by frequency
- 3.6 Music Lyrics (1960-2007): filtration by definition likelihood
- 3.7 Music Lyrics (1960-2007): filtration by frequency
- 3.8 English Wikipedia Articles (2010): filtration by definition likelihood
- 3.9 English Wikipedia Articles (2010): filtration by frequency
- 3.10 Project Gutenberg eBooks (1500-2010): filtration by definition likelihood
- 3.11 Project Gutenberg eBooks (1500-2010): filtration by frequency
I am a student and mathematical linguistics researcher at the University of Vermont, and this is a living research project where we are observing our success at predicting dictionary entries that are missing from the Wiktionary. By analyzing large quantities of text from different sources, we are tracking lexical objects (in the tables below) that we refer to as phrases, which were without reference in the Wiktionary as of July 1, 2014. Any phrases that have been referenced since our access of the Wiktionary's data will appear as blue, and all that have not will appear as red. In general, we are deciding which short lists of phrases to track by two methods—the likelihood of dictionary definition (which is the result of our study), and frequency (as a baseline). For more information about this project please see our article on the arxiv.
Update 1: The paper is live!
The paper is live, so check it out on the arxiv:
for more information on the methods and analysis. So far it appears that several editors have inspected the tables below and added references, and as was expected almost all were propositions of the likelihood filter (only one newly referenced phrase was a frequency proposition). If you have definitions or references for any of these phrases that you would like to add, you should join the Wiktionary and become an editor.
Tables of potential missing entries
The tables below contain phrases filtered out of various large corpora by two methods as part of an ongoing project aimed at detecting missing entries from a phrasal dictionary. Though this page was created on Feb. 16th, 2015, the original download of the Wiktionary entries list was made on July 1, 2014. As such, any entries below that are no longer listed as missing were successfully predicted. For more information regarding the processes by which these phrases have been extracted, please see a slightly more detailed explanation on my user page, or our article on the arxiv.
Twitter (2009): filtration by definition likelihood
- ¹ probably: knock it out of the park
- ² appears to be from:
°º¤ø¤º°¨¨°º¤oNEW KIDS ON THE BLOCK º°¨¨°º¤ø