Wiktionary talk:Frequency lists/PG/2005/08/1-10000

From Wiktionary, the free dictionary
Jump to navigation Jump to search

One more: in the first subsection titled 1-100 there are only 99 words. But in the other subsections we have 100 words.

Here's a few problems I can see:

  1. Apostrophe is included as a word character in English in positions where it is only rarely permitted such as the first character: ('tis in 2601-2700 is an exception)
    • 'I (701 - 800)
    • 'The (2301 - 2400)
    • 'You (2801 - 2900)
    • 't (3201 - 3300)
    • 'And (3901 - 4000)
    • 'What (3901 - 4000)
    • 'It (4101 - 4200)
    • 'Oh (4201 - 4300)
  2. "." seems to be blindly included as a word character even when it's not surrounded by a letter on both sides or is accompanied by an apostrophe:
    • it.' (6101 - 6200)
    • you.' (7501 - 7600)
    • me.' (7801 - 7900)
  3. URLs are present - is it possible that Gutenberg's newsletters are being included as well as the out-of-copyright books?:
    • pobox.com (5401 - 5500)
    • gutenberg.net (6201 - 6300)
    • pglaf.org. (9801 - 9900)

Most of these are easily filtered out. In the first case I think we'd be better off with fewer false positives at the cost of a small number of false negatives.

Thanks. I'd just like to say, gah! I had punctuation characters converted to spaces in an earlier iteration; I'm not sure what I goofed up on this last round. Oh wait, these frequency lists are not the latest version, and don't correspond to the template:rank entries. Double gah! --Connel MacKenzie T C 22:09, 8 December 2005 (UTC)[reply]


I was searching for some common 3 letter words, and here's what I found:

For j: joy, job, jag, joy, jos, jeg, jar and jaw...but no jug! For k: key, kun, kan, kam and kin...but no kid! For z: zoo, zou and zal...but no zip!

Hard to see how this can really be useful as any sort of approximation of the 10000 most commonly used English words, which is what I was loooking for.122.107.225.220

Why's 'Gutenberg' #243? How often does that come up?

Presumably because the scanner includes the copyright text at the start of every book (that contains Gutenberg). Conrad.Irwin 03:14, 19 March 2009 (UTC)[reply]

Why is 'la' in the first 100 words? It refers to http://en.wiktionary.org/wiki/la where it has number 481 (and even that is a very low number for the meaning of 'la' as a syllable used in solfège (music) Lcla

Crappy[edit]

This page is really crappy, to be blunt. Can we delete it? --Newfriendforyou 00:27, 23 July 2011 (UTC)[reply]