Wiktionary talk:Frequency lists/Spanish1000
I have generated a new, improved, version of the list. The new list has been generated from 6527 subtitle files of TV-series and movies with a total of 27417111 words.
A bug in the counting script causing lost words at the end of line has been fixed. The bugfix results in an increased total wordcount and an increased wordcount of mainly adjectives and nouns, which are likely to occur at the end of a sentence. Additionally, some few files in non-Spanish language or with encoding problems have been excluded.Matthias Buchmeier 13:24, 10 October 2008 (UTC)
Words excluded from the list
- The following names, proper nouns, words from other languages and typos have been excluded from the list:
Import to other Wiktionaries
What's the licencing for these lists? Can I import them to the French Wiktionary (which has shockingly few Spanish words, less than 7000, I think). Mglovesfun 11:42, 21 May 2009 (UTC)
- The lists are released under both the GDFL and the LGPL licenses. Of course, You can import them to the French Wiktionary. Matthias Buchmeier 08:36, 22 May 2009 (UTC)
What about a two words together frequency list? I have looked up two words alone and the combination of the two can have a meaning that it hard for me to explain. Like say "tal vez" , "creo que". Any way my point is there maybe some interesting trends.
Possible repetitions and spacing errors?
I am wondering about the word counts. The corpus consists of 27.4 million words. The unformatted list contains words by number of tokens in groups of 5.000 up to 225,000 words. However, there are a few repetitions where only the accents differ. Is this simply an orthographic error or are they different words? For example, in the 75,001-80,000 group, there are 6 tokens of decídselo '[2PL] you tell it to him'. However, in the 105,000-110,000 group decidselo appears again without an accent on the 'i' for a total of 3 tokens. I am guessing it is the same expression, but missing the accent? Also, there seem to be a number of typographical errors where the space between two words was eliminated and thus counted as one word, such as miracometimos and miraditasparece in the 175,001-180,000 group. Zaimot 18:23, 9 June 2010 (UTC)