TV/movie frequency lists
This is a frequency count of words in a collection of TV and movie scripts/transcripts, primarily downloaded from the Internet.
- The total number of words counted is: 29,213,800.
- Most stage directions and other cruft were stripped out of the scripts. What's left is (mostly) the actual words you'd hear coming out of your speakers.
- "Words" were divided on any character not in [A-Z], [a-z], or the ISO-Latin-1 range [À-ÿ]. This includes a hyphen (since so many of the transcribers couldn't tell the difference between a hyphen and a dash). So the compound "happy-juice" would have been counted as "happy" and "juice". I may eventually get around to generating a separate list of common hyphenated compounds.
- Apostrophes were included only if they were entirely contained within word characters. So don't was counted as don't but goin' , 'cause, and 'cool' would have had the apostrophes stripped.
- Especially when you get to the lower-frequency words, don't expect all entries to actually make sense. Some of the not terribly useful (for Wiktionary) things you'll see are:
- "creative" spellings by transcribers
- attempts to write non-linguistic behaviour, like "mrmph!"
- partial words. When Giles says "bu-bu-bu-but", that's counted as 3 "bu"s and one "but". You probably don't want to rush out and add an English section to the article for bu. (But we're now accepting pool bets for when somebody actually does.)
- gibberish created by occasional malfunctions of the transcriber's closed-caption capture card. A not atypical line of a soap-opera transcript that you would not want to base an entry for tdodo or ógc on:
- No, I have tdodo this, jack. I have to. ç|ógc1sss @ How long will we be inaris?
Sincerest thanks to all the fans in the world who meticulously wrote down all this data. May they soon discover Wiktionary! Keffy 01:04, 17 February 2006 (UTC)