Wiktionary talk:Frequency lists/Portuguese wordlist

From Wiktionary, the free dictionary
Jump to navigation Jump to search

There is a group of incomplete words that came from ignoring the dash "-". For example, "deixá" has no meaning as a word, should be for example "deixá-lo".


Bond, James Bond?[edit]

I'm learning Portuguese and do find this list helpful in choosing what words to learn first. But using subtitles as the sample does leave it open to certain errors. Take for example number 2458 in the list, "bond", and the apparently even more common "james", which makes it into the top 1,000 words at 969. I'm not sure these are really on the lips of the people of Portugal that often. It's more likely sample bias - some of the subtitles evidently came from James Bond films.

I do have a copy of the excellent Routledge book from 2008, "Frequency dictionary of Portuguese". This doesn't have either "james" or "bond" in its top 5,000. It has only "bonde", a Brazilian word for streetcar. However, this doesn't mean the subtitle-based list here is useless. By comparison with the massive Routledge academic undertaking it is a quick and dirty study. But valid - providing an accurate snapshot of the modern spoken language, at least as depicted in subtitled film and TV shows. By contrast the Routledge authors, based at Brigham Young University in the US, set out to cover a much broader range of text types. But from their own description this took a long time to assemble. So their list inevitably may not be as up to date.

There aren't many good frequency lists around for Portuguese, so all are likely to contain useful information. Routledge used a body of 20 million words, mainly from the period 1970 to 2000, and obtained equally from sources in Brazil and Portugal. The final sample was made up of six million words from novels and short stories, six million from various news media, six million from academic sources, and finally two million spoken words from various other collections and interviews. Their method is described in detail in the book's Introduction.

The Routledge team working on Portuguese were in at the start of a larger academic collaboration that has gone on to produce frequency dictionaries for over a dozen other languages. These all use a similar computational approach. But the key thing seems to be assembling a large and representative sample of text to analyse.

"Frequency dictionary of Portuguese" by Mark Davies and Ana Maria Raposo Preto-Bay, published by Routledge, 2008.

Istobe (talk) 05:22, 7 June 2017 (UTC)[reply]

@Istobe, I have a user page with a partially cleaned-up frequency list (here. I think it is better than this page, but if you have access to a real frequency dictionary, that is even better. — Ungoliant (falai) 15:03, 7 June 2017 (UTC)[reply]
@Ungoliant MMDCCLXIV, I prefer the layout on your user page, which splits the list up into smaller chunks. This may be of more help for my purposes than the big list - so thanks! Re the differences between the subtitle-based list and the Routledge dictionary, they are surprisingly small at the top end of the lists. It seems that the most frequent words are frequent whatever type of text is in your sample, probably because they are basic structural words of the language. So of the top 10 subtitle words we find that "que" "de" "um" "para" are also in the overall Routledge top 10, "não" is 11, "se" is 14. It seems only once you get outside the top 1,000 that really big differences kick in between words used in the spoken and written languages, and between news, fiction and academic writing. For example lots of government words show up in the Routledge news sample, and many more feeling and descriptive words in fiction. One good feature of the Routledge approach is that they store more about each word, so they can pull up interesting sub-lists. So for example the most common adjectives in their Portuguese fiction texts are "triste" - sad, "nu" - naked and "gordo" - fat. In the news sample they are "real" - royal, "federal", "municipal" and "eleitoral" - electoral. Istobe (talk) 02:39, 8 June 2017 (UTC)[reply]