User talk:Chernorizets/bg-top-5000-bnc

From Wiktionary, the free dictionary
Latest comment: 9 months ago by Chernorizets in topic Download
Jump to navigation Jump to search

Download

[edit]

Hi @Chernorizets, Do you have this list in CSV, or other, downloadable format SimonWikt (talk) 13:44, 22 August 2023 (UTC)Reply

@SimonWikt you can download the original frequency lists from here: https://dcl.bas.bg/bulnc/en/dostap/retchnitsi/. All I've done is some grepping to exclude words with non-Cyrillic characters and uppercase letters. I've used the GENERAL dictionary which combines all others. Chernorizets (talk) 20:41, 22 August 2023 (UTC)Reply
Thanks @Chernorizets
I dumped your page into a spreadsheet and went from there.
On closer inspection it is not quite what i was hoping for. Some, of what I consider to be everyday, words aren't even on the list!
This seems to be typical of all the 'Top lists' that I can find, they all seem to be based on sources such as Wikipedia, film subtitles, the news, books, etc and not on normal everyday life.
Thanks anyway 😀
SimonWikt (talk) 05:53, 23 August 2023 (UTC)Reply
@SimonWikt there might be other corpora of Bulgarian text that include a more representative sample - this particular corpus is what's available from BAS at the moment. The InformalFiction wordlist is probably denser in everyday vocabulary, based on its description - you can take a look at that one if you'd like. Chernorizets (talk) 06:51, 23 August 2023 (UTC)Reply