John.
秀才
The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese. Frequency lists derived from the corpus can be downloaded here: http://bcc.blcu.edu.cn/downloads/resources/BCC_LEX_Zh.zip
The ZIP file contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.
The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.
If you have problems with the original files the UTF-8 versions attached below may work. Due to technical limitations the lists global and blogs only include the 1048576 most frequent words.
I also created two Pleco user dictionaries showing the frequency of a word as definition. The first is based on the 100 000 most frequent words from the literature frequency list, the second is based on the 100 000 most frequent words from the news frequency list. They are attached below as ZIP files. If you are interested in a frequency user dictionary based on spoken language, see this post.
Maybe this is useful to some of you.
The ZIP file contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.
The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.
If you have problems with the original files the UTF-8 versions attached below may work. Due to technical limitations the lists global and blogs only include the 1048576 most frequent words.
I also created two Pleco user dictionaries showing the frequency of a word as definition. The first is based on the 100 000 most frequent words from the literature frequency list, the second is based on the 100 000 most frequent words from the news frequency list. They are attached below as ZIP files. If you are interested in a frequency user dictionary based on spoken language, see this post.
Maybe this is useful to some of you.
Attachments
-
global_wordfreq.release_UTF-8.txt16.3 MB · Views: 199,338
-
literature_wordfreq.release_UTF-8.txt4.8 MB · Views: 224,166
-
news_wordfreq.release_UTF-8.txt6 MB · Views: 72,098
-
technology_wordfreq.release_UTF-8.txt7.8 MB · Views: 185,355
-
blogs_wordfreq.release_UTF-8.txt16.7 MB · Views: 131,240
-
weibo_wordfreq.release_UTF-8.txt4.6 MB · Views: 187,712
-
user dictionary literature frequency (100000 words).zip19.4 MB · Views: 1,893
-
user dictionary news frequency (100000 words).zip20.6 MB · Views: 2,020
Last edited: