Sentences flashcards generator (Python script)

Pierre Biannic · Dec 16, 2021

Dear all,

I am pleased to share with you this Python script that allows to automatically generate sentences flashcards from the Tatoeba database, that I wrote based on a previous script from and with the help of @Shun . The main features are:

Choice of translation language, and automatic download from tatoeba.org
Assessment of sentence HSK level (algorithm based on the number of HSK words and other 'rare' words), and export only selected levels
Copy of the Chinese sentence in the translation (as tapping the Chinese sentence will only open the character translation, instead of looking for multiple character words in the available dictionaries)
Possibility to overwrite the pinyin translation with the expected Chinese words, but shuffled: to be used as a hint when translating from foreign language to Chinese. If this option is not selected, the pronunciation is left blank, Pleco to automatically fill it during the import process

At this moment, it only works with simplified characters.

The script requires two additional files: hsk new.txt (the list of HSK words) and global_wordfreq.release (Hanzi only).txt (a word frequency list).
To run, the script requires the following packages to be installed: hanziconv and tatoebatools

I also attach a few examples of the exports: English (with and without shuffled words hints) and French.
As the files are too large for this forum, please use the following links:

The script and associated files --> here
The flashcards examples --> here

Any feedback, ideas of improvement, etc. is very appreciated!

Pierre

Shun · Dec 16, 2021

Hi all,

that was some great work by Pierre! Everyone who would like to try studying with sentences suited to their current level, and in their own native language, should try it out. We could also create a standalone app for this, but it would be pretty large, so you have to install Python for now.

Brief instructions: If you're on a Mac, I suggest installing it through Homebrew (easily found through Google), and on Windows, you could use the "Chocolatey" package manager, or on Linux, the package manager of your Linux distribution, all of which keep Python up to date. Then, you only need to install the hanziconv and tatoebatools packages using "pip3 install <package name>". Once that is done, you can run the script using "python3 <script name>". The two associated files above need to reside in the same directory as the script file. We should be able to help out in case of any troubles.

Enjoy,

Shun

hugovth · Feb 14, 2022

Hi!

Thanks a lot, it works perfectly! (except hsk1 that render some complex sentences, but to be honest, it does not matter)

I have small question:

How do you generate global_worldfreq and hsk new.txt ? I am willing to adapt it for the hsk 3.0 but I am not very much aware of where I can find/generate these txt.

Thanks and good job !

Shun · Feb 14, 2022

Hi hugovth,

welcome; it would be wonderful to have even more programmers. If Pierre hasn't already done so, soon, we could start an open source repository on GitHub, perhaps with different forks. As it happens, just yesterday I've added an even stronger word segmenter to Pierre's script (for word reordering in the sentence) which uses 350'000 expressions for segmenting and works great.

You can get Pleco's clean, built-in HSK 3 vocabulary from here:

and then export it. I have also attached the same list to this post (the "9levels" one), as well as yet another new "2020" HSK list with four levels ("hsk 3.txt") that I don't remember the origin of. Possibly it comes from @Weyland. The HSK rating should work even more reliably if we include these lists, or perhaps even all three lists at once. But it certainly takes a lot of testing and personal day-to-day usage to make sure that the HSK rating performs optimally.

"hsk new.txt" is the older 5,000 word HSK 2.0, which also comes from Pleco's built-in list. BCC is a frequency list that was obtained from a thread on these forums:

Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018，人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as...

www.plecoforums.com

Feel free to come back with questions to both of us.

Have fun,

Shun

misaifan · Feb 2, 2023

Hi guys,

I'm don't have experience with Python and am a bit confused here. Perhaps you can point me to the right direction?

I just wanna add more example sentences to one my pleco dictionaries. Don't care which one or the source. Just as many as possible to a whole dictionary, I don't need them to be separated by level or something like that.

Is something like that existing already without the need of Python?

Misaifan

mikelove · Feb 2, 2023

There's no way to do that at the moment - once it is possible, though, I don't know why you'd need Python for it since it will most likely just be a matter of importing a text file containing a bunch of example sentences.

Shun · Feb 2, 2023

Hi misaifan, hello Mike,

I know what Mike means, but I took a different route.

I have a 30 MB text file of the 116,000-vocabulary item CC-CEDICT dictionary along with up to fifteen or so Tatoeba example sentences for each vocabulary item that occurs in them (Python-generated two years ago). Misaifan could import it into a new user dictionary. However, it's 10 MB zipped, which is too large for it to be uploaded to the Pleco Forums. (There once were tremendous download volumes for older Tatoeba sentence files, which of course isn't the purpose of a forum.) This one is a useful dictionary, since Tatoeba example sentences are longer and tend to be quite idiomatic. Example of the dictionary entry for 中国人:

The example sentences containing 中国人 were randomly chosen, up to 15 of them.

@misaifan, do you use Dropbox or some other cloud service? Then you could create a "file request" link, dm me the link, and I could upload it to your Dropbox there. Would you be willing to post a public download link for it to the Pleco Forums afterwards? (There are no copyright restrictions for the Tatoeba sentences except for attribution.)

Edit: I see that the translation of the second sentence is wrong, 好客 should be "hospitable".

Another example with 电脑 "computer":

Cheers,

Shun

Pierre Biannic · Feb 3, 2023

Hi @Shun , with your example, wouldn't you get all the 中国人 sentences also copied under 中, 国 and 人 ? (it reminds me of a previous conversation to try to have a logical way to cut sentences in "words"...).
I'm thinking of another method, I hope I'm not saying something really tupid though: can you make sentences examples appear in the SENTS tab (when adding an entry, the only fields are headwords, pronunciation and definition, I don't see examples or sentences)? I'm not sure how it works, but if a word exists in a dictionary, Pleco displays all its occurences in sentence examples in this tab. So is there a way to "throw" a bunch of Tatoeba sentences in a kind of dictionary, and Pleco would automatically display the proprer examples under SENTS when you look up a word?

Shun · Feb 3, 2023

Hi Pierre!

Yeah, if you search for 国 or 中 in the dictionary, it may be that an example sentence with 中国人 in it will come up. It would definitely be better to segment the sentences first (for example, using the BCC corpus frequency list, going through the sentences backwards and starting with the longest words) instead of just searching all sentences for a certain string. Then each sentence could be an unordered list of words which would need to be identical with the CC-CEDICT words to qualify as an example sentence. Would you like to have a go at it, or shall I try it? I would program it from scratch, it shouldn't take much work thanks to Python.

Indeed, there currently doesn't seem to be a possibility of tagging parts of a user dictionary definition as example sentences and translations so they come up in SENTS properly; that's Pleco-internal for version 3.2.

Cheers, Shun

misaifan · Feb 3, 2023

Shun said:
Hi misaifan, hello Mike,

I know what Mike means, but I took a different route.

I have a 30 MB text file of the 116,000-vocabulary item CC-CEDICT dictionary along with up to fifteen or so Tatoeba example sentences for each vocabulary item that occurs in them (Python-generated two years ago). Misaifan could import it into a new user dictionary. However, it's 10 MB zipped, which is too large for it to be uploaded to the Pleco Forums. (There once were tremendous download volumes for older Tatoeba sentence files, which of course isn't the purpose of a forum.) This one is a useful dictionary, since Tatoeba example sentences are longer and tend to be quite idiomatic. Example of the dictionary entry for 中国人:
ddsdd

Yes, that's the one I had in mind. I found that one first and later on this thread here. I reckoned that there might be a newer version or something of it. If that's the current one though, I will gladly be accepting your offer!

What about that topic you started below, where you would need to programm on scratch. I have no experience with Python therefore cant gauge if I should wait for that first? I recently bought additional dictionaries in pleco and I am not unhappy with the amount of example sentences. There is no rush for me.

BTW, does anyone know about the quality of taoteba? Because up there you found a mistake. I never worked long enough with it to assess it.

Thank you all already. It is still a delight to study Chinese after so many years. And I am grateful for Pleco and its community! <3

Shun · Feb 3, 2023

Hi misaifan,

you're welcome! Tatoeba's translation quality usually is acceptable, at least you can usually figure out the exact meaning and still benefit from the sentence structure and additional context provided by the sentence. I believe Tatoeba uses a credit system where sentence translations made by one user are rated by other users, but it appears that inaccuracies still slip through.

One reason might be that sentences in Tatoeba can be invented (or copied from any source) in a third language, such as French, which is then translated to English and Chinese by different users. So the English-to-Chinese translation isn't a direct translation between Chinese and English made by a near-native speaker, but actually took a detour by another language, French in this case, and they are linked together anyway.

I'd suggest that you install both user dictionaries, since they will not get in each other's way when you will install the Tatoeba-CEDICT dictionary with segmentation at a later date.

Feel free to post here if you need any help.

Best, Shun

Pierre Biannic · Feb 3, 2023

Shun said:
Indeed, there currently doesn't seem to be a possibility of tagging parts of a user dictionary definition as example sentences and translations so they come up in SENTS properly; that's Pleco-internal for version 3.2.

Is this a feature planned for Pleco 4 @mikelove ? Thanks!

mikelove · Feb 4, 2023

Shun · Feb 4, 2023

Hi @Pierre Biannic and @misaifan,

I did it and put it in a new thread. See:

Segmenting CEDICT-Tatoeba Dictionary Generator for any language

Dear all, I reprogrammed the «CEDICT Tatoeba example sentences dictionary generator» with word segmenting added. A search for "中" now will not produce example sentences that contain 中国人. Only if you display the dictionary definition for 中国人 will sentences containing 中国人 appear. The script...

plecoforums.com

Now it looks like this:

Feedback is welcomed.

Enjoy,

Shun

Sentences flashcards generator (Python script)

Pierre Biannic

进士

Shun

状元

hugovth

Member

Shun

状元

Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

Attachments

misaifan

Member

mikelove

皇帝

Shun

状元

Pierre Biannic

进士

Shun

状元

misaifan

Member

Shun

状元

Pierre Biannic

进士

mikelove

皇帝

Shun

状元

Segmenting CEDICT-Tatoeba Dictionary Generator for any language