79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

I have just updated the Tatoeba+HSK+α English flashcards to correct an issue with rogue forward slash characters in the word pinyin for a number of the target words preventing flashcard generation for those words.

For details of the update, and to download the updated version of the Tatoeba+HSK+α English flashcards please scroll to the bottom of post #42 of this thread (see link below), or just download the updated file also attached to this message.



Last edited:
Dear all,

I've also finished a first Python version of @leguan's sentence contextual flashcard generation program. This program does the following:
  1. It collects all the words from the BCC, Leiden, and SUBTLEX corpora that are known to Pleco's dictionaries and orders them by frequency. (175,000 words)

  2. It matches all the words from the merged corpus file to the sentences in which they occur. The sentences can be from any source, but are from tatoeba's Chinese-English list in this case.

  3. Using the indexed word list (i.e., with the matching sentence IDs added to them) and the numbered tatoeba sentence list, it generates Pleco sentence flashcards whose headword is the Hanzi word we are looking for, and whose pronuncation field is filled with the sentence in which the headword occurs, with the headword replaced by its pinyin, and whose definition is filled with the English translation of the sentence (multiple translations for the same Chinese sentence are folded together).

    It doesn't use the same sentence more than twice, and it doesn't generate more than seven different sentences for one HSK word. (across all HSK levels) In addition, it takes a random sample of 7000 mostly non-HSK words from the first 40,000 most common words in the corpora which occur in the tatoeba sentences, and generates sentences with them. These words can't be too hard, since the tatoeba sentences aren't that hard, either.

  4. It orders the finished sentence list by HSK level. The HSK level of the sentence is currently determined not by the HSK level of the headword (if there is one), but by the HSK level of the entire sentence, as calculated by the HSK rating program (see thread "Automatically Assessing the HSK Difficulty Level of Arbitrary Chinese Sentences").
It doesn't order the sentences randomly, because Pleco's Flashcards will already do that. More languages will follow. This program does something very similar to leguan's, now just additionally separated by HSK levels. There are still many things I/we could improve. I hope you like the sentences; feedback is welcome.

I attach the current source code and the Chinese-English output.

Edit: Added German, French, Italian, Japanese, Russian, and Spanish.




Last edited:
Excellent work, Shun!
The tools you have developed are great not only because they, if my understanding is correct, fully automate the sentence contextual flashcard generation process, but also because their modularity allow easy adaptation to a user's preferences, and for utilization in other related projects. They indeed form a textbook on how to use Python to create sentence based learning tools and are surely a great asset for all!
Hi leguan,

many thanks! Python seems to be quite ideal for this type of task. Thanks also for the excellent "sentence contextual" idea of replacing one Chinese word with its pinyin. As stated previously, it trains your Hanzi writing skills, and at the same time your Chinese reading skills and general passive vocabulary knowledge.

Best, Shun