79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun

状元
Hi leguan,

that sounds excellent, I'm all for a clear thread structure. So I will use another thread to answer your post and edit in a link to the new thread here:

HSK Difficulty Thread

Best,

Shun
 
Last edited:

leguan

探花
I have just updated the Tatoeba+HSK+α English flashcards to correct an issue with rogue forward slash characters in the word pinyin for a number of the target words preventing flashcard generation for those words.

For details of the update, and to download the updated version of the Tatoeba+HSK+α English flashcards please scroll to the bottom of post #42 of this thread (see link below), or just download the updated file also attached to this message.

https://plecoforums.com/threads/79-...apanese-and-spanish-sentences.5925/post-45063
 

Attachments

  • 20190106 sentence contextual writing practice (ENG) - Tatoeba+HSK+α.txt
    8.6 MB · Views: 2,589
Last edited:

Shun

状元
Dear all,

I've also finished a first Python version of @leguan's sentence contextual flashcard generation program. This program does the following:
  1. It collects all the words from the BCC, Leiden, and SUBTLEX corpora that are known to Pleco's dictionaries and orders them by frequency. (175,000 words)

  2. It matches all the words from the merged corpus file to the sentences in which they occur. The sentences can be from any source, but are from tatoeba's Chinese-English list in this case.

  3. Using the indexed word list (i.e., with the matching sentence IDs added to them) and the numbered tatoeba sentence list, it generates Pleco sentence flashcards whose headword is the Hanzi word we are looking for, and whose pronuncation field is filled with the sentence in which the headword occurs, with the headword replaced by its pinyin, and whose definition is filled with the English translation of the sentence (multiple translations for the same Chinese sentence are folded together).

    It doesn't use the same sentence more than twice, and it doesn't generate more than seven different sentences for one HSK word. (across all HSK levels) In addition, it takes a random sample of 7000 mostly non-HSK words from the first 40,000 most common words in the corpora which occur in the tatoeba sentences, and generates sentences with them. These words can't be too hard, since the tatoeba sentences aren't that hard, either.

  4. It orders the finished sentence list by HSK level. The HSK level of the sentence is currently determined not by the HSK level of the headword (if there is one), but by the HSK level of the entire sentence, as calculated by the HSK rating program (see thread "Automatically Assessing the HSK Difficulty Level of Arbitrary Chinese Sentences").
It doesn't order the sentences randomly, because Pleco's Flashcards will already do that. More languages will follow. This program does something very similar to leguan's, now just additionally separated by HSK levels. There are still many things I/we could improve. I hope you like the sentences; feedback is welcome.

I attach the current source code and the Chinese-English output.

Edit: Added German, French, Italian, Japanese, Russian, and Spanish.

Enjoy,

Shun
 

Attachments

  • sentence_contextual_tatoeba_cn_eng_folded by HSK rating_Shun.txt
    2.1 MB · Views: 3,197
  • sentence_contextual_tatoeba_cn_deu_folded by HSK rating - random sentence selection.txt
    468.1 KB · Views: 624
  • sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt
    1.4 MB · Views: 736
  • sentence_contextual_tatoeba_cn_ita_folded by HSK rating - random sentence selection.txt
    338.1 KB · Views: 726
  • sentence_contextual_tatoeba_cn_jpn_folded by HSK rating - random sentence selection.txt
    517.1 KB · Views: 940
  • sentence_contextual_tatoeba_cn_rus_folded by HSK rating - random sentence selection.txt
    663.7 KB · Views: 1,399
  • sentence_contextual_tatoeba_cn_spa_folded by HSK rating - random sentence selection.txt
    619 KB · Views: 1,263
  • Fold, rate hsk, match to sent, generate.py.txt
    15.9 KB · Views: 527
Last edited:

leguan

探花
Excellent work, Shun!
The tools you have developed are great not only because they, if my understanding is correct, fully automate the sentence contextual flashcard generation process, but also because their modularity allow easy adaptation to a user's preferences, and for utilization in other related projects. They indeed form a textbook on how to use Python to create sentence based learning tools and are surely a great asset for all!
 

Shun

状元
Hi leguan,

many thanks! Python seems to be quite ideal for this type of task. Thanks also for the excellent "sentence contextual" idea of replacing one Chinese word with its pinyin. As stated previously, it trains your Hanzi writing skills, and at the same time your Chinese reading skills and general passive vocabulary knowledge.

Best, Shun
 

Shun

状元
Hi pdwalker,

many thanks, you‘re welcome! I‘m also very open to any additional requests you may have after having studied some.

Best,

Shun
 

Shun

状元
Hello all,

at @Akpierce1776 's request, I am happy to upload a Chinese-English Tatoeba sentence contextual flashcard list, graded by HSK from levels 2 to 6 instead of 3 to 6 as before, now with 23,866 sentences in all. There are just under 900 sentences that were determined to be HSK level 2. This list is based on the newest Tatoeba sentence data from February 24, 2019, which contains about 5-6% more sentences than the old lists did. If anyone would like me to make HSK 2-6 sentence lists from the the newest data in another language, please just tell me.

For those who are already in the process of studying with the older lists, I don't think it's worth starting over just because of these 5-6%, though.

Enjoy,

Shun


PS: And, as always, thanks to @leguan for the great idea!
 

Attachments

  • sentence_contextual_tatoeba_cn_eng_folded - ordered by HSK ratings 2-6.txt
    2.2 MB · Views: 2,020
Last edited:

agewisdom

进士
@Shun The HSK 2-6 segregated cards is pretty fantastic work! Many thanks. I'll update my post and spread the word around soon.
Phew, it's pretty hard work reviewing these cards.

BTW - Is there any way to get PLECO to voice the entire sentence rather than just the pinyin word?
 

Shun

状元
Hi agewisdom,

thank you very much! I guess so. :)

There is a way to have it pronounce almost the entire sentence: You select the sentence and tap on the speaker button. The pinyin part will be pronounced when you haven’t selected anything.

Enjoy (and say thank you to leguan equally),

Shun
 
[QUOTE="Shun, post: If anyone would like me to make HSK 2-6 sentence lists from the the newest data in another language, please just tell me [/QUOTE]


If you are able to easily do the Spanish version of these, that would be fantastically helpful to practice my Spanish as well.
 

Shun

状元
This is the newest Spanish-English list, which I would recommend over Chinese-Spanish if refreshing Spanish is your goal, because you can use it in both directions with Anki, unlike the sentence contextual flashcards.
 

Attachments

  • sentences_spa_eng folded.zip
    5.2 MB · Views: 639

agewisdom

进士
Hi agewisdom,
There is a way to have it pronounce almost the entire sentence: You select the sentence and tap on the speaker button. The pinyin part will be pronounced when you haven’t selected anything.
Shun

I don't know why but that doesn't work. It still only pronounces the pinyin word.

Alternatively, is there a way to change the default to speak the entire sentence instead?
 

leguan

探花
I don't know why but that doesn't work. It still only pronounces the pinyin word.

Alternatively, is there a way to change the default to speak the entire sentence instead?

Unfortunately, I believe not. At least, not now.

I guess we might be able to realise this once Pleco 4.0 has been launched :) However, even if it is achievable, it seems likely that it will require rebuilding the flashcards with an extra full sentence without pinyin (or possibly, full pinyin) attribute
 
Last edited:

agewisdom

进士
Unfortunately, I believe not. At least, not now.

I guess we might be able to realise this once Pleco 4.0 has been launched :) However, even if it is achievable, it seems likely that it will require rebuilding the flashcards with an extra full sentence without pinyin (or possibly, full pinyin) attribute

Thanks @leguan for letting me know.

A big THANK YOU for sharing your deck and flashcard concept. I tried it out and darn it... It's really HARD! But I'm learning a lot through it. It's just that I don't know some of the other characters which makes it hard to GUESS :p the pinyin character accordingly. Everytime I go through the flashcard, it's just like reading Chinese text. I stutter and fumble a lot, but at least it's in manageable chunks.
 

leguan

探花
Hi agewisdom,

You're very welcome! I'm very happy to hear that the flashcards are useful to you!

Have tried using Shun's latest HSK level graded decks? With his HSK level graded decks you can reduce the overall difficulty of the sentences by choosing not to include higher HSK levels when you start a new test, which should make it easier to guess the pinyin word.
 
Last edited:

agewisdom

进士
Yes, I used his folded graded decks. Even HSK 2 is also a bit tough. The pinyin words are HSK 2 but some of the other sentences aren't. Which was what prompted me to ask whether it was possible to hear the entire sentence rather than just the pinyin.

Still, it's excellent practice. Except there's a bit more work to lookup some of unfamiliar characters in the sentences.
 
Top