79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

#41
Dear all,

I've just replaced the files at the top of the thread with ones that use better Pleco pinyin segmentation. However, those who have already downloaded my "pure" Tatoeba Sentence Flashcards do not need to remove their flashcards with worse pinyin from their flashcard database. Instead, if you wish to keep all scores, it is much better if you just select all the Tatoeba cards you are studying in Organize Flashcards, tap on Batch (on an iPhone—Android is similar) and then scroll down a bit and tap on Remove Pinyin readings, then confirm. After that, you can temporarily change your active dictionaries under Manage Dictionaries so that only the following dictionaries have their Use in dictionary switch enabled. They are also the ones I used for the new files:

  • Das Neue Chinesisch-Deutsch
  • OCD C-E
  • Oxford C-E
  • Pleco C-E
  • KEY Chinese-English
  • ABC C-E
  • Tuttle C-E
  • Tuttle
  • CC-CEDICT

Then go back to the Organize Flashcards category, select all cards and tap on Add pinyin readings.

If you do not have these dictionaries, you can of course also delete the cards you have previously imported and import the new file. If it's done that way, however, you would lose scores from your learning sessions with the older cards.

I apologize for the hassle, but I now know something I didn't know at the start about how best to do pinyin conversion using Pleco.

@leguan Of course, you could now check if the new cards make for an even better sentence utilization ratio. (after all the double-width characters have been replaced by single-width ones) But I'd absolutely understand it if you need a break from these cards for now. :)

Regards,

Shun
 
Last edited:
#42
Hi Shun,

Thank you very much for your continued efforts to improve the pinyin segmentation and for your concern regarding my workload!

Until now, I have maintained one Excel spreadsheet containing all of the sentence lists, and have created flashcards from this spreadsheet by programmatically selecting sentences from the required lists. However, since this method has a few drawbacks, including not being able to easily prevent duplicate flashcards and non-optimal sentence utilization, particularly for flashcard sets based on a small proportion of the total sentences (e.g. Chinese-German), I have now created separate spreadsheets for each sentence list, and have created new "sentence contextual writing" flashcards sets based on these.

Yesterday, I also discovered two other related issues regarding Pleco's treatment of text in the pinyin field upon importation.

Before I go into the details, firstly we should be clear that these sentence contextual flashcards are based on a "hack" of Pleco's functionality. The pinyin field was surely never intended to contain mixtures of pinyin, chinese characters, and other languages' alphanumeric text. This means that these flashcards could be rendered unusable any time in the future by an update to Pleco's behaviour regarding the pinyin field:(! -> prays to God of Pleco...:p
EDIT: See Shun's following post for good news regarding this!:D

Here are the issues:

Issue 1: Alphabetic characters in the "Pinyin" field can also "disappear", or be converted into pinyin. For example, the English word "to" could become "tong", etc
Issue 2: Full width numeric characters (e.g. 0,1,・・・)also seem to "disappear" on importation.

It seems that Issue 2 can be dealt with by converting the full-width numeric characters into half-width numeric characters. However, for Issue 1, unfortunately this issue appears to occur regardless of whether the characters are half or full-width.

In my new sets of flashcards I have thus removed all of the sentences (totalling about 1000) that contain alphabetic characters in the definition field and have converted all of the full-width numeric characters to half-width.

The result of all this is as follows:

<Tatoeba Chinese-German>
Total number of original sentences in list: 4,538 → 4,411 (a)
Total number of flashcards: 5,740 → 8,392
Total number of unique sentences: 3,411 → 4,215 (b)
Percentage of original sentences utilized (= (b)/(a)) = 75.1% → 95.5%:D
Average number of flashcards per unique sentence: 1.68 → 1.99
Total number of unique words tested: 3,388 → 3,474
Average number of flashcards per unique words tested: 1.69 → 2.42

Total number of HSK words: 1,387 → 1,544
Total number of flashcards testing HSK words: 2,631 → 4,631
Average number of flashcards per HSK word: 1.89 → 3.00

Total number of non-HSK words: 2,001 → 1,930
Total number of flashcards testing non-HSK words: 3,110 → 3,397
Average number of flashcards per non-HSK word: 1.55 → 1.76

COMMENT: The big gain in sentence utilization partly comes from Shun's improved pinyin segmentation and partly from removing the non-Chinese-German sentences from the spreadsheet used to generate these flashcards - this made sure that 100% of the matches were relevant (previously, a high proportion (more than 90%) of the matches (up to a maximum of 30/word) in my spreadsheet were related to other flashcard sets so this limited the possible number of relevant Chinese-German sentence matches in this Chinese-German flashcard set). The downside of indexing on a per flashcard set basis is that each sentence list needs to be indexed separately - for the larger flashcard sets this is usually an overnight job:eek:!


<Tatoeba English>
Total number of original sentences in list: 41,587 → 40,792 (a)
Total number of flashcards: 39,828 → 46,603
Total number of unique sentences: 28,689 → 31,837 (b)
Percentage of original sentences utilized (= (b)/(a)) = 69.0% → 78.0%:)
Average number of flashcards per unique sentence: 1.38 → 1.62
Total number of unique words tested: 13,405 → 13,259
Average number of flashcards per unique words tested: 2.97 → 3.51

Total number of HSK words: 3,353 → 3,440
Total number of flashcards testing HSK words: 15,456 → 21,870
Average number of flashcards per HSK word: 4.60 → 6.35

Total number of non-HSK words: 10,052 → 9,821
Total number of flashcards testing non-HSK words: 24,372 → 24,732
Average number of flashcards per non-HSK word: 2.42 → 2.52

COMMENT: As well as the gains attributable to Shun's improved pinyin segmentation, some of the sentence utilization gain for this flashcard set can be attributed to increasing the maximum number of flashcards per non-HSK word from six to seven and increasing the maximum BCC corpus ranking to 160,000 from 100,000, and the maximum non-BCC corpus (i.e., LWC and SUBTLEX) ranking from 60,000 to 100,000


<HSK English>
Total number of original sentences in list: 18,261 → 18,036 (a)
Total number of flashcards: 40,054 → 57,460
Total number of unique sentences: 17,750 → 17,819 (b)
Percentage of original sentences utilized (= (b)/(a)) = 97.2% → 98.8%
Average number of flashcards per unique sentence: 2.26 → 3.24
Total number of unique words tested: 15,561 → 15,823
Average number of flashcards per unique words tested: 2.57 → 3.63

Total number of HSK words: 4,236 → 4,318
Total number of flashcards testing HSK words: 22,541 → 28,259
Average number of flashcards per HSK word: 5.32 → 6.54

Total number of non-HSK words: 11,325 → 11,505
Total number of flashcards testing non-HSK words: 17,514 → 29,199
Average number of flashcards per non-HSK word: 1.55 → 2.54

COMMENT: The sentence utilization gain here can mainly be attributed to increasing the maximum number of flashcards per HSK and non-HSK tested words.


<Tatoeba+HSK+α English>
Total number of original sentences in list: 63,556 → 62,520 (a)
Total number of flashcards: 86,721→ 88,604
Total number of unique sentences: 51,279 → 51,701 (b)
Percentage of original sentences utilized (= (b)/(a)) = 80.7% → 82.7%
Average number of flashcards per unique sentence: 1.69 → 1.71
Total number of unique words tested: 21,896 → 21,532
Average number of flashcards per unique words tested: 3.96 → 4.11

Total number of HSK words: 4,602 → 4,607
Total number of flashcards testing HSK words: 41,102 → 41,382
Average number of flashcards per HSK word: 8.93 → 8.98

Total number of non-HSK words: 17,294 → 17,157
Total number of flashcards testing non-HSK words: 45,620 → 47,222
Average number of flashcards per non-HSK word: 2.64 → 2.75

COMMENT: For this set I increased the maximum number of flashcards per non-HSK word from six to seven - this change resulted in an addition of 2665 new flashcards..


Once again, I hope these flashcard sets will be useful for those who want to practice writing Chinese characters based on the sound (pinyin) of the word, as would be the case when listening to the sentence spoken in Chinese, in the context of a sentence at the same time as getting reading comprehension practice.

As an aside, when I study with these flashcards, I use "Self-grading" and only give myself a "remembered perfectly" grade if I can perform all three of the following:
1. perfectly write the Chinese characters for the tested word
2. fully understand the sentence
3. read the sentence out loud with correct pronunciation/tones for all words in the sentence.

Enjoy:)
 

Attachments

Last edited:
#43
Hi leguan,

you're welcome! Yeah, I was concerned about your workload (well, this is actually fun :)), and this looks really good, thank you! I don't quite grasp how you index the dictionaries or how the whole process works exactly, but in any case, I am very happy about the gains you were able to achieve, especially with the German version. For completeness, I will just describe what I did for my part, perhaps you'd also like to detail yours more fully.

Tatoeba.org has a "Developers" section where I could download the following:

- A list of sentences of all available languages mixed together (about 230,000 sentences), which has the following tab-delimited format:

Sentence ID number<tab>Language code of sentence<tab>Sentence in that language

- A links list of the following format, linking the sentences having the same meaning together:

Sentence ID in one language<tab>Sentence ID in the other language

This data format allowed a sentence in one language to have many different translations in different languages. It did require me to write the Python script to pick out the language I needed from the Sentence list, and then find all the translations in the Links list which had the other language as a target language, and save it to a list. I converted the lists to Python dictionaries to do this, as that is probably faster by several orders of magnitude than searching through the lists would have been. But, it requires about 5 GB of RAM to perform the calculation, and it took about 1-2 minutes for each language pair. My machine only has 8 GB of RAM, so about 1.5 GB of memory had to be swapped to disk while Python was working on it.

Then I wrote out the output files, and that was it.

I believe that with Pleco 4.0, where there will be custom fields in Flashcards, it should definitely be possible to continue using your flashcards. It should perhaps work ever better, since the full-width numeric characters or the Latin characters in the pinyin field will not disappear or change anymore.

Best regards,

Shun
 
Last edited:
#44
Hi Shun,

Thanks once again. Yes, I was also very happy to see that much higher sentence utilization could be achieved for the Chinese-German flashcards! I hope your students will enjoy them and find them of use to their studies.

Please find below an overview of the flashcard generation process, including a description of the "indexing" process:
Hope it is useful:)

Step 1: Create a list of words ranked by frequency of usage that have an entry in at least one of the Pleco dictionaries that I have installed/purchased with pinyin information (with tone information removed) for matching with pinyin information sentences lists
(Only using words that have an entry in one of my Pleco dictionaries, means that I can look up in Pleco the meanings of all the words that are tested)

a. Obtain the corpora:

BCC (Balanced Chinese Corpus)
https://plecoforums.com/threads/wor...haracter-corpus-bcc-blcu-chinese-corpus.5859/

LWC (Leiden Weibo Corpus)
http://lwc.daanvanesch.nl/openaccess.php

SUBTLEX (Movie Subtitles)
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-ch

Next, separately for each of the above corpora:

b. Import the corpus into Pleco as flashcards with the import settings such that only words that have at least one Pleco dictionary entry create a flashcard.

c. Export the created flashcards to a text file, including the pinyin information generated by Pleco.

d. Import both the original corpus and the text file created in c. into Excel.

e. Add the ranking information taken from the original list to the list of words that contain Pleco dictionary entries.

Next, combine all of the corpora into one big list:

f. Since the BCC corpus is much bigger, and therefore more authoritive than any of the other corpora, throw away any words in non-BCC corpora that also have an entry in the BCC corpus.

g. Merge all the corpora keeping a tag in the combined list indicating the source corpus and the ranking in that corpus.

h. Create a new "PINYIN_NO_TONES" column that is the search key used for matching pinyin in sentences that takes the pinyin information supplied by Pleco and removes the tones. Further, a space is added before and after so that a match of the pinyin in the sentence will only occur if the pinyin in the sentence also has space characters before and after the pinyin for a word.
e.g., pinyin for 电脑 from Pleco is dian4nao3. Remove the tones and add spaces and we have the pinyin search key " diannao ".
(The reason for removing the tones is that we can match words in short sentences reasonably reliably without them and that if we kept it in we would have to deal with the fact that there is unlikely to be universal agreement regarding the correct pinyin for a word between the Pleco dictionary(s) used to source the pinyin for a word and the pinyin contained in the sentence lists used for words that have variants containing silent tones)

Step 2: Create the list of Sentences

a. Import the sentences (Chinese, Pinyin, Definition) into an Excel sheet.

b. Convert any full-width punctuation characters and numerals in the sentences to equivalent half-width characters. Remove any sentences that contain alphabetic characters.

c. Create a new "PINYIN_NO_TONES" column that contains the pinyin information for the sentence with all the tones removed and a space added at the start and at the end. Also remove any residual punctuation symbols that may be mixed in with the pinyin.
e.g. If the pinyin for 我没有电脑. in sentence list is "wo3 mei2 you3 dian4nao3", the "PINYIN_NO_TONES" column will become " wo mei you diannao "

d. Order the sentence list by the number of Chinese characters in the sentence (so that shorter sentences are preferred over longer ones when generating the flashcards)

e. Remove any sentences longer than 35 Chinese characters (The main goal of the flashcards is to provide enough context to know which word is being tested. If the sentence is too long time taken for reading comprehension becomes too long)

Step 3: For each word create a list of pointers (row numbers in the sentence list) to all the sentences that contain that word using the MATCH function in Excel.

a. Find the first sentence match with:

{=IFNA(MATCH(N2&O2,sent!B:B&sent!F:F,0),-1)}
where the N column for the word entry 电脑 is "*电脑*" and the O column is "* diannao *" and;
the F column in the sentence list has the sentence pinyin info without tones and the B column has the sentence in Chinese characters.
This formula will find the first sentence that contains BOTH the Chinese characters and the pinyin (flanked by spaces) of a word.

b. Find the remaining sentence matches with the following formula (for the following 30~60 columns following the formula above is the AX column:
{=IF(AX2=-1,-1,IFNA(AX2+MATCH($N2&$O2,INDIRECT(CONCATENATE("sent!B$",AX2+1,":B99999"))&INDIRECT(CONCATENATE("sent!F$",AX2+1,":F99999")),0),-1))}
This formula looks a bit complicated, but all it actually does is just search for the next match in a sentence from the row in the sentence list directly following the row of the sentence matched previously. When there are no more matches the result is -1.

This is the so called "indexing" operation I mentioned in my earlier posts. Since there are over 200,000 words in my word list it can take several hours for excel to crunch through finding tens of matching sentences for each word.

Step 4: Create the Flashcards using a VBA script.

Basically the VBA script will make multiple user-defined passes through the word lists and create a flashcard for each word that has at least one matching sentence (found in Step 3 above) and that also meets other certain conditions in each pass: e.g.:
a. Multiple flashcards for the same word and same sentence are prohibited.

b. The sentence match chosen for a word is the sentence that currently has the lowest number of flashcards (i.e., sentences that have already been used to create flashcards with other words)

c. The word ranking is less than a certain ranking e.g., less than 160000 for a BCC corpus entry and less than 100000 for a non-BCC corpus entry (Reason for this: The BCC corpus is much larger than the other corpora used so we can allow lower ranked words from this corpus with less risk of the word getting a "lucky" high ranking due to the smaller sample sizes that smaller corpora have)

d. Allow a maximum of X (e.g., 5 etc) flashcards for each non-HSK word and a maximum of Y. (e.g. HSK Grade * 3 + 2, etc) flashcards for each HSK word.

e. The "pinyin" field in the flashcard is generated by replacing any occurences of the Chinese characters found in the matched sentence with the pinyin (including the tonal information)

f. Create a flashcard (row in the "Dict" sheet) which contains the word as the first column, the generated "pinyin" above in the second column and the Definition (i.e. the English or other language translation of the sentence) in the third column.

Step 5: Create the flashcard file
a. Just copy all of the "Dict" entries created in Step 5 above into a text file which can be imported into Pleco as flashcards.
 
Last edited:
#45
Hi leguan,

incredible work, thank you! I will look through it later and then dispense more praise. You certainly know how to utilize Excel's/VBA's power. I really thought it's a good idea that we add this information, it's a bit like the References at the end of an academic paper. ;) I hope you agree.

Best,

Shun
 
#46
Hi leguan,

so I've read through your procedure and am quite impressed. In the places where you had to make choices (on sentence length, for example), I thought they were very sensible. I wonder how much faster the same process would run if it were done entirely in C++, Swift, or Python. It would probably be more work to program, but I would expect it to take maybe 10 minutes at the most to execute. Since you were working with text and matching inside lists, it would surely all be doable. I could try it as a side project once, based on your information above. It's a good way to learn to program something bigger. But for now, I'll enjoy the freshly made flashcards. :)

Thanks again and have a nice rest of the weekend,

Shun
 
#48
Hi leguan,

I believe that with Pleco 4.0, where there will be custom fields in Flashcards, it should definitely be possible to continue using your flashcards. It should perhaps work ever better, since the full-width numeric characters or the Latin characters in the pinyin field will not disappear or change anymore.

Best regards,

Shun
Wow! That is great news, isn't it!
I guess we can rest assured that these flashcards will not go to waste once Pleco 4.0 arrives:D:D:D

Not only that, but we can also look forward to these flashcards no longer being just a "hack" of Pleco, but can be a fully-fledged learning solution without any of the currently imposed limitations!:)
 
#49
Wow! That is great news, isn't it!
I guess we can rest assured that these flashcards will not go to waste once Pleco 4.0 arrives:D:D:D

Not only that, but we can also look forward to these flashcards no longer being just a "hack" of Pleco, but can be a fully-fledged learning solution without any of the currently imposed limitations!:)
Oh yes, we can be almost sure of that! I’ll tell you what the open-minded, unbiased students think about the sentence contextual system. (I‘ll have to coax them and then wait for them to have tried it out first.)

After some studying, I found I also like to do a Show Definition test of the same cards I tested before, to try to come up with the entire sentence I had just seen a few hours earlier.
 

mikelove

皇帝
Staff member
#50
Yes, 4.0 allows unlimited arbitrary custom fields. In fact we even allow arbitrary indexing of those fields - can run each field through an arbitrary sequence of regex replacements / character transliterations before it's indexed *and* even install your own custom ICU collation (=result sort) sequence; if you're a fan of Austrian phone book sort order and you want 'ä' to be sorted after 'az', you can load your own collator and get that.

(but only in user dictionaries - our own German/French/etc dictionaries will support the most common collation systems for dictionaries in those languages, rather than just treating them like English, but they won't allow you to pick your own)
 
#51
Thanks, that's quite remarkable! Of course, if you have custom fields in flashcards and the flashcards are all visible in the dictionary search, then they also need to be indexed and sorted.
 
#52
Hi leguan,

I demonstrated your Chinese-English and Chinese-German contextual flashcards to some interested students today in the first part of the course, and I could tell they liked it, or at least I made them curious! :) Next week I will get even more reaction.

Best regards,

Shun
 
#53
Hi leguan,

with the second part of the course having passed, I can say the students were definitely rather excited about the contextual sentence flashcards and about the potential of Pleco in general. (already of the current version 3.2, but also of what's coming in version 4.0) All these new things can be overwhelming to new users, so I can understand that they can't tell me exactly what they think yet. Anyhow, it was fun, and I am very thankful for your innovative contextual sentence flashcards system, it really holds great value, and it also served to beef up my presentation! :)

I've already started trying to replicate your Excel/VBA procedure in Python, which is a good exercise for me. It will be a while, but you would surely be pleased to see a Python package doing the same thing. It will put it on github when it's done.

Best regards,

Shun
 
#54
Hi Shun,
Thank you very much for your update!
In very pleased to hear that your students were excited about the sentence contextual flashcards and that your presentation went well overall! Best of luck with your "pythonization" efforts too!
Kind regards,
Ieguan
 
Top