Extracting words for pleco flashcards // best cutter // etc

Alca · Feb 21, 2021

Hi everyone,

I was wondering about which technique do you use to extract flashcards and what works best for you. The sensitive parts being the cutting of the words.

I am using ckiptagger https://github.com/ckiplab/ckiptagger but there is also jieba and ntlk for chinese.

While reading on this forum, I also found this https://www.chinesetextanalyser.com/features which I haven't try, is it any good? I also read that Pleco might implement those functionalities, is that still a thing?

Anyway, I am extracting words from a taiwanese serie subtitles now. The word list is a bit messy but working the voc before watching definitely help.

Shun · Feb 21, 2021

Hi Alca,

since all the word segmenters are somewhat flawed, I suggest you download the CC-CEDICT dictionary (it's open source and has over 100,000 vocabulary items) and write a Python script that searches for all the words in CC-CEDICT in the subtitle files. Then you can be assured that each word is properly segmented, and that it is reasonably common. If you are more of a beginner, you could also just use the 5,000 New HSK vocab and search for the HSK words in all the subtitle text.

Once you have done that, you can create the flashcard import text file using the Python script, perhaps even including some context in the definition field. If you like, you could post the subtitle file(s) here or send them to me as a private message, and I could give it a shot.

I think the Chinese Text Analyzer is a reliable tool, I suggest you install it and try it in demo mode for free for 14 days. But you shouldn't need it for the word harvesting strategy I proposed above.

Cheers,

Shun

Alca · Feb 21, 2021

Hi Shun,

Thanks for your answer

My level is around hsk 5, probably a bit more but not quite hsk 6 (it's a guess though). I am also trying to escape those hsk list, or graded texts since it is usually the same type of discourse. I also do not feel that studying hsk 6 voc really help for novel or series.

ckiptagger isn't too bad but indeed sometimes the results are a bit funky. Their github page claim 96% reliability which should be good enough. In the meantime, I tried chinese text analyzer but I haven't yet compared the results yet. One advantage of ckiptagger, is that it is also extracting names, with possibility of making a characters' list (which I haven't done yet neither).

Do you have any example code of the methode with cc-cedict? Especially with context, I would be very curious to know how to do that.

What makes you think that cc-cedict method would be better than cutter?

Shun · Feb 21, 2021

Hi Alca,

you're welcome!

You're right, the CC-CEDICT probably won't include a lot of personal names, so for guesses on that, a segmenter would be better. But it does have all the family names and marks them as such. The selection of Chinese personal names is endless anyway, because as you know, they are so often built from real-word terms, unlike first names in a Western language. It's clear that segmenters can't get 100% accuracy because quite often pragmatic information is needed to make the right segmenting choice, and only humans have that ability right now. Let me give you some pseudocode for how the Python code could work:

Read in the CC-CEDICT (into a Python dictionary with all the keywords, pinyin, and accompanying definition)
For every vocabulary item in CC-CEDICT, check if it is in any of the sentences in the subtitles file. If there is a match, then:
1. Add the vocabulary item to the output vocabulary list (another Python dictionary with all the vocabulary to be studied)
2. Add the context (i.e., the whole sentence in the subtitles file where it found the word, for example, with the word itself blacked out) to the definition field of the vocabulary item in the output
Write the output vocabulary list to a text file in the Pleco flashcards import format. You could include the CC-CEDICT definition, or use another dict's definition, just then you would have to import your vocabulary into a user dictionary, as well, such that you could still see the context from there whenever you wish to. You could then use any other installed dictionary for the flashcard definition and place your user dictionary next in order, for example.

If you opt to use a segmenter, you would probably proceed as follows:

Run the segmenter through your subtitles files
Create a Pleco flashcard import text file from the segmented words
Tell Pleco only to import those words that exist in any of your dictionaries

Besides the ability to include the context easily, another advantage I see from using the CC-CEDICT approach is that it can give you overlapping definitions of different variants/combinations of characters, which the segmenter can't (it can only make one segmenting choice, and that's what you have to go with). For example, if you have 公共汽车, the CC-CEDICT approach will find both the entire word and the two components 公共 and 汽车, and the individual characters separately, i.e. before importing into Pleco you can always choose which combination to keep and which one to discard. Overall, I feel the CC-CEDICT approach gives me the most control over the process.

Does that sound convincing? You could of course try both possibilities, CC-CEDICT and a segmenter. Tell me if you'd like me to construct the actual Python code. If at all possible, I encourage you to try it yourself, as it's very easy to get good results in Python, and you could google for anything language-related you don't yet know.

Cheers,

Shun

Alca · Feb 21, 2021

hm, I don't understand how you deal with the cc-cedict into the python dictionary with all the entries as python dictionaries are limited to tupples? Or you use Jason or xml? How do you do that?
Then, how do you add the context? With regex based on the match? I think I need a little more help to get started with this method.

I changed the serie from 俗女养成记 which has a lot of Taiyu and for which I have subtitles in simplified to 谁是被害者, with subs in traditional. I also cleaned properly the subs before using the segmenter and it worked much much better. After segmenting, I just make a dictionary with the words list, remove hsk until level 4 (hsk 5 can be a refresh) substracting a python dictionary to another and then using the same method, I remove further some easy words not in the hsk list. The end is a simple list of chinese words, one per line, in txt. The result this time feels very close to the 96% claimed.

Python is fun

I'm still amazed this is possible. And Thanks!

Shun · Feb 21, 2021

In the Python dictionary of CEDICT, you can have as a key the Traditional or Simplified Chinese character, and as a value a tuple of pinyin and the definition you'd like to use. You could also use a list of tuples as a value if you have the same character with different pinyin somewhere.

The context you could just add to the definition, as you go through the entire subtitle text in the Python script. There isn't any need for regexps, you just have a long list of strings with your subtitles, then you iterate through that list and check:

Python:

definition = {}

for word in CEDICT_dict:
    for string in subtitle_list:
        if word in string:
            if word not in definition:
                definition[word] = string.replace(word, "***") + '\n\n'
            else:
                definition[word] += string.replace(word, "***") + '\n\n'

That would add the sentence without the word itself to the end of the definition. At the end of the loop, you would add the word, its pinyin and the definition you want to an output list, which you finally write out to a flashcard input file. Of course, there are many ways you could design the loop and the Python dictionaries; the result would be the same.

Good to hear you already have quite some Python experience. Subtracting vocab sounds good. I already have quite a few other Python scripts on the forums, so you could also check out those if you need some more inspiration. I agree, Python is fun.

You're welcome,

Shun

hoshi · Feb 22, 2021

although drama itself has good reputation, but when comes to harvesting min dialect word, very disappointed almost no value to gain.

wibr · Feb 23, 2021

@Alca FYI http://www.jiong3.com/gradedwatching/ also includes 谁是被害者

Shun · Feb 23, 2021

@wibr, thanks a lot, I love that list.

Alca · Mar 3, 2021

wibr said:
@Alca FYI http://www.jiong3.com/gradedwatching/ also includes 谁是被害者

Wow! This is awesome. Thank you. It's like once I have I learned how to do it I found out everything is already there xD

Anyway, here is my take on it: https://github.com/Alqua/subtitles-flashcard-for-pleco
( I can get the voc episode by episode with it, names and places)
I sweated a lot to write that

It was a good exercise but I'll wait a bit before writing anything else.

@hoshi I watched the drama without studying the voc at the end

The voc is not particularly difficult but there are a lot of common taiwanese words mixed with taiwanese accent. Nothing too complicated in terms of vocabulary and it's easy to follow if missing a few words. It is a good drama though, I liked it.

Extracting words for pleco flashcards // best cutter // etc

Alca

Member

Shun

状元

Alca

Member

Shun

状元

Alca

Member

Shun

状元

hoshi

Member

Attachments

wibr

进士

Shun

状元

Alca

Member