Hi Alca,
you're welcome!
You're right, the CC-CEDICT probably won't include a lot of personal names, so for guesses on that, a segmenter would be better. But it does have all the family names and marks them as such. The selection of Chinese personal names is endless anyway, because as you know, they are so often built from real-word terms, unlike first names in a Western language. It's clear that segmenters can't get 100% accuracy because quite often pragmatic information is needed to make the right segmenting choice, and only humans have that ability right now. Let me give you some pseudocode for how the Python code could work:
- Read in the CC-CEDICT (into a Python dictionary with all the keywords, pinyin, and accompanying definition)
- For every vocabulary item in CC-CEDICT, check if it is in any of the sentences in the subtitles file. If there is a match, then:
- Add the vocabulary item to the output vocabulary list (another Python dictionary with all the vocabulary to be studied)
- Add the context (i.e., the whole sentence in the subtitles file where it found the word, for example, with the word itself blacked out) to the definition field of the vocabulary item in the output
- Write the output vocabulary list to a text file in the Pleco flashcards import format. You could include the CC-CEDICT definition, or use another dict's definition, just then you would have to import your vocabulary into a user dictionary, as well, such that you could still see the context from there whenever you wish to. You could then use any other installed dictionary for the flashcard definition and place your user dictionary next in order, for example.
If you opt to use a segmenter, you would probably proceed as follows:
- Run the segmenter through your subtitles files
- Create a Pleco flashcard import text file from the segmented words
- Tell Pleco only to import those words that exist in any of your dictionaries
Besides the ability to include the context easily, another advantage I see from using the CC-CEDICT approach is that it can give you overlapping definitions of different variants/combinations of characters, which the segmenter can't (it can only make one segmenting choice, and that's what you have to go with). For example, if you have 公共汽车, the CC-CEDICT approach will find both the entire word and the two components 公共 and 汽车, and the individual characters separately, i.e. before importing into Pleco you can always choose which combination to keep and which one to discard. Overall, I feel the CC-CEDICT approach gives me the most control over the process.
Does that sound convincing? You could of course try both possibilities, CC-CEDICT and a segmenter. Tell me if you'd like me to construct the actual Python code. If at all possible, I encourage you to try it yourself, as it's very easy to get good results in Python, and you could google for anything language-related you don't yet know.
Cheers,
Shun