Segmenting CEDICT-Tatoeba Dictionary Generator for any language

Shun

状元
Dear all,

I reprogrammed the «CEDICT Tatoeba example sentences dictionary generator» with word segmenting added. A search for "中" now will not produce example sentences that contain 中国人. Only if you display the dictionary definition for 中国人 will sentences containing 中国人 appear. The script searches for words inside sentences in both the forward and backwards directions to maximize word discovery. (Whenever you have a string of Chinese characters that serves as the ending and beginning of two different CEDICT dictionary keywords, you get two different segmentations depending on the direction in which you scan a Chinese sentence, both of which are useful. This script checks for both.)

Tatoeba languages other than English can be selected. The CEDICT definition will remain in English. The export of the dictionary from Python takes about 15 minutes, and its import into Pleco will take quite a bit longer. Here is an example dictionary entry for French from the 118,000 current CEDICT terms:

-----

Hanzi: 宁愿[寧願]
Pinyin: ning4 yuan4
Definition:

- would rather
- better

与其让我在他手下干,我宁愿辞职。
Je préférerais démissionner que travailler sous ses ordres.

我宁愿呆在这里。
Je préférerais rester ici.

我宁愿叫一杯啤酒。
Je préfère commander une bière.

我宁愿您休一天假。
Je préférerais que vous preniez un jour de congé.

[...]

我宁愿待在这里而不去。
Je préférerais rester plutôt que de m'en aller.

我宁愿吃药,也不打针。
Je préfère prendre des médicaments plutôt que d'avoir une piqûre.

我们宁愿明天吃蜗牛。
Nous préférerions manger des escargots demain.

Definitions from CC-CEDICT, example sentences from Tatoeba.org

-----

Many of the sentences are quite colloquial, it's just nice to get a bit more context and everyday usage than from most dictionaries. CEDICT also includes a nice selection of Internet slang and modern usage.

I attach the Python script. It includes some instructions. I uploaded the English, French, and German files here (the same link as for the version from 2 years back):


One could combine this script with an HSK level detection algorithm, so beginners could only see sentences with easier vocabulary.

Enjoy,

Shun
 

Attachments

  • dict_generator_cedict_tatoeba.py.txt
    11.7 KB · Views: 170
Last edited:
Top