Strategic Lesson Selection for ChinesePod and others

jiacheng

榜眼
I've come up with an idea that I'm just now starting to use and am really optimistic about. It is a way to figure out which is the best ChinesePod lesson to study next. I've written some very rough programs (bash scripts) that make an informed decision. In a nutshell, it figures out which lesson's unlearned vocabulary is the most frequently used.

To do this, it requires 3 things:

1. An exhaustive list of your vocabulary, not an easy thing to get unless you use pleco, anki or some other program that can track your vocabulary.

2. Chinese character & bigram frequency data, freely available online.

3. ChinesePod vocabulary lists for each lesson that you want to choose between. (this was kind of a hassle to compile, and may not be perfect)

So the script reads in your vocabulary, the frequency data and then looks at the vocab list in each lesson. For each lesson, it figures out which words or characters that you have not yet learned, and then looks all those words/characters in the frequency table. it then adds those frequencies up and then divides the total by the number of new words. The results can then be sorted to determine the highest scored lesson, which is the lesson you should study next.

Currently, I'm using 2 separate programs to do this for single characters and bigrams. I don't have a frequency list for words of arbitrary length, but it sure would be useful if I could find one.
 

Attachments

  • lesson-select.tar.gz
    419.3 KB · Views: 1,208
  • chinesepod-vocab.tar.gz
    550.1 KB · Views: 1,272
Jiacheng,

I applaud your efforts. It looks like you are off to a really good start with this idea.

I'd like to see more documentation, since although I am a Linux user I don't have much experience hacking shell scripts. I'd like to play around with the script but I'm not sure how to start.

I can help with the frequency list for words of arbitrary length. I've been downloading every frequency list I can find for some time now, and I merged them all together into a master list. Finally, I ran the entire list through Google and did a comparison for anomalies. You can find the list attached:


The fields are as follows: Chinese word, adjusted ranking, Google frequency count (may not reflect current Google counts which are always changing), original frequency ranking.

Where Google didn't vary much from the expected I left the original count alone. In the other cases I used the Google data to moderate the rankings. See the 概要 and 笔记本电脑 entries for a couple of examples. I know there are still discrepancies in the rankings but the list should be pretty reliable as a general guide, that is, words near the top of the list are very common and words at the bottom are rarely seen.

I hope this helps. I'd like to see many more applications like yours appear.

 

Attachments

  • rankfile.zip
    2.1 MB · Views: 1,474

jiacheng

榜眼
bao:

Thanks a bunch!

I've made a slight adjustment to my script to read in this data and I'm starting to try it out. I'm attaching the altered script.
 

Attachments

  • count-best-lesson-ngrams-eff.zip
    719 bytes · Views: 1,028
Jiacheng,

I'm glad to help. I see you've added a little documentation to the file. It's helpful to see what the command-line parameters are. How about the vocab file, is it just a flat file with one Chinese word per line? Do you use the exact same frequency list file I uploaded, or did you make modifications to read it with the script?

Can you give an example of how you would use the script? Eg. count-best-lesson-ngrams-eff char-freq.txt myvocab.txt lesson1.txt lesson2.txt lesson3.txt

 

jiacheng

榜眼
Yes, the vocab file is just the same format you'd get if you do a text export from pleco.

Each line would basically look like this:
Code:
词汇[詞彙]        ci2hui4        n. 〈lg.〉 vocabulary; lexis; lexicon
Allthough for the vocabulary, I'm pretty much ignoring everything beyond the first column.

Note that when you export from pleco, you should UNCHECK the "categories" box under the Include: section.

for the ngrams script, you can just pass it the exact uncompressed text file that you uploaded as the first parameter, just like your post:

count-best-lesson-ngrams-eff rankfile.txt myvocab.txt lesson1.txt lesson2.txt lesson3.txt

The output is kind of cryptic, but it will basically look like this:
Code:
[average score]        [total score]        [lesson_filename.txt]        [newword1] ...

The output is sorted by the average score, which is basically the sum of all the frequencies divided by the total number of unlearned new words in that particular lesson. So the lessons most highly recommended by the script will be on the last lines.

Note that there are some issues in the script that result mostly from "儿" and "……" on word entries. I'll try to clean them up at some point, but it shouldn't cause a huge issue.
 
Jiacheng,

Thanks for the clarification. I'm looking forward to putting the script to work.

So far I've had some problems running the script. I must have a different version of bash than you use. First, it told me:
line 13: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]

I changed it to declare -a and it got past that line ok. But it gets stuck here:
line 19: 国家: syntax error: operand expected (error token is "国家")

国家 is the first item in my vocab list. Any idea what's going on?

 
ipsi,

Thanks for the link. Here's my version info:
~$ bash -version
GNU bash, version 3.2.25(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.

It looks like my version of bash doesn't support the -A option. Any idea of a workaround? (Yes, I know I need to upgrade my os)

 

ipsi

状元
My suggestion would be to try installed ZSH and pointing the script to that (e.g. on Ubuntu run
Code:
sudo apt-get install zsh
, then putting, I assume,
Code:
#!/bin/zsh
at the top of the script.), but I can't say if that would result in ZSH taking over from Bash as your default shell or not. I would try it myself, but not on my work laptop.
 
Thanks!

Progress! The script seems to be working for me now. However, I'm still not understanding the output very well. Maybe something is still going wrong.

I ran the script using the menu-stealer files since there are only 3. (Start small when testing, I say)

Here are my results:

loaded frequency tables
./count-best-lesson-ngrams-eff:shift:37: shift count must be <= $#
12580000 12580000 chinesepod_TMS0002-vocab.txt 花椒:871000
24088200 24088200 chinesepod_TMS0003-vocab.txt 下酒菜:0
34010000 34010000 chinesepod_TMS0004-vocab.txt 新疆拌面:0

The "shift count" part looks like an error.

Does the fact that there are only three words listed mean that there are only three new words for me to learn? I think there are more than that missing from my word list.

 

sfrrr

状元
Strategic lesson choice for CPod. Sounds great.
But I can't figure out what you guys are talking about (except for an occasional the or a) and I don't have any idea how to use the script. Could you translate the script, your messages, or both into slightly less techy language? Thanks.
 
Sfrrr,

I'm glad to see your interest in this undertaking. User Jiacheng wrote the above script and introduced it here and on Chinesepod. That page provides a better introduction to the script than this one, so I suggest reading it and the comments posted on it. This idea is really in its infancy right now, mostly because the script is only usable by people who have access to Bash v4 (if you don't know what that is, you don't have access to it).

Unfortunately I also don't have access to the script, since my version of Bash isn't new enough. I think it would be great to see this idea ported to a more widely-available platform.

What I like about this idea is that it moves a step closer to providing a custom-tailored learning experience. The ideal learning program incorporates strategic repetition of newly learned vocabulary (ex. Pimsleur, Rosetta Stone). The disadvantage of a packaged product is that it is very structured and unable to contain features that appeal to everyone. On the other hand, Chinesepod has lessons with more universal appeal. The script referenced in this thread is one way to bridge this gap. In theory, a person who creates a vocabulary list using Pleco can use this script to determine which Chinesepod lessons are most appropriate for learning new vocabulary/reinforcing old vocab.

 

sfrrr

状元
Bao--thanks for the explanation. I'm about to look up bash on the web--just to find out a little about what it is. And I'm on my way to the CPod page. Thanks again.
 

mihobu

秀才
A Progressive Word List

Bao Mingguang, thank you for posting your rankfile.txt. I've been looking for a way to produce a progressive word list (attached), and this helped out a lot. I limited the output to bigrams, but still don't have a good way to weed out high-frequency non-words! This list presents words by character frequency, then listing words (most common first) having only characters encountered so far.
 

Attachments

  • progressive-wordlist.txt
    607.1 KB · Views: 27,280
mihobu,

You're welcome. :) Do you mean you can't weed out non-words, or you couldn't before downloading my list? AFAIK all the words on my frequency list are actual words. I notice there are 23 words on your list that aren't on mine. I checked a few and found most of them in the dictionary. They don't seem to be very common ones though.

I like your word list. It's a great idea that could serve as the foundation for a very effective system of introducing Chinese vocabulary.

 

mihobu

秀才
I've continued working on this idea, using Jun Da's word frequency data and CC-CEDICT for word validation. I've expanded the list to include bigrams, trigrams, and quadrigrams also. The latest results include 2,500 characters and 24,000 multicharacter entries (words, phrases, idioms). The list (and a frequency dictionary version that includes the pinyin and English definitions) can be found on my website: http://monkeywalk.com/wordlist.
 
I really appreciate the thought and effort put into this thread. I was just going to reiterate what I'd posted on another thread about the importance of knowing the source of word frequency lists. Google may not be the best search tool and particularly not for Chinese.

I had similar intentions with ChinesePod but along a different line and emailed them my suggestion:

Sent: October-13-07 10:23:43 AM
To: chinesepod@gmail.com
May I suggest a series of revision lessons every 10 or 20 lessons using vocab and grammar of preceding lessons? They could be in the form of translation exercises with an English dialogue given at the beginning along with any new vocab (if applicable). After hearing the English, a pause is provided in which the listener has to guess at how to convey the meaning in Chinese. Following the pause, (variations* on how it could be translated in) Chinese could be provided. It would be an empowering way for students to actively revise language learned and test their communication skills. Resulting comments on such lessons might prove extremely useful in clarifying misconceptions about how to express something.

I don't recall receiving much, if anything in the way of a reply email form them but I use this very method now when teaching Chinese adult beginners. I come up with dialogues or sentences based on all the language covered and give them the Chinese to translate into English. It's a great way to test/augment their retention. It's harder to do with 'someone else's students' (when you don't what's been studied/learned, or in other words when you don't have a detailed list of active and passive vocab/grammar).
 
Top