Vocabulary from ChinesePod

yesacas

Member
I am (and suspect a few Pleco users are as well) a subscriber to ChinesePod (http://www.chinesepod.com) and would like to use the vocabulary list from that site on my Pleco. I can get there somewhat successfully, though the process is by no means a smooth one. Basically, ChinesePod allows export of a vocabulary list into either a CSV file, or a XML file. With a little bit of editing I can get the list to something that resembles the Pleco requirement of character <tab> pinying <tab> definition, but there are some embedded codes - looking something like this from a course on badminton:

垃圾食品 lājī shípǐn junk food
球场 qi&#250;ch&#462;ng court
前场 qi&#225;nch&#462;ng front court

It looks like some codes never get converted and are embedded in the pinyin. I guess one question is what is happening and if someone has suggestions for replacing the codes and correcting the file besides manually reentering the code manually. Ideally, I would like a seamless flow from ChinesePod to Pleco with little more than a save of a file and upload with some minor editing. Any ideas how to make that happen?

Stephen
 
It's also possible to get an HTML version of the transcripts if you know where to look. If you are signed up for the RSS feed (ask google if you don't know what they are or how to use them), then take a closer look at the transcript which comes with that.

If I remember correctly, following the PDF link provided in the RSS feed will actually take you to a different version of the PDF than is available on the website. At the bottom of this PDF is a link to an HTML version of the transcript. I just copy/paste the content of these HTMLs into a text document and use that directly on my PDA.

Downloading each on individually is a major pain, but you can download the HTML transcripts en-mass if you're smart about it - the URLs are all something like http://www.example.com/lessons/A_123/A_123.html, which can be downloaded all at once (from A_001 through A_999, B_001 through B_999 etc.) if you get the right download manager.
 

yesacas

Member
Thanks for the speedy response. You certainly gave me a few things to think about. I just want to clarify one thing that I failed to mention in my original post. The vocabulary list that I am exporting comes from ChinesePod's flashcard list. As students study each lesson they have an option to save individually specified words to this flashcard vocabulary list. The student then has the option to export this list into either the xml or the csv file. It is a rather nice feature . . . in theory.

But I'll look into the suggestions you made.

thanks again,
stephen
 

sfrrr

状元
I read the PDFs on my iPAQ with a PDF reader. But, lke you, I'd like be be able to create a user dictionary with my personal CPod vocabulary. Or, maybe, after beta 5 comes out, just make flashcards out of each vocab word and assign a category (CPod) to it. Anyone with any ideas?

Sandra
 
Free Download Manager has a bulk download option that lets you set the pattern, and it will generate all of the download links automatically, etc. Works extremely well when in China, and downloading from the US. Single downloads get limited somewhere along the line to about 12.5kb to Harbin, so I very often get 10x speed-ups using this approach.
 

sfrrr

状元
Stephan--Down them all (for Firefox) does the same.

I tried saving my CPod vocab as both CSV and XML files. Neither one imports cleanly into Excel--or anything else I can find. Also, I can't get any Hanzi to appear, no matter how many times I change the IME and/or the font. his is nothing new for me--I have never, never successfully imported my CPod Chinese vocab.

Sandra
 

ldolse

状元
As I recall, if you choose csv, and then open the csv file in notepad you will be able to view the characters. At that point you can copy them and then paste directly into excel using the text import wizard if required. The csv not working appears to be an excel bug....
 

sfrrr

状元
Thanks, Idolse. Have you tried exporting as xml and opening in a Web browser? I get an immediate parsing error in firefox.

Sandra
 

yesacas

Member
I finally found a method that seems to work for me - not completely seamless, but not impossible.

1. Cut n paste vocab list into word (only copy individual entries not ctrl-A copy).
2. "Save as" encoded text and select encoded type (I choose UTF-8 Unicode since Pleco works with this, but you can choose others that work I suppose).
3. Open the txt file in Word selecting the "endoded txt file" option, UTF-8.
4. Edit to strip out the spaces and insert tabs (as required by Pleco). I made a simple macro that did this automatically, since each entry has a regular pattern. I also add the header as required by Pleco.
5. Download to PDA, etc and import into pleco.

Seems that all of this could be made simpler . . . we are afterall in the computer age. Still I suppose it beats modifying each bit by hand!

S
 

sfrrr

状元
Yesacas--If I understand you correctly, you copy each vocabulary word, one at a time. If so, I'm not sure that is much easier than entering them one by one manually. Check CPod in a day or so. I'm going to post a comment in one or another of the "conversations" and see if they can tell us how to make the export work for real. I wonder if it's an equal pain to import vocabulary words.

Sandra
 

ldolse

状元
I've contacted them about it before, they didn't seem too interested in effort integrating with other study tools... Good luck though, would be great to see them do something about it.

I just checked the current csv format - it looks like it's more messed up than before - the PinYin syllables are encoded as ascii Decimal NCRs - that appears to be a new behaviour, not sure why they chose that, you might be able to get them to treat that as a bug.

Anyway, easiest way I see getting this into Pleco is by opening in a text editor, then pasting into Excel as I mentioned before. However you'd have to delete the content from Pinyin column as well as the MP3 pointer column. If you want Pleco definition entries then delete the Cpod definition text as well. At that point it could be saved to whatever format Pleco likes, which I forget as I'm writing this.
 

yesacas

Member
Sandra,

Actually in step one I copy one page at a time (20 entries each page) but I select only the vocab entry lines (20 lines). It's definately faster than entry one by one, especially with a macro.

BTW, I already contacted CPod - they seem particularly unresponsive. Maybe MICHAEL LOVE needs to contact them, if he is paying attention??? I do think these web-based learning coupled with a portable learning tool (ie. Pleco) are the future and the two need to work together.

Stephen
 

yesacas

Member
Idolse,

You say "easiest way I see getting this into Pleco is by opening in a text editor, then pasting into Excel as I mentioned before." What do you open -- the CSV file is messed up as you already observed. If you have something that works maybe you could provide a few steps?

Appreciate your help,
Stephen
 

mikelove

皇帝
Staff member
We did write them and they weren't interested in working with us - shame about it, too, we could have done some really cool things combining 2.0's audio playback + flashcards + reader (= lesson transcripts) features into a sort of interactive portable ChinesePod lesson capsule.
 

ldolse

状元
It depends what you really want out of that Cpod export. Getting the chinese headword is easy, see below. If you want Headword, Pinyin, and Definition you're out of luck, there's no real easy solution except to find other tools to massage the Pinyin. The steps below would help you massage it though.

Since Pleco's import routine will happily take just a headword you can just use that portion. Here are the steps again, to be clear:

1. (edited) Open the csv file in a Babelpad, refer to the next post for details.
2. Copy all with Ctrl-A, Ctrl-C
3. Open Excel, paste into a new document using Ctrl-V

At this point you should have each item in it's own column, headwords being in the first column. If that's so go to Step 8. If it still looks screwy then do the following:

4. Select the entire first column by clicking 'A' at the top.
5. Go to the Data menu, choose Text to Columns.
6. Going through the wizard, choose delimited, click Next
7. Select the comma delimiter, uncheck other options. Make sure double quotes are specified as a text Qualifier, Click Finish
8. Highlight Column A by clicking 'A' at the top.
9. Copy with Ctrl-C, then paste this back into a new text file.
10. Edit as needed, adding categories if desired, Save the text file as UTF-8
 

ldolse

状元
Here's a complete Solution. Instead of using Notepad or any other text editor, download Babelpad from the following URL:
http://www.babelstone.co.uk/Software/BabelPad.html

1. Open the CSV in babelpad
2. Select all with Ctrl-A
3. Right Click on the selection
4. Select Convert -> NCR to Unicode

Now you can paste into Excel for further massaging, save it out from Excel as a tab separated file, and you should be good for import into Pleco.
 
As I recall, if you choose csv, and then open the csv file in notepad you will be able to view the characters. At that point you can copy them and then paste directly into excel using the text import wizard if required. The csv not working appears to be an excel bug....

I've seen this behavior in other software. I would bet it is not an excel bug, but rather that they are not including a Unicode header block on their file. Notepad will read and correctly "guess" Unicode files with or without the header.

I tested this on someone's CSV file, and it works. However, for some reason, I still couldn't read the CSV file directly with Excel 2003. I used Notepad++ (free editor) to replace all commas with tabs, and then it imported OK with excel.

So, in all, about 30 seconds to process a file of about 270kb (1200+ lines)

I didn't fix the pinyin encoding problem, however. Their choice seems rather strange to me.

These steps:

1) Open the original CSV file in Windows Notepad first (not Notepad++ ) . This should show all the If things are OK, then do SAVE AS (so you get the save dialog), but first, CHANGE the encoding to specifically UTF-8 (or whatever). It's the THIRD option, under the File Name line. While you could also use Notepad++, you then have to go to the "Format" menu and choose "Encode in UTF8".

BTW, You can see (using Notepad++) that ChinesePod encoded without the BOM (Unicode prefix characters), which is against the Unicode standard, if someone wants to complain to them about it. There's a commonly used Java library for writing unicode that has this "bug"/default behavior.

2) Open saved file in Notepad++ and replace all commas with tabs. Save again.

3) At that point, open the file in Excel.

As I mentioned, you can do it in two steps, if you just use notepad++ (or your favorite text editor) and change the formatting to include the Unicode beginning of file marker.

*** EDIT *** Sorry I didn't notice the previous post while I was writing this one. I was on the wrong page of the forum messages. ***
 

yesacas

Member
Thanks Mike for the response . . . I'll add that to my growing list of grievances with their service.

Stephen
 
Top