mnemosyne "20,000+ chinese sentences" - HSK Level 1

crj · Jul 25, 2009

I saw this great thread over at Chinese-Forums:
http://www.chinese-forums.com/showthread.php?t=26892

Where it had this link and information:
"Brian Vaughan has complied a mnemosyne data base of "20,000+ Chinese sentences with translations and pinyin",
http://www.mnemosyne-proj.org/node/115
The sentences were apparently generated by searching online dictionaries for words in the HSK vocabulary lists. The database is classified by (1) the HSK level of the target word; and (2) the maximum HSK level of the other words in the sentence. "

Well, I took this list, and exported the HKS Level 1 section to the attached text file to be used as a Pleco Flashcard File.
Be warned, it created 2698 cards all with sentences (some very short, some quite long).

Quite useful for studying purposes.

The first flashcard is the sentence in Chinese, the 'answer' is in pinyin and English. There are a few mistakes in the pinyin apparently, and there are some odd " in places, but overall it's a decent resource.

If anyone feels like cleaning it up, or dividing it into smaller chunks, please feel free and repost

Thanks.

crj · Jul 28, 2009

Just really wanted to reiterate how poorly written the English sentences are! But the Chinese ones are the ones I want, so it's okay if you put it into perspective.
The biggest issue seems to be that in the file looks fine as a txt on my computer, but when uploaded into Pleco the English sentences have become corrupt. I think it is due to the ridiculously large file size.
If I get a chance, and the inspiration, I will try to redo these as smaller files.
Any suggestions to how many sentences make sense per file? 100?
(Mike, what about suggestions from a technical point of view?)

mikelove · Jul 29, 2009

The problem here is that there's no tab between the Pinyin and the sentence; hence, the sentence gets interpreted as part of the Pinyin and ends up garbled as a result. (if you reviewed the card it would show up in Pron rather than Defn) Fixing this manually with 3,000 sentences would be kind of a chore, but if you use a text editor to replace "。 " (a Chinese period followed by a space) with "。\t" (same thing followed by a tab), I think that would fix most of them - you could then quickly go through in, say, Excel and find / fix rows where there was nothing in the third column.

Also, make sure to delete the quotation marks at the beginning / end of some lines (inserted by a previous run with Excel, I assume) - they won't necessarily screw things up import-wise, but they'll look weird on the finished cards.

radioman · Jul 30, 2009

If someone goes through the cleanup exercise let me know!

Lurks · Aug 9, 2009

I took a look at the data and thought it was a bit of a goldmine... So I had a go.

Now... there was a lot of work to do here so I begun in stages with a view to doing everything including fixing up the pinyin into numbered tones that should import correctly. I'm actually just about done but run into a snag in that the last pass needs to be a simple script to shove the numerals on the end of each pinyin word instead of in the middle, and to insert tab characters between the definitions. Unfortunately my scripting language of choice doesn't actually do unicode and the parsing is kinda broken. I haven't given up, I just might have to learn how to use something that DOES do unicode to do this right or maybe some sort of hack by just pulling into excel and exporting the bits I can work with (the pinyin) and assembling it somehow.

I'll give it a new shot tomorrow

ldolse · Aug 10, 2009

Lurks said:
I took a look at the data and thought it was a bit of a goldmine... So I had a go.

Now... there was a lot of work to do here so I begun in stages with a view to doing everything including fixing up the pinyin into numbered tones that should import correctly. I'm actually just about done but run into a snag in that the last pass needs to be a simple script to shove the numerals on the end of each pinyin word instead of in the middle, and to insert tab characters between the definitions. Unfortunately my scripting language of choice doesn't actually do unicode and the parsing is kinda broken. I haven't given up, I just might have to learn how to use something that DOES do unicode to do this right or maybe some sort of hack by just pulling into excel and exporting the bits I can work with (the pinyin) and assembling it somehow.

I'll give it a new shot tomorrow

Have you tried babelpad on Windows? It can do automatic Pinyin conversion to tone marks. Anyway, if you wanted to attach or pm me the current level of progress I could take a stab at it, I've done this sort of conversion before.

ldolse · Aug 10, 2009

Actually I was wrong about Babelpad. Turns out there are very few tools out there to convert from tone marks to numbers. I found this one:
http://www.chineselearnonline.com/pinyi ... rsion-tool

Everything else out there seemed to do the reverse....

Anyway I was procrastinating doing real work today, and this looks like an awesome list, so I went ahead and cleaned this up. Most of the punctuation is fixed, and tabs should be in all the right places. Lastly I broke the sentences into categories based on number of 字. It's about 5 different categories, but that still creates 500-600 cards per category.

Note I haven't tried to input this into Pleco yet, let me know how it goes.

Lurks · Aug 10, 2009

That looks good but the pinyin has all the spaces removed. I think it's very useful to keep spacing where words are grouped.

Here's my effort. I've removed the Chinese fullstops from the characters. I've stripped out a hell of a lot of weird characters but ultimately my script threw away any line that it failed to parse - mostly because it wasn't possible to figure out where the pinyin ended and the english began. Just fixed some more stuff to do with my tone numbers going wrong with punctuation and some strange characters that were embedded in pinyin. Also translated most of the weird unicode chars in the pinyin to regular ascii. I've imported to Pleco and it works well.

hmm need to figure out how to mass kill the flashcards and category created by importing an earlier version...

ldolse · Aug 10, 2009

I deleted the spaces because I think that's Pleco's convention (not to mention Chinese in general), I'm not sure if it matters to Pleco though, easy enough to put them back.

edit: I don't see any example where pinyin syllables were grouped without a space. If they're not grouped I can't see that there would be any value to leaving the spaces in.

Lurks · Aug 11, 2009

You know, you're absolutely right. They're not grouped at all. That's kinda annoying and you're absolutely right, the spaces shouldn't be there in this case. If they were grouped I would absolutely prefer the spaces.

I'll take a new look later, could well be your pinyin is better even after I remove the spaces. I had to add a few workarounds for things like exclamations at the end of sentences and what have you.

I'm pleased I worked this stuff out though. There's actually nothing to stop me scraping any online reference now and building flash cards from it. I always get the feeling when looking up dictionary entries occasionally that I'd like to file a word, not for the word but for an excellent example sentence in the definition. Course without access to the actual data of the dictionary definitions there's not much you can do to automate this from ABC/NWP data etc. To be fair it's content that belongs to them so it shouldn't really be sharable anyway I guess.

ldolse · Aug 11, 2009

In CRJ's original post the Chineseforums link discusses some other online dictionaries where sentences examples may be scrapeable in that fashion, nciku being one. This file is also only level one from the original DB, apparently there are 17,000+ more examples.

The pinyin tool I linked seemed to work pretty well, I didn't see any errors with a cursory inspection. Note the javascript it used only seemed to work flawlessly using Safari, seems like Firefox had some bugs with that much data. The only other major flaw I see at this point is not every single character was converted to pinyin in the source file, and I didn't attempt to fix this. I did finally get around to importing my file, so I can verify that the import seems to work ok as well.

caesartg · Jan 17, 2010

Hi Mike

Perhaps there's a file permissions issue, as I just get a phpBB "Unable to deliver file." error when trying to download these.

Cheers

Ben

mikelove · Jan 17, 2010

Indeed there was - fixed it now, thanks for pointing that out.

fischer · Mar 11, 2010

Did anyone converted the other 17000+ sentences and would like to share :?:

Jo-DV · Feb 21, 2011

Hi,

Interesting post!
I did some further clean-up, formatted correctly in UTF-8, did a spelling check on the English translations etc. That helped to further complete the hard work done earlier by various people. The resulting file is attached below, it should work fine in Pleco.

After using those sentences however, I concluded for myself they are of limited value, or better said, they are only of value if you already master 99% of the HSK character list.
First of all, the pinyin is not at all reliable. I can only hint to everybody to actively select each unknown character in the sentences and let Pleco pop up the dictionary entry for it to double check. The other issue is that the English translations are pretty approximative, and won't help you to learn new characters or words. Those translations are only useful if you know each character/word, but still struggle to understand the sentence.

I think it is much better to just use Pleco's built-in dictionary for the flashcards and switch on the extended definitions to see the example sentences. That way you have quicker and more natural language learning experience rather than cramming characters in your head and afterwards starting to see how they are put in context.

Anyway, the updated sentence list is attached below.

Jo.

natetrue · Dec 12, 2014

Awesome list! So many sentences to study. I pared down the larger sentences and made a couple files of the shortest and medium-length sentences. Hope they're useful to others!

EDIT: Changed the pinyin for the short sentences to match what Google Translate thinks they should be. Multi-character words are now smooshed together and a few of the pronunciations are corrected (or changed, I'm really not that experienced with Chinese yet)

sobriaebritas · Dec 13, 2014

Good idea, natetrue. Thank you for sharing it!

Miguel · Dec 28, 2014

Great!!! Thanks.

Does anyone know how to increase the speed of the TTS when revealing this sentences? with the default speed it's quite slow for sentences... I have the bundle package by the way

mnemosyne "20,000+ chinese sentences" - HSK Level 1

秀才

Attachments

秀才

皇帝

状元

探花

状元

状元

Attachments

探花

Attachments

状元

探花

状元

榜眼

皇帝

秀才

Member

Attachments

Member

Attachments

榜眼

状元