18,896 HSK sentences

Shun

状元
Dear all,

here’s a file of 18,896 quite realistic-sounding HSK sentences, in ascending order of complexity. I imported them from a tatoeba sentence list some time ago. They’re useful for reinforcing one’s feeling for sentence structure and grammatical constructions. They work best with the Self-graded studying mode or even just by browsing through in Card Info.

Caveat: The generated pinyin is sometimes off, and the English often sounds like it hadn’t been translated by a native English speaker.

Hope you like it, cheers, Shun


Edit: The file in message number 24 is ordered by sentence length. I'm also attaching it here.
 

Attachments

  • 18,896 HSK sentences.txt
    3.1 MB · Views: 11,987
  • 18,896 HSK sentences ascending complexity.txt
    3.1 MB · Views: 6,978
Last edited:

leguan

探花
Screenshot_20171019-220744.png
Screenshot_20171019-220754.png
Screenshot_20171019-220806.png
Screenshot_20171019-220813.png
 

Shun

状元
Thanks a lot for your list. :) Good English translations and realistic-sounding Chinese. However, I will try using regex to change them to the format

"Hanzi - pinyin - translation" instead of

"Word Hanzi - Hanzi with word in pinyin - translation".

If I manage it, I will upload the converted list here.
 

leguan

探花
It was converted from your list using excel. The idea is you want to test if you can write hanzi. But just listening to the pinyin of the word gives you no context so it is an unrealistic context. The sentence gives you the context (you can choose to not display the sentence initially but I prefer to read the sentence and then write the hanzi without listening to only the pronunciation).
 

Shun

状元
I see! Funny though that it's got 23'000 lines. Your format is definitely very useful!
 
Last edited:

leguan

探花
If you don't know the hanzi from the Mandarin context, you can look at the English translation as a hint.
 

leguan

探花
The converter does two passes through your list trying to use as many list entries as there are in my 65535 words with definitions in all the Pleco dictionaries I have.
On each pass it chooses the least used sentence (so far, and if one exists) to distribution usage of the sentences evenly.
 

leguan

探花
Currently, using the highest 40000 frequency words that exist as Pleco entries blended 50%/50% from the Subtitles and Leiden Weibo corpuses.
Of the 40,000 top words, only 13670 appear in the sentence list you compiled. The average number of sentences per word is thus 23084/13670. ie. approximately 2 sentences/word. HSK words are prioritized - There are about 500 HSK words which don't exist in the sentences so I'm planning to add 500 sentences to your list and rebuild
 

leguan

探花
Thank you for your kind words. :)
I can make alternative versions if you have different parameters you prefer. e.g. number of sentences/word, which words are included as candidates, etc.
 

Shun

状元
So if I understand it correctly, you're not just picking the most "special" word in each sentence, i.e. the least common word, but something more complex. Anyway, the goal is to get the most useful word.
 

Shun

状元
You're welcome! I guess most of the HSK words is already a fine starting point. You seem to have done the optimal word selection already, as I see it. Have a nice day then!
 

leguan

探花
Thank you very much too for posting the list :)
I can now enjoy reading comprehension and writing in context in an enjoyable way!
 

Shun

状元
Glad you like it. I have to go check tatoeba's free database again. :)
 
Last edited:

leguan

探花
You are quite correct, it doesn't look for a "special" word in each sentence. Rather it makes multiple passes through HSK and high frequency words ordered by HSK and frequency looking. On each pass it chooses the sentence in which the word exists which currently has the lowest usage so far (i.e by the current word and all other words that have been used). I have ordered the sentences by sentence length from short to long, so that it is biased to choosing shorter sentences. The reason for this is I prefer shorter contexts to longer ones as the primary reason for the sentence is to give the context - reading comprehension is just a byproduct of incorporating the context.
 
Top