18,896 HSK sentences

Shun · Sep 30, 2017

Dear all,

here’s a file of 18,896 quite realistic-sounding HSK sentences, in ascending order of complexity. I imported them from a tatoeba sentence list some time ago. They’re useful for reinforcing one’s feeling for sentence structure and grammatical constructions. They work best with the Self-graded studying mode or even just by browsing through in Card Info.

Caveat: The generated pinyin is sometimes off, and the English often sounds like it hadn’t been translated by a native English speaker.

Hope you like it, cheers, Shun

Edit: The file in message number 24 is ordered by sentence length. I'm also attaching it here.

leguan · Oct 19, 2017

I like it. ;-)

leguan · Oct 19, 2017

Shun · Oct 19, 2017

Thanks a lot for your list.

Good English translations and realistic-sounding Chinese. However, I will try using regex to change them to the format

"Hanzi - pinyin - translation" instead of

"Word Hanzi - Hanzi with word in pinyin - translation".

If I manage it, I will upload the converted list here.

leguan · Oct 19, 2017

It is your list, not mine ;-)

leguan · Oct 19, 2017

The order is intentional. Please see the screen shots above.

leguan · Oct 19, 2017

It was converted from your list using excel. The idea is you want to test if you can write hanzi. But just listening to the pinyin of the word gives you no context so it is an unrealistic context. The sentence gives you the context (you can choose to not display the sentence initially but I prefer to read the sentence and then write the hanzi without listening to only the pronunciation).

Shun · Oct 19, 2017

I see! Funny though that it's got 23'000 lines. Your format is definitely very useful!

leguan · Oct 19, 2017

If you don't know the hanzi from the Mandarin context, you can look at the English translation as a hint.

Shun · Oct 19, 2017

Very interesting! That way you can focus solely on writing. I like it.

leguan · Oct 19, 2017

The converter does two passes through your list trying to use as many list entries as there are in my 65535 words with definitions in all the Pleco dictionaries I have.
On each pass it chooses the least used sentence (so far, and if one exists) to distribution usage of the sentences evenly.

Shun · Oct 19, 2017

So not all the English translations are bad, only some.

Shun · Oct 19, 2017

This is really neat, you must be very much into IT as well!

So thanks again.

leguan · Oct 19, 2017

Currently, using the highest 40000 frequency words that exist as Pleco entries blended 50%/50% from the Subtitles and Leiden Weibo corpuses.
Of the 40,000 top words, only 13670 appear in the sentence list you compiled. The average number of sentences per word is thus 23084/13670. ie. approximately 2 sentences/word. HSK words are prioritized - There are about 500 HSK words which don't exist in the sentences so I'm planning to add 500 sentences to your list and rebuild

leguan · Oct 19, 2017

Thank you for your kind words.

I can make alternative versions if you have different parameters you prefer. e.g. number of sentences/word, which words are included as candidates, etc.

Shun · Oct 19, 2017

So if I understand it correctly, you're not just picking the most "special" word in each sentence, i.e. the least common word, but something more complex. Anyway, the goal is to get the most useful word.

Shun · Oct 19, 2017

You're welcome! I guess most of the HSK words is already a fine starting point. You seem to have done the optimal word selection already, as I see it. Have a nice day then!

leguan · Oct 19, 2017

Thank you very much too for posting the list

I can now enjoy reading comprehension and writing in context in an enjoyable way!

Shun · Oct 19, 2017

Glad you like it. I have to go check tatoeba's free database again.

leguan · Oct 19, 2017

You are quite correct, it doesn't look for a "special" word in each sentence. Rather it makes multiple passes through HSK and high frequency words ordered by HSK and frequency looking. On each pass it chooses the sentence in which the word exists which currently has the lowest usage so far (i.e by the current word and all other words that have been used). I have ordered the sentences by sentence length from short to long, so that it is biased to choosing shorter sentences. The reason for this is I prefer shorter contexts to longer ones as the primary reason for the sentence is to give the context - reading comprehension is just a byproduct of incorporating the context.

18,896 HSK sentences

状元

Attachments

探花

Attachments

探花

状元

探花

探花

探花

状元

探花

状元

探花

状元

状元

探花

探花

状元

状元

探花

状元

探花