Randomly Ordered Sentences Game

Weyland

进士
Your feedback is welcomed. Next, I'd like to improve the HSK rating algorithm.
HSK rating algorithm, that also uses a more difficult wordlist? Is this list primarily HSK5-6 or do you mean parts of the list I PM'd you? If the latter I'm curious as to how you're going to organize them. Personally I have always thought that the HSK difficulty grading is a big mess. Like why is 既然 HSK4, but 帐篷 or like 指甲 is HSK6? Maybe it's because I started reading books like this before I even started reviewing the HSK books published by BLCUP(?).

Maybe a bit random, but like 8-9 years ago I had to review and improve my math skills before I could enrol into an Engineering degree so I took the math courses over at Khanacademy. Back then, they use to have this knowledge map (click here for image) that worked as a progression map. I thought it was great. IF only there was a Chinese language app/website that did something similar (there probably is, but I am willing to bet it goes till like HSK3-4) that guides you through the language from nothing to native.

The issue with the current HSK system is that you progress rather quickly through HSK1-3, but then it literally slows down exponentially. As such most people overestimate their skills in Chinese and underestimate the amount of effort they have to put in to achieve even a smidgen of fluency.


P.S. 昏昏欲睡 isn't part of the current HSK, let alone the newly proposed one. Also, it's like 600+ in the frequency list. How do you rate 昏昏欲睡 as HSK5 when, IIRC, HSK5 itself only has a single idiom(讨价还价).
 

Shun

状元
Nice, congratulations on straddling the fence between STEM and Chinese.

What the current HSK rating algorithm does:
  • It checks for all 5,000 words of the current HSK (not yet the very newest one), averaging out the HSK levels of each word it finds in the sentence.
  • For all the words it finds in the sentence that aren't part of the HSK, it checks how common they are in the BCC frequency list. If any of them is below a certain threshold, the HSK rating is increased by 1.
So the example sentence is HSK 5 because the words that were found in it that were also in the HSK were averaged to around HSK 4, and then 1 was probably added. As you suggest, it is likely better to set the HSK level to the score of the most difficult/rare word in the sentence. Here, we had a rather simple sentence with one rarer vocabulary item. So the one rare vocabulary item should determine the level, in this case the maximum (6).

On your idea of a knowledge progression chart like at Khan Academy for STEM subjects, I feel that for language acquisition, the various skills required to master a language are much less dependent on each other. You could learn one particularly advanced thing in Chinese without knowing much about some basics, and listeners wouldn't really notice (at least for a while).

I also think that language is intimately tied to your thinking. (When you think, you usually produce language at the same time.) So when you're learning a new language, not just your language skills need to improve, but your thinking needs to adapt to that language, too. That just requires a lot of good input, otherwise you lack the necessary information to recognize patterns of thinking and linking them up to particular language constructions. I like to think of Chinese as so different that it could almost come from another planet. English and German may be ten times closer to one another in the way thoughts are expressed than English and Chinese. I like to think of Chinese as a kind of "lovely" language. Everything sounds kind of cute, and the same goes for its people. But I digress. :)

I agree that HSK ratings of vocabulary are somewhat arbitrary. Word frequency in corpora is probably the main determining factor. I can use either of the HSK lists. If I may, I will try your new list with its four levels. They seem more thought-out and natural to me.

Of course, I hope you're giving my dict.cn list with words a try.

Cheers,

Shun
 

Weyland

进士
You could learn one particularly advanced thing in Chinese without knowing much about some basics, and listeners wouldn't really notice (at least for a while)
What, how does that work? How can you show off "advanced Chinese" without knowing much about the basics? Like I could quote 《红楼梦》;
我就是个“多愁多病的身”, 你就是那“倾国倾城的貌”。
I’m the one “sick with longing”. And yours is the beauty which causes “cities and kingdoms to fall”.
Telling someone to neglect the foundation and focus on whatever they might fancy is the worst advice you could give to a language learner. If someone has terrible knowledge of what the correct tones are then whoever might be listening will know right away whatever lever they're at.

There is no points in learning words like 颠覆 (to overturn a regime), without knowing the conjunctions needed to explain causal relations.

We're already using the HSK as a guide to draw up divisions in fluency. Surely there can be a somewhat of a straight line going from HSK1 to someone being able to take the 高考 (gaokao) or even the 公考 (Chinese civil-service exam).
 

Shun

状元
What, how does that work? How can you show off "advanced Chinese" without knowing much about the basics?
It looks like we're talking from somewhat different worlds here. I don't mean you can show off your Chinese abilities for long without knowing the basics, but it is certainly possible to learn things out of order. For example, you could learn how to use more advanced constructions before you learn more simple ones and then use them correctly in a few sentences. Perhaps that gives you inspiration to learn the easier ones. In any case, I argue that you can't do the same thing as easily in mathematics, for example.

Telling someone to neglect the foundation and focus on whatever they might fancy is the worst advice you could give to a language learner. If someone has terrible knowledge of what the correct tones are then whoever might be listening will know right away whatever lever they're at.
I wasn't trying to give advice, I was just pointing out that the path to proficiency in learning any language can be a little less systematic than in mathematics, because there are fewer clear-cut interdependencies. (Though of course, it's still possible to follow your heart when you do maths, now and then.) Being able to speak well requires mastery of tones, that's true, but when you are forming sentences, it's less clear what depends on what.

There is no points in learning words like 颠覆 (to overturn a regime), without knowing the conjunctions needed to explain causal relations.
But you could use that word more or less by itself, could you not? If you're able to use a word properly, you've already learned a piece of independent knowledge.

We're already using the HSK as a guide to draw up divisions in fluency. Surely there can be a somewhat of a straight line going from HSK1 to someone being able to take the 高考 (gaokao) or even the 公考 (Chinese civil-service exam).
I'd say that only if everyone learns the same way, can there can be a straight line. Otherwise, it seems too idealistic to me to think that you can give everyone the same systematic course curriculum and expect experts to come out at the other end without fail. For that, people's ways of thinking, and learning habits, are too different. A course can only give you the hard facts; how exactly you process them, how you conceptualize them and study them still has to be left to the learner. One could think of this as the "conscious-subconscious" difference. The course material can usually only give you what is conscious, but the subconscious has to work in tandem with it, even though you can't tell your subconscious as directly what it should do and think. An easy example:

The course tells you that you should put a 吗 at the end of a yes-no question. That's easy enough for your conscious mind to grasp. But it's the job of your subconscious to catch any instance while you're speaking where you might still forget to use it, and to learn it in such a way that it is crystal clear to you when you should use it. When learning, you need to attach this knowledge to your previous brain. And that works a little differently for everyone, because everyone's previous brain is different. So a large part of your learning ability just depends on your knowing how you work.

I hope I could make my points a little clearer. Now I'll go back from theory into practice, I'll study a bit more with these new flashcards. I think they work rather well, if interspersed with some general reading. :)
 
Last edited:

leguan

探花
Dear Shun,
Sorry to hear you haven't been so well - remember your health is the most important thing! We can wait for your cards later or, if we don't want to wait, we can do the work to make them by ourselves!

Thank you very much for making new sets of flashcards including the above new arrangement and also with reordering by words rather than characters!

I'm enjoying practicing with them and playing around with the test settings.
The learning methodology you describe above is quite instructive! I'm still not 100% sure how best to use them yet in terms of cards selection and scoring etc. So I'll keep playing with them and give you more feedback later.

One thing I did notice is that if you use Show: Defn + Audio and turn off the Auto-Play on show and Auto-Play on reveal settings in the Audio settings you can always have the option to listen to the Pinyin, albeit the original ordered pinyin, if you prefer to try writing the characters using both the Pinyin and English sentence as hints. In any case, you don't seem to lose anything by using this setting.

Best regards
leguan
 

Shun

状元
Dear leguan,

thank you kindly for your words of reason! I will honor them by taking breaks.

Naturally, if you wish to re-create my Python setup on your computer with all the source files, I'd be glad to provide you with everything! Feel free to let me know by private message if so. What I did better this time was to put all the functionality into smaller functions in Python. This already makes the source code easier to handle.

It's great to hear you're already enjoying practicing with the cards and finding good methodologies. I am, too. I like to put all my practice in Notability on my iPad. Through iCloud sync, this allows me to review my studying on my iPhone when I'm on the move.

That's great, I have the Auto-Play on reveal setting on, too. I like how the TTS reads the sentence back to me after I reveal the headword.

Best regards, you're welcome,

Shun
 

Shun

状元
Dear leguan, dear all,

I'd like to share two additional lessons I learned after a few more hours of studying.

The learning methodology you describe above is quite instructive! I'm still not 100% sure how best to use them yet in terms of cards selection and scoring etc. So I'll keep playing with them and give you more feedback later.
  1. I think it's best to study only sentences that one hasn't yet done, with the Repeat cards during session setting enabled. Trying to repeat previously studied sentences exactly in Pleco a few days later usually moves one's focus to less important details about the particular sentences or their translation. Patterns of sentence structures and useful constructions repeat themselves often enough in different forms for us to be able to recognize and produce them more quickly, and to consolidate our memory of them.

  2. Speed is a prerequisite for being able to express oneself fluently and correctly. In Chinese, as we know, the ordering of sentence elements is much more important for constructing meaning unambiguously than in most Western languages that have inflections and cases. So if one isn't able to put sentences together within a short time of seeing and understanding its parts, it is preferable to reveal the answer right away, study the correct sentence structure attentively, and mark one's answer incorrect.
Best regards, have fun,

Shun
 
Last edited:

leguan

探花
Hi Shun,
Thank you for sharing your thoughts on methodology. I agree very much with your first point. Point 2 is also food for thought and is definitely an important aspect to consider when considering the purpose and effectiveness of various methodologies!
Best regards
leguan
 

Shun

状元
Hi leguan,

you're welcome! Yeah, the slower way (trying to develop the sentences on paper) should be ideal if you're more of a beginner, and the fast way is ideal if you're more advanced and/or unable to write anything down.

Best regards,

Shun
 
Last edited:

Shun

状元
Thank you, that's a pretty huge database/parallel corpus, it seems. I didn't know it. If I search for "责任", for example, it finds 2,340 sentences, while Tatoeba only has 49 sentences containing the word. But, to be fair, Tatoeba has more self-contained sentences, whereas the Purple Culture ones were taken from longer texts, which makes them harder to use for studying with the Random Order system because one doesn't know the context. But it's still great for determining the range of usage of a particular vocabulary item. Thanks again!

Edit: @Weyland Sorry, I was wrong; most sentences are indeed self-contained there, as well.
 
Last edited:

Weyland

进士
- Study the categories using Test type Self-graded, Show: Definition. Under Display, turn off the option Filter head in defns if you have it enabled.
Btw, from your main post; Are you sure it's not "Show: Pronunciation"? Because when I choose "Show: Definition" I get the English translation as opposed to the scrambled sentence.
 

Shun

状元
Btw, from your main post; Are you sure it's not "Show: Pronunciation"? Because when I choose "Show: Definition" I get the English translation as opposed to the scrambled sentence.
You could do either, whichever way you prefer! Just make sure to activate Reveal parts separately so you get clues in a piecewise manner. I'm happy that you're trying it out; any feedback is appreciated.
 

Weyland

进士
Well, I did give it a download and took a look at it before, and decided it wasn't for me. (As my biggest issue is post-HSK active vocabulary).

Some of the issues that aren't accounted for:
  • Words in quotations "x" should not be separated from their quotations. Sentences that have multiple quotes in them should probably not be included.
  • Is separating the measure words, e.g. "一项", into “一” and "项" on purpose?
  • Symbols, such as exlamations, question marks and commas, should probably be featured in the scrambled-up sentences. Chinese has multiple versions of the comma, not all of them carry over to the scrambled-up sentences.
  • You should probably give idioms another pass-over. e.g. 过河拆桥's sentence is just “过河拆桥。”, which is the sentence, but goes against the spirit of this game.
  • When sentences feature multiple non-HSK words they should probably not be part of an HSK list. That, or, you could give a quick word+definition as part of the English translation.
  • Not a big fan of short sentences. Apart from the idioms, "你在撒谎."and "我胃疼." seems like a waste. Maybe have sentence length be correspondent to HSK level? Since I shared that Purple Culture database with you it shouldn't be too difficult to have 15+ character sentences for HSK5-6.
  • To continue the point above. If you the scrambled up sentences, as is the case with "我胃疼." only consists of two pieces it's probably not worth including. Heck, anything less than 5 (excluding the symbols) is probably not worth including. Maybe for HSK1 "你好吗?" would work? I guess?
  • Can we avoid sentences with transliterated names, or names at all?
  • IF the HSK example sentences are going to include idioms, can we use the frequency list we created to make sure they're part of the 500-600 most common idioms? While you can guess the meaning of 厚颜无耻(freq. 1800-2000), it doesn't make sense to include it in a list for let's say HSK5 that only includes one idiom (讨价还价).
  • One of the sentences: "A n n 是 啦啦队队长." chooses 啦啦 (which means gossiping) as a word, as opposed to 啦啦队 (which means chearleading). How does you code sort through which word to pick? Long words should probably take priority, after HSK specific words, to avoid such problems.
  • 《Whatever is inside the brackets should probably remain inside the brackets》... Same goes for all brackets.
  • “不” seems to always accompany another character(s) if that character is a verb, sometimes it messes up. 不知道 is an entry on Pleco, and will stay together. But, 不希望 will be divided up into "不希" and "望", which doesn't make sense.
  • If the sentences has a question mark and a 吗 , then let's not seperate them.
That's it for now. I'm off to make myself some grub.
 

Shun

状元
Hi Weyland,

thank you for some very good points, let me address them one by one:

Well, I did give it a download and took a look at it before, and decided it wasn't for me. (As my biggest issue is post-HSK active vocabulary).
Do you really think the sentences are too easy for someone at HSK level 6? I feel they contain plenty of vocabulary that isn't in any of the six HSK levels, especially the ones from dict.cn. Even if one can often guess what a Chinese word one has never seen before means, these sentences still allow one to practice proper word usage precisely. At least from my former classmates at university, quite a few of whom have successfully taken the HSK 6, I can tell that a lot of practice in the feeling for vocabulary, and the proper use of vocabulary is still necessary. I'd say HSK 6 gives you a skeleton of hard vocabulary, but there are many more basic words related to them that need to be practiced before you are able to form sentences that are indistinguishable from a native speaker's, i.e. that are in every way adequate to a situation and sound very natural. For example, if we have the sentence:

她多次遭到同事侮慢。
She suffered many slights from colleagues.

If you're at HSK level 6, you can deduce that "suffer, incur" corresponds to 遭到, or that 侮慢 must have to do with 侮辱, but let's be honest, we probably didn't yet know that the first two words are the proper words to use in exactly this situation: Slightly abstract, slightly euphemistic, referring to something unpleasant, but not wanting to express oneself too bluntly. So that is something you can learn, that is quite useful.

It may well be that not every Chinese learner cares about such niceties, though to me at least, being able to find precisely the right word in Chinese seems pretty important.

Some of the issues that aren't accounted for:
  • Words in quotations "x" should not be separated from their quotations. Sentences that have multiple quotes in them should probably not be included.
Yes, if the content of a quotation is made up of just one word, I wouldn't separate them, either. Why would you remove a sentence with multiple quotations? It could still be a good sentence.

  • Is separating the measure words, e.g. "一项", into “一” and "项" on purpose?
Yes, I would keep it that way, so the learner can't see immediately what the measure word is, and needs to recognize it and also assign it to the right noun phrase, in case there is more than one measure word.

  • Symbols, such as exclamations, question marks and commas, should probably be featured in the scrambled-up sentences. Chinese has multiple versions of the comma, not all of them carry over to the scrambled-up sentences.
Indeed, thanks, I noticed that, too. This is definitely something I need to fix.

Edit: I verified that my lists do contain all punctuation marks the way they're supposed to, but once the cards are imported into Pleco, some of them are lost. Perhaps Pleco does some parsing in the pronunciation field. Another reason to look forward to Pleco 4, where, as we know, custom fields can be created.

  • You should probably give idioms another pass-over. e.g. 过河拆桥's sentence is just “过河拆桥。”, which is the sentence, but goes against the spirit of this game.
Oh yes, a one-idiom sentence would need to be scrambled.

  • When sentences feature multiple non-HSK words they should probably not be part of an HSK list. That, or, you could give a quick word+definition as part of the English translation.
I rated the list by HSK levels, though they aren't strictly meant for HSK learners of those levels. I just use the HSK level vocabulary to rate them. Instead of HSK, I should perhaps introduce a simple scale with "Novice", "Beginner", "Lower Intermediate", "Intermediate", "Upper Intermediate", "Advanced", "Expert", and "Native-level", or similar.

A "word+definition" line below the English translation would look nice, but it would give you the answer right from the start, which I'd rather not do. But fortunately with Pleco, the student can always tap on any word or phrase once they see the answer, and add it to their flashcards.

  • Not a big fan of short sentences. Apart from the idioms, "你在撒谎."and "我胃疼." seems like a waste. Maybe have sentence length be correspondent to HSK level? Since I shared that Purple Culture database with you it shouldn't be too difficult to have 15+ character sentences for HSK5-6.
I agree, you're absolutely right. I should make a check if a sentence is composed of a four-letter idiom, then scramble that. If it isn't, I could drop the sentence if it is five characters long or shorter.

  • To continue the point above. If you the scrambled up sentences, as is the case with "我胃疼." only consists of two pieces it's probably not worth including. Heck, anything less than 5 (excluding the symbols) is probably not worth including. Maybe for HSK1 "你好吗?" would work? I guess?
Exactly, yes, though something like “你好吗?” is already covered by study materials for beginners, so it's probably best if I leave out such short sentences.

  • Can we avoid sentences with transliterated names, or names at all?
Yeah, in Tatoeba, there are just a handful of names that are repeated across all sentences. I should at least not scramble them, i.e. leave them together.

  • IF the HSK example sentences are going to include idioms, can we use the frequency list we created to make sure they're part of the 500-600 most common idioms? While you can guess the meaning of 厚颜无耻(freq. 1800-2000), it doesn't make sense to include it in a list for let's say HSK5 that only includes one idiom (讨价还价).
I agree, yeah, I will integrate your frequency-sorted list of idioms into the rating code. Then we could even separate out all sentences with idioms, so users can practice sentences with idioms specifically.

  • One of the sentences: "A n n 是 啦啦队队长." chooses 啦啦 (which means gossiping) as a word, as opposed to 啦啦队 (which means chearleading). How does your code sort through which word to pick? Long words should probably take priority, after HSK specific words, to avoid such problems.
Yes, it starts looking for all words of four letters, then three, two, and one. (the HSK lists and BCC together) I saw that my BCC frequency list doesn't have 啦啦队, so that's the reason.

  • 《Whatever is inside the brackets should probably remain inside the brackets》... Same goes for all brackets.
Perhaps I'll scramble the entire sentence if the character string inside brackets is five characters long or less, or if it's longer, I will keep it inside the brackets and scramble that.

  • “不” seems to always accompany another character(s) if that character is a verb, sometimes it messes up. 不知道 is an entry on Pleco, and will stay together. But, 不希望 will be divided up into "不希" and "望", which doesn't make sense.
Yes, thanks, that isn't good. I think I could try searching for substrings from right to left instead of from left to right. Then the word 希望 would get caught before 不希.

  • If the sentences has a question mark and a 吗 , then let's not seperate them.
Yes, not having a user do anything that's obvious, makes sense.

That's it for now. I'm off to make myself some grub.
Thank you very much! Guten Appetit!

Cheers,

Shun
 
Last edited:

Weyland

进士
Do you really think the sentences are too easy for someone at HSK level 6?... being able to find precisely the right word in Chinese seems pretty important.

Active vocabulary. That's where the distinction is. This might be great for passive vocabulary, but you're still being a passive observer to their use. It probably has a leg up from your standard flashcard drills, but unless you're going out of your way to use these words in your own writing they won't enter your daily lexicon.


Why would you remove a sentence with multiple quotations? It could still be a good sentence.
Sometimes a sentence is made up of 2-3 small quotes. E.g. "Quoted text A", -conjunction- "quoted text B", -conjuction- "and maybe another quoted text C". Such a sentence is difficult to make work. As they're just 3 strings of text that are being held together with context. If you scramble it up and you have you have pieces from A, B and C interspersed it becomes a blind-guessing game.

Oh yes, a one-idiom sentence would need to be scrambled. ... I should make a check if a sentence is composed of a four-letter idiom, then scramble that. If it isn't, I could drop the sentence if it is five characters long or shorter.
IF you just included the 112 idioms from the HSK that'd be fine. But, even natives have a difficult time parsing together idioms. Let's take this sentence for example from Purple Culture;

"这是一场没有任何征兆,突如其来,暴虐无情的灾难,把我们鲜活的生命连根拔起取而代之的是痛苦,悲伤,和万念俱灰。"
It was an unplanned, unexpected, and unforgiving catastrophe. It uprooted lives, and replaced it with anguish, sorrow, and hopelessness in those affected.

Both 突如其来 and 取而代之 are relatively common. And will be part of the later levels of the new HSK. But, 万念俱灰 is gaokao level and not part of the top 2000 most used idioms. Then there is 连根拔起, which is a colloquialism, but which can be mistaken for an idiom. So you now have 16 random characters from which you have to make 4 words. Two of which are part of HSK8 or HSK9, one which is gaokao level and one which is a dialectic colloquialism. Can you see how such a sentence would quickly devolve into such a complex mess that'd it defeat its own purpose?


I rated the list by HSK levels, though they aren't strictly meant for HSK learners of those levels."
Still, let's not use too much post-HSK vocab that isn't part of either HSK3.0 or the PSC wordlist for the non-advanced levels. If I'm doing HSK3 level scrambled sentences I wouldn't want to see anything above HSK4 in there. Also, if you're going the distance you'd probably have to code it to recognize sentence constructions (句型) to judge it's HSK level. E.g. “不像话” is HSK6 grammar, yet all words used are HSK1-3. I don't have my HSK books anymore, but I assume there are better examples.

But fortunately with Pleco, the student can always tap on any word or phrase once they see the answer, and add it to their flashcards.
Yeah, after you've shown the results, at which time it's too late. Whenever you're taking a test it's standard that they'll provide you with additional vocabulary if its not part of the course work.

I agree, yeah, I will integrate your frequency-sorted list of idioms into the rating code. Then we could even separate out all sentences with idioms, so users can practice sentences with idioms specifically.
I think idioms will be the most difficult to tackle. As some idioms, e.g. 入乡随俗, are part of the BLCUP published course-work vocab because they're featured in the reading section as topics. Yet, their frequency is more in the range of 高考 vocab. IF you were going to make an idiom specific section it'd make sense to have an HSK collection and a separate frequency-based collection that come in 25-40 idiom batches.

I saw that my BCC frequency list doesn't have 啦啦队, so that's the reason.
So, the only solution would be to include all of Pleco's dictionary entries?

Perhaps I'll scramble the entire sentence if the character string inside brackets is five characters long or less, or if it's longer, I will keep it inside the brackets and scramble that.
《Let's not scramble book titles... Otherwise, this will into the Chinese language's students' "Jeopardy"》 Anything inside brackets, should not be scrambled in 9/10 cases. And if the sentence starts with a 《》 segment it might be best to not include it all together as it's like a quote.

The same goes for name -> ・ <- this dot is used in between the first and last name of transliterated names. I wouldn't want to unscramble George Orwell's (乔治·奥威尔) name
 

Shun

状元
Active vocabulary. That's where the distinction is. This might be great for passive vocabulary, but you're still being a passive observer to their use. It probably has a leg up from your standard flashcard drills, but unless you're going out of your way to use these words in your own writing they won't enter your daily lexicon.
Yes, these flashcards are more about word order, but while studying I've been making notes of my observations on when and why particular words are used. If you review those notes and use the words in your own writing within a short time of doing these flashcards, most of them should make their way into your active vocabulary.

If you want to speak or write Chinese well, there is no alternative to speaking and writing Chinese, without any help from flashcards. I agree with you on this.

Thanks for the link. However, if you agree with my text above, I don't yet see why you think the flashcards don't work for you. :) (in terms of the difficulty of the vocabulary and their usefulness for enhancing your active vocabulary)

Sometimes a sentence is made up of 2-3 small quotes. E.g. "Quoted text A", -conjunction- "quoted text B", -conjuction- "and maybe another quoted text C". Such a sentence is difficult to make work. As they're just 3 strings of text that are being held together with context. If you scramble it up and you have you have pieces from A, B and C interspersed it becomes a blind-guessing game.
OK, such sentences seem to exist. :) Thanks!

IF you just included the 112 idioms from the HSK that'd be fine. But, even natives have a difficult time parsing together idioms. Let's take this sentence for example from Purple Culture;

"这是一场没有任何征兆,突如其来,暴虐无情的灾难,把我们鲜活的生命连根拔起取而代之的是痛苦,悲伤,和万念俱灰。"
It was an unplanned, unexpected, and unforgiving catastrophe. It uprooted lives, and replaced it with anguish, sorrow, and hopelessness in those affected.

Both 突如其来 and 取而代之 are relatively common. And will be part of the later levels of the new HSK. But, 万念俱灰 is gaokao level and not part of the top 2000 most used idioms. Then there is 连根拔起, which is a colloquialism, but which can be mistaken for an idiom. So you now have 16 random characters from which you have to make 4 words. Two of which are part of HSK8 or HSK9, one which is gaokao level and one which is a dialectic colloquialism. Can you see how such a sentence would quickly devolve into such a complex mess that'd it defeat its own purpose?
Yeah, in such a case, the idioms should not be scrambled. I find it interesting how the speaker shows their emotional involvement and underscores the gravity of the situation by using so many idioms in this sentence.

Still, let's not use too much post-HSK vocab that isn't part of either HSK3.0 or the PSC wordlist for the non-advanced levels. If I'm doing HSK3 level scrambled sentences I wouldn't want to see anything above HSK4 in there. Also, if you're going the distance you'd probably have to code it to recognize sentence constructions (句型) to judge it's HSK level. E.g. “不像话” is HSK6 grammar, yet all words used are HSK1-3. I don't have my HSK books anymore, but I assume there are better examples.
Yeah, we can't make the algorithm absolutely perfect (Only humans could rate all sentences perfectly, of course.), but that's another aspect to consider.

Yeah, after you've shown the results, at which time it's too late. Whenever you're taking a test it's standard that they'll provide you with additional vocabulary if its not part of the course work.
Very true. But for over 40,000 sentences, it would just be too much work to check and prepare each of them. I think you can write on your testing sheet—where you write (parts of) the sentences by hand—what you noticed about the usage of new vocabulary and later review that. (Then possibly study the words in flashcards and add them to a user dictionary with a personal definition, as well.) And let's not forget that if you saw the words' meanings from the start, you probably wouldn't learn them, either.

I think idioms will be the most difficult to tackle. As some idioms, e.g. 入乡随俗, are part of the BLCUP published course-work vocab because they're featured in the reading section as topics. Yet, their frequency is more in the range of 高考 vocab. IF you were going to make an idiom specific section it'd make sense to have an HSK collection and a separate frequency-based collection that come in 25-40 idiom batches.
Good idea. For the HSK collection, I should simply be able to check which vocab items are four characters long, or see if they're in your list of idioms. For the BCC list, I could do something similar.

So, the only solution would be to include all of Pleco's dictionary entries?
Yes, but for reasons that are very understandable, Pleco doesn't support exporting of entire dictionaries. I could, however, download the newest CC-CEDICT dictionary and add its vocabulary items to my collection of expressions. I'll definitely do that.

《Let's not scramble book titles... Otherwise, this will into the Chinese language's students' "Jeopardy"》 Anything inside brackets, should not be scrambled in 9/10 cases. And if the sentence starts with a 《》 segment it might be best to not include it all together as it's like a quote.

The same goes for name -> ・ <- this dot is used in between the first and last name of transliterated names. I wouldn't want to unscramble George Orwell's (乔治·奥威尔) name
Yeah, thanks, you definitely have a point. I will look at examples myself and then probably implement it the way you suggest (keeping them together).

Thanks again for your valued input. I should be able to implement these changes in the coming days.

Cheers,

Shun
 
Last edited:
Top