79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun · Dec 17, 2018

Hi leguan,

that sounds great! The ranking in the BCC frequency list can of course be seen as proportional to the difficulty of a word (except for names, which can be rare, but also easy). Let’s just try both and see which technique produces better results!

Best regards,

Shun

leguan · Dec 17, 2018

Hi Shun,
You make very good points!
Maybe a blend of both evaluation strategies might give the best results

哈哈!

Shun · Dec 17, 2018

Hi leguan,

thanks! 嘻嘻, true! We could perhaps use the BCC corpus in some way for all non-HSK words. Let’s see.

pdwalker · Dec 17, 2018

Shun said:
This allows for sentence reuse across different language pairs, but it also means that you can get 1:n (one-to-many) correspondences.

Yeah, I was thinking about that.

While there were some true duplicates (words and definitions), the others were along those lines.

My first pass approach to solving that would be to fold the alternate definitions together into the single definition: e.g.
别来 bie2lai2烦我. Don't bother me / Don't bug me

As for the alternate Chinese -> English sentences, I'd leave them as they are unless a native speaker told me that one of them was obviously wrong.

pdwalker · Dec 17, 2018

Shun said:
...and then, based on the HSK level counts for a sentence, assign a HSK level using a good formula (perhaps the weighted average HSK level of all the words in the sentence plus 1), with HSK 6 still being the maximum. Or we could even introduce a HSK 7 level for what is even more advanced.

Actually, that would work rather well. You're thinking is a lot smarter than what mine was - originally I was thinking of only weighting the answer words, not the entire sentence.

However, if you could give weight to all of the characters/phrases appropriately, then you could generate a weighting based primarily on the answer as a minimum weight, and the question as a maximum weight.

E.g. a hk1 answer with an hsk5 question should be considered hsk5. A hsk3 answer with one or two higher level hsk characters in the questions.... might still be considered hsk3? 3+?.

Rather than a weighted answer, it might have to take into account the absolute number of hsk characters in the question.

Shun · Dec 17, 2018

Hi pdwalker,

first post: Good thought, I will perform the folding using a 10-line Python script later because it’s so easy.

Second post: Thanks! We could try just averaging the HSK values of the HSK words and then blending the result with a score from the BCC corpus for all words/all non-HSK words. Perhaps there is even some Digital Humanities paper about assessing the difficulty of Chinese sentences that we could learn a thing or two from.

pdwalker · Dec 17, 2018

Shun said:
Perhaps there is even some Digital Humanities paper about assessing the difficulty of Chinese sentences that we could learn a thing or two from.

*laugh* Academics!

(Ribbing aside, that's a good idea)

Shun · Dec 17, 2018

Yeah, there's a lot of material. Thanks! I already found one, it's called:

"Automatic Difficulty Assessment for Chinese Texts", by "John Lee, Meichun Liu, Chun Yin Lam, Tak On Lau, Bing Li, Keying Li" from City University of Hong Kong.

I won't add any link for legal reasons, but it can easily be found through Google. They do it for entire texts rather than individual sentences, but maybe later it will give us some good ideas.

Shun · Dec 17, 2018

Hi pdwalker and leguan,

I've just folded leguan's Chinese-English Tatoeba list from message #42 (46'604 lines) into a new file with 45'002 lines, where different English translations of identical Hanzi-Pinyin combinations are grouped together, with a " / " separator between the different variants. Takes just a second to execute. I attach the simple script and the resulting folded list.

It would be even more effective if we did this folding before replacing one Chinese word with its pinyin using Leguan's program. So this is just a start!

Best,

Shun

leguan · Dec 17, 2018

Shun said:
Yeah, there's a lot of material. Thanks! I already found one, it's called:

"Automatic Difficulty Assessment for Chinese Texts", by "John Lee, Meichun Liu, Chun Yin Lam, Tak On Lau, Bing Li, Keying Li" from City University of Hong Kong.

Quite fascinating and enlightening!

leguan · Dec 17, 2018

Shun said:
Hi pdwalker and leguan,

I've just folded leguan's Chinese-English Tatoeba list from message #42 (46'604 lines) into a new file with 45'002 lines, where different English translations of identical Hanzi-Pinyin combinations are grouped together, with a " / " separator between the different variants. Takes just a second to execute. I attach the simple script and the resulting folded list.

It would be even more effective if we did this folding before replacing one Chinese word with its pinyin using Leguan's program. So this is just a start!

Best,

Shun

Nice work!

leguan · Dec 18, 2018

Shun said:
Hi pdwalker and leguan,
It would be even more effective if we did this folding before replacing one Chinese word with its pinyin using Leguan's program. So this is just a start!

I agree. But since reindexing the sentences is a rather time consuming process, I am not currently planning to rebuild the sentence contextual flashcards to incorporate "folding". Creating graded flashcard sets, on the other hand, just requires plugging in grading data for the current sentences and some additional sentence selection logic...

Shun · Dec 18, 2018

leguan said:
I agree. But since reindexing the sentences is a rather time consuming process, I am not currently planning to rebuild the sentence contextual flashcards to incorporate "folding". Creating graded flashcard sets, on the other hand, just requires plugging in grading data for the current sentences and some additional sentence selection logic...

Sure, that's fine; that might be the very last step. I have already made a HSK rating script.

leguan · Dec 18, 2018

太厉害了！非常期待！

Shun · Dec 18, 2018

Hello leguan, pdwalker, agewisdom,

谢谢!

So I've created a HSK rating script that calculates the average HSK level of all recognized HSK words in a sentence, averaged with the HSK level of the word with the highest HSK level in the sentence, with the scale shifted by +0.75, rounded to integers and bounded by levels 3 and 6.

It's quite basic, of course, but I've had to tweak it quite a bit already, and I believe the results are 80% usable. There are some short stray sentences undeservedly getting HSK level 6, but they're not too many. I attach the Python scripts and the resulting folded and HSK rated list (the "by HSK levels" one). I also provide a list with the HSK level added as a third field (the "with HSK levels" one).

I've done the same with the more difficult HSK list, and 果然, the ratings went higher on average. Do you think the ratings are realistic? Enjoy!

Best regards,

Shun

agewisdom · Dec 18, 2018

Shun said:
Hello leguan, pdwalker, agewisdom,

谢谢! So I've created a HSK rating script that calculates the average HSK level of all recognized HSK words in a sentence, averaged with the HSK level of the word with the highest HSK level in the sentence, with the scale shifted by +0.75, rounded to integers and bounded by levels 3 and 6.

Best regards,

Shun

Fantastic work, Shun! Seems Christmas has come early this year.
I will update this on my newbie post on 1.1.2019 after it's baked in a little more

Shun · Dec 18, 2018

Hi agewisdom,

that's great, many thanks!

Shun

leguan · Dec 18, 2018

Hi Shun,
Wow, I'm really impressed with how quickly you were able to make these graded flashcard sets, and think you have already gone a long way to allowing a student to choose a set of flashcards appropriate for their current level!! Hats off!

After looking through some of the "folded" sentences, I also believe this is a really valuable improvement, and that having multiple translations can really help one gain a more in-depth understanding of the corresponding Chinese phrases!

>Do you think the ratings are realistic?

This is quite a hard question to answer, especially for individual sentences!
However, the average "difficulty" of grade X+1 does certainly appear to be markedly greater than the average "difficulty" of grade X - this in itself is a big step forward!

Best regards,
leguan

Shun · Dec 18, 2018

Hi leguan,

thanks a lot!

Glad to hear it. While there would surely be some more advanced techniques, perhaps including statistics and machine learning, to determine the level—a bit like in the paper I cited above—I agree it's certainly a big step ahead already.

I agree that the difficulty of sentences relative to one another is usually represented correctly. As a next step, I can try including the BCC corpus frequency lists.

Best regards,

Shun

leguan · Dec 19, 2018

Hi Shun,
You're welcome!

Actually, I am a litte concerned that those who read this thread may be under the false impression that the "by HSK levels" flashcards you are posted have been optimized for each HSK level. While I am sure that your intention was not to convey this impression, many of the subtleties may go unnoticed.

I believe your current intention is to develop "difficulty" analysis tools as a first step. . Using the current sentence contextual flashcards is a good way to evaluate the analysis tools (i.e. proof of concept) but not ideal for creating HSK graded sentence contextual flashcard sets since the sentence contextual flashcards have been specifically optimized for HSK6+ students.

That is one reason for there being so fewer lower level HSK flashcards remaining after categorizing sentences by difficulty.

Another point that should be noted by readers is that the tested words in the "by HSK levels" are not limited to those HSK levels e.g. "盒子 in 请看这里的这个he2zi5. Please look at this box here." is categorized as HSK 3, but it is an HSK 4 word, etc.

To avoid any possible such misunderstandings, I would like to propose that discussion of "difficulty" analysis software should be moved to a separate thread and any further discussion here should be limited to the original topic(s) i.e. the sentence lists and the sentence contextual flashcards based on them. What do you think?

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

状元

探花

状元

状元

状元

状元

状元

状元

状元

Attachments

探花

探花

探花

状元

探花

状元

Attachments

进士

状元

探花

状元

探花