Wrong traditional characters

#1
I noticed that Pleco often shows wrong traditional variants when using any dictionary but CC-CEDICT (maybe that's because CC-CEDICT comes with its own traditional variants). E.g. Pleco doesn't distinguish between 游/遊, 周/週, 回/迴, etc. Any entry that should have 遊 still shows 游 for some reason. 游戲 is not a correct word. Nor is 周刊 valid traditional Chinese. This situation reminds me of people putting 美發 on their shops in many places on the mainland.

Actually, these exact mistakes all appear in Wenlin/the ABC dictionary. If I'm allowed to make an assumption, I would assume that all dictionary databases except for CC-CEDICT and ABC came only in simplified, and the Pleco company used the Wenlin/ABC system to make the traditional editions. Which is pretty tragic, since ABC is notorious for its often wrong traditional forms. The Wenlin/ABC editors didn't even bother consulting MoE/國語辭典, and lots of their character entries are full of guessing ("traditional form is sometimes X or Y"), and I'm not even joking! If you couldn't afford proper simplified to traditional conversion software, then at least using the CC-CEDICT database would have been a much better choice. Otherwise, it's just misleading at the moment and may even misteach people. This is called 誤人子弟.

I'm awfully sorry if I'm too frank about this issue, but it's just so unprofessional in my opinion.
 

mikelove

皇帝
Staff member
#2
Yiliya said:
I noticed that Pleco often shows wrong traditional variants when using any dictionary but CC-CEDICT (maybe that's because CC-CEDICT comes with its own traditional variants). E.g. Pleco doesn't distinguish between 游/遊, 周/週, 回/迴, etc. Any entry that should have 遊 still shows 游 for some reason. 游戲 is not a correct word. Nor is 周刊 valid traditional Chinese. This situation reminds me of people putting 美發 on their shops in many places on the mainland.

Actually, these exact mistakes all appear in Wenlin/the ABC dictionary. If I'm allowed to make an assumption, I would assume that all dictionary databases except for CC-CEDICT and ABC came only in simplified, and the Pleco company used the Wenlin/ABC system to make the traditional editions. Which is pretty tragic, since ABC is notorious for its often wrong traditional forms. The Wenlin/ABC editors didn't even bother consulting MoE/國語辭典, and lots of their character entries are full of guessing ("traditional form is sometimes X or Y"), and I'm not even joking! If you couldn't afford proper simplified to traditional conversion software, then at least using the CC-CEDICT database would have been a much better choice. Otherwise, it's just misleading at the moment and may even misteach people. This is called 誤人子弟.
Interesting... wasn't aware of this (in fact I think you're the first person in a while to make any complaint about our traditional conversions).

We actually use a combination of systems for our jianti/fanti conversion (though if you're aware of a better one I'd very much appreciate a recommendation) - checking the first error you mention at least, 游/遊 look like they're actually classified as Z-variants (same meaning but not inherently traditional/simplified) rather than traditional/simplified variants in the Unihan database. Now according to the notes a lot of the data in that database did originally come from Wenlin, but it's a publicly-defined standard, and if there's something wrong with that database then the MoE / Taiwan government are perfectly capable of submitting a correction. Anyway, the most widely-used database of Chinese character information doesn't currently consider 遊 to be the traditional variant of 游. As does 汉语大词典, which includes definitions for both 游戲 and 遊戲.

With the "traditional form is sometimes" business, I wouldn't necessarily assume that that's because of sloppy research - traditional characters are used outside of Taiwan too, and the MoE doesn't get to define what is and is not a traditional character for the entire world any more than the PRC gets to force people outside of the PRC to use simplified. Language is usage - Google lists 12 million results for "游戲"; if enough people write "美發" on shop signs then that will eventually become a valid way of writing it regardless of what any bureaucrat says.

CC-CEDICT actually has quite a few traditional conversion issues too, mainly due to the lack of an orderly / enforced system for the classification and tagging of variants; lots of incorrect traditional conversions (or at least incorrect-in-Taiwan ones) have their own CC entries and it's not always clear in the database which version is the correct one and which isn't. ABC actually has them handily beaten in that regard since their variants are clearly identified and (more-or-less) prioritized; their big problem is that they don't clearly differentiate between variant ways of writing a particular character and variant ways of writing an entire word, so it's not consistently clear whether these two alternate characters are an unrelated but similarly-pronounced set of characters that somebody decided to use to write down this particular word or whether they're historically-linked alternate ways of writing this particular character. (though in some cases I suppose there may not even be clear linguistic consensus on that)

Yiliya said:
I'm awfully sorry if I'm too frank about this issue, but it's just so unprofessional in my opinion.
Well FWIW we have licensed / will soon be introducing a couple of dictionaries where the headwords were originally written in fanti instead of jianti; the Taiwanese publishers and (in spite of their ostensible desire to promote Taiwanese Chinese around the world) the MoE have refused to do business with us, but there are a few fanti dictionaries published outside of Taiwan too. But for other dictionaries I think automated conversion, however imperfect, is much better than dropping support for fanti altogether - if Taiwanese fanti users want better electronic dictionaries then they need to start demanding that Taiwanese publishers develop / license out those dictionaries. The cost of developing their own equivalent to the ABC would be a rounding error in the MoE's budget, and if they then turned around and gave it away for free (or at least licensed it under very app-developer-friendly terms) you'd very quickly end up in a situation where jianti users were complaining about incomplete conversions from fanti rather than the other way around.
 
#3
We actually use a combination of systems for our jianti/fanti conversion (though if you're aware of a better one I'd very much appreciate a recommendation) - checking the first error you mention at least, 游/遊 look like they're actually classified as Z-variants (same meaning but not inherently traditional/simplified) rather than traditional/simplified variants in the Unihan database. Now according to the notes a lot of the data in that database did originally come from Wenlin, but it's a publicly-defined standard, and if there's something wrong with that database then the MoE / Taiwan government are perfectly capable of submitting a correction. Anyway, the most widely-used database of Chinese character information doesn't currently consider 遊 to be the traditional variant of 游. As does 汉语大词典, which includes definitions for both 游戲 and 遊戲.
I'm sorry, but you don't really know what you're talking about, 游 and 遊 always had different meanings before the simplification process (see 康熙字典). 游's meanings are mostly related to water (游水又浮行也) and it's used in the words like 游泳 and 上游/下游. 遊's original meaning is 'to travel, to roam' (遨遊也), and also 'to be carefree, to play' (無事閒暇) by extension. Generally speaking, 游 is always 遊 in TC, unless the meaning is related to water, although there are exceptions, such as 游擊. 'Week' is always 週, so 周刊 is, of course, 週刊. 週 can also mean any repeating cycle of time, so it's used in words like 週年. In short, these mistakes all come from the confusion brought about by the simplification in the PRC. Anyone who has spent a sizeable chunk of time in a region with traditional writing will tell you that 旅游 and 周末 are incorrect.

With the "traditional form is sometimes" business, I wouldn't necessarily assume that that's because of sloppy research - traditional characters are used outside of Taiwan too, and the MoE doesn't get to define what is and is not a traditional character for the entire world any more than the PRC gets to force people outside of the PRC to use simplified.
The thing is, 'travel/play' was never written with 游 before the communists came to power. It's still 遊 in Japanese, for example. And the Japanese also use 週 for 'week'. TW/HK standards are actually a bit different, the HK one is closer to KX (TW standard uses a number of simplifications), but they both agree on these two characters, so I'm afraid you're bluffing here.

if enough people write "美發" on shop signs then that will eventually become a valid way of writing it regardless of what any bureaucrat says.
Yeah, and if enough people buy your app, then the complaints can be ignored, right? Who cares, as long as the 愚民 eat it up. Why should we even care, huh? The more 錯別字, the merrier. What a silly remark, makes me regret that I even bothered to report this issue.

I think automated conversion, however imperfect, is much better than dropping support for fanti altogether
I'd rather not have a feature then rely on something that's only sometimes correct. Not to mention that it prevents me from finding words when using a traditional IME.
 

mikelove

皇帝
Staff member
#4
Yiliya said:
The thing is, 'travel/play' was never written with 游 before the communists came to power. It's still 遊 in Japanese, for example. And the Japanese also use 週 for 'week'. TW/HK standards are actually a bit different, the HK one is closer to KX (TW standard uses a number of simplifications), but they both agree on these two characters, so I'm afraid you're bluffing here.
Then would you care to explain those 12 million Google results? (it's not auto-correcting, if you look at the results they're definitely the water radical version)

Yiliya said:
Yeah, and if enough people buy your app, then the complaints can be ignored, right? Who cares, as long as the 愚民 eat it up. Why should we even care, huh? The more 錯別字, the merrier. What a silly remark, makes me regret that I even bothered to report this issue.
I'm sorry you feel that way. But if you'd do a little research into lexicography you'd find that that's pretty much how dictionaries are made; no committee of linguists in Beijing got together and decided to decree that 车奴 was a word, somebody made it up and other people started using it and eventually it ended up in the dictionary. I do appreciate your feedback and I do intend to address this issue the next time we update these dictionaries (at least the ones that don't contain their own official fanti headwords), but insults aren't helping anybody here.

Yiliya said:
I'd rather not have a feature then rely on something that's only sometimes correct.
Well then, to turn off traditional character support go to Settings / General, set "Headword character set" to "One set only" and make sure "Traditional characters" is disabled. Hope that helps.
 
#5
Then would you care to explain those 12 million Google results? (it's not auto-correcting, if you look at the results they're definitely the water radical version)
No idea. You can't even input 游戲 in any traditional IME. Also, if you narrow down the results to site:tw, then it's only 644,000, common for a misspelling.

I'm sorry you feel that way. But if you'd do a little research into lexicography you'd find that that's pretty much how dictionaries are made; no committee of linguists in Beijing got together and decided to decree that 车奴 was a word, somebody made it up and other people started using it and eventually it ended up in the dictionary. I do appreciate your feedback and I do intend to address this issue the next time we update these dictionaries (at least the ones that don't contain their own official fanti headwords), but insults aren't helping anybody here.
But 美發 is not a new word, it's just a misspelling of 美发/美髮. The committees are supposed to standardize the spelling, or wee oll wud bee riting like thiz. But I guess it's time for me to 告辭, this discussion isn't going anywhere.
 

mikelove

皇帝
Staff member
#6
Yiliya said:
No idea. You can't even input 游戲 in any traditional IME. Also, if you narrow down the results to site:tw, then it's only 644,000, common for a misspelling.
Nevertheless, it's 12 million web pages on which a user with one of our dictionaries would probably like to be able to get valid results if they type in / tap on "游戲".
 
#7
mikelove said:
Yiliya said:
No idea. You can't even input 游戲 in any traditional IME. Also, if you narrow down the results to site:tw, then it's only 644,000, common for a misspelling.
Nevertheless, it's 12 million web pages on which a user with one of our dictionaries would probably like to be able to get valid results if they type in / tap on "游戲".
To be fair, it is 326 million for 遊戲, and that is the only way my traditional IME will let me type it. I can see the point that even if a lot of people use it, some words are just wrong - an English equivalent might be the word 'wierd' (31 million web pages on Google) where the correct spelling is 'weird' (417 million). Perhaps as a compromise words like "游戲" should be in a dictionary as a 'common misspelling (miswriting?)', with a link to 遊戲? For example, like this for 'supercede': http://oxforddictionaries.com/definitio ... =supercede
 

mikelove

皇帝
Staff member
#8
alex_hk90 said:
To be fair, it is 326 million for 遊戲, and that is the only way my traditional IME will let me type it. I can see the point that even if a lot of people use it, some words are just wrong - an English equivalent might be the word 'wierd' (31 million web pages on Google) where the correct spelling is 'weird' (417 million). Perhaps as a compromise words like "游戲" should be in a dictionary as a 'common misspelling (miswriting?)', with a link to 遊戲? For example, like this for 'supercede': http://oxforddictionaries.com/definitio ... =supercede
That's pretty much what we're working towards (and what we already do in a lot of cases); the tricky thing is identifying which variant is the "primary" one for use in Pinyin / full-text searches that aren't already specifying the characters they want; if someone types in 游戲 it's extremely obvious what they want, but if somebody types in "youxi" we need to know whether we should display the resulting entry as 游戲 or 遊戲. CC doesn't cover 游戲 at all, but there are a bunch of other cases where they cover more than one traditional variant for the same word and it's not always clear which one is correct (indeed there may be several depending on the region, or it may vary purely based on individual preference as in the case of 臺灣/台灣; even Taiwanese government websites don't appear to adopt a consistent standard on that).

游 specifically is a case of bad data - we're trying to come up with a more authoritative SC/TC list than the one in Unihan so we can re-check the characters that we're otherwise missing - but there are lots of other cases where the difference between correct / incorrect is much less clear; 彙/匯 for example overlap in meaning quite a bit.
 
#9
I agree with the three simplified/traditional distinctions made by the original poster, but there is no need for the negativity. The fact that Pleco even supports traditional characters so well, for what is essentially an ever-dwindling market, is already great. I also consider myself lucky to be able to give feedback to a company that actually reads what you have to say and acts upon suggestions.
 
#10
I can give you a lot more examples, like e.g. 向往 should be 嚮往, 欲望 should be 慾望 etc. The modern Taiwanese standard also differentiates between 噁 (ě, as in 噁心) and 惡 (è, as in 邪惡), again only CC-CEDICT correctly shows all these traditional variants.

And if you're going to use the "we're not following the TW standard and don't care about what the MoE has to say" excuse, then let me ask you a question. Why are you using 裡 and 為? These are TW specific simplifications. If you wanted, you could follow either the HK standard or the elusive PRC traditional characters standard (yes, it DOES exist, this is what's used in 漢語大詞典, for example).

One more funny thing. My paper copy of the 規範詞典 includes traditional characters in brackets for single character entries. And guess what, it does correctly differentiate between different meanings of 游 and 遊. So it looks like your software is discrediting the original source. For shame.
 
#11
mikelove said:
alex_hk90 said:
To be fair, it is 326 million for 遊戲, and that is the only way my traditional IME will let me type it. I can see the point that even if a lot of people use it, some words are just wrong - an English equivalent might be the word 'wierd' (31 million web pages on Google) where the correct spelling is 'weird' (417 million). Perhaps as a compromise words like "游戲" should be in a dictionary as a 'common misspelling (miswriting?)', with a link to 遊戲? For example, like this for 'supercede': http://oxforddictionaries.com/definitio ... =supercede
That's pretty much what we're working towards (and what we already do in a lot of cases); the tricky thing is identifying which variant is the "primary" one for use in Pinyin / full-text searches that aren't already specifying the characters they want; if someone types in 游戲 it's extremely obvious what they want, but if somebody types in "youxi" we need to know whether we should display the resulting entry as 游戲 or 遊戲. CC doesn't cover 游戲 at all, but there are a bunch of other cases where they cover more than one traditional variant for the same word and it's not always clear which one is correct (indeed there may be several depending on the region, or it may vary purely based on individual preference as in the case of 臺灣/台灣; even Taiwanese government websites don't appear to adopt a consistent standard on that).

游 specifically is a case of bad data - we're trying to come up with a more authoritative SC/TC list than the one in Unihan so we can re-check the characters that we're otherwise missing - but there are lots of other cases where the difference between correct / incorrect is much less clear; 彙/匯 for example overlap in meaning quite a bit.
That sounds good. :)

Tezuk said:
I agree with the three simplified/traditional distinctions made by the original poster, but there is no need for the negativity. The fact that Pleco even supports traditional characters so well, for what is essentially an ever-dwindling market, is already great. I also consider myself lucky to be able to give feedback to a company that actually reads what you have to say and acts upon suggestions.
That's pretty much my point of view as well, though I'm not sure I agree that traditional is a dwindling market as TW and HK still use it and I haven't heard anything that suggests they're going to give over to simplified any time soon.

Following on with the theme of user feedback - if there aren't that many traditional character mistakes, would it be possible to just have a dedicated forum thread where people could report any mistakes and how to fix them? Of course there could be discussion in them if people don't agree but if there is a consensus then that might be easier and quicker than trying to find official sources (especially if they disagree anyway).

PS: On a related note, it might be useful if there was a general 'Pleco Dictionaries' sub-forum under 'Current Products', because these kind of topics aren't platform-specific and don't really go in any of the other forum sections.
 

mikelove

皇帝
Staff member
#12
Tezuk said:
I agree with the three simplified/traditional distinctions made by the original poster, but there is no need for the negativity. The fact that Pleco even supports traditional characters so well, for what is essentially an ever-dwindling market, is already great. I also consider myself lucky to be able to give feedback to a company that actually reads what you have to say and acts upon suggestions.
Thanks!

Yiliya said:
I can give you a lot more examples, like e.g. 向往 should be 嚮往, 欲望 should be 慾望 etc. The modern Taiwanese standard also differentiates between 噁 (ě, as in 噁心) and 惡 (è, as in 邪惡), again only CC-CEDICT correctly shows all these traditional variants.
We'll do an automated check of all of the words that are in both ABC and CC to see where their traditional versions differ so that we can research them all further and address them in our non-ABC dictionaries (given that the issues mainly seem to come from ABC itself or from the ABC-based data in Unihan). I should probably send the ABC folks an email on this too.

Yiliya said:
And if you're going to use the "we're not following the TW standard and don't care about what the MoE has to say" excuse, then let me ask you a question. Why are you using 裡 and 為? These are TW specific simplifications. If you wanted, you could follow either the HK standard or the elusive PRC traditional characters standard (yes, it DOES exist, this is what's used in 漢語大詞典, for example).
You're confusing one part of this discussion with another - my anti-MoE comment was in relation to 美發 and the question of whether new words / new "spellings" can be created by everyone or have to be approved by a committee. But we would like to cover TW variants correctly, which is why I'm quite earnestly trying to figure out the scope of this issue so that we can address it.

Yiliya said:
One more funny thing. My paper copy of the 規範詞典 includes traditional characters in brackets for single character entries. And guess what, it does correctly differentiate between different meanings of 游 and 遊. So it looks like your software is discrediting the original source. For shame.
That's a data file issue, actually - the data files for 規範詞典 were (quite frankly) in lousy shape when we got them and contained many many thousands of 造字 thanks to the editors' heavy use of partial characters / character components in dictionary definitions, and a couple of TC variants like 遊 were also implemented as 造字. We've been planning to address this (and a number of other issues in GF) when we get our hands on the second edition, but discussions on that are taking an extremely long time.

alex_hk90 said:
Following on with the theme of user feedback - if there aren't that many traditional character mistakes, would it be possible to just have a dedicated forum thread where people could report any mistakes and how to fix them? Of course there could be discussion in them if people don't agree but if there is a consensus then that might be easier and quicker than trying to find official sources (especially if they disagree anyway).
Interesting idea, but I'd rather wait until we've gotten our various dictionary updates out - a lot of data files have already been updated since their last Pleco versions and it's likely that many errors that people might report are no longer there (in which case there's no sense wasting time reporting them).

alex_hk90 said:
PS: On a related note, it might be useful if there was a general 'Pleco Dictionaries' sub-forum under 'Current Products', because these kind of topics aren't platform-specific and don't really go in any of the other forum sections.
Good idea - also something to launch with the new / updated dictionaries.
 
#13
Here's what mistakes in ABC I've found so far:

The following traditional variants are never used at all in ABC:
(what's used/what should be used when appropriate)
游/遊
周/週
回/迴
欲/慾
向/嚮
團/糰
(these two should be used at all times)
凈/淨
羡/羨

Some more corrections:
惡/噁 (噁心 should at least be given as a variant of 惡心, this is the modern TC standard, the IMEs nowadays use 噁心, I suggest you include the word as 噁/惡心)
熏/薰 (利欲熏心 should be 利慾薰心, however ABC doesn't give any traditional variant at all, this is curious because it also gives 薰心 as a separate entry but ONLY in traditional)
注/註 (注音 is erroneously 註音 for some inexplicable reason)

There's bound to be more. I'm sorry, I'm not really using Pleco that much.
 
#14
Slightly off topic as always, since I hate creating new threads but I've actually seen a lot of mistakes in various dictionaries, from wrong character usage, wrong English spelling, punctuation to wrong pinyin. I don't know if I should start noting them down since you've said that updated versions are coming. These are usually from Pleco's dictionary, Tuttle, ABC and 21st century. I'm not going to bother with CC since it's user created.

Maybe I should make a list of errors from now, so that I can recheck to see if they have been corrected in the update? Or would it not make any difference. :\
 

mikelove

皇帝
Staff member
#15
Yiliya said:
Here's what mistakes in ABC I've found so far:
Thanks for the list! We're trying out some new options for SC/TC conversion software, incidentally - good ones aren't cheap but this is clearly an area we need to do better.

scykei said:
Slightly off topic as always, since I hate creating new threads but I've actually seen a lot of mistakes in various dictionaries, from wrong character usage, wrong English spelling, punctuation to wrong pinyin. I don't know if I should start noting them down since you've said that updated versions are coming. These are usually from Pleco's dictionary, Tuttle, ABC and 21st century. I'm not going to bother with CC since it's user created.
PLC: we think we may be about to get our hands on the newer edition data for the dictionary this was based on, so the only feedback worth bothering with at the moment is Pinyin for example sentences, which we generated ourselves and hence would be applying to the (mostly similar) example sentences in the new dictionary.

Tuttle: definitely a "don't bother," we've got the new expanded edition of that and its E-C counterpart coming and will probably stop selling the original altogether once it's out (we'll offer a credit / discount for previous Tuttle Learner's buyers, assuming Apple lets us)

ABC: newer edition of that coming too, so another "don't bother."

21C: feedback much appreciated on this as we're trying very hard to improve it and we don't have any new data coming for it in the near future. We're already on top of the formatting / general ugliness (about the only dictionary that hasn't gotten a major formatting upgrade since the Palm/WM days, and it badly needs one) but bugs / mistakes / etc we can certainly use the help with.

scykei said:
Maybe I should make a list of errors from now, so that I can recheck to see if they have been corrected in the update? Or would it not make any difference. :\
If you'd like to that would be great, but it's not really necessary with the soon-to-be-updated ones.
 

mikelove

皇帝
Staff member
#17
lechuan said:
How about building in "crowd-sourcing" correction features into Pleco?
Really hard, actually - the coding alone would be tough, but then you also need a really detailed set of editorial guidelines and staff on your end to enforce them; one of CC-CEDICT's biggest problems now is inconsistency. Giving people an easy way to submit feedback is one thing, but if we want to do this in a more organized way we're probably better off devoting those man-hours to doing it ourselves rather than to reviewing other people's submissions.
 
#20
One more: 托/託

ABC only has 託 in 拜託, and even then only as a "variant spelling". Plenty of words should have 託, e.g. 寄託, 假託, etc.