3.0 New Dictionary Bug Report / Feedback Thread

mikelove

皇帝
Staff member
Thought this deserved its own thread. Particularly interested in feedback on 汉语大词典 since that's the one we're doing the most active work on (improving jianti conversions of headwords and integrating all the lovely data from the new 2010 "supplement" volume), but also very interested in any feedback on the 成语词典 and 广州话方言词典 since those are both new categories of dictionary for us.
 

yoose

探花
Noticed some stuff in 广州话方言词典 not sure if these are bugs or if its just because it’s still in early access mode. I noticed some entries have missing characters, some have a * or number in front of the entry. I attached some screenshots.
IMG_3560.PNG IMG_3561.PNG IMG_3564.PNG IMG_3566.PNG
 

mikelove

皇帝
Staff member
Actually, those boxes are intentional - they're used for Cantonese slang words for which no written character exists. (at least according to the dictionary's front matter, which we really need to edit / post)
 

張龍

Member
Thought this deserved its own thread. Particularly interested in feedback on 汉语大词典 since that's the one we're doing the most active work on (improving jianti conversions of headwords and integrating all the lovely data from the new 2010 "supplement" volume), but also very interested in any feedback on the 成语词典 and 广州话方言词典 since those are both new categories of dictionary for us.
In the main dictionary, the romanization for 佢 is incorrectly given as "keoi4" in JyutPing. It is also incorrect under the various Yale transcriptions. It should be "keoi5" with the low-rising tone. Are the Cantonese romanizations in the main dictionary (Pleco) complete?
 

mikelove

皇帝
Staff member
They should be, yes. 佢 looks like a software bug relating to the fact that that entry was originally 渠/佢 (渠 being correctly pronounced "keoi4") - it's pulling the Cantonese from the other half of that variant. But checking the original data file we actually have it listed as "keoi5" under that character, it's just not showing up that way in the app.
 

BanMai

秀才
-spaces in the hanyu da cidian sample sentences (which causes pauses on the read out)

-multiple entries on hanyu da cidian that should be combined in one entry. For example, 支撑. It displays two entries for the word because there's different ways of writing it in traditional (but since i have my settings on simplified, it displays the same characters).

-also 亦作 or 见 should link to the word after
 
I love having 漢語大詞典 in Pleco and I'm looking forward to the "new 2010 supplement"!

1. I noticed that sometimes definitions get merged with examples sentences.

For instance:

后因以“九茎”指芝草。
and
参见“九芝”。

parts of the 九茎 entry should be outside of the example sentences.

2. Also, tapping on 九芝 (参见“九芝”。) shows a blank definition under HDC.
3. I think spaces in example sentences to indicate proper nouns are ok, but it may be an even better idea to remove spaces and underline those words instead.
 

Attachments

Last edited:
Bug in Oxford Chinese Dictionary: look up any 膈* (ge2) compound in the Search field, tap on it and then tap on 膈. When you reach OCDict, you can choose between ge2 and ge4. If you select the latter (by tapping on the arrow), it only displays "[character in]" in the definition field.
 

bglasow

举人
Looking up 牛 in GZH looks like it has some unescaped HTML/XML artifacts in the attached image.

Also for this same char, I noticed in Tuttle it shows niu2 as a verb but all the examples point towards the usual noun.
 

Attachments

mikelove

皇帝
Staff member
golden chyld - Thanks for all of that. #1/3 relate to the less-than-stellar condition the text arrived in - there was no tagging for the original underlines so we have to come up with a sophisticated filter to apply them correctly.

bglasow - thanks; not sure how those managed to escape conversion.
 

mikeo

榜眼
May have been reported already, but noticed that early access CY dict misfires on some characters, popping up when in fact it contains no corresponding entry. This has happened to me on about 5 different occasions,with different characters. The CY dict itself is great though.
 

Attachments

gato

状元
In HYDCD, quoted words that follow 见 or 参见 probably should be tappable links.

For example, under definition
- for 爱不释手, there should be a link to 爱不忍释; and
- for 四书, there should be a link to 四子书.
 

mikelove

皇帝
Staff member
mikeo - hmm, that actually seems more like a software bug - thanks!

gato - quite right, thanks!
 

aristide

Member
Old stuff, but 悄悄话 reads:
  • qiǎo qiǎo huà in GR
  • qiāoqiāohuà in PLC, ABC
  • qiāoqiaohuà in CF, CC
... not sure if such little discrepancies need to be reported though.
 

mikelove

皇帝
Staff member
Hard to do much about that, yeah - huge number of minor tone variations like this and it's tough to even come up with a definitive ruling on which one is "right" (on top of which we wouldn't be allowed to apply that to change to most of our licensed dictionaries anyway).
 
I don't know if it is me or the HYDCD

I work in traditional characters, but the HYDCD displays definitions in simplified characters.

Can I force it to display definitions in traditional characters?

Thanks
 

mikelove

皇帝
Staff member
That one's HYDCD - no way around it at the moment, it was originally printed with simplified definitions and traditional quotations and it's simply too big for us to make even a half-credible attempt at converting it to traditional. (we are trying to find some legal traditional data for it that we can use, though)
 
Last edited:
I love the format of HYDCD - it just feels right to preserve the original traditional characters for the cited examples (and they'd have lost far too much information and introduced too many choices trying to convert much of that ancient text to simplified). And I like preserving the original simplified definitions too - no potential for error in conversion (so if any conversion is implemented please leave at least an option to not mess with the original)

Question - currently, it seems all my dictionaries that were made originally in traditional (Longman, Taiwan MoE) remain that way, and the same is true for the many simpified. I really, really like this - in the old version of pleco I seem to remember it always auto-converting everything to traditional or simplified depending on my setting, which seemed to sometimes introduce weirdness. Is this a new setting that was introduced, or some kind of a design change? It is so much more natural for those of us used to working in both character sets to just leave dictionaries in their original format, and I love having it at least as an option.

Also, a few bug reports, all for HYDCD:

First, question on the pinyin for compound words - is there some kind of potentially bugged process for inserting the pinyin for compounds? There seems to be a great deal of weirdness, like 凝重 shows up as ning2chong2 in HYDCD (actually, also in GHYCD, but with that one I take it there were no pinyin for compound words to begin with - which perhaps was also the case here?) - this doesn't seem to be a possible legitimate alternative in pronunciation, as it so so obviously ning2zhong4....I've noticed similar things elsewhere, like with 2 character compounds ending in 得, where HYDCD tends to give it a dei3 pronunciation in circumstances that seem very strange.

In addition, occasional odd pinyin - for example, 裳 is listed as cheng2, when pretty much every other dictionary gives chang2 - is this potentially a digitization error, or just a weird choice by the editors?

Then there is this entry, where something odd seems to have happened with duplication within a single dictionary entry, listing the same thing for 1 and 2 (so it couldn't have been caused by multiple potential charecter writings 0r some such obvious problem):

恶贯满盈
1 作恶极多, 已到末日。

语本《书·泰誓上》
商 罪貫盈, 天命誅之。 ☕

孔 传
紂 之爲惡, 一以貫之, 恶貫已滿, 天畢其命。 ☕

元 无名氏 《硃砂担》第四折
你今日惡貫滿盈, 有何理説? ☕

《醒世恒言·卢太学诗酒傲王侯》
及至惡貫滿盈, 被拿到官, 情真罪當, 料無生理。 ☕

峻青 《海啸》第四章
沈百万 这老狗, 恶贯满盈了。 ☕
2 作恶极多, 已到末日。

语本《书·泰誓上》
商 罪貫盈, 天命誅之。 ☕

孔 传
紂 之爲惡, 一以貫之, 恶貫已滿, 天畢其命。 ☕

元 无名氏 《硃砂担》第四折
你今日惡貫滿盈, 有何理説? ☕

《醒世恒言·卢太学诗酒傲王侯》
及至惡貫滿盈, 被拿到官, 情真罪當, 料無生理。 ☕

峻青 《海啸》第四章
沈百万 这老狗, 恶贯满盈了
 

mikelove

皇帝
Staff member
Thanks! There isn't an explicit change at work here in character sets, we're just slowly pushing the limits of what people will let us get away with - frankly I regret giving into demands that we offer a traditional-converted version of GHYDCD (which was originally simplified-only), and in the future I'd prefer to avoid character set conversion of monolingual dictionaries whenever possible. We think there may be a official traditional conversion of HYDCD floating around somewhere that we could use, though - a few dictionaries (ABC and the big Oxford chief among them) have official simplified and traditional versions so we offer merged versions of those when we can. We may eventually add a "preserve original character set" option for specific dictionaries in Manage Dicts, we just haven't quite settled on an official policy for this stuff yet.

The HYDCD Pinyin is indeed a bit buggy - we actually hired some people in China to do that conversion by hand (couldn't get the rights to the data from the official Alphabetical Index - the copyright to it is actually held by HYDCD's publisher, but they didn't have an electronic version available, and the only party that did wanted an extortionate amount of money to release it to us) but they didn't quite do as perfect a job as we'd like in a few places. Pinyin for the first character in any given compound is very reliable, since that at least we get from the original dictionary (which differentiates between character senses in compounds), it's only the subsequent characters that give us trouble.

裳 has that same pronunciation in the HYDCD CD-ROM, which is where we obtained most of our data (wasn't worth re-digitizing the darn thing from scratch again) - have to double-check whether the print version agrees on that too. The 恶贯满盈 issue likewise seems to be from the CD-ROM.

The annoying thing about giant dictionaries like HYDCD and Ricci (which has a fair number of errors like this in its data too) is that even though they're dictionaries that are especially sensitive to mistakes / have a particular need to be converted accurately, they're also so large and so specialized that it's tough to do so without losing lots of money in the process.
 
Top