The MoE dictionary is now open source

Yiliya

榜眼
The thing is, the dictionary isn't supposed to be used with a PRC font, many characters will look wrong. Not only the punctuation (e.g. MoE also uses ○ instead of 〇, in addition to the already mentioned .), but the actual hanzi as well, many traditional characters are wrong in PRC fonts, e.g. one of the biggest offenders is 誤 (the right part should be 吳, not 吴!), but also 呈, 丰 etc.

Thankfully, Pleco supports custom fonts (at least on Android, not sure about iOS), so all you have to do is copy a TW/HK font to your phone. If you have Windows 7, then Microsoft JhengHei Bold (msjhbd.ttf) is an excellent choice, it looks great on a smartphone.

Anyway, I would strongly advise against fiddling with the dictionary's punctuation/content in any way. If somebody wants to replace X character with Y, they can always do it privately with a search and replace command in their favorite text editor. We can't possible accommodate everyone's tastes.
 

sandwich

举人
Yiliya said:
The thing is, the dictionary isn't supposed to be used with a PRC font, many characters will look wrong. Not only the punctuation (e.g. MoE also uses ○ instead of 〇, in addition to the already mentioned .), but the actual hanzi as well, many traditional characters are wrong in PRC fonts, e.g. one of the biggest offenders is 誤 (the right part should be 吳, not 吴!), but also 呈, 丰 etc.

Thankfully, Pleco supports custom fonts (at least on Android, not sure about iOS), so all you have to do is copy a TW/HK font to your phone. If you have Windows 7, then Microsoft JhengHei Bold (msjhbd.ttf) is an excellent choice, it looks great on a smartphone.
iOS handles selecting fonts for Asian languages in a _bad_ way, I couldn't see the problem you were talking about until I switched system language to 简体中文. If you check your device 简体中文 is probably above 繁體中文. Selecting 繁體中文 and then changing back to your preferred language should resolve the issue in 萌典. Pleco uses its own font so 誤 appears correctly regardless.

The thing with "." is that its not actually a font issue, the problem is that character "." is "FULLWIDTH FULL STOP" which is intended for a different purpose. There are no rules about where to put the full stop in the character, both the fonts on iOS and android are correct. The online MoE dictionary actually also uses a different character not appropriate for the job, "˙" which in big5 is the bopomofo 5th tone.

Edit: I should add the reason I am advocating for [the dot to appear in the middle, like the character] • to be used instead is that I am under the impression that, this kind of character is pretty standard to use as a separator, where as something that looks like a full stop, is not. (Could even be "/" instead, and that would be fine). That said, its a minor issue and its not like Pleco [on iOS] doesn't have other readability issues atm anyway.
 

Yiliya

榜眼
Here's a random quote from MoE with a correct font (MingLiu):
ZSdyDwq.png

And now with a PRC one (SimSun):
sHwv2n5.png

Note the butchered punctuation (all of it, not just the ".", is off center).

And no, you can't just randomly stick "/" into Chinese text.
 

sandwich

举人
Yiliya said:
Here's a random quote from MoE with a correct font (MingLiu):
ZSdyDwq.png

And now with a PRC one (SimSun):
sHwv2n5.png

Note the butchered punctuation (all of it, not just the ".", is off center).
I don't see your point, both MingLiu and SimSun are correct fonts. Just because the MingLiu font happens to put the .(FULLWIDTH FULL STOP) in the middle, wont stop the unicode character from being a full stop. If http://en.wikipedia.org/wiki/Chinese_punctuation is to be believed, then · (MIDDLE DOT) or one of its variations should be the actual character used.
 

Yiliya

榜眼
All TW fonts are like this, not just MingLiu.

Anyway, my point is that we shouldn't touch the dictionary's content unless it's absolutely necessary. A minor aesthetic issue when using a non standard font is hardly such a case. One can simply do a search and replace if it bothers them so much.
 

joseph

Member
I tried to download the file for this dictionary and use it on Android Pleco but couldn't get it to work. How do I use it?
 

mikelove

皇帝
Staff member
sandwich said:
Pleco crashed last night while trying to import, so hard to play around with it yet. But gonna be awesome, thanks for working on this.

Yeah, imports of very long user dictionary entries (as some of these are) are a bit buggy on iOS at the moment, thanks to some rather dicey platform-specific threading issues - it seems like the vast majority of the people interested in doing complicated things with user dictionaries are on Android now, so there hasn't been too much pressure to fix this, but it should be corrected in the Great Big Update anyway. And in the meantime the PQB database that alex_hk90 is generating should work just fine on iOS too.

Yiliya said:
Thankfully, Pleco supports custom fonts (at least on Android, not sure about iOS), so all you have to do is copy a TW/HK font to your phone.

We don't support custom fonts on iOS at the moment, but iOS actually includes both simplified- and traditional-styled variants of its built-in STHeiti font (with the attendant character / punctuation / etc changes), and we use whichever one is appropriate for your current character set. (the new font in our big update also comes in both variants)

sandwich said:
That said, its a minor issue and its not like Pleco [on iOS] doesn't have other readability issues atm anyway.

Could you be a little more specific? We certainly know we have issues in this area, hence the new type design, but if there's anything specific that we're doing incorrectly (particularly with regard to monolingual dictionaries) we'd love to make sure that the new design has actually fixed it / tweak it if not.

Interestingly, the new design uses half-width Chinese punctuation rather than full-width (with densely-packed text on a smartphone screen the consensus was that this would work better), which makes the discussion of proper location of punctuation marks somewhat moot - the fonts will still support full-width characters in user dictionaries / documents / etc that use them, and if there's a lot of griping about this we might even add an option to switch back to full-width in our own dictionaries, but I believe the official Pleco type spec going forward is going to be half-width everywhere.

joseph said:
I tried to download the file for this dictionary and use it on Android Pleco but couldn't get it to work. How do I use it?

It's a user dictionary, so you'd go into Settings / Manage Dictionaries, tap "Add User," "Add Existing," and select the .pqb file you downloaded. User dictionary support is a paid feature, though (part of the flashcard add-on), so this option won't be available if you haven't bought it.

(we'll be happy to release an official version as a free add-on if MoE lets us, but we haven't secured the necessary permission yet - since this user dictionary version is being created independently of Pleco, it doesn't seem to run afoul of Taiwanese fair use law like a Pleco-created version would)
 

alex_hk90

状元
audreyt said:
Hi, mea culpa, that was an oversight in db2unicode.pl -- I carried the change from json2unicode.pl but forgot that SQLite uses single quotes, not double quotes, in its dump format.

Fixed in https://github.com/g0v/moedict-epub/com ... af4476762b (with due credit to both of you) and the code in https://github.com/g0v/moedict-epub/blo ... unicode.pl should no longer produce self-looping variants.

Cheers,
Audrey
Thanks Audrey. I'll use the updated script from now on. :)

sandwich said:
Pleco crashed last night while trying to import, so hard to play around with it yet. But gonna be awesome, thanks for working on this.

Only comment is that the readability is not as good as 萌典, which I think is mostly a Pleco issue, but I had a couple of thoughts:
[*] the separator character "."(FULLWIDTH FULL STOP, U+FF0E) might work on android, but it looks wrong with apple fonts. "•"(BULLET, U+2022) is available on the ios keyboard, or maybe "・"(KATAKANA MIDDLE DOT) is a safer monospaced bet. (nb: 萌典 also has this issue, so its probably upstream)
[*] the ◆ looks ok when there is a <名/助/動/形/...>, but otherwise breaks the format. (e.g. in 出來 it would read better if the numbers where in the same column.)
You're welcome, and thanks for trying it out. :) On the second point, it shouldn't be too difficult to change the formatting to only have the diamond/bullet when there are multiple parts of speech. This should work for 出來 and also more generally because it's currently set up to make a new line after the part of speech <名/助/動/形/...> where there are multiple definitions for that part.
 

sandwich

举人
mikelove said:
sandwich said:
That said, its a minor issue and its not like Pleco [on iOS] doesn't have other readability issues atm anyway.

Could you be a little more specific? We certainly know we have issues in this area, hence the new type design, but if there's anything specific that we're doing incorrectly (particularly with regard to monolingual dictionaries) we'd love to make sure that the new design has actually fixed it / tweak it if not.
Argh, sorry Mike. I guess I would have been better saying, "That said, its a minor issue and there's probably not much point worrying about readability issues too much until the the new type design gets here anyway.". As to not go anymore OT, I'll add anything I can think of to that thread (iif I can think of anything that is).
 
goldyn chyld

萌典 for iOS just got updated and now supports simplified input as well. Not sure how it would behave in Pleco, though.
 

mikelove

皇帝
Staff member
Re: goldyn chyld

goldyn chyld said:
萌典 for iOS just got updated and now supports simplified input as well. Not sure how it would behave in Pleco, though.

Presumably means they've added simplified versions of headwords, which would take away a step in adding them to this custom database. (which we recommend since it'll keep the merged multi-dictionary search system from acting up)
 

alex_hk90

状元
I've got round to making some of the previously mentioned changes to the conversion:
- Used the updated db2unicode.pl script (with the fix for the self-looping characters) - thanks to audreyt for this.
- Changed the formatting to only show the diamond bullet characters if the definition has more than one part of speech (see for example 出來) - thanks to sandwich from bringing up this point.
- Moved the bracketed Western names from the Hanzi field to the definition field.
- Replaced the Chinese brackets (which didn't show up in the Pinyin field) with Western brackets (which do).

SQL (with comments) to convert from source:
Code:
-- Create new table (combined) with relevant columns for Pleco user dictionary.
 
create table combined as
select definitions.id, definitions.idx, entries.title, heteronyms.pinyin,
definitions.type, definitions.def, definitions.example, definitions.quote,
definitions.synonyms, definitions.antonyms, definitions.link
from definitions, heteronyms, entries
where definitions.heteronym_id = heteronyms.id and heteronyms.entry_id = entries.id;
 
-- Combine definition fields into single field.
 
-- Process (Western) comma separators (replace with Pleco new line or Chinese list separators):
update combined set example = replace(example, ',', '');
update combined set quote = replace(quote, ',', '');
update combined set synonyms = replace(synonyms, ',', '、');
update combined set antonyms = replace(antonyms, ',', '、');
update combined set link = replace(link, ',', '');
 
/* Replace null entries to empty strings '' for combining/counting later
(otherwise can causes problems with null results,
especially when using type for counting number of definitions),
and at same time add label for synonyms and antonyms:*/
update combined set
type = case when coalesce(type, '') = '' then '' else type end;
update combined set
def = case when coalesce(def, '') = '' then '' else def end;
update combined set
example = case when coalesce(example, '') = '' then '' else example end;
update combined set
quote = case when coalesce(quote, '') = '' then '' else quote end;
update combined set
synonyms = case when coalesce(synonyms, '') = '' then '' else '似:'||synonyms end;
update combined set
antonyms = case when coalesce(antonyms, '') = '' then '' else '反:'||antonyms end;
update combined set
link = case when coalesce(link, '') = '' then '' else link end;
 
-- Now combine into single field newdef:
alter table combined add column newdef;
update combined set newdef = '';
update combined set newdef = case when coalesce(def, '') = ''
then newdef else def end;
update combined set newdef = (case when coalesce(example, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then example else newdef||''||example end) end);
update combined set newdef = (case when coalesce(quote, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then quote else newdef||''||quote end) end);
update combined set newdef = (case when coalesce(synonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then synonyms else newdef||''||synonyms end) end);
update combined set newdef = (case when coalesce(antonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then antonyms else newdef||''||antonyms end) end);
update combined set newdef = (case when coalesce(link, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then link else newdef||''||link end) end);
 
/* Add numbering to combined definitions:
order by idx then count up to current id.*/
 
create table combined2 as
select id, title, pinyin, idx, type, newdef,
(select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type and c2.id<=c1.id) as defid,
(select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type) as defcount
from combined c1;
 
-- Add numbering (only if more than one definition in unique title, pinyin, type):
update combined2 set newdef =
(case when defcount<=1 then newdef
else '(@'||defid||'@) '||newdef end);
 
-- Replace with circled numbers:
update combined2 set newdef = replace(newdef, '(@1@)','①');
update combined2 set newdef = replace(newdef, '(@2@)','②');
update combined2 set newdef = replace(newdef, '(@3@)','③');
update combined2 set newdef = replace(newdef, '(@4@)','④');
update combined2 set newdef = replace(newdef, '(@5@)','⑤');
update combined2 set newdef = replace(newdef, '(@6@)','⑥');
update combined2 set newdef = replace(newdef, '(@7@)','⑦');
update combined2 set newdef = replace(newdef, '(@8@)','⑧');
update combined2 set newdef = replace(newdef, '(@9@)','⑨');
update combined2 set newdef = replace(newdef, '(@10@)','⑩');
update combined2 set newdef = replace(newdef, '(@11@)','⑪');
update combined2 set newdef = replace(newdef, '(@12@)','⑫');
update combined2 set newdef = replace(newdef, '(@13@)','⑬');
update combined2 set newdef = replace(newdef, '(@14@)','⑭');
update combined2 set newdef = replace(newdef, '(@15@)','⑮');
update combined2 set newdef = replace(newdef, '(@16@)','⑯');
update combined2 set newdef = replace(newdef, '(@17@)','⑰');
update combined2 set newdef = replace(newdef, '(@18@)','⑱');
update combined2 set newdef = replace(newdef, '(@19@)','⑲');
update combined2 set newdef = replace(newdef, '(@20@)','⑳');
-- Note: max(defcount) suggests only need to go up to 19.
 
-- Group by part of speech (title, pinyin, type):
create table combined3 as
select title, pinyin, idx, type, defcount,
group_concat(newdef, '') as newdef3
from (select title, pinyin, idx, type, defcount, newdef
from combined2
order by title, pinyin, defid)
group by title, pinyin, type;
 
-- Add part of speech (type) count (for determining use of bullet characters):
create table combined3a as
select title, pinyin, idx, type, defcount, newdef3,
(select count(*) from combined3 c2 where c1.title=c2.title and c1.pinyin=c2.pinyin) as partcount
from combined3 c1;
 
-- Add bullet character if more than one part of speech (type):
alter table combined3a add column bullet;
update combined3a set bullet =
(case when partcount>1
then '◆ '
else '' end);
 
-- Add part of speech (type), bullets and new lines as appropriate:
update combined3a set newdef3 =
(case when coalesce(type, '') = ''
then bullet||newdef3 else
(case when defcount>1
then bullet||'<'||type||'> '||newdef3
else bullet||'<'||type||'> '||newdef3 end) end);
 
-- Group by Hanzi/Pinyin (title, pinyin):
create table combined4 as
select title, pinyin, group_concat(newdef3, '') as newdef4
from (select title, pinyin, idx, type, newdef3
from combined3a
order by title, pinyin, idx)
group by title, pinyin;
 
-- Replace Chinese brackets with Western brackets in pinyin field (to show in Pleco):
update combined4 set pinyin=
replace(pinyin, '(', ' (');
update combined4 set pinyin=
replace(pinyin, ')', ') ');
 
-- Move bracketed Western names from title (Hanzi) field to start of definition field:
update combined4 set
newdef4=substr(title,instr(title,'('))||' '||newdef4,
title=substr(title,1,instr(title,'(')-1)
where title like '%)';
 
-- Output counts for consistency checking:
select count(*) from combined2; -- (unique title, pinyin, type, def) = 213486;
select count(*) from combined3; -- (unique title, pinyin, type) = 171378;
select count(*) from combined4; -- (unique title, pinyin) = 165810.

Pleco flashcards:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Pleco user dictionary:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
(Again it imported 164936 entries, missing 874 from the flashcards - this seems consistent with the previous import.)

audreyt said:
We (well, I) have added a very simplistic char-to-char mapping:

https://github.com/audreyt/moedict-webk ... 177#L0R504

It's just Simp-Trad pairs, two characters as an unit, so headword processing can use the "b2g" function defined in line 505~516 to obtain simplified versions of the title entries.
How would this handle the cases mentioned by Yiliya in this post: viewtopic.php?f=20&t=3606&start=15#p29336?
 

mikelove

皇帝
Staff member
Didn't even see Yiliya's SCTC post, sorry:

Yiliya said:
mikelove,
Yes, simplified headwords would be very handy, however the Trad -> Simp conversion is not as trivial as it seems. Aside from 乾 (don't simplify when it's pronounced qián), there's a few other tricky characters. Off the top of my head:
著 (don't simplify when it's pronounced zhù), very common character that causes the most of the confusion in Trad -> Simp conversions
徵 (don't simplify when it's pronounced zhǐ), this is a rare usage, but still, MoE has it
於 (don't simplify when it's pronounced wū), also rare

Also, the 幺/么/麼/麽 confusion. Basically, Trad 么 = Simp 幺 (yāo), Trad 麼 = Simp 么 (me) OR 麽 (mó). This way, 么麼 (yāomó) gets simplified to 幺麽.

Another thing to consider is that the MoE dictionary uses a number of archaic traditional characters throughout the whole dictionary, case in point - 祕 (instead of the nowadays commonly accepted 秘).

FWIW, here's a longer list of TC->multiple SC mappings, including a couple of Extension B/C ones which might be considered more "variants":

么 么,幺
乾 乾,干
份 份,分
俱 具,俱
卻 卻,却
夥 伙,夥
幺 么,幺
彷 彷,仿
徵 徵,征
摺 摺,折
擣 捣,U+22B4F
於 于,於
沈 沈,沉
瀰 弥,㳽
甚 什,甚
畫 画,划
瞭 瞭,了
矓 眬,胧
綵 彩,䌽
著 著,着
藉 藉,借
覆 复,覆
託 托,讬
諮 咨,谘
逕 径,迳
鉅 巨,钜
鉋 刨,铇
鍾 钟,锺
阪 坂,阪
願 愿,U+2B5B8
颺 扬,飏
餘 馀,余
餸 餸,U+2980c
騃 呆,U+2B624
鬹 鬶,鬹
鹼 碱,硷
麼 麽,么

(adapted from a combination of Unihan, Wenlin, and Sayjack's tables)

But aside from 著 (don't know how I forgot to mention that one) and 幺/么/麼/麽 these are mostly quite uncommon, and given that the main goal of adding SC here is to have the SC in multi-character headwords be the same as SC supplied by other dictionaries, it's unlikely that there will be very many multi-character words that both a) map a character in an unusual way and b) are also listed in another dictionary. (in any event, this is certainly a heck of a lot easier than mapping them in the other direction)
 

Yiliya

榜眼
alex_hk90, audreyt,

The latest conversion still has some self-looping characters. While 呈 stopped showing up as a variant of itself, 華 and 夢 still do.
 

Yiliya

榜眼
Forgot to comment on this:

mikelove said:
Interestingly, the new design uses half-width Chinese punctuation rather than full-width (with densely-packed text on a smartphone screen the consensus was that this would work better), which makes the discussion of proper location of punctuation marks somewhat moot - the fonts will still support full-width characters in user dictionaries / documents / etc that use them, and if there's a lot of griping about this we might even add an option to switch back to full-width in our own dictionaries, but I believe the official Pleco type spec going forward is going to be half-width everywhere.
Wouldn't it be smarter to simply do this conversion on fly and make it an option that could be easily turned off? Everyone wins this way.
 

mikelove

皇帝
Staff member
Yiliya said:
Wouldn't it be smarter to simply do this conversion on fly and make it an option that could be easily turned off? Everyone wins this way.

It's an option either way, but an option to replace half-width punctuation marks gives us the ability to always use full-width punctuation in some dictionaries if we find it works better in them, or if a nosy publisher insists that we use full-width in their particular dictionary.
 

alex_hk90

状元
mikelove said:
FWIW, here's a longer list of TC->multiple SC mappings, including a couple of Extension B/C ones which might be considered more "variants":

么 么,幺
乾 乾,干
份 份,分
俱 具,俱
卻 卻,却
夥 伙,夥
幺 么,幺
彷 彷,仿
徵 徵,征
摺 摺,折
擣 捣,U+22B4F
於 于,於
沈 沈,沉
瀰 弥,㳽
甚 什,甚
畫 画,划
瞭 瞭,了
矓 眬,胧
綵 彩,䌽
著 著,着
藉 藉,借
覆 复,覆
託 托,讬
諮 咨,谘
逕 径,迳
鉅 巨,钜
鉋 刨,铇
鍾 钟,锺
阪 坂,阪
願 愿,U+2B5B8
颺 扬,飏
餘 馀,余
餸 餸,U+2980c
騃 呆,U+2B624
鬹 鬶,鬹
鹼 碱,硷
麼 麽,么

(adapted from a combination of Unihan, Wenlin, and Sayjack's tables)

But aside from 著 (don't know how I forgot to mention that one) and 幺/么/麼/麽 these are mostly quite uncommon, and given that the main goal of adding SC here is to have the SC in multi-character headwords be the same as SC supplied by other dictionaries, it's unlikely that there will be very many multi-character words that both a) map a character in an unusual way and b) are also listed in another dictionary. (in any event, this is certainly a heck of a lot easier than mapping them in the other direction)
Thanks for the information.

While the main point isn't to be 100% accurate, it would still be nice to deal with the common exceptions (the Sayjack link you have posted mentions that there are at least 19 exceptions, so it would be good to handle these at least). The MoEDict data has the associated Pinyin as well so that should help the conversion. It would be good to have a script for (relatively) accurately converting traditional/pinyin pairs to simplified. In fact, I would except one already exists - I'll have a quick search online when I get the time. :)
 
alex_hk90 said:
Pleco flashcards:
http://www.mediafire.com/?8841ku2or145wpq
Pleco user dictionary:
http://www.mediafire.com/?1jt67g3l0d2xa8r
(Again it imported 164936 entries, missing 874 from the flashcards - this seems consistent with the previous import.)

Any way you could host these on a friendlier service, such as Dropbox? You can easily create links to files on Dropbox, for example.

I've tried 9 times already, and it just cycles me back to different ads each time on the same page.
 
Top