The MoE dictionary is now open source

goldyn chyld · Apr 5, 2013

alex_hk90, thanks for all your work! It's really appreciated. I just loaded your MoE file into my Pleco on iOS and it seems to recognize traditional input only? For example, if I enter 话, it won't recognize it...

alex_hk90 · Apr 5, 2013

stephanhodges said:
Any way you could host these on a friendlier service, such as Dropbox? You can easily create links to files on Dropbox, for example.

I've tried 9 times already, and it just cycles me back to different ads each time on the same page.

I've never had any problems with MediaFire (maybe because I'm a member?), but sure I can upload the next version to Dropbox as well.

I noticed a formatting improvement that can be made for cases such as the third part in https://www.moedict.tw/#%E6%BC%A2 so I'll try to sort that out in the next version.

goldyn chyld said:
alex_hk90, thanks for all your work! It's really appreciated. I just loaded your MoE file into my Pleco on iOS and it seems to recognize traditional input only? For example, if I enter 话, it won't recognize it...

You're welcome.

And that is correct - at present it only recognises traditional characters because the dictionary only has traditional headwords. (I think this is similar to the LMA traditional-only dictionary in Pleco?) Hence the discussion above regarding traditional-simplified conversions - all that needs to be done to let it recognise simplified input is to run a script on the traditional/pinyin pairs and generate the simplified, problem is the exceptions with multiple traditional-simplified mappings.

stephanhodges · Apr 5, 2013

Thanks for the quick reply. I'm in a location (Northern India right now) with very bad internet AND slow, which I'm sure adds to the problem. I'm not a member of MediaFire either.

I was able to download the text file, which is UTF-8 without a BOM character, in case anyone else needs to know. Compressing the file would reduce it to about half the current size with ZIP compression, and about 60+% with 7z compression. Compression would also mean faster download time.

goldyn chyld · Apr 5, 2013

goldyn chyld

alex_hk90 said:
goldyn chyld said:

alex_hk90, thanks for all your work! It's really appreciated. I just loaded your MoE file into my Pleco on iOS and it seems to recognize traditional input only? For example, if I enter 话, it won't recognize it...

Click to expand...

You're welcome. And that is correct - at present it only recognises traditional characters because the dictionary only has traditional headwords. (I think this is similar to the LMA traditional-only dictionary in Pleco?) Hence the discussion above regarding traditional-simplified conversions - all that needs to be done to let it recognise simplified input is to run a script on the traditional/pinyin pairs and generate the simplified, problem is the exceptions with multiple traditional-simplified mappings.

Hm, but have you tried using 萌典's latest code which supports simplified input (see page 4)?

P.S.: the Mediafire dl worked just fine for me (and I'm not registered)... (Indian Internet connection is notoriously crappy though; been there before, so I'm pretty sure that's the culprit here.)

alex_hk90 · Apr 5, 2013

stephanhodges said:
Thanks for the quick reply. I'm in a location (Northern India right now) with very bad internet AND slow, which I'm sure adds to the problem. I'm not a member of MediaFire either.

I was able to download the text file, which is UTF-8 without a BOM character, in case anyone else needs to know. Compressing the file would reduce it to about half the current size with ZIP compression, and about 60+% with 7z compression. Compression would also mean faster download time.

I'll compress the next one and upload to Dropbox as well as MediaFire.

goldyn chyld said:
Hm, but have you tried using 萌典's latest code which supports simplified input (see page 4)?

I haven't yet, because as far as I can tell that code won't deal with the exceptions (where a single traditional character can map to more than one simplified character). Since the original data does not have simplified headwords, I don't want to use a conversion method to generate them that might cause inaccuracies.

goldyn chyld · Apr 5, 2013

Oh, ok I see. In any case, it's awesome to have MoE in Pleco... 8)

alex_hk90 · Apr 5, 2013

Updated with a small formatting change (affecting 911 entries):
- Added newlines for bracketed number lists within sub-definitions (def), such as found in: https://www.moedict.tw/#漢
There were a few exceptional cases that I noticed (and maybe some I didn't, I don't know), which I have dealt with where it is a formatting issue, but not if it is more content related (I'll probably correct these later when I've read through the two entries properly to confirm):
三關 (https://www.moedict.tw/#三關): upstream/data typo? could manually correct (from "(2)" to "(7) " in last line);
漢 (https://www.moedict.tw/#漢): corrected extra newline (formatting issue, "(2)" used in reference rather than numbering);
短 (https://www.moedict.tw/#短): upstream/data typo? could manually correct (from "(2)" to "(3)" for last line in first sub-definition);
田鼠 (https://www.moedict.tw/#田鼠): corrected obvious formatting typo (from "(2 )" to "(2) ").

SQL with comments:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.

Pleco flashcards (165810 entries):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Pleco user dictionary (164936 entries, missing 874 entries - same as before):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Note: Compressed in 7z - extract using 7-zip, p7zip or similar.
EDIT: Also now locked the database (can be unlocked in Settings / Manage Dictionaries) to improve performance - thanks to mikelove for suggesting this.

One slightly odd thing I noticed: I initially tried to replace the bracketed numbers with ➊ symbols (in Unicode - Dingbats), similar to the MoEDict.tw website, but while this worked on a Samsung Galaxy SIII (and on an Android tablet), the characters showed up as boxes on an HTC Desire. :? But it doesn't matter as in actual fact I think the parenthesised symbols ⑴ are nicer anyway.

I was just wondering about what fonts Pleco uses, and whether they vary across devices/Android versions/firmwares/etc.?

Also, I noticed that the logic for the newlines/pagination for the Chinese list separator ("、") doesn't seem to be the same as for the Western comma (","):

I would expect the line break to occur at the ("、") in the above?

mikelove · Apr 5, 2013

alex_hk90 said:
One slightly odd thing I noticed: I initially tried to replace the bracketed numbers with ➊ symbols (in Unicode - Dingbats), similar to the MoEDict.tw website, but while this worked on a Samsung Galaxy SIII (and on an Android tablet), the characters showed up as boxes on an HTC Desire. But it doesn't matter as in actual fact I think the parenthesised symbols ⑴ are nicer anyway. I was just wondering about what fonts Pleco uses, and whether they vary across devices/Android versions/firmwares/etc.?

We use whichever font is built into the phone, though you can customize it in Settings / General. (we could offer an official built-in font, but our initial download is already larger than we would like it to be without that extra 10-15 MB burden added, and the vast majority of Android phones embed DroidSansFallback anyway)

alex_hk90 said:
Also, I noticed that the logic for the newlines/pagination for the Chinese list separator ("、") doesn't seem to be the same as for the Western comma (","):

That one you can blame on Google - the line wrapping code for Android text fields is an ugly hack and doesn't detect Unicode line breaking characters in a consistent or intelligent manner. (said hacky-ness was also the reason for the much-publicized Android 4.1 random text field crashing bug - somebody mis-handled a variable in generate() because they didn't understand what it meant)

goldyn chyld · Apr 5, 2013

I noticed that Pleco freezes for 4-5 seconds when looking up certain words in MoE. For example: 咬人的狗兒不露齒. (iPhone 5, iOS 6.1.3).

Does it happen on Android, too?

mikelove · Apr 6, 2013

goldyn chyld said:
I noticed that Pleco freezes for 4-5 seconds when looking up certain words in MoE. For example: 咬人的狗兒不露齒. (iPhone 5, iOS 6.1.3).

Go into Settings / Manage Dictionaries, select the MOE dictionary, and "Lock Database" - this greatly speeds up performance. (alex_hk90 might also want to consider doing this with the official version)

alex_hk90 · Apr 6, 2013

mikelove said:
We use whichever font is built into the phone, though you can customize it in Settings / General. (we could offer an official built-in font, but our initial download is already larger than we would like it to be without that extra 10-15 MB burden added, and the vast majority of Android phones embed DroidSansFallback anyway)

Thanks for the info.

mikelove said:
alex_hk90 said:

Also, I noticed that the logic for the newlines/pagination for the Chinese list separator ("、") doesn't seem to be the same as for the Western comma (","):

Click to expand...

That one you can blame on Google - the line wrapping code for Android text fields is an ugly hack and doesn't detect Unicode line breaking characters in a consistent or intelligent manner. (said hacky-ness was also the reason for the much-publicized Android 4.1 random text field crashing bug - somebody mis-handled a variable in generate() because they didn't understand what it meant)

That code really does look ugly, but it looks like to fix this would just involve cleaning up the code around isIdeographic() and the function itself - do they accept external contributions to the source?

I just looked at the same entry as the above screenshot (zi5) on a Samsung Galaxy SIII, and it seems to handle it a bit better - it splits (1) and (2) before the last zi5, and (3) and (4) after the last Chinese list separator. I'm not sure if this is due to a higher resolution screen (more pixels for the text field), newer version of Android or Samsung have written a better function for this.

goldyn chyld said:
I noticed that Pleco freezes for 4-5 seconds when looking up certain words in MoE. For example: 咬人的狗兒不露齒. (iPhone 5, iOS 6.1.3).

Does it happen on Android, too?

That search term worked fine for me (HTC Desire and Samsung Galaxy SIII).

mikelove said:
Go into Settings / Manage Dictionaries, select the MOE dictionary, and "Lock Database" - this greatly speeds up performance. (alex_hk90 might also want to consider doing this with the official version)

Done this now.

stephanhodges · Apr 6, 2013

Thanks for the Dropbox links. They work very well, and in fact I see about 5x speed improvement on the download from them.

One additional item, if you're interested. You could have just shared the link to a folder in Dropbox rather than individual files. That way, when you change the files, you wouldn't need to regenerate new links, since the folder itself hadn't changed, just the contents.

If you need more Dropbox space, you could also provide a "read-only" share invitation to the folder. I already have Dropbox with 18.5 GB, but I'd still accept a folder sharing invitation, since it would mean getting the updates "automatically". I guess that makes me both a geek and a lazy one too. Unfortunately, I can't think of anything I could do to contribute to this project, so I hope I don't sound like a troll here, carping about better ways to get "me" the files!

Yiliya · Apr 6, 2013

I ran a script, and there are 30 self-looping variant characters remaining, namely:

Code:

周		 「周」的異體字。
善		 「善」的異體字。
垂		 「垂」的異體字。
夢		 「夢」的異體字。
契		 「契」的異體字。
害		 「害」的異體字。
寺		 「寺」的異體字。
差		 「差」的異體字。
廣		 「廣」的異體字。
慨		 「慨」的異體字。
旨		 「旨」的異體字。
欖		 「欖」的異體字。
比		 「比」的異體字。
氎		 「氎」的異體字。
沒		 「沒」的異體字。
獲		 「獲」的異體字。
瓜		 「瓜」的異體字。
示		 「示」的異體字。
稹		 「稹」的異體字。
米		 「米」的異體字。
縣		 「縣」的異體字。
考		 「考」的異體字。
華		 「華」的異體字。
蔻		 「蔻」的異體字。
衷		 「衷」的異體字。
衽		 「衽」的異體字。
豪		 「豪」的異體字。
釵		 「釵」的異體字。
風		 「風」的異體字。
黃		 「黃」的異體字。

I suggest we just don't use those Perl scripts until Audrey makes a proper fix.

alex_hk90 · Apr 6, 2013

stephanhodges said:
Thanks for the Dropbox links. They work very well, and in fact I see about 5x speed improvement on the download from them.

One additional item, if you're interested. You could have just shared the link to a folder in Dropbox rather than individual files. That way, when you change the files, you wouldn't need to regenerate new links, since the folder itself hadn't changed, just the contents.

If you need more Dropbox space, you could also provide a "read-only" share invitation to the folder. I already have Dropbox with 18.5 GB, but I'd still accept a folder sharing invitation, since it would mean getting the updates "automatically". I guess that makes me both a geek and a lazy one too. Unfortunately, I can't think of anything I could do to contribute to this project, so I hope I don't sound like a troll here, carping about better ways to get "me" the files!

You're welcome.

I think this should be the link to the folder: https://www.dropbox.com/sh/do7i5f9kh5xl ... nese/Pleco
(It wasn't the easiest to find because I am using a (now deprecated I think) Dropbox Public folder.)

I'll probably have to move the folder out of Public to create the share invitation - it says "You can't create shared folders inside your Public folder." when I try to share it.

Yiliya said:

I ran a script, and there are 30 self-looping variant characters remaining, namely:

Code:

周		 「周」的異體字。
善		 「善」的異體字。
垂		 「垂」的異體字。
夢		 「夢」的異體字。
契		 「契」的異體字。
害		 「害」的異體字。
寺		 「寺」的異體字。
差		 「差」的異體字。
廣		 「廣」的異體字。
慨		 「慨」的異體字。
旨		 「旨」的異體字。
欖		 「欖」的異體字。
比		 「比」的異體字。
氎		 「氎」的異體字。
沒		 「沒」的異體字。
獲		 「獲」的異體字。
瓜		 「瓜」的異體字。
示		 「示」的異體字。
稹		 「稹」的異體字。
米		 「米」的異體字。
縣		 「縣」的異體字。
考		 「考」的異體字。
華		 「華」的異體字。
蔻		 「蔻」的異體字。
衷		 「衷」的異體字。
衽		 「衽」的異體字。
豪		 「豪」的異體字。
釵		 「釵」的異體字。
風		 「風」的異體字。
黃		 「黃」的異體字。

I suggest we just don't use those Perl scripts until Audrey makes a proper fix.

I tried not using the Perl script but there just seemed to be far too many missing characters to the extent that it would made the dictionary difficult to use. I think better usability is worth the 30 self-looping variants. I can understand how there might be some doubts over the Perl script due to these 30 being incorrect, but these are only incorrect insofar as they are variant characters (with same pronunciation and meaning) so using these characters instead of the ones they were meant to be shouldn't affect the main dictionary definitions too much.

mikelove said:
Go into Settings / Manage Dictionaries, select the MOE dictionary, and "Lock Database" - this greatly speeds up performance. (alex_hk90 might also want to consider doing this with the official version)

On this topic, would there be benefit in also doing "Enable Full-text Index"? The description says "Enable full-text English search (experimental)" but I guess enabling this should also allow searching within the definitions? Are there any disadvantages of enabling this?

goldyn chyld · Apr 6, 2013

mikelove said:
goldyn chyld said:

I noticed that Pleco freezes for 4-5 seconds when looking up certain words in MoE. For example: 咬人的狗兒不露齒. (iPhone 5, iOS 6.1.3).

Click to expand...

Go into Settings / Manage Dictionaries, select the MOE dictionary, and "Lock Database" - this greatly speeds up performance. (alex_hk90 might also want to consider doing this with the official version)

Thanks, it did the trick! Now it displays the definition instantaneously.
(Also, Alex' latest version already comes with locked database by default now.) 8)

Yiliya · Apr 6, 2013

A question for Mike.

Is possible to make 造字 like "⿰亻壯" or "⿹乁乄" display properly using Pleco's radical system?

mikelove · Apr 6, 2013

alex_hk90 said:
That code really does look ugly, but it looks like to fix this would just involve cleaning up the code around isIdeographic() and the function itself - do they accept external contributions to the source?

Generally no, at least not in core pieces of the OS like this.

alex_hk90 said:
On this topic, would there be benefit in also doing "Enable Full-text Index"? The description says "Enable full-text English search (experimental)" but I guess enabling this should also allow searching within the definitions? Are there any disadvantages of enabling this?

Only works with English, so it wouldn't help with this. (we can / eventually plan to extend it to Chinese too, but that's a much less popular feature so we didn't make it a priority)

goldyn chyld said:
Thanks, it did the trick! Now it displays the definition instantaneously.

Great!

Yiliya said:
Is possible to make 造字 like "⿰亻壯" or "⿹乁乄" display properly using Pleco's radical system?

In theory yes, but it doesn't support them at the moment, and we're actually moving in the direction of supporting 造字 through embedded images instead. Maybe not quite as pretty, but much more universal; almost every dictionary we've recently licensed that uses rare characters supplies them that way.

alex_hk90 · Apr 6, 2013

mikelove said:
alex_hk90 said:

That code really does look ugly, but it looks like to fix this would just involve cleaning up the code around isIdeographic() and the function itself - do they accept external contributions to the source?

Click to expand...

Generally no, at least not in core pieces of the OS like this.

alex_hk90 said:

On this topic, would there be benefit in also doing "Enable Full-text Index"? The description says "Enable full-text English search (experimental)" but I guess enabling this should also allow searching within the definitions? Are there any disadvantages of enabling this?

Click to expand...

Only works with English, so it wouldn't help with this. (we can / eventually plan to extend it to Chinese too, but that's a much less popular feature so we didn't make it a priority)

Thanks for the info.

audreyt · Apr 6, 2013

Hi, the last 30 self-looping characters are addressed in the mapping table at https://github.com/g0v/moedict-epub/blob/master/sym.txt (the db2unicode.pl program remains unchanged) —— thanks for the feedback!

And sorry about the slowish response; creating a new issue at https://github.com/g0v/moedict-epub/issues/new will ensure a more timely fix in the future.

alex_hk90 · Apr 7, 2013

audreyt said:
Hi, the last 30 self-looping characters are addressed in the mapping table at https://github.com/g0v/moedict-epub/blob/master/sym.txt (the db2unicode.pl program remains unchanged) —— thanks for the feedback!

And sorry about the slowish response; creating a new issue at https://github.com/g0v/moedict-epub/issues/new will ensure a more timely fix in the future.

Thanks again - I'll use the updated mapping table in future versions.

Speaking of which, I think I've done more or less everything I can on it now, so there shouldn't be many future versions. The main thing left is the addition of simplified headwords (with the use of a reliable/accurate script/mapping table) to allow multi-character merged searching. I'll probably leave the minor issue where definitions cannot be found by their secondary Pinyin (occurs for the 995 definitions, such as 齉鼻兒: https://www.moedict.tw/#%E9%BD%89%E9%BC%BB%E5%85%92, where more than one pronunciation is stored in the Pinyin field) because they can still be searched by their headword or the primary Pinyin (in the preceding example, 齉鼻兒 can be found by searching for nang4bi2 but not nang4bie2).

The MoE dictionary is now open source

状元

状元

状元

状元

状元

状元

状元

皇帝

状元

皇帝

状元

状元

榜眼

状元

状元

榜眼

皇帝

状元

Member

状元