The MoE dictionary is now open source

Yiliya

榜眼
EDIT: Get the Pleco user dictionary here.

The database of the huge 國語辭典 of Taiwan's Ministry of Education is now available for download. You can read more about the project here or just grab the SQL database here.

The database has a very clean format, with separate entries for each of the multiple meanings and parts of speech of any given word. However, this also makes it a bit complicated to query and convert to other formats.

There are several GitHub projects associated with the dictionary. For example, there's a webkit if you want to put it online, and also a collection of scripts to convert the missing non Big 5 characters to Unicode.

This is a huge breakthrough for learners of Chinese, as 教育部重編國語辭典修訂本 is one of the biggest Chinese dictionaries ever created. Although it's bookish and archaic and is in fact based on the prewar dictionary of the same name, but it's one of the best references if one is serious about reading Chinese literature.

Anyway, what I'm getting at is now it can be added to Pleco! Either officially (as a free addon, like CC-CEDICT) or unofficially (as a user dictionary). As I said the database is very clean, there's no need to proofread or edit it in anyway (you can just sort everything by title/pinyin), it should be pretty easy to convert it to Pleco's format. I would do it myself, but sadly I currently don't have enough time to write SQL scripts and learn the nuances of Pleco's user dictionary format. Hopefully, somebody else (or even Mike himself) lends a hand!
 
Last edited:

mikelove

皇帝
Staff member
Thanks for posting about this.

Looks like it would be easy enough to convert - the format is actually quite similar to HanDeDict and Adsotrans so we probably already have 90% of the code we need (not that it's that much anyway) - but I'm a little concerned about the "為非營利之教育目的" language in the license - we don't mind releasing it as a free add-on, but since it's being done in a proprietary format in a commercial app, and our presumed goal in releasing it would be to motivate people to buy other stuff to go along with it, I'm not quite sure if we'd be covered by that language.

The closest equivalent of this in widely-used open-source licenses would probably be Creative Commons by-nc, which bars use "in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation" and so would pretty clearly exclude us (since there's a clear commercial advantage to our offering this). Someone distributing it independently as a user dictionary would probably be fine - they're not gaining any commercial advantage from that act even if Pleco is, and I don't think you'd get into any trouble for distributing a non-commercial-licensed document in Microsoft Word just because doing that offers some benefit to Microsoft - but Pleco distributing it officially might be problematic.

Do you have any sense (from any other postings that the MoE might have made on the subject, say) of how this would be interpreted? Or of how "非營利" would be understood under ROC law in general?
 

Yiliya

榜眼
Thanks for your response, Mike, and I can definitely see why would you want to be cautious about this. However, your analogy with MS Word is a bit off. Pleco itself is free, one doesn't have to pay a cent in order to get access to the Pleco platform. It's the premium content that must be purchased.

To draw another analogy, Pleco is not gaining any direct commercial advantage by including CC-CEDICT. One can download both Pleco and its CC-CEDICT add-on for free. It's ABC and the other premium dictionaries that power Pleco's revenue. Wouldn't the same logic apply for the MoE add-on?
 

mikelove

皇帝
Staff member
Yiliya said:
Thanks for your response, Mike, and I can definitely see why would you want to be cautious about this. However, your analogy with MS Word is a bit off. Pleco itself is free, one doesn't have to pay a cent in order to get access to the Pleco platform. It's the premium content that must be purchased.

Premium features, too - OCR, flashcards, reader, handwriting, audio, TTS, etc all work with any dictionary, and it's quite reasonable to think (and indeed our hope/expectation would be) that somebody downloading the MoE dictionary for free might subsequently decide to buy one or more of those add-on features to use with it.

Yiliya said:
To draw another analogy, Pleco is not gaining any direct commercial advantage by including CC-CEDICT. One can download both Pleco and its CC-CEDICT add-on for free. It's ABC and the other premium dictionaries that power Pleco's revenue. Wouldn't the same logic apply for the MoE add-on?

We're a for-profit company, so it would be reasonable for a court to assume that the only reason we would ever release anything for free would be because we expect to make money off of that act in other ways. Even "customer goodwill" might be considered a form of commercial benefit. CC-CEDICT is distributed under Creative Commons by-sa, which doesn't prohibit commercial use as long as you give them credit and share any improvements that you make, but an explicitly non-commercial license like this one is more worrying.

That's not to say that we can't do it, just that we need to clear it with somebody first - either the MoE themselves (who haven't been particularly friendly to us in the past, sadly) or at the very least an IP lawyer in Taiwan.
 

alex_hk90

状元
If I have some time this week I might have a look at converting this to Pleco user dictionary format. The database looks quite straightforward but appears to have some additional information such as a table with character information (radical, stroke count, etc.) and also Bopomofo pronunciations which I'll probably (at least initially) ignore for simplicity and brevity, focussing on the main table with the definitions and just the Pinyin pronunciations.

EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
 

mikelove

皇帝
Staff member
alex_hk90 said:
If I have some time this week I might have a look at converting this to Pleco user dictionary format. The database looks quite straightforward but appears to have some additional information such as a table with character information (radical, stroke count, etc.) and also Bopomofo pronunciations which I'll probably (at least initially) ignore for simplicity and brevity, focussing on the main table with the definitions and just the Pinyin pronunciations.

Well it shouldn't run into the "entries are too long" problem at least. Bopomofo you don't need to worry about since Pleco will optionally convert the Pinyin to that anyway.

We're in the middle of making a few inquiries with the MoE, so there's still hope that we might be able to distribute this officially.
 
They recently released an app called "萌典" for iOS (probably available on Android, too), which is simple but nice. You can also cross-check words within the definitions.
Now, imagine how awesome it would be to have it available on Pleco...wow! It says within the app that it includes 160,000 entries.

Keeping my fingers crossed!
 

mikelove

皇帝
Staff member
goldyn chyld said:
They recently released an app called "萌典" for iOS (probably available on Android, too), which is simple but nice. You can also cross-check words within the definitions.

Seems to be free and web-based, which is probably why it could safely be released with the non-commercial license. Anyway, we've got a good friend in Taiwan making inquiries with the MoE now so hopefully we'll know soon whether Pleco releasing it for free is kosher or not.
 
goldyn chyld

mikelove said:
goldyn chyld said:
They recently released an app called "萌典" for iOS (probably available on Android, too), which is simple but nice. You can also cross-check words within the definitions.

Seems to be free and web-based, which is probably why it could safely be released with the non-commercial license. Anyway, we've got a good friend in Taiwan making inquiries with the MoE now so hopefully we'll know soon whether Pleco releasing it for free is kosher or not.

It works in offline mode too. Hopefully your friend can bring us some good news then. :)
 

alex_hk90

状元
Yiliya said:
The database of the huge 國語辭典 of Taiwan's Ministry of Education is now available for download. You can read more about the project here or just grab the SQL database here.

The database has a very clean format, with separate entries for each of the multiple meanings and parts of speech of any given word. However, this also makes it a bit complicated to query and convert to other formats.

There are several GitHub projects associated with the dictionary. For example, there's a webkit if you want to put it online, and also a collection of scripts to convert the missing non Big 5 characters to Unicode.
I've had a look at the SQL database and while it is quite clean there are a couple of (relatively minor) challenges to converting this nicely for Pleco.

Firstly even after running that Perl script to replace the missing characters there are still quite a few that are missing (e.g. the first definition {[8e40]}). Maybe these are just really rare characters though I don't know for sure.

Secondly is how to handle the fields which are often blank (quote, example, antonyms, synonyms, link, etc.). They need to only be included in the formatted definition if they are not blank, but I'm not sure there's an easy way of doing this without just looping through each of the 200,000+ entries individually (which is doable, just not the most elegant solution). Possibly it could be done by adding some kind of text for each of the fields then searching for and replacing the blank ones with nothing.
 

mikelove

皇帝
Staff member
alex_hk90 said:
Firstly even after running that Perl script to replace the missing characters there are still quite a few that are missing (e.g. the first definition {[8e40]}). Maybe these are just really rare characters though I don't know for sure.

Hmm... the website HTML might have images, in which case we could potentially embed those if they are in fact rare characters. But since I doubt we'd be allowed to charge for this in any case, we can't really afford to spend a lot of man-hours digitizing / cleaning up those characters if not, so we'll have to wait for the community to do it. (then again, it's probably rather easy to recruit people for that if it's an open-source dictionary and they can use it in Pleco for free)

alex_hk90 said:
Secondly is how to handle the fields which are often blank (quote, example, antonyms, synonyms, link, etc.). They need to only be included in the formatted definition if they are not blank, but I'm not sure there's an easy way of doing this without just looping through each of the 200,000+ entries individually (which is doable, just not the most elegant solution). Possibly it could be done by adding some kind of text for each of the fields then searching for and replacing the blank ones with nothing.

That one's not a problem at all, actually - just about every dictionary data format we deal with has optional fields, and the internal format we use for dictionaries allows for them too.
 

sandwich

举人
alex_hk90 said:
Firstly even after running that Perl script to replace the missing characters there are still quite a few that are missing (e.g. the first definition {[8e40]}). Maybe these are just really rare characters though I don't know for sure.

8e40 is part of the code block reserved for user-defined characters so...
... if your lucky, that'll be because your conversion table doesn't have the right extensions (see: http://en.wikipedia.org/wiki/Big5#Big-5E).
... if your unlucky, it'll be because the MOE's custom font is needed.
 

mikelove

皇帝
Staff member
sandwich said:
8e40 is part of the code block reserved for user-defined characters so...
... if your lucky, that'll be because your conversion table doesn't have the right extensions (see: http://en.wikipedia.org/wiki/Big5#Big-5E).
... if your unlucky, it'll be because the MOE's custom font is needed.

Wikipedia says it's part of Big5-E, so should be convertible unless it's a rare character that hasn't made it into one of the Unicode standards yet.
 

sandwich

举人
mikelove said:
sandwich said:
8e40 is part of the code block reserved for user-defined characters so...
... if your lucky, that'll be because your conversion table doesn't have the right extensions (see: http://en.wikipedia.org/wiki/Big5#Big-5E).
... if your unlucky, it'll be because the MOE's custom font is needed.

[Wikipedia](http://en.wikipedia.org/wiki/Big5#Big-5E) says it's part of Big5-E, so should be convertible unless it's a rare character that hasn't made it into one of the Unicode standards yet.

No such luck I'm afraid, according to Big-5E 8e40 is 亠, which does not have a total stroke count of 14 as per the first entry in the db. The definition for that entry says its a variant(異體字) of 籮, which if you then look up using the online version looking you can see an image of the character 箩. Which obviously is in unicode, but not Big5. I tried a few of the other encoding extensions but couldn't find anything useful, and given that "Taiwan Ministry of Education font" is listed under the "Official extensions", I would say that is what we are dealing with.

Edit: 萌典 must have partially solved the problem since I can find entry 10 (i.e. 玍) in that dictionary, but not 箩. (or maybe they have variants turned off).
Edit2: So now I see whats going on... or rather I don't... its all written up on 3du.tw in Chinese, which I cant read :roll:. (Even if I could, no offense to the people working on it, seems a complete mess. A lot of, wrong tool for the job, and a lot of fragmentation).
Anyway, as far as I can tell the conversion tables, which whatever perl script you using is probably using, have been copy pasted all over the place and are possibly hand edited. So um yeah.. expect missing characters and errors.
 

alex_hk90

状元
sandwich said:
No such luck I'm afraid, according to Big-5E 8e40 is 亠, which does not have a total stroke count of 14 as per the first entry in the db. The definition for that entry says its a variant(異體字) of 籮, which if you then look up using the online version looking you can see an image of the character 箩. Which obviously is in unicode, but not Big5. I tried a few of the other encoding extensions but couldn't find anything useful, and given that "Taiwan Ministry of Education font" is listed under the "Official extensions", I would say that is what we are dealing with.

Edit: 萌典 must have partially solved the problem since I can find entry 10 (i.e. 玍) in that dictionary, but not 箩. (or maybe they have variants turned off).
Edit2: So now I see whats going on... or rather I don't... its all written up on 3du.tw in Chinese, which I cant read :roll:. (Even if I could, no offense to the people working on it, seems a complete mess. A lot of, wrong tool for the job, and a lot of fragmentation).
Anyway, as far as I can tell the conversion tables, which whatever perl script you using is probably using, have been copy pasted all over the place and are possibly hand edited. So um yeah.. expect missing characters and errors.

Hmm, thanks for the information.

The script I ran it through (db2unicode.pl from https://github.com/g0v/moedict-epub) seems to use the conversion table sym.txt (https://github.com/g0v/moedict-epub/blob/master/sym.txt). There are some others on there but most of them seem smaller/more specific.

Anyway, putting aside the missing character issue for now (that step can always be improved later), I hope to have something in Pleco format by the weekend. :)
 

alex_hk90

状元
OK, it's very much a first attempt but I've converted the SQL database linked to in the first post to Pleco flashcard format:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
I'll upload it in Pleco user dictionary format later as it's taking ages to import on my phone (HTC Desire), going at a rate of around 20,000 entries per hour.

EDIT: Here it is in Pleco user dictionary format:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
You might notice that there are 212710 entries instead of 213486. For some reason 776 entries (within the first 10,000 entries from the flashcards) didn't import - from a quick glance through it looks like at least some of these (if not all) were variants with missing characters. I didn't have time to try to re-import these but will look into this in future versions.

I've merged the entries to a very basic format for now:
<{type}> {def}

{example}

Quote: {quote}

Synonyms: {synonyms}

Antonyms: {antonyms}

Link: {link}

For anyone wanting to reproduce this from the source, I kept the following notes/instructions:
Code:
20130327 MoEDict
 
Source: dict-revised.sqlite3.bz2 from http://kcwu.csie.org/%7Ekcwu/moedict/dict-revised.sqlite3.bz2
Information: http://3du.tw/
 
- Extract to dict-revised.sqlite3 and open with sqlite:
.schema reveals table structure:
definitions link to heteronyms link to entries (link to dicts but only 1 dict so unnecessary).
Note: Counts: definitions = 213486; heteronyms = 165825; entries = 163093; dicts = 1.
 
- Convert missing non Big 5 characters to Unicode:
use moedict-epub / db2unicode.pl and work on output (dict-revised.unicode.sqlite3).
Note: the Perl script does not convert all missing characters.
 
- Create new table (combined) with only relevant columns:
create table combined as
select definitions.id, entries.title, heteronyms.pinyin,
definitions.type, definitions.def, definitions.example, definitions.quote,
definitions.synonyms, definitions.antonyms, definitions.link 
from definitions, heteronyms, entries
where definitions.heteronym_id = heteronyms.id and heteronyms.entry_id = entries.id;
 
- Reformat definitions into single definition field and use Pleco flashcard special characters (newlines, bold):
---
<{type}> {def}
 
{example}
 
[b]Quote:[/b] {quote}
 
[b]Synonyms:[/b] {synonyms}
 
[b]Antonyms:[/b] {antonyms}
 
[b]Link:[/b] {link}
---
alter table combined add column newdef;
 
update combined
set newdef = case when coalesce(type, '') = ''
then type else '<'||type||'> ' end;
 
update combined
set newdef = case when coalesce(newdef, '') = ''
then def else newdef||def end;
Note: The case statement here solves an issue where newdef not updated if currently null.
 
update combined
set newdef = case when coalesce(example, '') = ''
then newdef else newdef||''||example end;
 
update combined
set newdef = case when coalesce(quote, '') = ''
then newdef else newdef||''||'Quote: '||quote end;
(and repeat for synonyms, antonyms, link)
update combined set newdef = case when coalesce(synonyms, '') = ''
then newdef else newdef||''||'Synonyms: '||synonyms end;
update combined set newdef = case when coalesce(antonyms, '') = ''
then newdef else newdef||''||'Antonyms: '||antonyms end;
update combined set newdef = case when coalesce(link, '') = ''
then newdef else newdef||''||'Link: '||link end;
Note: Consider hyperlinks for link? Difficult if link has more than just the other entry/entries.
 
- Output into Pleco flashcard format:
{characters}{ TAB }{pinyin}{ TAB }{definition}
.mode tabs
.output MoEDict-cards.txt
select title, pinyin, newdef from combined;
 
- Import MoEDict-cards.txt into a Pleco user dictionary:
Settings - Manage Dictionaries - Add User - Create New
Import Entries
Also if anyone has suggestions for optimising the SQL commands let me know - I don't think I've done it in the most elegant or efficient fashion.

Things that still need some consideration:
- Missing characters: as discussed above, a better/additional conversion table could be used to deal with the remaining missing characters.
- Formatting: this could almost certainly be improved - I've just tried to keep it simple and more or less in line with the other Pleco dictionaries.

Hope this is helpful for someone though. :)
 

audreyt

Member
Heya, main author of https://moedict.tw/ and its Android/iOS offline apps here. Thanks for the awesome work converting to the Pleco format!

The missing entries are all variant characters; they have no distinct semantics, and it's safe to discard them.

Also note that this database was crawled & cleaned from the MoE website by the g0v.tw team ( http://g0v.tw/about.html ), under Taiwan's fair-use clause, without official blessing from MoE (although we did inform them of our work). In light of that, we released the supporting code into the public domain under CC0 ( https://creativecommons.org/publicdomai ... /legalcode ):

"Affirmer disclaims responsibility for clearing rights of other persons that may apply to the Work or any use thereof, including without limitation any person's Copyright and Related Rights in the Work. Further, Affirmer disclaims responsibility for obtaining any necessary consents, permissions or other rights required for any use of the Work."

The TL;DR version is that MoE is not yet ready to release this dictionary with a CC license, and so any work on the data set are subject to Taiwan's fair-use criteria.

If you run into any issues, please feel free to write to the 3du group ( 3du-tw@googlegroups.com ) or to me ( audreyt@audreyt.org ).

Cheers,
Audrey

P.S.: MoE did release Holok and Hakka dictionaries under CC-BY-ND, and we've been converting them to machine-friendly format as well -- see e.g. https://github.com/g0v/moedict-data-twblg for the Holok data set.
 

mikelove

皇帝
Staff member
Audrey,

Thanks for chiming in here.

audreyt said:
The TL;DR version is that MoE is not yet ready release this dictionary with a CC license, and so any work on the data set are subject to Taiwan's fair-use criteria.

Could you point to me to any more information on those fair-use criteria? We spent a good deal of time searching for the relevant law but didn't find anything - we're happy to release this in Pleco for free, but it's tough for us to argue that we're not still deriving some commercial benefit from doing so. So the question really rests on how the Taiwanese government defines fair use.
 

audreyt

Member
Sure. The AIT provides a pretty succinct summary at http://www.ait.org.tw/en/ipr-copyright.html .

The exact legal code is here: http://www.financelaw.fju.edu.tw/law/do ... %B3%95.pdf

§50 Works publicly released in the name of a central or local government agency or a public juristic person may, within a reasonable scope, be reproduced, publicly broadcast, or publicly transmitted.

§65 Fair use of a work shall not constitute infringement on economic rights in the work.
In determining whether the exploitation of a work complies with the provisions of Articles 44 through 63, or other conditions of fair use, all circumstances shall be taken into account, and in particular the following facts shall be noted as the basis for determination:

1. The purposes and nature of the exploitation, including whether such exploitation is of a commercial nature or is for nonprofit educational purposes.
2. The nature of the work.
3. The amount and substantiality of the portion exploited in relation to the work as a whole.
4. Effect of the exploitation on the work's current and potential market value.

Where the copyright owner organization and the exploiter organization have formed an agreement on the scope of the fair use of a work, it may be taken as reference in the determination referred to in the preceding paragraph.
In the course of forming an agreement referred to in the preceding paragraph, advice may be sought from the specialized agency in charge of copyright matters.
 

mikelove

皇帝
Staff member
audreyt said:
Sure. The AIT provides a pretty succinct summary at http://www.ait.org.tw/en/ipr-copyright.html .

The exact legal code is here: http://www.financelaw.fju.edu.tw/law/do ... %B3%95.pdf

Thanks!

Unfortunately, though, the circumstances of this (not-authorized-by-MoE) release plus the legal language against commercial use would seem to prevent us from using this in Pleco, barring a separate agreement with MoE. (there's still hope that we might manage to obtain that agreement, but if we don't then I'm afraid we're probably out of luck until they decide to do an official CC release)
 
Top