The MoE dictionary is now open source

Yiliya · Mar 30, 2013

alex_hk90,

Thanks for the commitment! I took a look at your conversion, and here are the issues I'd like to bring to your attention:

- Please take a look at how 萌典 processes the 行 entry. This entry is excellent for testing since it has three readings, multiple parts of speech and also multiple definitions per each pos. Pleco sorts entries by hanzi/pinyin, so we need to have only three entries, one for háng, one for xíng and one for xìng. Parts of speech and numbered definitions should be included in the entry body itself and not as separate entries. Your current conversion has 25 (!) entries for 行, which is a bit of an overkill.

- I don't think it's a good idea to include English headings like 'Quote:' or 'Link:' in a Chinese-Chinese dictionary. In fact, there's no need to include such headings at all. Both the original website and 萌典 don't have them. As for synonyms and antonyms, simply use 同義詞 and 反義詞.

Again, thank you for your attempt. I ran in the same issue as you (each definition being a separate entry), and my SQL skills aren't good enough to combine all definitions under a single title. Hopefully, these problems can be resolved!

audreyt · Mar 30, 2013

The SQL syntax in the simple command-line client https://gist.github.com/audreyt/4648550#file-moe-pl-L11 may be relevant -- " GROUP BY title, heteronyms.bopomofo " is used to produce one entry per pronunciation.

alex_hk90 · Mar 30, 2013

Yiliya said:
alex_hk90,

Thanks for the commitment! I took a look at your conversion, and here are the issues I'd like to bring to your attention:

- Please take a look at how 萌典 processes the 行 entry. This entry is excellent for testing since it has three readings, multiple parts of speech and also multiple definitions per each pos. Pleco sorts entries by hanzi/pinyin, so we need to have only three entries, one for háng, one for xíng and one for xìng. Parts of speech and numbered definitions should be included in the entry body itself and not as separate entries. Your current conversion has 25 (!) entries for 行, which is a bit of an overkill.

- I don't think it's a good idea to include English headings like 'Quote:' or 'Link:' in a Chinese-Chinese dictionary. In fact, there's no need to include such headings at all. Both the original website and 萌典 don't have them. As for synonyms and antonyms, simply use 同義詞 and 反義詞.

Again, thank you for your attempt. I ran in the same issue as you (each definition being a separate entry), and my SQL skills aren't good enough to combine all definitions under a single title. Hopefully, these problems can be resolved!

Thanks for the suggestions.

On the individual points:

- I noticed this as well. It shouldn't be overly difficult to combine them as you describe. I guess the difficulty might be in determining what are "parts of speech" as opposed to "definitions".

- I agree. Dropping the Quote and Link headings and replacing Synonyms and Antonyms with the Chinese is a very easy change.

When skimming through, I also noticed a couple of other points:

- Sometimes the title (Hanzi) includes the English in brackets (seems to be for transliterated names like "Abu Dhabi") - this should probably be moved into the start of the main body of the definition.

- Sometimes the Pinyin has a note before it, like <又音> or <讀音>; this should probably be moved into the main body of the definition as well, though it doesn't seem to affect the audio pronunciation so maybe it can stay in the Pinyin field?

Unfortunately I'm not going to be able to do any more on this over the Easter weekend as I'll be away from my main computer, but I'll think about it and hopefully resolve all these issues when I get back.

audreyt said:
The SQL syntax in the simple command-line client https://gist.github.com/audreyt/4648550#file-moe-pl-L11 may be relevant -- " GROUP BY title, heteronyms.bopomofo " is used to produce one entry per pronunciation.

Thanks - that looks useful. And also thank you very much for your work on the dictionary to make it available in such a useful format in the first place.

Yiliya · Mar 30, 2013

How much Chinese do you understand? The link to the 行 entry that I provided should be a good illustration as to what the correct output looks like. Anyway:

- definitions/type is the part of speech, e.g. 動 ( = 動詞), 名 ( = 名詞), 形 ( = 形容詞) etc
- definitions/idx is the definition's number, if it's greater than 0 then the definition should be displayed along the previous definition(s) in a single entry

alex_hk90 · Mar 30, 2013

Yiliya said:
How much Chinese do you understand? The link to the 行 entry that I provided should be a good illustration as to what the correct output looks like. Anyway:

- definitions/type is the part of speech, e.g. 動 ( = 動詞), 名 ( = 名詞), 形 ( = 形容詞) etc
- definitions/idx is the definition's number, if it's greater than 0 then the definition should be displayed along the previous definition(s) in a single entry

I can understand some Chinese (as a rough indicator I passed the HSK4 a while back so can read quite a bit more than required for that). Thanks for the clarification - I'll read through that link as a guide.

alex_hk90 · Mar 31, 2013

Quick question for the merged entries: how do you produce the circled numbers in Pleco user dictionaries? I'm thinking of using the (1), (2), (3) format used in the PLC and ABC dictionaries.

Yiliya · Mar 31, 2013

①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳

(U+2460 to U+2473 if you can't see them)

alex_hk90 · Mar 31, 2013

Yiliya said:
①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳

(U+2460 to U+2473 if you can't see them)

Thanks - I've incorporated these into the process.

I haven't got round to sorting out the two minor points I mentioned about the extra information in the Hanzi (title) and Pinyin (pinyin) columns, but I have hopefully sorted out the grouping and cleaned up the formatting somewhat.

Considering 行, there are now 4 entries (one for each different pronunciation):

Updated/new notes on the process followed (for reproducing from source):

Code:

20130331 MoEDict
 
- Follow notes "20130327 MoEDict" up to creation of combined table, but add idx to this:
create table combined as
select definitions.id, definitions.idx, entries.title, heteronyms.pinyin,
definitions.type, definitions.def, definitions.example, definitions.quote,
definitions.synonyms, definitions.antonyms, definitions.link
from definitions, heteronyms, entries
where definitions.heteronym_id = heteronyms.id and heteronyms.entry_id = entries.id;
 
- Combine definition fields into single field.
 
Firstly replace null entries to empty strings '' for combining/counting later
(otherwise can causes problems with null results,
especially when using type for counting number of definitions),
and at same time add label for synonyms and antonyms:
update combined set
type = case when coalesce(type, '') = '' then '' else type end;
update combined set
def = case when coalesce(def, '') = '' then '' else def end;
update combined set
example = case when coalesce(example, '') = '' then '' else example end;
update combined set
quote = case when coalesce(quote, '') = '' then '' else quote end;
update combined set
synonyms = case when coalesce(synonyms, '') = '' then '' else '同義詞：'||synonyms end;
update combined set
antonyms = case when coalesce(synonyms, '') = '' then '' else '反義詞：'||antonyms end;
update combined set
link = case when coalesce(link, '') = '' then '' else link end;
 
Then combine into single field newdef:
alter table combined add column newdef;
update combined set newdef = '';
update combined set newdef = case when coalesce(def, '') = ''
then newdef else def end;
update combined set newdef = (case when coalesce(example, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then example else newdef||''||example end) end);
update combined set newdef = (case when coalesce(quote, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then quote else newdef||''||quote end) end);
update combined set newdef = (case when coalesce(synonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then synonyms else newdef||''||synonyms end) end);
update combined set newdef = (case when coalesce(antonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then antonyms else newdef||''||antonyms end) end);
update combined set newdef = (case when coalesce(link, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then link else newdef||''||link end) end);
 
Add numbering to combined definitions:
order by idx then count up to current id.
 
Create new table with defid (definition id for unique title/Hanzi, pinyin, part of speech/type) and defcount (total number of definitions for unique title/Hanzi, pinyin, part of speech/type) columns:
create table combined2 as
select id, title, pinyin, idx, type, newdef,
(select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type and c2.id<=c1.id) as defid, (select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type) as defcount
from combined c1;
 
Add numbering (only if more than one definition in unique title, pinyin, type):
update combined2 set newdef =
(case when defcount<=1 then newdef
else '(@'||defid||'@) '||newdef end);
Replace with circled numbers:
update combined2 set newdef = replace(newdef, '(@1@)','①');
update combined2 set newdef = replace(newdef, '(@2@)','②');
update combined2 set newdef = replace(newdef, '(@3@)','③');
update combined2 set newdef = replace(newdef, '(@4@)','④');
update combined2 set newdef = replace(newdef, '(@5@)','⑤');
update combined2 set newdef = replace(newdef, '(@6@)','⑥');
update combined2 set newdef = replace(newdef, '(@7@)','⑦');
update combined2 set newdef = replace(newdef, '(@8@)','⑧');
update combined2 set newdef = replace(newdef, '(@9@)','⑨');
update combined2 set newdef = replace(newdef, '(@10@)','⑩');
update combined2 set newdef = replace(newdef, '(@11@)','⑪');
update combined2 set newdef = replace(newdef, '(@12@)','⑫');
update combined2 set newdef = replace(newdef, '(@13@)','⑬');
update combined2 set newdef = replace(newdef, '(@14@)','⑭');
update combined2 set newdef = replace(newdef, '(@15@)','⑮');
update combined2 set newdef = replace(newdef, '(@16@)','⑯');
update combined2 set newdef = replace(newdef, '(@17@)','⑰');
update combined2 set newdef = replace(newdef, '(@18@)','⑱');
update combined2 set newdef = replace(newdef, '(@19@)','⑲');
update combined2 set newdef = replace(newdef, '(@20@)','⑳');
①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳
(U+2460 to U+2473)
Note: max(defcount) suggests only need to go up to 19.
 
Group by part of speech (title, pinyin, type):
create table combined3 as
select title, pinyin, idx, type, defcount,
group_concat(newdef, '') as newdef3
from (select title, pinyin, idx, type, defcount, newdef
from combined2
order by title, pinyin, defid)
group by title, pinyin, type;
 
Add part of speech (type), bullets and new lines as appropriate:
update combined3 set newdef3 =
(case when coalesce(type, '') = ''
then '◆ '||newdef3 else
(case when defcount>1
then '◆ <'||type||'> '||newdef3
else '◆ <'||type||'> '||newdef3 end) end);
 
Group by Hanzi/Pinyin (title, pinyin):
create table combined4 as
select title, pinyin, group_concat(newdef3, '') as newdef4
from (select title, pinyin, idx, type, newdef3
from combined3
order by title, pinyin, idx)
group by title, pinyin;
 
Note: Counts: combined2 (unique title, pinyin, type, def) = 213486;
combined3 (unique title, pinyin, type) = 171378;
combined4 (unique title, pinyin) = 165810.
 
Output combined4 as Pleco flashcards:
.mode tabs
.output MoEDict-cards03.txt
select * from combined4;

Again it's not the nicest SQL (to say the least), but it seems to work (as far as I can tell) and I more or less understand it - feel free to suggest efficiency/elegancy improvements.

The resulting flashcards (which can be imported into a Pleco user dictionary) are here:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.

I'm going to try a full import of these overnight and will post a Pleco user dictionary format file here once that's complete.

Hopefully this should be a significant improvement on the previous version.

mikelove · Mar 31, 2013

One conversion tip: since it's a pretty easy mapping (only tricky common character is 乾 and you can sort that out automatically from the Pinyin), it would be worth taking the time to generate simplified character versions of all of this dictionary's multi-character headwords, as Pleco uses only simplified characters when merging multi-character entries from different dictionaries. (we had to pick one set or the other, and it's much easier for us to generate accurate simplified headwords for traditional-only dictionaries than to generate accurate traditional for simplified-only ones)

For single-character entries Pleco goes by whatever your active character set is, so it's less of a concern for those unless you want to make this accessible to simplified character users.

Yiliya · Apr 1, 2013

alex_hk90,
Most excellent, thank you! I'm very happy with this conversion. And it's especially nice of you to share the code, so that any one of us can benefit from your experience.

I still have a few very MINOR suggestions, though:
- What's the function of the black diamond character (◆) at the start of each entry?
- It's better to separate multiple examples with a newline (like 萌典 does) instead of the Western comma mark. E.g. see 行/xíng/動/2.
- As for the synonyms/antonyms. The original website uses 相似詞/相反詞, while 萌典 simply uses 似/反. It's your call, 同義詞/反義詞 seems to be perfectly fine to me (and less confusing). However, again, it would be better to use 、 than the Western comma for separating them.

mikelove,
Yes, simplified headwords would be very handy, however the Trad -> Simp conversion is not as trivial as it seems. Aside from 乾 (don't simplify when it's pronounced qián), there's a few other tricky characters. Off the top of my head:
著 (don't simplify when it's pronounced zhù), very common character that causes the most of the confusion in Trad -> Simp conversions
徵 (don't simplify when it's pronounced zhǐ), this is a rare usage, but still, MoE has it
於 (don't simplify when it's pronounced wū), also rare

Also, the 幺/么/麼/麽 confusion. Basically, Trad 么 = Simp 幺 (yāo), Trad 麼 = Simp 么 (me) OR 麽 (mó). This way, 么麼 (yāomó) gets simplified to 幺麽.

Another thing to consider is that the MoE dictionary uses a number of archaic traditional characters throughout the whole dictionary, case in point - 祕 (instead of the nowadays commonly accepted 秘).

alex_hk90 · Apr 1, 2013

mikelove said:
One conversion tip: since it's a pretty easy mapping (only tricky common character is 乾 and you can sort that out automatically from the Pinyin), it would be worth taking the time to generate simplified character versions of all of this dictionary's multi-character headwords, as Pleco uses only simplified characters when merging multi-character entries from different dictionaries. (we had to pick one set or the other, and it's much easier for us to generate accurate simplified headwords for traditional-only dictionaries than to generate accurate traditional for simplified-only ones)

For single-character entries Pleco goes by whatever your active character set is, so it's less of a concern for those unless you want to make this accessible to simplified character users.

Thanks Mike - I'll have a look into the Traditional - Simplified conversion. Initially I was thinking it might make sense if this dictionary was like the LMA one (with only traditional characters), but if it's not too difficult then it's probably worth at least making a version with simplified headwords.

Yiliya said:
alex_hk90,
Most excellent, thank you! I'm very happy with this conversion. And it's especially nice of you to share the code, so that any one of us can benefit from your experience.

I still have a few very MINOR suggestions, though:
- What's the function of the black diamond character (◆) at the start of each entry?
- It's better to separate multiple examples with a newline (like 萌典 does) instead of the Western comma mark. E.g. see 行/xíng/動/2.
- As for the synonyms/antonyms. The original website uses 相似詞/相反詞, while 萌典 simply uses 似/反. It's your call, 同義詞/反義詞 seems to be perfectly fine to me (and less confusing). However, again, it would be better to use 、 than the Western comma for separating them.

You're welcome.

About your suggestions:
- The function of the black diamond bullet point is to separate each of the parts of speech, and to be consistent with other Pleco dictionaries such as ABC and OX (both of which also use this black diamond character for that purpose).
- The Western comma mark to separate examples is not something I (intentionally) added, and it looks like it wouldn't be that straightforward to replace them because it is sometimes used within an example instead of as a separator. For instance, it is used as a comma in one of the (more famous) examples in 行/xíng/動/1 (seen in the first screenshot). I'll have a look back at the original tables to see if it's something in the SQL that replaced the newlines with commas to fix this.
EDIT: On second look it seems like they are different commas so it should be easy enough to replace the separator ones with newlines. I was looking at the example column instead of the quote column before so got confused.

- The use of the Chinese word for synonyms/antonyms is very easy to change (just replace the appropriate text strings), and I'm relatively indifferent over which phrase is used. I quite like the single-character ones to save space and be consistent with the single character used for examples. Finally, in this case it would probably be easy enough to replace the Western comma characters (because I don't think they're used for anything else but separators).

Yiliya said:
mikelove,
Yes, simplified headwords would be very handy, however the Trad -> Simp conversion is not as trivial as it seems. Aside from 乾 (don't simplify when it's pronounced qián), there's a few other tricky characters. Off the top of my head:
著 (don't simplify when it's pronounced zhù), very common character that causes the most of the confusion in Trad -> Simp conversions
徵 (don't simplify when it's pronounced zhǐ), this is a rare usage, but still, MoE has it
於 (don't simplify when it's pronounced wū), also rare

Also, the 幺/么/麼/麽 confusion. Basically, Trad 么 = Simp 幺 (yāo), Trad 麼 = Simp 么 (me) OR 麽 (mó). This way, 么麼 (yāomó) gets simplified to 幺麽.

Another thing to consider is that the MoE dictionary uses a number of archaic traditional characters throughout the whole dictionary, case in point - 祕 (instead of the nowadays commonly accepted 秘).

Hmm, do you know any good traditional to simplified conversion scripts/tables that cover all of these issues? I started learning simplified characters and only relatively recently learning the traditional characters so I really wouldn't know about these exceptional cases.

alex_hk90 · Apr 1, 2013

EDIT: Fixed typo affecting entries with antonyms but not synonyms. (Thanks to Yiliya for spotting this issue.

)

Made changes regarding the Western commas:

Code:

20130401 MoEDict
 
Addendum (goes just before "Combine definition fields into single field."):
- Process (Western) comma separators (replace with Pleco new line or Chinese list separators):
update combined set example = replace(example, ',', '');
update combined set quote = replace(quote, ',', '');
update combined set synonyms = replace(synonyms, ',', '、');
update combined set antonyms = replace(antonyms, ',', '、');
update combined set link = replace(link, ',', '');

So for just the SQL commands to reproduce from the source database (save the following as a file and use .read to run it):

Code:

create table combined as
select definitions.id, definitions.idx, entries.title, heteronyms.pinyin,
definitions.type, definitions.def, definitions.example, definitions.quote,
definitions.synonyms, definitions.antonyms, definitions.link
from definitions, heteronyms, entries
where definitions.heteronym_id = heteronyms.id and heteronyms.entry_id = entries.id;
 
update combined set example = replace(example, ',', '');
update combined set quote = replace(quote, ',', '');
update combined set synonyms = replace(synonyms, ',', '、');
update combined set antonyms = replace(antonyms, ',', '、');
update combined set link = replace(link, ',', '');
 
update combined set
type = case when coalesce(type, '') = '' then '' else type end;
update combined set
def = case when coalesce(def, '') = '' then '' else def end;
update combined set
example = case when coalesce(example, '') = '' then '' else example end;
update combined set
quote = case when coalesce(quote, '') = '' then '' else quote end;
update combined set
synonyms = case when coalesce(synonyms, '') = '' then '' else '似：'||synonyms end;
update combined set
antonyms = case when coalesce(antonyms, '') = '' then '' else '反：'||antonyms end;
update combined set
link = case when coalesce(link, '') = '' then '' else link end;
 
alter table combined add column newdef;
update combined set newdef = '';
update combined set newdef = case when coalesce(def, '') = ''
then newdef else def end;
update combined set newdef = (case when coalesce(example, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then example else newdef||''||example end) end);
update combined set newdef = (case when coalesce(quote, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then quote else newdef||''||quote end) end);
update combined set newdef = (case when coalesce(synonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then synonyms else newdef||''||synonyms end) end);
update combined set newdef = (case when coalesce(antonyms, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then antonyms else newdef||''||antonyms end) end);
update combined set newdef = (case when coalesce(link, '') = ''
then newdef else (case when coalesce(newdef, '') = ''
then link else newdef||''||link end) end);
 
create table combined2 as
select id, title, pinyin, idx, type, newdef,
(select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type and c2.id<=c1.id) as defid, (select count(*) from combined c2 where c1.title=c2.title and c1.pinyin=c2.pinyin and c1.type=c2.type) as defcount
from combined c1;
 
update combined2 set newdef =
(case when defcount<=1 then newdef
else '(@'||defid||'@) '||newdef end);
 
update combined2 set newdef = replace(newdef, '(@1@)','①');
update combined2 set newdef = replace(newdef, '(@2@)','②');
update combined2 set newdef = replace(newdef, '(@3@)','③');
update combined2 set newdef = replace(newdef, '(@4@)','④');
update combined2 set newdef = replace(newdef, '(@5@)','⑤');
update combined2 set newdef = replace(newdef, '(@6@)','⑥');
update combined2 set newdef = replace(newdef, '(@7@)','⑦');
update combined2 set newdef = replace(newdef, '(@8@)','⑧');
update combined2 set newdef = replace(newdef, '(@9@)','⑨');
update combined2 set newdef = replace(newdef, '(@10@)','⑩');
update combined2 set newdef = replace(newdef, '(@11@)','⑪');
update combined2 set newdef = replace(newdef, '(@12@)','⑫');
update combined2 set newdef = replace(newdef, '(@13@)','⑬');
update combined2 set newdef = replace(newdef, '(@14@)','⑭');
update combined2 set newdef = replace(newdef, '(@15@)','⑮');
update combined2 set newdef = replace(newdef, '(@16@)','⑯');
update combined2 set newdef = replace(newdef, '(@17@)','⑰');
update combined2 set newdef = replace(newdef, '(@18@)','⑱');
update combined2 set newdef = replace(newdef, '(@19@)','⑲');
update combined2 set newdef = replace(newdef, '(@20@)','⑳');
 
create table combined3 as
select title, pinyin, idx, type, defcount,
group_concat(newdef, '') as newdef3
from (select title, pinyin, idx, type, defcount, newdef
from combined2
order by title, pinyin, defid)
group by title, pinyin, type;
 
update combined3 set newdef3 =
(case when coalesce(type, '') = ''
then '◆ '||newdef3 else
(case when defcount>1
then '◆ <'||type||'> '||newdef3
else '◆ <'||type||'> '||newdef3 end) end);
 
create table combined4 as
select title, pinyin, group_concat(newdef3, '') as newdef4
from (select title, pinyin, idx, type, newdef3
from combined3
order by title, pinyin, idx)
group by title, pinyin;
 
select count(*) from combined4;

Pleco flashcards as a result of running the above:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
And Pleco user dictionary as a result of importing the flashcards:
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Note: Again some of the early entries (I think all variant characters) didn't import. There are 874 fewer entries in the user dictionary than in the flashcards, and a quick comparison of the database files suggests that all of these are within the first 1000 entries in the flashcards. I'm not sure why these are not importing - I'll probably try isolating them and re-importing just those later.

Things left to do:

1. Consider adding simplified headwords.
Combination of title (Hanzi) and pinyin might be enough to generate a new column with the simplified headwords. Shouldn't be too difficult to run a conversion script on these columns.

2. Consider what to do when English in title column. (1492 entries)
For example: https://www.moedict.tw/#龐貝
Maybe move the English to the main body of the definition? EDIT: Done in later version.

3. Consider what to do when extra information in the pinyin column. (823 entries) EDIT: This count doesn't seem correct, I get 1619 in a later version.
For example: https://www.moedict.tw/#麥
or (more difficult): https://www.moedict.tw/#骨都
I don't think it actually affects the audio pronunciation so maybe just leave it or format it in some way (the Chinese brackets don't show up in the Pinyin field).
As for searching, in the first example it comes up when searching for "mai4". In the second example it comes up when searching for "gu1du1" but not "gu3du1", so maybe those entries (202 entries out of the 823) need some kind of duplication/splitting/re-ordering the pronunciations so they can be found by searching.

4. Very minor, but I notice that in the ABC and OX definition previews (in the search screen), the newlines do not appear (which allows you to see more text in that screen) - is there a way of doing this for user dictionaries (have the newlines only appear in the definition screen, not in the search screen)?

Yiliya · Apr 1, 2013

Hmm, it seems like some content got lost during the conversion.

Take a look at the 國語 entry. It has two antonyms on both the original website and 萌典, however your conversion is missing them. Same thing with the 終結 entry, it's supposed to have antonyms. Looks like your script has problems processing entries that have only antonyms, without any synonyms coming first.

mikelove · Apr 1, 2013

alex_hk90 said:
4. Very minor, but I notice that in the ABC and OX definition previews (in the search screen), the newlines do not appear (which allows you to see more text in that screen) - is there a way of doing this for user dictionaries (have the newlines only appear in the definition screen, not in the search screen)?

No, but there should be - we'll see about fixing that.

alex_hk90 · Apr 1, 2013

Yiliya said:
Hmm, it seems like some content got lost during the conversion.

Take a look at the 國語 entry. It has two antonyms on both the original website and 萌典, however your conversion is missing them. Same thing with the 終結 entry, it's supposed to have antonyms. Looks like your script has problems processing entries that have only antonyms, without any synonyms coming first.

Thanks for checking - you are correct, there was a typo in the null-checking SQL that referenced synonyms instead of antonyms. I've fixed it and will update the post above with the new link when it's finished processing.

mikelove said:
alex_hk90 said:

4. Very minor, but I notice that in the ABC and OX definition previews (in the search screen), the newlines do not appear (which allows you to see more text in that screen) - is there a way of doing this for user dictionaries (have the newlines only appear in the definition screen, not in the search screen)?

Click to expand...

No, but there should be - we'll see about fixing that.

Thank you. Also, a big thanks for the sensible licensing policy (up to 3 devices) - the import runs much faster on a Samsung Galaxy SIII.

Yiliya · Apr 1, 2013

Another minor issue. Your conversion has an entry for 華 with the following def:「華」的異體字。

How 華 could be a variant character of itself? Turns out, the original headword for this entry was {[99ac]}, and it looks like one of the GitHub scripts is to blame for the confusion. It's probably best not to use them anymore in order to be on the safe side.

alex_hk90 · Apr 1, 2013

Yiliya said:
Another minor issue. Your conversion has an entry for 華 with the following def:「華」的異體字。

How 華 could be a variant character of itself? Turns out, the original headword for this entry was {[99ac]}, and it looks like one of the GitHub scripts is to blame for the confusion. It's probably best not to use them anymore in order to be on the safe side.

Well spotted - thanks again.

This one is very easy to avoid, just don't use that Perl script to convert the missing characters. The SQL I wrote will work just as well on the original database file (dict-revised.sqlite3). I wonder if there is a better missing character conversion script though...

audreyt · Apr 1, 2013

Hi, mea culpa, that was an oversight in db2unicode.pl -- I carried the change from json2unicode.pl but forgot that SQLite uses single quotes, not double quotes, in its dump format.

Fixed in https://github.com/g0v/moedict-epub/com ... af4476762b (with due credit to both of you) and the code in https://github.com/g0v/moedict-epub/blo ... unicode.pl should no longer produce self-looping variants.

Cheers,
Audrey

Yiliya · Apr 2, 2013

Thanks for the prompt fix.

I noticed quite a few of the "self-looping" characters, e.g. 呈, 夢 etc. Pretty confusing.

Guess, it's time for us to rebuilt the database one more time.

sandwich · Apr 2, 2013

Pleco crashed last night while trying to import, so hard to play around with it yet. But gonna be awesome, thanks for working on this.

Only comment is that the readability is not as good as 萌典, which I think is mostly a Pleco issue, but I had a couple of thoughts:
[*] the separator character "．"(FULLWIDTH FULL STOP, U+FF0E) might work on android, but it looks wrong with apple fonts. "•"(BULLET, U+2022) is available on the ios keyboard, or maybe "・"(KATAKANA MIDDLE DOT) is a safer monospaced bet. (nb: 萌典 also has this issue, so its probably upstream)
[*] the ◆ looks ok when there is a <名／助／動／形／...>, but otherwise breaks the format. (e.g. in 出來 it would read better if the numbers where in the same column.)

The MoE dictionary is now open source

榜眼

Member

状元

榜眼

状元

状元

榜眼

状元

皇帝

榜眼

状元

状元

榜眼

皇帝

状元

榜眼

状元

Member

榜眼

举人