MoE Minnan and Hakka dictionaries

Abun · Aug 31, 2015

EDIT: Work has progressed significantly, here is the link to the Github which was kindly created by @alex_hk90: https://github.com/alexhk90/Pleco-User-Dictionaries/tree/master/MoE-Minnan

Opening a proper thread to continue the discussion about possible insertion of the online Minnan and Hakka dictionaries by the Taiwanese Ministry of Education which started here: http://plecoforums.com/threads/official-moedict-pleco-release.4915/ .

Linking the databases I found again for reference:

Abun said:
For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

alex_hk90 said:
As mentioned before I don't know anything about Minnan but the JSON file you have linked to looks pretty clean - shouldn't be too difficult to use that and convert to Pleco flashcards / user dictionary format. Whether it will make any sense given the above discussion about romanisation / etc. is another question.

I guess that would depend on what qualifies as making sense to you. I personally would love to be able to use the MoE dict in Pleco.

As for the romanization issue, I think it would probably be easier to convert the diacritics to numbers than in Pinyin. The reason for this is that syllables are always linked with a hyphen, so detecting a syllable border would be as easy as searching for the character "-".
I must admit, I so far only scratched the surface of php and js and don't know anything about non-web-based programming at all. But speaking in pseudo-code, I guess such a script could look roughly like this:

Create an array which assigns an index to each dictionary entry.
For each entry:
Access the pronunciation info and store it in a string.
Explode the string into a second array, using "-" as the marker for where to seperate. Thereby the syllables are seperated from each other.
Search each syllable for the combining diacritics. If one is detected, it is deleted and the corresponding number is added to the end of the syllable.
Implode the resulting array back to a string, re-adding the seperating "-" and return it.
Insert the changed string into the pronunciation info of the entry.

The result of this progress should be that diacritics are replaced with numbers at the end of the syllable. The 1st and 4th tone are not marked with diacritics and would therefore have no number. Maybe this can be fixed by detecting final consonants (4th tone is an entering one, therefore it always ends on -p, -t, -k or -h (glottal stop)), but I don't think that's absolutely necessary. @alex_hk90, do you think that is reasonable?

alex_hk90 · Aug 31, 2015

Abun said:
Opening a proper thread to continue the discussion about possible insertion of the online Minnan and Hakka dictionaries by the Taiwanese Ministry of Education which started http://plecoforums.com/threads/official-moedict-pleco-release.4915/ .

Thanks - if I get some time today I'll have a look at the Minnan data in earnest. To start it will just be a straight conversion from JSON to Pleco flashcards / user dictionaries, then we can have a look at some of the enrichment for the pronunciation/etc.

Abun · Aug 31, 2015

Thanks, that would already be awesome!

I just remembered that I lied before though... there are cases where two syllables are not seperated with "-" but with " " (word borders) or with "--" (marks the following syllable as neutral tone. That might complicate the explosion process somewhat... How much, I can't quite say. Is it possible to explode elements within an array?
For example, if I had a string "Sè-han thau bán pû" (細漢偷挽匏) and I exploded it using " " as a seperator, I would get the array a=["Sè-han", "thau", "bán", "pû"]. Can I now explode a[0] using "-" as a seperator and get a=["Sè", "han", "thau", "bán", "pû"]? Or would that not work because it would mean setting a[0] as an array ["Sè", "han"], which would wreak havoc with the index numbers?

alex_hk90 · Aug 31, 2015

Abun said:
Thanks, that would already be awesome!

I just remembered that I lied before though... there are cases where two syllables are not seperated with "-" but with " " (word borders) or with "--" (marks the following syllable as neutral tone. That might complicate the explosion process somewhat... How much, I can't quite say. Is it possible to explode elements within an array?
For example, if I had a string "Sè-han thau bán pû" (細漢偷挽匏) and I exploded it using " " as a seperator, I would get the array a=["Sè-han", "thau", "bán", "pû"]. Can I now explode a[0] using "-" as a seperator and get a=["Sè", "han", "thau", "bán", "pû"]? Or would that not work because it would mean setting a[0] as an array ["Sè", "han"], which would wreak havoc with the index numbers?

It's certainly possible to split out the string by multiple deliminators (" " and "-") as you have described - an easy workaround would be to replace all"-" with " " (or vice-versa) and then do the splitting of string into array of characters/words.

Abun · Aug 31, 2015

alex_hk90 said:
It's certainly possible to split out the string by multiple deliminators (" " and "-") as you have described - an easy workaround would be to replace all"-" with " " (or vice-versa) and then do the splitting of string into array of characters/words.

Ah right, I hadn't thought about that.

I played around with javascript a little and managed to write a script that can convert text from an input line in the way I imagine it to be. Probably is much messier than need be (I decided it's safer to declare a new variable every time something is edited, just in case, but that's probably not necessary) but at least it works

Don't know if it's feasible to use javascript for such a purpose (since it's database work, I guess php is the language of choice?), but maybe the overall structure is still useful.

Code:

<!Doctype html />
<html>
  <head>
  <title>Tâi-lô conversion script</title>
  <meta charset="utf-8" />

  <script>
      function numfunc(inputForm) {
      // store input string in variable inp
      var inp = input.tlinput.value;
 
      /* Replace Space with "-q" and double hyphen with "-x" respectively.
      The "-" is detected when exploding in the next step, the letters are used
      to recognize the original spacing character and re-insert it later after
      re-implosion. A hyphen is also added in front of punctuation marks in order
      to seperate them from the preceding syllable ("." and "?" don't work)*/
      var inputTrans  = inp.replace(/ /g, "-q");
      inputTrans  = inputTrans.replace(/--/g, "-x");
      inputTrans  = inputTrans.replace(/,/g, "-,");
      inputTrans  = inputTrans.replace(/!/g, "-!");
      inputTrans  = inputTrans.replace(/\./g, "-.");
      inputTrans  = inputTrans.replace(/\?/g, "-?");
 
      // Split into Array
      var inpArray = inputTrans.split("-");
 
      // Declare empty output array
      var outpArray = [];
 
      // For-loop goes through every element in inpArray (every syllable)
      for (i = 0; i < inpArray.length; i++) {
          /* If statements check existance of combining diacritic in string
          (acute = 2, gravis = 3, circumflex = 5, macron = 7, vertical line
          above = 8), delete it and place the corresponding number at the
          end of the string*/
          if (inpArray[i].search("́") >= 0) {
              outpArray[i] = inpArray[i].replace("́", "");
              outpArray[i] += "2";
          } else if (inpArray[i].search("̀") >= 0) {
              outpArray[i] = inpArray[i].replace("̀", "");
              outpArray[i] += "3";
          } else if (inpArray[i].search("̂") >= 0) {
              outpArray[i] = inpArray[i].replace("̂", "");
              outpArray[i] += "5";
          } else if (inpArray[i].search("̄") >= 0) {
              outpArray[i] = inpArray[i].replace("̄", "");
              outpArray[i] += "7";
          } else if (inpArray[i].search("̍") >= 0) {
              outpArray[i] = inpArray[i].replace("̍", "");
              outpArray[i] += "8";
          } else {
              /* For all elements without diacritic marks, add 4 if they have a
              入聲 coda, output them as is if they are punctuation and add 1 in all
              other cases */
              if (inpArray[i].substring(inpArray[i].length - 1) == "p" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "t" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "k" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "h"
                  ) {
                  outpArray[i] = inpArray[i] + "4";
              } else if (inpArray[i] == "." || inpArray[i] == "," ||
                              inpArray[i] == "?" || inpArray[i] == "!" ||
                              inpArray[i] == ""  || inpArray[i] == "q" ||
                              inpArray[i] == "x"
                             ) {
                  outpArray[i] = inpArray[i];
              } else {
                  outpArray[i] = inpArray[i] + "1";
              }
          }
      }
 
      // Join output array to a string
      var output = outpArray.join("-");
   
      /* Replace "-q" and "-x" with a spacebar and double hyphen respectively and
      delete the seperating hyphen in front of punctuation */
      output = output.replace(/-q/g, " ");
      output  = output.replace(/-x/g, "--");
      output  = output.replace(/-,/g, ",");
      output  = output.replace(/-!/g, "!");
      output  = output.replace(/-\./g, ".");
      output  = output.replace(/-\?/g, "?");
    
      // Insert output in the "output" paragraph
      document.getElementById("output").innerHTML = output;
      }
  </script>

  </head>

  <body>
  <!-- Input form -->
  <form id="input" action="" onsubmit="numfunc()" method="get">
  Input Romanization here:<br />
  <input type="text" name="tlinput" /><br />
  <input type="button" value="Click to output" onclick="numfunc(this.inputForm)" />
  </form>
 
  <!-- Output -->
  <p>Output with numbers:</p>
  <p id="output"></p>
  </body>
</html>

EDIT: Just thought of one problem: This script doesn't take punctuation into accout, so if a syllable is followed by a punctuation mark, numbers are added after it. (e.g. "
Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng." --> Tsa-boo2 khia7 tsit ping,5 tsa-poo khia7 hit ping.5).
EDIT2: Streamlined it a bit in terms of number of different variables. Implemented numbering for 1st and 4th tone as well. Also taught it to recognize certain punctuation marks ("," and "!" to be precise) and add the numbers in front of them instead of behind. "." and "?" continue to be a problem because js syntax prevents me from using the same method as for "," and "!".
EDIT3: Now working for "." and "?" as well. I just forgot that I have to cancel those out with \

alex_hk90 · Aug 31, 2015

Abun said:
Ah right, I hadn't thought about that.

I played around with javascript a little and managed to write a script that can convert text from an input line in the way I imagine it to be. Probably is much messier than need be (I decided it's safer to declare a new variable every time something is edited, just in case, but that's probably not necessary) but at least it works Don't know if it's feasible to use javascript for such a purpose (since it's database work, I guess php is the language of choice?), but maybe the overall structure is still useful.

Thanks.

To be honest I'm not a big fan of JavaScript, though you can almost certainly do this in JavaScript if you want. If the source data is in JSON I'll look to manipulate it either in JSON or in the Pleco flashcard / user dictionary format. If it needs more complex processing then maybe I'll import it from JSON to a relational database (probably SQLite).

For the MoEDict conversion, I just did it in SQL as that the source data had already been converted to an SQLite database.

As you can see from the above links, I've started a public GitHub repo to collect together all the information on Pleco user dictionary conversions I have which is currently spread across various threads on these forums and on my local hard drive. It was taking quite a while to migrate this information (still need to do LACD and YEDict, as well as document the Traditional to Simplified conversion process better) so I'll put a stop on that for now to look at the Minnan data first.

alex_hk90 · Aug 31, 2015

@Abun: The JSON looks really clean and easy to convert, just need some help on what fields are useful to keep:

Code:

  {
    "title": "㧎",
    "radical": "手",
    "heteronyms": [
    {
      "id": "13487",
      "trs": "khê",
      "reading": "替",
      "definitions": [
      {
        "type": "動",
        "def": "卡住。",
        "example": [
          "￹我的嚨喉㧎著一支魚刺。￺Guá ê nâ-âu khê-tio̍h tsi̍t ki hî-tshì. ￻我的喉嚨卡著一根魚刺。"
        ]
      },
      {
        "type": "形",
        "def": "不通順的、不順暢的、不和睦的。",
        "example": [
          "￹這支筆寫起來㧎㧎。￺Tsit ki pit siá--khí-lâi khê-khê. ￻這支筆寫起來不順暢。"
        ]
      }
      ]
    }
    ],
    "stroke_count": 7,
    "non_radical_stroke_count": 4
  },

At the moment I'm thinking a simple mapping along the lines of:

Code:

- Parsing logic:
- - For each top-level entry:
- - - For each heteronym:
- - - - Hanzi = title;
- - - - Pinyin = @trs;
- - - - Definition = definitions (type, def, examples) + synonyms.

Is there anything else required? What does the "reading" field represent?

Abun · Sep 1, 2015

alex_hk90 said:
Thanks. To be honest I'm not a big fan of JavaScript, though you can almost certainly do this in JavaScript if you want. If the source data is in JSON I'll look to manipulate it either in JSON or in the Pleco flashcard / user dictionary format. If it needs more complex processing then maybe I'll import it from JSON to a relational database (probably SQLite).

Yeah, even I find php easier to read (can't really form an opinion on other languages, yet^^'). In this case I mainly chose javascript over php because I originally planned to just upload the html file. So I thought it's easier to to test if it doesn't require either uploading it to an external server or running software such as xampp.
As far as databases are concerned, I've so far only done a few experiments with MySQL (don't know if it's a big difference to SQLite), and those mainly included getting or adding data as well as unconditional editing. I don't know any further functions or how to implement variables or conditions, yet (if that is at all possible using the database-internal syntax).

alex_hk90 said:
Is there anything else required? What does the "reading" field represent?

I agree with your selection of essential fields.

"reading" should only exist for single character entries and gives information about the relationship between the word in spoken language and the character associated with it. As far as I remember, it should have one of four different values:
文: "literary readings", i.e. readings which would be used when reading classical literature, but can occur in everyday speech especially in vocabulary from more "learned" fields such as science. Historically these are essentially pronunciations borrowed into Min from Middle Chinese. Ex. tāi 大 as in tāi-ha̍k 大學, tāi-jîn 大人
白: "colloquial readings", the older stratae of vocabulary which are more prevalent in everyday contexts. Ex. tuā 大 (the normal word for "big")
替 and 俗 both represent Characters which are "etymologically not correct". Most are existing characters borrowed for their sound or their meaning (i.e. equivalent to 假借 in 六書 terms). Ex. 閣 for koh (again), 欲 for beh/bueh (to want). Others are characters specially coined for Minnan in recent times, such as (亻因) or (石匹). 俗 is supposed to stand for "popularly used characters" but I can't find a rule for discerning it from 替. I tested a few character readings which I would consider "popular use" (such as 人 for lâng (person)) but they were marked with 替.
Generally 文 and 白 are considered 本字 ("etymologically correct" characters). However for a lot of 白 readings there is some disagreement on which character really is the "etymologically correct" one.

In any case, I think the information in "reading" is useful but by no means essential. I suggest leaving it out for the moment.

alex_hk90 · Sep 1, 2015

Abun said:
Yeah, even I find php easier to read (can't really form an opinion on other languages, yet^^'). In this case I mainly chose javascript over php because I originally planned to just upload the html file. So I thought it's easier to to test if it doesn't require either uploading it to an external server or running software such as xampp.
As far as databases are concerned, I've so far only done a few experiments with MySQL (don't know if it's a big difference to SQLite), and those mainly included getting or adding data as well as unconditional editing. I don't know any further functions or how to implement variables or conditions, yet (if that is at all possible using the database-internal syntax).

I agree with your selection of essential fields.

"reading" should only exist for single character entries and gives information about the relationship between the word in spoken language and the character associated with it. As far as I remember, it should have one of four different values:
文: "literary readings", i.e. readings which would be used when reading classical literature, but can occur in everyday speech especially in vocabulary from more "learned" fields such as science. Historically these are essentially pronunciations borrowed into Min from Middle Chinese. Ex. tāi 大 as in tāi-ha̍k 大學, tāi-jîn 大人
白: "colloquial readings", the older stratae of vocabulary which are more prevalent in everyday contexts. Ex. tuā 大 (the normal word for "big")
替 and 俗 both represent Characters which are "etymologically not correct". Most are existing characters borrowed for their sound or their meaning (i.e. equivalent to 假借 in 六書 terms). Ex. 閣 for koh (again), 欲 for beh/bueh (to want). Others are characters specially coined for Minnan in recent times, such as (亻因) or (石匹). 俗 is supposed to stand for "popularly used characters" but I can't find a rule for discerning it from 替. I tested a few character readings which I would consider "popular use" (such as 人 for lâng (person)) but they were marked with 替.
Generally 文 and 白 are considered 本字 ("etymologically correct" characters). However for a lot of 白 readings there is some disagreement on which character really is the "etymologically correct" one.

In any case, I think the information in "reading" is useful but by no means essential. I suggest leaving it out for the moment.

Thanks for the information.

I tried writing a simple shell script to loop through the JSON and process it into Pleco flashcard format but it was horribly inefficient because it involved a call to the JSON parsing application (jshon was the one I used, which seems to work well) every time I wanted to query the data. Python seems to have a native JSON library so I'll probably try that next.

EDIT: A Python script seems like the way to go - the JSON library/module seems very easy to use.

jasonmcdowell · Sep 2, 2015

Hi, by coincidence I stumbled upon your discussion while searching for info about creating Minnan user dictionaries for Pleco. I'm glad I'm finding other people interested. I also want to get Taiwanese data into Pleco, and I'd like to help.

I was planning to reformat the Maryknoll data. Last year, I taught myself enough iPhone programming to make a simple Taiwanese Dictionary app using the Maryknoll data. My 'Taiwanese Dictionary' app is on the App Store, but it is pretty basic and I don't have time to work on it. I would really like to spend more time studying Taiwanese using Pleco.

I hadn't heard of the MoE Minnan data until today. I stopped looking after I found the Maryknoll data a few years ago. I see that it has some nice info that the Maryknoll data doesn't, but I think I'll still want to reformat the Maryknoll data too. I previously did a bunch of work to convert amongst various romanization systems, but I haven't worked on this stuff for about a year now.

alex_hk90 · Sep 2, 2015

@Abun: There's a very early version of the MoE Minnan JSON to Pleco flashcards conversion ready for testing:
- Pleco flashcards (14,005 entries): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

I haven't had time to import to Pleco user dictionary format so it might not work at all, but it should be easy to fix/improve - the Python JSON module works really well for this.

jasonmcdowell · Sep 2, 2015

@Abun in the previous thread you mentioned you used a Minnan keyboard. What platform is this on? I've been thinking about making a Minnan keyboard for iOS, but I don't need to if someone else already has.

jasonmcdowell · Sep 2, 2015

@alex_hk90, @Abun:

I found Iûnn Ún-giân's doctoral dissertation "Processing Techniques for Written Taiwanese -- Tone Sandhi and POS Tagging", to be very helpful - especially chapter 3 which helped clarify edge cases for searching character combinations in Unicode.

http://ip194097.ntcu.edu.tw/giankiu/lunbun/Ungian/ungian.asp

Abun · Sep 2, 2015

jasonmcdowell said:
Hi, by coincidence I stumbled upon your discussion while searching for info about creating Minnan user dictionaries for Pleco. I'm glad I'm finding other people interested. I also want to get Taiwanese data into Pleco, and I'd like to help.

I was planning to reformat the Maryknoll data. Last year, I taught myself enough iPhone programming to make a simple Taiwanese Dictionary app using the Maryknoll data. My 'Taiwanese Dictionary' app is on the App Store, but it is pretty basic and I don't have time to work on it. I would really like to spend more time studying Taiwanese using Pleco.

I hadn't heard of the MoE Minnan data until today. I stopped looking after I found the Maryknoll data a few years ago. I see that it has some nice info that the Maryknoll data doesn't, but I think I'll still want to reformat the Maryknoll data too. I previously did a bunch of work to convert amongst various romanization systems, but I haven't worked on this stuff for about a year now.

Cool! I frequently consult the Maryknoll dictionary as well. Do you know whether it would at all be legal to publish a plecoified version of it? And how far have you come with doing so? Is there a database which can be converted? I know there is an .xls version of it, but I only looked at it shortly and it looked somewhat messy (example sentences get their own fields so they look almost like entries for example), so I usually use the pdf scans (both can be found here: http://www.taiwanesedictionary.org/).

Automated conversion between Maryknolls version of POJ (or any version of POJ) and the MoE's Tâi-lô should be doable in theory, seeing as it mostly involves replacing a few symbols with others ("ch" to "ts", "oa" and "oe" to "ua" and "ue", the dotted o to "oo" ect.. Or the other way around of course).

jasonmcdowell said:
@Abun in the previous thread you mentioned you used a Minnan keyboard. What platform is this on? I've been thinking about making a Minnan keyboard for iOS, but I don't need to if someone else already has.

The keyboard I use is called TaigIME by somebody who calls himself Pierre Magistry - 阿石. It's not perfect if you ask me, but it does its job, and it's improving as well (when I originally found it, it was 註音 only (and its own version of 台語式註音 to boot), now it also supports Tai-lo, POJ and even Taithong. You can also choose whether you want to output Romanization or characters (MoE ones)). I use Android though and I don't know whether there is an iOS version by now. When friends of mine looked a few months ago, there was none.

alex_hk90 said:
@Abun: There's a very early version of the MoE Minnan JSON to Pleco flashcards conversion ready for testing:
- Pleco flashcards (14,005 entries): https://www.dropbox.com/s/5rjldeqtffee620/MoE-Minnan-flashcards-v01.txt.7z?dl=0

I haven't had time to import to Pleco user dictionary format so it might not work at all, but it should be easy to fix/improve - the Python JSON module works really well for this.

Awesome! I imported it as a dictionary as well as flashcards and tested a few things that I thought might cause problems and here's what I found (in order of descending importance):

Tone diacritics are not displayed anywhere (be it dictionary window, when opening a single entry or during flashcard tests). The sole exception is the edit window. There everything is displayed. The above-stroke for the 8th tone returns an "unknown character" square in the default font, but in TNR it's fine.
The "example" field is missing (or maybe that was intentional in this first version because there were problems?)
The "type" field returns nothing for idioms, resulting in empty angled brackets (ex. 驚驚袂著等)
There seem to be discrepancies as to whether or not the @ has to be included in the search. For example, "in" (亻因) can only be found if searching for "in" (without the @), but "@in" (with the @) returns entries as well.
Searching for characters from unicode extensions (ex. (亻因), (敖 over 力) ect.) return only the unicode entry, although the entries exist in the dictionary (problem with Pleco’s search algorithm?)
The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.

Another thought that came to me concerning the layout: Maybe in the final version, it would be nice to have index numbers (maybe ①, ② ect. as with other existing dictionaries) for each heteronym within an entry. That would make the structure clearer at first glance.

ACardiganAndAFrown · Sep 2, 2015

Abun said:
The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.

Wrong! it's part of CJK Unified Ideographs Extension E (unicode码：2C9B0) check the U+2C9Bx part on the wiki table. HanaMin can display it, apparently, Pleco uses this font too, should be display-able - in theory!

Abun · Sep 2, 2015

ACardiganAndAFrown said:
Wrong! it's part of CJK Unified Ideographs Extension E (unicode码：2C9B0) check the U+2C9Bx part on the wiki table. HanaMin can display it, apparently, Pleco uses this font too, should be display-able - in theory!

Ah, it seems the newest extension of Unicode has slipped me and my HanaMin was not up to date. Thanks for pointing it out. How do you change the font for the extension characters in Pleco though? I can't find an option for that, only for the regular Chinese font...

jasonmcdowell · Sep 2, 2015

Abun said:
Cool! I frequently consult the Maryknoll dictionary as well. Do you know whether it would at all be legal to publish a plecoified version of it?

The Maryknoll data is published with a Creative Commons Non-commercial license, so it would be fine for us to publish a user dictionary, but probably not for Pleco to include it as an Add-on. I wouldn't expect a commercial exemption for this dataset.

Last year, I processed the Maryknoll excel spreadsheet data a bit to fold multiple word definitions and example sentences into individual dictionary entries. I just moved to a new city and I'm still getting unpacked. All my dictionary project stuff is on my NAS, which I haven't unpacked yet. I think I wrote various scripts in either python, ruby or Objective-C. I'll get everything together soon.

jasonmcdowell · Sep 2, 2015

Do you all prefer using Tâi-lô or POJ? I've only ever practiced with POJ, though I included several other romanizations in my previous dictionary work.

I don't understand the scope of the MoE dataset. Is the regular MoE dataset a chinese-chinese dictionary, and the MoE Minnan dataset show Hanzi characters for Minnan words written in Tâi-lô? I thought that Minnan does not have standardized hanzi? Is this an attempt at standardization by the Ministry of Education?

My Taiwanese is much better than my Mandarin, and my Taiwanese is only at a beginning conversational level. I can probably only recognize 10 - 20 hanzi, so while I'm planning to start practicing hanzi using Pleco, the Maryknoll Taiwanese-English dataset will be most helpful to me.

I think I have read that user dictionaries in Pleco do not do full text search, which would make looking up English words or Chinese words in the Maryknoll data difficult. Maryknoll also published some English-Taiwanese PDF files, but they don't have an excel spreadsheet or any other easily parsed format for the English-Taiwanese dataset. I've considered trying to parse the PDF files, but it would be a lot of work to get everything perfect.

jasonmcdowell · Sep 2, 2015

The Maryknoll Taiwanese-English excel spreadsheet data is pretty easy to reorganize. I'll probably do that in the next week or two.

The Maryknoll English-Taiwanese dictionary PDF files are parseable as rich text, if you segment by font, and text boldness. I'll probably do that in the next couple of months. I just started a new job in a new city so I have to getting with that first.

alex_hk90 · Sep 2, 2015

jasonmcdowell said:
Hi, by coincidence I stumbled upon your discussion while searching for info about creating Minnan user dictionaries for Pleco. I'm glad I'm finding other people interested. I also want to get Taiwanese data into Pleco, and I'd like to help.

I was planning to reformat the Maryknoll data. Last year, I taught myself enough iPhone programming to make a simple Taiwanese Dictionary app using the Maryknoll data. My 'Taiwanese Dictionary' app is on the App Store, but it is pretty basic and I don't have time to work on it. I would really like to spend more time studying Taiwanese using Pleco.

I hadn't heard of the MoE Minnan data until today. I stopped looking after I found the Maryknoll data a few years ago. I see that it has some nice info that the Maryknoll data doesn't, but I think I'll still want to reformat the Maryknoll data too. I previously did a bunch of work to convert amongst various romanization systems, but I haven't worked on this stuff for about a year now.

Hi, and welcome!

I saw your comment on the Dropbox file as well but wasn't sure how to respond (I didn't even know Dropbox had a comments feature).

I know nothing about Minnan, but I've done a few Pleco user dictionary conversions (still migrating the info to GitHub) and this one didn't look too difficult.

jasonmcdowell said:
@alex_hk90, @Abun:

I found Iûnn Ún-giân's doctoral dissertation "Processing Techniques for Written Taiwanese -- Tone Sandhi and POS Tagging", to be very helpful - especially chapter 3 which helped clarify edge cases for searching character combinations in Unicode.

http://ip194097.ntcu.edu.tw/giankiu/lunbun/Ungian/ungian.asp

Thanks for the info.

Abun said:
Awesome! I imported it as a dictionary as well as flashcards and tested a few things that I thought might cause problems and here's what I found (in order of descending importance):

Tone diacritics are not displayed anywhere (be it dictionary window, when opening a single entry or during flashcard tests). The sole exception is the edit window. There everything is displayed. The above-stroke for the 8th tone returns an "unknown character" square in the default font, but in TNR it's fine.

The "example" field is missing (or maybe that was intentional in this first version because there were problems?)

The "type" field returns nothing for idioms, resulting in empty angled brackets (ex. 驚驚袂著等)

There seem to be discrepancies as to whether or not the @ has to be included in the search. For example, "in" (亻因) can only be found if searching for "in" (without the @), but "@in" (with the @) returns entries as well.

Searching for characters from unicode extensions (ex. (亻因), (敖 over 力) ect.) return only the unicode entry, although the entries exist in the dictionary (problem with Pleco’s search algorithm?)

The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.

Another thought that came to me concerning the layout: Maybe in the final version, it would be nice to have index numbers (maybe ①, ② ect. as with other existing dictionaries) for each heteronym within an entry. That would make the structure clearer at first glance.

Thanks - I'll have a look at addressing some of these points in the next version.

jasonmcdowell said:
I don't understand the scope of the MoE dataset. Is the regular MoE dataset a chinese-chinese dictionary, and the MoE Minnan dataset show Hanzi characters for Minnan words written in Tâi-lô? I thought that Minnan does not have standardized hanzi? Is this an attempt at standardization by the Ministry of Education?

Not sure but there are much fewer entries in this MoE Minnan dataset than the full MoEDict (Chinese-Chinese) one.

MoE Minnan and Hakka dictionaries

榜眼

状元

榜眼

状元

榜眼

状元

状元

榜眼

状元

秀才

状元

秀才

秀才

榜眼

状元

榜眼

秀才

秀才

秀才

状元