MoE Minnan and Hakka dictionaries

Discussion in 'Future Products' started by Abun, Aug 31, 2015.

  1. Abun

    Abun 探花

    EDIT: Work has progressed significantly, here is the link to the Github which was kindly created by @alex_hk90: https://github.com/alexhk90/Pleco-User-Dictionaries/tree/master/MoE-Minnan

    Opening a proper thread to continue the discussion about possible insertion of the online Minnan and Hakka dictionaries by the Taiwanese Ministry of Education which started here: http://plecoforums.com/threads/official-moedict-pleco-release.4915/ .

    Linking the databases I found again for reference:
    I guess that would depend on what qualifies as making sense to you. I personally would love to be able to use the MoE dict in Pleco.

    As for the romanization issue, I think it would probably be easier to convert the diacritics to numbers than in Pinyin. The reason for this is that syllables are always linked with a hyphen, so detecting a syllable border would be as easy as searching for the character "-".
    I must admit, I so far only scratched the surface of php and js and don't know anything about non-web-based programming at all. But speaking in pseudo-code, I guess such a script could look roughly like this:

    Create an array which assigns an index to each dictionary entry.
    For each entry:
    Access the pronunciation info and store it in a string.
    Explode the string into a second array, using "-" as the marker for where to seperate. Thereby the syllables are seperated from each other.
    Search each syllable for the combining diacritics. If one is detected, it is deleted and the corresponding number is added to the end of the syllable.
    Implode the resulting array back to a string, re-adding the seperating "-" and return it.
    Insert the changed string into the pronunciation info of the entry.


    The result of this progress should be that diacritics are replaced with numbers at the end of the syllable. The 1st and 4th tone are not marked with diacritics and would therefore have no number. Maybe this can be fixed by detecting final consonants (4th tone is an entering one, therefore it always ends on -p, -t, -k or -h (glottal stop)), but I don't think that's absolutely necessary. @alex_hk90, do you think that is reasonable?
     
    Last edited: Oct 1, 2015
    Captain Planet likes this.
  2. alex_hk90

    alex_hk90 状元

    Thanks - if I get some time today I'll have a look at the Minnan data in earnest. To start it will just be a straight conversion from JSON to Pleco flashcards / user dictionaries, then we can have a look at some of the enrichment for the pronunciation/etc.
     
    Abun likes this.
  3. Abun

    Abun 探花

    Thanks, that would already be awesome!

    I just remembered that I lied before though... there are cases where two syllables are not seperated with "-" but with " " (word borders) or with "--" (marks the following syllable as neutral tone. That might complicate the explosion process somewhat... How much, I can't quite say. Is it possible to explode elements within an array?
    For example, if I had a string "Sè-han thau bán pû" (細漢偷挽匏) and I exploded it using " " as a seperator, I would get the array a=["Sè-han", "thau", "bán", "pû"]. Can I now explode a[0] using "-" as a seperator and get a=["Sè", "han", "thau", "bán", "pû"]? Or would that not work because it would mean setting a[0] as an array ["Sè", "han"], which would wreak havoc with the index numbers?
     
    Last edited: Aug 31, 2015
  4. alex_hk90

    alex_hk90 状元

    It's certainly possible to split out the string by multiple deliminators (" " and "-") as you have described - an easy workaround would be to replace all"-" with " " (or vice-versa) and then do the splitting of string into array of characters/words.
     
    Abun likes this.
  5. Abun

    Abun 探花

    Ah right, I hadn't thought about that.

    I played around with javascript a little and managed to write a script that can convert text from an input line in the way I imagine it to be. Probably is much messier than need be (I decided it's safer to declare a new variable every time something is edited, just in case, but that's probably not necessary) but at least it works :D Don't know if it's feasible to use javascript for such a purpose (since it's database work, I guess php is the language of choice?), but maybe the overall structure is still useful.

    Code:
    <!Doctype html />
    <html>
      <head>
      <title>Tâi-lô conversion script</title>
      <meta charset="utf-8" />
    
      <script>
          function numfunc(inputForm) {
          // store input string in variable inp
          var inp = input.tlinput.value;
     
          /* Replace Space with "-q" and double hyphen with "-x" respectively.
          The "-" is detected when exploding in the next step, the letters are used
          to recognize the original spacing character and re-insert it later after
          re-implosion. A hyphen is also added in front of punctuation marks in order
          to seperate them from the preceding syllable ("." and "?" don't work)*/
          var inputTrans  = inp.replace(/ /g, "-q");
          inputTrans  = inputTrans.replace(/--/g, "-x");
          inputTrans  = inputTrans.replace(/,/g, "-,");
          inputTrans  = inputTrans.replace(/!/g, "-!");
          inputTrans  = inputTrans.replace(/\./g, "-.");
          inputTrans  = inputTrans.replace(/\?/g, "-?");
     
          // Split into Array
          var inpArray = inputTrans.split("-");
     
          // Declare empty output array
          var outpArray = [];
     
          // For-loop goes through every element in inpArray (every syllable)
          for (i = 0; i < inpArray.length; i++) {
              /* If statements check existance of combining diacritic in string
              (acute = 2, gravis = 3, circumflex = 5, macron = 7, vertical line
              above = 8), delete it and place the corresponding number at the
              end of the string*/
              if (inpArray[i].search("́") >= 0) {
                  outpArray[i] = inpArray[i].replace("́", "");
                  outpArray[i] += "2";
              } else if (inpArray[i].search("̀") >= 0) {
                  outpArray[i] = inpArray[i].replace("̀", "");
                  outpArray[i] += "3";
              } else if (inpArray[i].search("̂") >= 0) {
                  outpArray[i] = inpArray[i].replace("̂", "");
                  outpArray[i] += "5";
              } else if (inpArray[i].search("̄") >= 0) {
                  outpArray[i] = inpArray[i].replace("̄", "");
                  outpArray[i] += "7";
              } else if (inpArray[i].search("̍") >= 0) {
                  outpArray[i] = inpArray[i].replace("̍", "");
                  outpArray[i] += "8";
              } else {
                  /* For all elements without diacritic marks, add 4 if they have a
                  入聲 coda, output them as is if they are punctuation and add 1 in all
                  other cases */
                  if (inpArray[i].substring(inpArray[i].length - 1) == "p" ||
                       inpArray[i].substring(inpArray[i].length - 1) == "t" ||
                       inpArray[i].substring(inpArray[i].length - 1) == "k" ||
                       inpArray[i].substring(inpArray[i].length - 1) == "h"
                      ) {
                      outpArray[i] = inpArray[i] + "4";
                  } else if (inpArray[i] == "." || inpArray[i] == "," ||
                                  inpArray[i] == "?" || inpArray[i] == "!" ||
                                  inpArray[i] == ""  || inpArray[i] == "q" ||
                                  inpArray[i] == "x"
                                 ) {
                      outpArray[i] = inpArray[i];
                  } else {
                      outpArray[i] = inpArray[i] + "1";
                  }
              }
          }
     
          // Join output array to a string
          var output = outpArray.join("-");
       
          /* Replace "-q" and "-x" with a spacebar and double hyphen respectively and
          delete the seperating hyphen in front of punctuation */
          output = output.replace(/-q/g, " ");
          output  = output.replace(/-x/g, "--");
          output  = output.replace(/-,/g, ",");
          output  = output.replace(/-!/g, "!");
          output  = output.replace(/-\./g, ".");
          output  = output.replace(/-\?/g, "?");
        
          // Insert output in the "output" paragraph
          document.getElementById("output").innerHTML = output;
          }
      </script>
    
      </head>
    
      <body>
      <!-- Input form -->
      <form id="input" action="" onsubmit="numfunc()" method="get">
      Input Romanization here:<br />
      <input type="text" name="tlinput" /><br />
      <input type="button" value="Click to output" onclick="numfunc(this.inputForm)" />
      </form>
     
      <!-- Output -->
      <p>Output with numbers:</p>
      <p id="output"></p>
      </body>
    </html>
    
    EDIT: Just thought of one problem: This script doesn't take punctuation into accout, so if a syllable is followed by a punctuation mark, numbers are added after it. (e.g. "
    Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng." --> Tsa-boo2 khia7 tsit ping,5 tsa-poo khia7 hit ping.5).
    EDIT2: Streamlined it a bit in terms of number of different variables. Implemented numbering for 1st and 4th tone as well. Also taught it to recognize certain punctuation marks ("," and "!" to be precise) and add the numbers in front of them instead of behind. "." and "?" continue to be a problem because js syntax prevents me from using the same method as for "," and "!".
    EDIT3: Now working for "." and "?" as well. I just forgot that I have to cancel those out with \ :oops:
     
    Last edited: Sep 4, 2015
    alex_hk90 likes this.
  6. alex_hk90

    alex_hk90 状元

    Thanks. :) To be honest I'm not a big fan of JavaScript, though you can almost certainly do this in JavaScript if you want. If the source data is in JSON I'll look to manipulate it either in JSON or in the Pleco flashcard / user dictionary format. If it needs more complex processing then maybe I'll import it from JSON to a relational database (probably SQLite).

    For the MoEDict conversion, I just did it in SQL as that the source data had already been converted to an SQLite database.

    As you can see from the above links, I've started a public GitHub repo to collect together all the information on Pleco user dictionary conversions I have which is currently spread across various threads on these forums and on my local hard drive. It was taking quite a while to migrate this information (still need to do LACD and YEDict, as well as document the Traditional to Simplified conversion process better) so I'll put a stop on that for now to look at the Minnan data first. :)
     
  7. alex_hk90

    alex_hk90 状元

    @Abun: The JSON looks really clean and easy to convert, just need some help on what fields are useful to keep:
    Code:
      {
        "title": "㧎",
        "radical": "手",
        "heteronyms": [
        {
          "id": "13487",
          "trs": "khê",
          "reading": "替",
          "definitions": [
          {
            "type": "動",
            "def": "卡住。",
            "example": [
              "我的嚨喉㧎著一支魚刺。Guá ê nâ-âu khê-tio̍h tsi̍t ki hî-tshì. 我的喉嚨卡著一根魚刺。"
            ]
          },
          {
            "type": "形",
            "def": "不通順的、不順暢的、不和睦的。",
            "example": [
              "這支筆寫起來㧎㧎。Tsit ki pit siá--khí-lâi khê-khê. 這支筆寫起來不順暢。"
            ]
          }
          ]
        }
        ],
        "stroke_count": 7,
        "non_radical_stroke_count": 4
      },
    
    At the moment I'm thinking a simple mapping along the lines of:
    Code:
    - Parsing logic:
    - - For each top-level entry:
    - - - For each heteronym:
    - - - - Hanzi = title;
    - - - - Pinyin = @trs;
    - - - - Definition = definitions (type, def, examples) + synonyms.
    Is there anything else required? What does the "reading" field represent?
     
  8. Abun

    Abun 探花

    Yeah, even I find php easier to read (can't really form an opinion on other languages, yet^^'). In this case I mainly chose javascript over php because I originally planned to just upload the html file. So I thought it's easier to to test if it doesn't require either uploading it to an external server or running software such as xampp.
    As far as databases are concerned, I've so far only done a few experiments with MySQL (don't know if it's a big difference to SQLite), and those mainly included getting or adding data as well as unconditional editing. I don't know any further functions or how to implement variables or conditions, yet (if that is at all possible using the database-internal syntax).

    I agree with your selection of essential fields.

    "reading" should only exist for single character entries and gives information about the relationship between the word in spoken language and the character associated with it. As far as I remember, it should have one of four different values:
    文: "literary readings", i.e. readings which would be used when reading classical literature, but can occur in everyday speech especially in vocabulary from more "learned" fields such as science. Historically these are essentially pronunciations borrowed into Min from Middle Chinese. Ex. tāi 大 as in tāi-ha̍k 大學, tāi-jîn 大人
    白: "colloquial readings", the older stratae of vocabulary which are more prevalent in everyday contexts. Ex. tuā 大 (the normal word for "big")
    替 and 俗 both represent Characters which are "etymologically not correct". Most are existing characters borrowed for their sound or their meaning (i.e. equivalent to 假借 in 六書 terms). Ex. 閣 for koh (again), 欲 for beh/bueh (to want). Others are characters specially coined for Minnan in recent times, such as (亻因) or (石匹). 俗 is supposed to stand for "popularly used characters" but I can't find a rule for discerning it from 替. I tested a few character readings which I would consider "popular use" (such as 人 for lâng (person)) but they were marked with 替.
    Generally 文 and 白 are considered 本字 ("etymologically correct" characters). However for a lot of 白 readings there is some disagreement on which character really is the "etymologically correct" one.

    In any case, I think the information in "reading" is useful but by no means essential. I suggest leaving it out for the moment.
     
    Last edited: Sep 1, 2015
    alex_hk90 likes this.
  9. alex_hk90

    alex_hk90 状元

    Thanks for the information. :)

    I tried writing a simple shell script to loop through the JSON and process it into Pleco flashcard format but it was horribly inefficient because it involved a call to the JSON parsing application (jshon was the one I used, which seems to work well) every time I wanted to query the data. Python seems to have a native JSON library so I'll probably try that next.

    EDIT: A Python script seems like the way to go - the JSON library/module seems very easy to use. :)
     
    Last edited: Sep 1, 2015
  10. jasonmcdowell

    jasonmcdowell 秀才

    Hi, by coincidence I stumbled upon your discussion while searching for info about creating Minnan user dictionaries for Pleco. I'm glad I'm finding other people interested. I also want to get Taiwanese data into Pleco, and I'd like to help.

    I was planning to reformat the Maryknoll data. Last year, I taught myself enough iPhone programming to make a simple Taiwanese Dictionary app using the Maryknoll data. My 'Taiwanese Dictionary' app is on the App Store, but it is pretty basic and I don't have time to work on it. I would really like to spend more time studying Taiwanese using Pleco.

    I hadn't heard of the MoE Minnan data until today. I stopped looking after I found the Maryknoll data a few years ago. I see that it has some nice info that the Maryknoll data doesn't, but I think I'll still want to reformat the Maryknoll data too. I previously did a bunch of work to convert amongst various romanization systems, but I haven't worked on this stuff for about a year now.
     
  11. alex_hk90

    alex_hk90 状元

    @Abun: There's a very early version of the MoE Minnan JSON to Pleco flashcards conversion ready for testing:
    - Pleco flashcards (14,005 entries): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

    I haven't had time to import to Pleco user dictionary format so it might not work at all, but it should be easy to fix/improve - the Python JSON module works really well for this.
     
    Last edited: Sep 28, 2015
  12. jasonmcdowell

    jasonmcdowell 秀才

    @Abun in the previous thread you mentioned you used a Minnan keyboard. What platform is this on? I've been thinking about making a Minnan keyboard for iOS, but I don't need to if someone else already has.
     
  13. jasonmcdowell

    jasonmcdowell 秀才

  14. Abun

    Abun 探花

    Cool! I frequently consult the Maryknoll dictionary as well. Do you know whether it would at all be legal to publish a plecoified version of it? And how far have you come with doing so? Is there a database which can be converted? I know there is an .xls version of it, but I only looked at it shortly and it looked somewhat messy (example sentences get their own fields so they look almost like entries for example), so I usually use the pdf scans (both can be found here: http://www.taiwanesedictionary.org/).

    Automated conversion between Maryknolls version of POJ (or any version of POJ) and the MoE's Tâi-lô should be doable in theory, seeing as it mostly involves replacing a few symbols with others ("ch" to "ts", "oa" and "oe" to "ua" and "ue", the dotted o to "oo" ect.. Or the other way around of course).

    The keyboard I use is called TaigIME by somebody who calls himself Pierre Magistry - 阿石. It's not perfect if you ask me, but it does its job, and it's improving as well (when I originally found it, it was 註音 only (and its own version of 台語式註音 to boot), now it also supports Tai-lo, POJ and even Taithong. You can also choose whether you want to output Romanization or characters (MoE ones)). I use Android though and I don't know whether there is an iOS version by now. When friends of mine looked a few months ago, there was none.

    Awesome! I imported it as a dictionary as well as flashcards and tested a few things that I thought might cause problems and here's what I found (in order of descending importance):
    • Tone diacritics are not displayed anywhere (be it dictionary window, when opening a single entry or during flashcard tests). The sole exception is the edit window. There everything is displayed. The above-stroke for the 8th tone returns an "unknown character" square in the default font, but in TNR it's fine.
    • The "example" field is missing (or maybe that was intentional in this first version because there were problems?)
    • The "type" field returns nothing for idioms, resulting in empty angled brackets (ex. 驚驚袂著等)
    • There seem to be discrepancies as to whether or not the @ has to be included in the search. For example, "in" (亻因) can only be found if searching for "in" (without the @), but "@in" (with the @) returns entries as well.
    • Searching for characters from unicode extensions (ex. (亻因), (敖 over 力) ect.) return only the unicode entry, although the entries exist in the dictionary (problem with Pleco’s search algorithm?)
    • The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.
    Another thought that came to me concerning the layout: Maybe in the final version, it would be nice to have index numbers (maybe ①, ② ect. as with other existing dictionaries) for each heteronym within an entry. That would make the structure clearer at first glance.
     
    Last edited: Sep 2, 2015
    Wrong! it's part of CJK Unified Ideographs Extension E (unicode码:2C9B0) check the U+2C9Bx part on the wiki table. HanaMin can display it, apparently, Pleco uses this font too, should be display-able - in theory!
     
  15. Abun

    Abun 探花

    Ah, it seems the newest extension of Unicode has slipped me and my HanaMin was not up to date. Thanks for pointing it out. How do you change the font for the extension characters in Pleco though? I can't find an option for that, only for the regular Chinese font...
     
  16. jasonmcdowell

    jasonmcdowell 秀才

    The Maryknoll data is published with a Creative Commons Non-commercial license, so it would be fine for us to publish a user dictionary, but probably not for Pleco to include it as an Add-on. I wouldn't expect a commercial exemption for this dataset.

    Last year, I processed the Maryknoll excel spreadsheet data a bit to fold multiple word definitions and example sentences into individual dictionary entries. I just moved to a new city and I'm still getting unpacked. All my dictionary project stuff is on my NAS, which I haven't unpacked yet. I think I wrote various scripts in either python, ruby or Objective-C. I'll get everything together soon.
     
  17. jasonmcdowell

    jasonmcdowell 秀才

    Do you all prefer using Tâi-lô or POJ? I've only ever practiced with POJ, though I included several other romanizations in my previous dictionary work.

    I don't understand the scope of the MoE dataset. Is the regular MoE dataset a chinese-chinese dictionary, and the MoE Minnan dataset show Hanzi characters for Minnan words written in Tâi-lô? I thought that Minnan does not have standardized hanzi? Is this an attempt at standardization by the Ministry of Education?

    My Taiwanese is much better than my Mandarin, and my Taiwanese is only at a beginning conversational level. I can probably only recognize 10 - 20 hanzi, so while I'm planning to start practicing hanzi using Pleco, the Maryknoll Taiwanese-English dataset will be most helpful to me.

    I think I have read that user dictionaries in Pleco do not do full text search, which would make looking up English words or Chinese words in the Maryknoll data difficult. Maryknoll also published some English-Taiwanese PDF files, but they don't have an excel spreadsheet or any other easily parsed format for the English-Taiwanese dataset. I've considered trying to parse the PDF files, but it would be a lot of work to get everything perfect.
     
  18. jasonmcdowell

    jasonmcdowell 秀才

    The Maryknoll Taiwanese-English excel spreadsheet data is pretty easy to reorganize. I'll probably do that in the next week or two.

    The Maryknoll English-Taiwanese dictionary PDF files are parseable as rich text, if you segment by font, and text boldness. I'll probably do that in the next couple of months. I just started a new job in a new city so I have to getting with that first.
     
    alex_hk90 likes this.
  19. alex_hk90

    alex_hk90 状元

    Hi, and welcome! :) I saw your comment on the Dropbox file as well but wasn't sure how to respond (I didn't even know Dropbox had a comments feature).

    I know nothing about Minnan, but I've done a few Pleco user dictionary conversions (still migrating the info to GitHub) and this one didn't look too difficult.

    Thanks for the info. :)

    Thanks - I'll have a look at addressing some of these points in the next version. :)

    Not sure but there are much fewer entries in this MoE Minnan dataset than the full MoEDict (Chinese-Chinese) one.
     

Share This Page