Hainanese Resources for Pleco

furisas

Member
I know of an on-going effort which is attempting to create a Hainanese dictionary of some 5,500 words. It will have 4 searchable indexes:
(1) Chinese word characters, (2) Chinese pronunciation pinyin, (3) Hainanese Wenchang pronunciation phonetics, (4) English meaning.

For example:
早上,zao3 shang4, da4 jio2, morning
笨蛋,ben3 dan4, pong1 gang3, stupid

I am interested to know how this can be inputed or licensed into Pleco.

Would Pleco one able to use TTS to voice the Hainanese Pronunciation Phonetics? If this is unsatisfactory, can voice recordings be attached to each of the words?

Can someone explain to me how this can be done? Can this be done in this current Pleco version, or does it need the capabilities in the upcoming Pleco 4.0?

If not doable in this version of Pleco, can this be done in Anki? The idea is that in the meanwhile, the work on the digital hainanese dictionary and flashcards can first be done in Anki, and then imported from Anki into Pleco 4.0 when that comes out, when it does come out!

Although not a new Pleco user, I am not a digital native and I struggle somewhat to be any sort of 'power user'. Please explain like you are talking to a 5 year old. Thank you!
 

mikelove

皇帝
Staff member
Would Pleco one able to use TTS to voice the Hainanese Pronunciation Phonetics? If this is unsatisfactory, can voice recordings be attached to each of the words?

Sorry, what TTS would we use? You can attach voice recordings, sure, but I'm not aware of any Hainanese TTS engines we could tap into; if this is only on Android and somebody's created a third party system Hainanese TTS engine for that then you could hook even in the current version of Pleco into it by choosing it as your System TTS Engine.

Can someone explain to me how this can be done? Can this be done in this current Pleco version, or does it need the capabilities in the upcoming Pleco 4.0?

You could do it now if you put the Hainanese pronunciation at the start of the definition field. Would not be possible to search it (or display it separately from the definition) until 4.0.

The idea is that in the meanwhile, the work on the digital hainanese dictionary and flashcards can first be done in Anki, and then imported from Anki into Pleco 4.0 when that comes out, when it does come out!

It would probably be easier to do it from a text file, honestly. Anki import would work OK, but it wouldn't really be any different for this purpose than a text import - in either case you've got a bunch of columns of semantically ambiguous text which Pleco needs to be told what to do with.
 

DavidMars

进士
Is there any preliminary data set available, with 100 or 500 or 1,000 words? Is there any link to the people doing this work? Thanks, David
 

furisas

Member
furisas said:
Would Pleco one able to use TTS to voice the Hainanese Pronunciation Phonetics? If this is unsatisfactory, can voice recordings be attached to each of the words?
Mike Love: Sorry, what TTS would we use? You can attach voice recordings, sure, but I'm not aware of any Hainanese TTS engines we could tap into; if this is only on Android and somebody's created a third party system Hainanese TTS engine for that then you could hook even in the current version of Pleco into it by choosing it as your System TTS Engine.
Mike,
Thanks for responding so quickly! My sincere apologies for being the one who is so slow. I have been so caught up. I really do feel your love for Pleco users. Thank you!

Please do bear with me if I do not properly comprehend the technology and techniques you use in Pleco. That is why I asked for your indulgence to explain to me like I was a 5 year old, even though I am past 60.

My aim is to learn to construct resources to use in Pleco, to help me learn Hainanese more easily, and consequently use it to teach Hainanese to others keen to learn.

I am shooting to improve this sort of learner - someone who already knows intermediate level of chinese (but can learn more using Pleco) and is very weak in Hainanese. The idea is to use visual chinese characters in a conversation text as the scaffolding to help the learner progress to strong conversational skills. The idea is to revive the Hainanese speaking skills in our Hainanese community. The observation in our community is that this hainanese speaking ability is gradually being weakened, even lost, even amongst the younger handiness families on Hainan island, as Mandarin and its pinyin takes stronger root.

I am thinking of benchmarking it to HSK 1-6. Please guide me if there is another better standard more suited to conversational contents. I suppose this learning eventually has to lead to reading chinese novels or newspapers in Hainanese.

Are you saying that the present Pleco 3.2.59 (both iOS ? and Android ?) is already be able help me if I have resource contents where the chinese characters had its Hainanese pronunciation
1. in a voice file
2. in phonetics (Would such phonetics be pronounceable using some generic text to speech software? I do not know if any such software exist)
 

mikelove

皇帝
Staff member
Sorry, no, the 'can' was referring to what you can do in the upcoming 4.0. If you had a Hainanese text to speech system (a complicated program somebody would have to write) you could theoretically install that on Android and have the current version of Pleco play audio through it, but that's not at all straightforward.
 

furisas

Member
Mike,

Hainanese Voice Recordings
Are you saying that including Hainanese Voice Recordings for each of the words into chinese words (both simple and compound words) has to wait for the upcoming Pleco 4.0?

If so, could you advise me if it is better to first add these Hainanese Voice Recordings to trial in the ANKI system (both for the User Defined Custom Dictionary and the Flash Cards system).... I am thinking this would iron out whatever production issues there may be... so as to be ready to port a smooth Hainanese Voice Recordings system into Pleco 4.0 when it comes out.

Hainanese Text to Speech:
If there are no further assistance I can call upon, I think the Hainanese Text to Speech system would be a very difficult project. Form the little I know, Hainanese dialect has some initials and finals not present in Mandarin Pinyin and Cantonese Jyutping. Most of the words use just 5 tones. The other words use 6th, 7th and 8th tones, but these could be neglected or ignored, as an approximation to the 5th could already be passable and good enough in the conversational domain.

In this Text-to-Speech area, I would need to read up on it. I would really appreciate if you can help point me to a few good resources that can help me get over the technical hump. Otherwise, I am not competent to discuss the matter. But I dearly love to be able to appreciate technically so that I can then look up and discuss with a technically competent person to see how to actualise the Hainanese text to speech system.

Hainanese Speech to Text:
The reverse effort is also an interesting area to me. I have voice recordings where the teacher explains in Hainanese. I want to turn this into text. How can I go about this?

I would be very grateful for your assistance to explain or to point me to somewhere to get help.

Sincerely,
Tsu Li
 

mikelove

皇帝
Staff member
Are you saying that including Hainanese Voice Recordings for each of the words into chinese words (both simple and compound words) has to wait for the upcoming Pleco 4.0?

Yes. You could certainly try it in Anki if you like but I don't know if that would really do anything to help with the process in Pleco 4.0.

If there are no further assistance I can call upon, I think the Hainanese Text to Speech system would be a very difficult project. Form the little I know, Hainanese dialect has some initials and finals not present in Mandarin Pinyin and Cantonese Jyutping.

Android TTS engines just send characters, actually, not romanized text, so this would basically be writing an Android TTS plugin that mapped characters to Hainanese recordings. (in theory you could actually also use that with AnkiDroid, since it'll hook into any configured Android TTS engine just like we do)

I would really appreciate if you can help point me to a few good resources that can help me get over the technical hump.

Honestly I haven't ever made more than a cursory investigation of how to write our own Android TTS engine (which we're mostly forbidden from doing for license reasons anyway) so I wouldn't be able to help much.

The reverse effort is also an interesting area to me. I have voice recordings where the teacher explains in Hainanese. I want to turn this into text. How can I go about this?

I don't know of any voice recognition engines that support Hainanese. Making your own would require building a model for an open-source voice recognition engine like DeepSpeech, for which you'd need an AI expert and a *ton* of training data (thousands of hours of transcribed audio of people speaking Hainanese).
 

DavidMars

进士
I've been corresponding with furisas on this and he has the early beginnings of a hopefully large 4 field Hainanese Word, English Translation, Characters, Pinyin data set. I tried to import the data as .csv into OpenOffice and into Numbers on my Mac and it doesn't parse the way I expect it to. I had OpenOffice show me the hidden characters and there don't seem to be any peculiarities that would 'misalign' the four fields in each record.

If anyone has the time to look at these 4 sample records and has any thoughts on why I can't do a .csv import into OpenOffice or Numbers, I'd appreciate hearing. I'm doing some travel so will be away from the internet for much of the next two weeks.

I attach the text file with the field description overview and 4 sample records. Basically I'd like to be able to clean/fix this small sample in the hope it can be applied to the full data set, and then arrange the columns in a spreadsheet so we can follow Mike's June 30, 2022 note: "You can create a new user dictionary in Settings / Manage Dictionaries and then use the "Import" function to add words to it from a text file. Format is:
characters<tab>pinyin<tab>definition
one entry per line."

My thinking is to match the available format to the target format. Presumably we'll need to combine 2 of these fields as only 3 fields are supported?

So presumably something like Characters<tab>Pinyin<tab>COMBINED Hainanese Word/English Translation.

Thoughts?

Thanks.
 

Attachments

  • Hainanese 4 records.txt
    253 bytes · Views: 327

Shun

状元
Hello David,

I had a quick look, and the only problem with the CSV file you uploaded was that it had an intro text at the beginning and some full-width commas instead of the regular half-width commas.

To stitch the four columns together into three columns using Numbers, you could first open the four-columns CSV in Numbers, then add more columns at the right where you reference the columns at the left (using "&" to concatenate strings):

Screenshot 2022-10-08 at 08.01.02.png


That will yield a combined field in the last column:

Screenshot 2022-10-08 at 08.00.50.png


Using regular expressions would be another possibility, but the advantage of Numbers is that it lets you see the output more clearly.

I attach the corrected CSV file as well as the final Numbers spreadsheet. Instead of a space between the Hainanese & Definition fields, you could also insert two "EAB1" Unicode characters which add newlines. (See this post: https://plecoforums.com/threads/multiple-new-lines-in-user-defined-flashcards.5916/post-44863 ) As the final step, you just copy and paste the three columns on the right into a text editor and save them as UTF-8. That will use tab delimiters, which also happens to be Pleco's import format.

If the whole process feels too finicky, I would understand. If it isn't too much data, I would be willing to help you.

Cheers,

Shun
 

Attachments

  • Hainanese 4 records.csv.txt
    106 bytes · Views: 307
  • Hainanese 4 records.numbers.zip
    75.2 KB · Views: 318

DavidMars

进士
Shun, thanks very much for your excellent advice. The concatenation of the two columns in Numbers is very helpful. I understand everything except the full-width comma and the half-width comma part. How do I do a Global Search & Replace to change all commas, or all full-width commas, to half-width commas? When I am in OpenOffice and show the hidden formatting, I see the spaces and line returns but I don't see anything different between one comma and the next, as per the attached screen shot. Thanks.

Hidden Formatting Shown.png
 

Shun

状元
You're welcome! The full-width comma is just a different character from the half-width comma. So you can Search and Replace like this:

Search for: «,»
Replace by: «,»

(both without quotes)

Cheers, Shun
 

adl135

Member
Hello,

I was wondering if this project has had any progress in the past few months?

I'd love to start studying some vocab!
 
Top