OCR problem with canto words?

pleco10 · Oct 8, 2012

The OCR engine seems to have trouble picking up 咁 - it detects it as 咐.

Perhaps this issue extends to other Canto slang/vernacular words?
(http://en.wikipedia.org/wiki/Written_Cantonese)

I am on Pleco v2.2.12.

mikelove · Oct 8, 2012

pleco10 said:
The OCR engine seems to have trouble picking up 咁 - it detects it as 咐.

Yes, OCR only supports about the 5000 most common traditional characters and so doesn't include much HK-specific vocabulary - every new character like 咁 that we support increases the odds of a false positive from some other character (like 咐) so we keep the character set limited for reasons of accuracy.

character · Oct 8, 2012

Could that be configurable? What you're doing sounds like a good default, but being able to turn on better Traditional support would be appreciated.

mikelove · Oct 8, 2012

character said:
Could that be configurable? What you're doing sounds like a good default, but being able to turn on better Traditional support would be appreciated.

Possibly in a future update, but in general we've had so few comments about improved traditional character support in OCR that we don't get the impression there's that much interest, and developing additional character templates is rather expensive relative to other enhancements we might make.

character · Oct 8, 2012

mikelove said:
Possibly in a future update, but in general we've had so few comments about improved traditional character support in OCR that we don't get the impression there's that much interest, and developing additional character templates is rather expensive relative to other enhancements we might make.

Wasn't aware it had improved. Tried it a few weeks ago and IIRC it couldn't recognize 裏 as inside.

mikelove · Oct 8, 2012

character said:
Wasn't aware it had improved. Tried it a few weeks ago and IIRC it couldn't recognize 裏 as inside.

Sorry, should have said "improving" - we haven't improved it, we were contemplating doing so but have not been given much reason to think that our users would like us to spend our time / money on that instead of on other OCR improvements (like PDF support, and the new "Crosshairs Mode" we're currently testing out on Android).

character · Oct 9, 2012

mikelove said:
Sorry, should have said "improving" - we haven't improved it, we were contemplating doing so but have not been given much reason to think that our users would like us to spend our time / money on that [...]

You had set expectations so low WRT recognizing Traditional characters I at least had given up hope on that front. If you add all the various ways people encounter Traditional characters (Traditional-using schools, advanced learners adding Traditional to their skillset, traveling to areas using Traditional, Cantonese learners, etc.) I would think there's enough interest.

scykei · Oct 9, 2012

I don't think he means no interest for traditional character sets. Pleco can detect both simplified and traditional just fine with OCR. It's just the extremely rare ones that it doesn't do. These Cantonese characters are considered rare because a lot of them were made up just for it, so they are never used in Mandarin Chinese.

I feel that it isn't much use implementing OCR for this because most of the street Cantonese you see online never use the assigned characters. There has never been any standardisation for it so people can choose whether or not to use the debatable "proper" characters which people use in gossip magazines or certain newspapers in Hong Kong. For example, 哥 for 嗰, D for 啲 and the radicals are usually dropped, like 甘 for 咁/噉, 吾 for 唔訓 for 瞓. Basically, If you don't know the language it wouldn't be much help anyway since the dictionary definition won't match the context. Besides, we don't even have a Cantonese dictionary yet so there's really no point scanning it for now.

mikelove · Oct 9, 2012

character said:
You had set expectations so low WRT recognizing Traditional characters I at least had given up hope on that front. If you add all the various ways people encounter Traditional characters (Traditional-using schools, advanced learners adding Traditional to their skillset, traveling to areas using Traditional, Cantonese learners, etc.) I would think there's enough interest.

I'm more thinking about the non-PlecoForums-following users - the random feedback emails from people relatively new to Pleco that give us a tremendous amount of tracking info about how we're doing with the people are buying our stuff right now.

Actually, even just in general interest in traditional appears to be fading - the bread-and-butter traditional issues like "how do I get my flashcards to display in traditional" are coming in at a noticeably lower rate than they were at year or two ago, and there isn't really any change in the UI to justify that. We get more emails now from people who accidentally switched to traditional and want to go back to simplified than we do from people who can't figure out how to switch to traditional. The simplified takeover in Chinese language classes around the world has probably had a lot to do with it - the old teachers are retiring and their replacements are teaching simplified.

scykei said:
I feel that it isn't much use implementing OCR for this because most of the street Cantonese you see online never use the assigned characters. There has never been any standardisation for it so people can choose whether or not to use the debatable "proper" characters which people use in gossip magazines or certain newspapers in Hong Kong. For example, 哥 for 嗰, D for 啲 and the radicals are usually dropped, like 甘 for 咁/噉, 吾 for 唔訓 for 瞓. Basically, If you don't know the language it wouldn't be much help anyway since the dictionary definition won't match the context. Besides, we don't even have a Cantonese dictionary yet so there's really no point scanning it for now.

Cantonese dictionary is getting actually pretty close (at last) - already starting to put together the searching code for it - but inconsistent character usage is indeed a problem there. The fact that HK character support in our handwriting recognizer is buried in an off-by-default option and that we nonetheless have had maybe 2 emails in the last year asking whether we support HK characters / how to turn them on doesn't make the case that it's something people want either.

scykei · Oct 10, 2012

mikelove said:
I'm more thinking about the non-PlecoForums-following users - the random feedback emails from people relatively new to Pleco that give us a tremendous amount of tracking info about how we're doing with the people are buying our stuff right now.

Actually, even just in general interest in traditional appears to be fading - the bread-and-butter traditional issues like "how do I get my flashcards to display in traditional" are coming in at a noticeably lower rate than they were at year or two ago, and there isn't really any change in the UI to justify that. We get more emails now from people who accidentally switched to traditional and want to go back to simplified than we do from people who can't figure out how to switch to traditional. The simplified takeover in Chinese language classes around the world has probably had a lot to do with it - the old teachers are retiring and their replacements are teaching simplified.

Well, it's not that. Most of the people who read in traditional in Taiwan, Hong Kong and Macau can also read in simplified, but it's not always true vice versa. A majority of them do have a personal preference but they don't usually complain too much unless they are extreme fanatics about the Traditional/Simplified debate. I know I prefer traditional but I won't make a big deal out of it.

However, support for traditional characters for a dictionary is absolutely necessary though. It's common for someone who reads only simplified to come across the traditional variant which he or she hasn't seen before and needs to check it up.

mikelove · Oct 10, 2012

scykei said:
However, support for traditional characters for a dictionary is absolutely necessary though. It's common for someone who reads only simplified to come across the traditional variant which he or she hasn't seen before and needs to check it up.

Oh of course, and we have no intention of dropping those - the most we'd ever consider would be only using one character set in the definitions / examples of a particular dictionary. Actually, though, one of the "early access" ones will be traditional-only everywhere initially, though it'll support simplified character searches after our big update, and we may go that route with some of the upcoming Cantonese ones too - I doubt many of the people interested in Cantonese support will object if we do.

(but don't worry, I'm not referring to the Classical one there; that supports both, and actually started its life in simplified but with simplified-to-traditional mappings / disambiguations that we've scrupulously followed)

OCR problem with canto words?

pleco10

秀才

mikelove

皇帝

character

状元

mikelove

皇帝

character

状元

mikelove

皇帝

character

状元

scykei

榜眼

mikelove

皇帝

scykei

榜眼

mikelove

皇帝