Plans for extension A and B support?

ldolse

状元
Hey,

I've been working on a little dictionary covering all the different radicals and their variants, as well as common character components. I discovered once I finished the KangXi radicals and their variants that I was often working with characters from extensions A and B, particularly with characters from mainland chinese tables (examples deleted, since PHPbb doesn't support). These characters aren't encoded in most fonts, including ZYSong font bundled with Pleco.

Anyway, what are the plans here? There are a couple things I see:
The Unihan database doesn't cover extensions A and B, so it's no help for these characters. Any other dictionaries out there?
There doesn't seem to be a single font that includes both the standard CJK plus extension A and B, so it seems Pleco would need the ability to display characters from primary and alternate fonts similar to the way windows mobile handles CJK support today. (perhaps this is already supported, I see Pleco has other alternate fonts for specific use)

Thanks!
 

mikelove

皇帝
Staff member
Extension A and B aren't really on our radar for handheld software at least; certainly there are useful characters in them, but as far as we know there aren't any PDA handwriting recognition engines available that support characters outside of the basic CJK Unified block (Hanwang's only seem to support characters in GB and Big5), so that eliminates the easy lookup possibility, and honestly we hardly ever get any requests for expanded character support; for the vast majority of our customers the current character set seems to be more than adequate.

If a lot of people are interested we certainly could add an option for CJK Extension A characters, perhaps through an extra (and user-supplied) alternate font file - our software does indeed support those. Adding Extension B support would require some significant extra work to enable our Unicode text engine to handle characters outside of the basic 0000-FFFF range; it would certainly be doable, there's not really anything in the architecture that would prevent it and thanks to our use of private-use characters to apply text formatting our software is already equipped to deal with double/triple-size character codes, but it would still be a significant undertaking, and again without many dictionaries / handwriting recognizers / etc supporting those characters I'm not sure if there's enough interest to justify spending our time on that instead of adding some more widely-used new feature.
 

ldolse

状元
Ah, I didn't realize extension B searches would actually break things today. I just recently began doing the research to find extension B characters I'm interested in, and I hadn't actually compiled a new version of my dictionary containing them.

Is alternate font support actually a user accessible feature, or is it something that the code supports internally? I'm actually thinking of switching my default font to Sun-ExtA - it's got full support for everything except extension B anyway, so the alternate font isn't particularly critical unless there was general Extension B support - it seems like extension B is usually handled by a dedicated font in most implementations.

As far as the rest of the support goes, all I'm looking for is the ability to display extension B safely in a dictionary and use those characters in flashcards. Search by character isn't actually critical either, though I wouldn't want to kill the search function for other entries just because extension B headwords are in the dict. I'm not so worried about handwriting recognizers, radical lookup, etc, which don't really help me much today for Unihan anyway (though they do occasionally help).

There are some other places it would be useful:
  • The component search funtion which I believe 2.0 will have - a significant number of standalone character components are only available in extension B.
  • Support for alternate radical tables - I'm not sure how many paper dictionaries use the 226 radicals that are in the 汉英词典, but trying to find all the components listed in this simplified radical table is what led me down this path. 5 or 6 of the radicals here are only encoded in Extension B, though I think GB has supported them for a long time.
 

mikelove

皇帝
Staff member
Alternate font support is strictly internal-use at the moment, and we're debating how much of it to expose in the UI; at the moment the preferences screen in 2.0 doesn't even include an option to change the main Chinese font from ZYSong, since we're not sure how well our new font system will interact with fonts we haven't tested it with. It will only check for the file name, though, and won't verify the name that's embedded in the font, so if you really want to use another font you can simply delete ZYSong and rename Sun-ExtA or whatever (SimSun-18030 is another good choice, and freely available from Microsoft in the GB18030 font support package) to ZYSong.ttf.

2.0 won't have a component search function, actually, that's been put off at least until 2.1 largely because we weren't able to license the data we needed for it and are now going to have to develop (or pay for the development of) that data ourselves. But we're definitely adding it at some point, we consider it essential if we ever want to support phones which don't have touchscreens. Alternate radical tables kind of go along with that, that area in general hasn't really changed at all in 2.0 so in 2.1 we'll likely be adding a number of enhancements to it. Along with some other new input systems like 4-corner and possibly WuBi/Cangjie.
 
Top