Emoji to be encoded in Unicode

The Unicode Technical Committee is working on encoding emoji (絵文字) in the Unicode Standard and ISO10646. It has spurred loads of discussions on the Unicode mailing list with more than a handful of forked threads, leading to fundamental questions like whether we should even encode them, and what constitutes a character.

Not to worry, the Unicode consortium is a veteran when it comes to dealing with the hairy issues of creating standards that work across languages, cultures and geographic regions. They simply can’t please everyone.

To me, the motivation for this is clear — interoperability. The current state of affairs in the Japanese mobile industry leaves a lot to be desired: across the carriers, there exist different sets of supported emoji’s, different private-use characters, substitution mappings, and code pages (user-defined characters in Shift_JIS, really). As one can imagine, the results is chaos, and as I software engineer, I really don’t want to imagine what those poor software engineers have to do to make it “just work” when a message cross the carrier boundaries.

To illustrate my point, let’s look at what Google does when you send a message with some emoji characters from GMail to each of DoCoMo, Softbank, and au.

The screenshot above is for DoCoMo, but I also repeated the experiment for Softbank and au. From the bounce message, one can easily tell that what is saved as a sent message and what gets actually transmitted to each of the carriers’ SMTP server are all distinct in their encodings.

What’s saved in Gmail when you click on “Show original” embeds the graphics using standard mime techniques (multipart/related with CID URIs a.k.a RFC 2111) with an extension attribute called goomoji in the HTML version, which carries part of Unicode private use character assigned for it. For example, the crab crab [カニ] is assigned U+FE1E3 in Google, so its goomoji value is 1E3.

What’s sent to DoCoMo is a different story altogether:

It’s a standard multipart/alternative message with 2 parts: text/plain and text/html, both encoded in Shift_JIS. Decoding the text/plain part gives:

>>> import base64
>>> sjis = base64.b64decode("W4NKg2pd+aT56ApYT1hPIIFfKF4tXimBXgoKaHR0cDovL3hyaS5uZXQvPXdpbAo=")
>>> sjis
'[\x83J\x83j]\xf9\xa4\xf9\xe8\nXOXO \x81_(^-^)\x81^\n\nhttp://xri.net/=wil\n'
>>> print sjis.decode("shift_jis", 'ignore')
[カニ]
XOXO \(^-^)/

http://xri.net/=wil

Since DoCoMo doesn’t have the decapods in their emoji set, it gets encoded as カニ (Japanese for crab) in square brackets. Next comes the double musical notes musical notes メロディ, which is assigned a user defined Shift_JIS value of F9A4 in DoCoMo (explains why I had to pass the ‘ignore’ parameter to the decode method above, Python has no way to map that that sequence to Unicode and therefore barfs). Same goes for the tulip Tulip チューリップ. Ignoring the plain text “XOXO”, the last emoticon is a hug face hug face emoticon \(^-^)/, which Google assigned a code point to but none of the carriers use a graphic to represent. In fact, this is mapped to a Kao-moji (顔文字 – “face words”).

For au (KDDI), the message was also sent in multipart/alternative with text/plain and text/html parts but this time encoded in ISO-2022-JP. Similar situation here, where crab => [カニ], KDDI’s version of ISO-2022-JP for the musical notes and tulip emoji, and kaomoji for the hugs.

Similar deal for Softbank, but the charset specified is PDC, but it smells just like Shift_JIS with user-defined characters to me.

I hope by now you have an appreciation of the kind of fiddling that Google engineers had to do in order to get their messages to display properly on Japanese mobile phones, just because the carriers decided to go invent their own mappings and character sets. It’s no wonder that Google and Apple, with its recently announced emoji support in iPhone, are among those supporting this effort.

The ongoing work can be found here and all the emoji’s is available here and here in gory details.

If I had the time and luxury (read: paid) to participate, I would. I wish them all the best and hope to see a good set of emoji’s in Unicode soon.




No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.


This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

2 Responses to “Emoji to be encoded in Unicode”

  1. dready blog v2.0 » Blog Archive » Emoji Encoding Conversion between Carriers Says:

    [...] remember reading about Apple supporting emoji on the iPhone OS 2.2. Now that I’ve upgraded, I decided to try it out but for could never [...]

  2. stepn Says:

    Aww, I want apple to approve of these cute little characters :)
    PLEASE APPLE!!
    I used to be able to send them to different carriers but not anymore.. :(

Leave a Reply

Please login with your OpenID to post a comment: