Archive for the 'unicode' Category

Emoji Encoding Conversion between Carriers

Sunday, February 1st, 2009

I remember reading about Apple supporting emoji on the iPhone OS 2.2. Now that I’ve upgraded, I decided to try it out but for could never find it after hunting through the keyboard preferences. Googling showed that these cute little emoticons are only available for Softbank users. Thankfully, Steven Troughton-Smith has figured out that by editing a file on your iPhone backup, the “emoji” option suddenly shows up under Settings -> General -> Keyboard -> International Keyboards -> Japanese!

Now that I have the ability to enter these emoji on my iPhone, I figure I’d try it out by sending an email to myself. Alas, all I get is a list of of boxes. Time to look at the message content (relevant fields):

(more…)

Emoji to be encoded in Unicode

Friday, January 30th, 2009

The Unicode Technical Committee is working on encoding emoji (絵文字) in the Unicode Standard and ISO10646. It has spurred loads of discussions on the Unicode mailing list with more than a handful of forked threads, leading to fundamental questions like whether we should even encode them, and what constitutes a character.

Not to worry, the Unicode consortium is a veteran when it comes to dealing with the hairy issues of creating standards that work across languages, cultures and geographic regions. They simply can’t please everyone.

To me, the motivation for this is clear — interoperability. The current state of affairs in the Japanese mobile industry leaves a lot to be desired: across the carriers, there exist different sets of supported emoji’s, different private-use characters, substitution mappings, and code pages (user-defined characters in Shift_JIS, really). As one can imagine, the results is chaos, and as I software engineer, I really don’t want to imagine what those poor software engineers have to do to make it “just work” when a message cross the carrier boundaries.

To illustrate my point, let’s look at what Google does when you send a message with some emoji characters from GMail to each of DoCoMo, Softbank, and au.

(more…)

Python 3.0: Text vs. Data Instead Of Unicode vs. 8-bit

Monday, December 15th, 2008

Python 3.0 (Py3K) is out. I’m with Sam Ruby — this seemingly simple change of paradigm from “Unicode vs. 8-bit” to “Text vs. Data” is a breath of fresh air.

What’s inconsistent in this new version though is that the new bytes type still contains many of the methods with text semantics that should only make sense as string methods: e.g. capitalize() and islower(). I suspect these are provided as convenience methods, which is fine. But one would imagine that these byte methods will work by decoding the bytes using the default encoding of your locale, then performing the operations on the resulting string. As it turns out from my trials, it seems to assume that your bytes are encoded in Latin1:


>>> greek_beta = "Β" # This is the uppercase greek letter "beta", not regular B.
>>> greek_beta.isupper()
True
>>> greek_beta.islower()
False
>>> greek_beta.lower()
'β'
>>> greek_beta_bytes = greek_beta.encode('iso8859-7')
>>> greek_beta_bytes
b'\xc2'
>>> greek_beta_bytes.isupper()
False
>>> greek_beta_bytes.islower()
False
>>> greek_beta_bytes.lower()
b'\xc2'
>>> greek_beta_bytes.upper()
b'\xc2'

This is definitely a gotcha that may lead to hard to find bugs. So, it is best to avoid using those methods on bytes objects.

Otherwise, this change is definitely more “correct” in that Py3K forces you to know the type of your variables earlier or at the interfaces (to the outside world) so errors like these are less likely to sneak up from your back. For example, you can no longer use the + (sequence concatenation operator) to mix text and data. Whereas in Python 2.x, you can do:


>>> name = 'Wil'
>>> greet = lambda n: u'Hello ' + n
>>> greet(name)
u'Hello Wil'
>>> name = u'François'.encode('utf-8')
>>> name
'Fran\xc3\xa7ois'
>>> greet(name)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

What went wrong was that you’re relying on Python’s automatic Unicode conversion, which uses the standard ASCII codec (which is a good thing) to promote your str to unicode.

In Python 3.x, you will be greeted by the TypeError: Can't convert 'bytes' object to str implicitly message if you tried to pass a bytes object to the function. This will happen on any bytes object, so the error is easier to catch.

In this new version, the unicodedata module is upgraded to Unicode version 5.1.0.

Now, I’m not ready to run production code on Py3K yet but it would be nice if Django (my favourite Python-based web framework) can run on it. It looks like Martin von Löwis has started the porting.