Archive for the 'i18n' Category

What’s with the J in Emails?

Thursday, December 1st, 2011

This has bothered me ever since I saw it appearing in emails:

I’d love that J

WTF is that “J”? Does it stand for “joke”? “Jesus”?

After a while it became apparent that it’s somewhat equivalent to a smiley face, but I was still puzzled by it until I peeked under the hood today and found an email sent from Outlook with the following bit in the HTML part:

I'd love that <span style="font-family:Wingdings">J</span>

A-ha!

When rendered using the Windings font, indeed you get a smiley face:

I'd love that J

And the text/plain part of the email actually does contain the regular :), so you’d only see the “J” showing up if your device is trying to display the HTML version but it doesn’t have the Windings font available.

Why Internationalize?

Thursday, February 19th, 2009

Tower of Babel

Seth Godin‘s book Tribes: We Need You to Lead Us looks like a good read, especially for marketers, crowd-herders, and entrepreneurs. Along with the book, he also started an invitation-only triiibal network on Ning, and got the folks to write an ebook called The Tribes Casebook (free download).

There’s a particular essay in there written by Dr. Saleh AlShebil titled When Technology Fails: A Language gets Born in an Online Tribe. Dr. AlShebil wrote about how an ASCII-based language (that he calls Araby) was born due to the lack of Arabic language input support on early instant messaging networks. These are transliterations of Arabic into Latin alphabets, not unlike l33t but grew out of different motivations.

Here’s what it looks like (source):

Sound Arabic letter ASCII Example
/ħ/ (a heavy /h/-type sound) ح 7 wa7ed (one)
/ʕ/ (a tightening of the throat resembling a light gargle) ع 3 ba3ad (after)
/t’/ (the emphatic version of /t/) ط 6 6arrash (he sent)
/s’/ (the emphatic version of /s/) ص 9 a9lan (actually)
/ʔ/ (glottal stop) ء 2 so2al (question)

So, واحد (one) sounds roughly like “wahed”, and you’d write it as “wa7ed”.

Quoting Dr. AlShebil (emphases added):

Arabic language alphabet is comprised of 28 letters. Some of these letters do not have an equivalent “sound” in English. So what did our online tribe do? They began looking for numbers and other keystrokes that can somehow resemble what the real Arabic letter “looks” like. Let me explain…

For instance, the Arabic letter “ﻉ” is pronounced as A’aa when used in a word and it got replaced with the number “3” since “3” looks like an inverted “ﻉ”. So the word Arabic which is written “Araby” (in Arabic sounding English) and begins with “ﻉ” was then written as “3raby.”

…This new form of tribal net lingo began to spread like wildfire. It would probably be a safe assumption to say that any Arab who is online today (especially the youth) is pretty familiar with it. Using it was not limited to chat and instant messaging but has also swelled to include any form of writing in online communities and even in mobile text messaging (sms). The Arabic net lingo virus caught on to Arabic websites that even wanted their domain names to sound or “look” Arabic.

As mentioned above, this is similar to l33t-speak, and also the lesser-known ギャル文字 (Gyaru-Moji).

Now, I dig subcultures like these, but don’t you think there’s something wrong with the emergence of a new lingo that could potentially erode a language like Arabic just because technology couldn’t support it?

Is this serious enough to erode the Arabic language? Maybe I’m exaggerating but one can imagine youths forgetting how to spell correctly in Arabic script because they’re so used to using “Araby”.

This is the case for why internationalization is important for the Internet (and technology in general.) More importantly, it is the prime motivation behind Internationalized Domain Names, which is in turn a primary contributor to the need for new TLDs.

Internationalization is not for vanity or luxury, it’s a necessity to preserve culture.

Python 3.0: Text vs. Data Instead Of Unicode vs. 8-bit

Monday, December 15th, 2008

Python 3.0 (Py3K) is out. I’m with Sam Ruby — this seemingly simple change of paradigm from “Unicode vs. 8-bit” to “Text vs. Data” is a breath of fresh air.

What’s inconsistent in this new version though is that the new bytes type still contains many of the methods with text semantics that should only make sense as string methods: e.g. capitalize() and islower(). I suspect these are provided as convenience methods, which is fine. But one would imagine that these byte methods will work by decoding the bytes using the default encoding of your locale, then performing the operations on the resulting string. As it turns out from my trials, it seems to assume that your bytes are encoded in Latin1:


>>> greek_beta = "Β" # This is the uppercase greek letter "beta", not regular B.
>>> greek_beta.isupper()
True
>>> greek_beta.islower()
False
>>> greek_beta.lower()
'β'
>>> greek_beta_bytes = greek_beta.encode('iso8859-7')
>>> greek_beta_bytes
b'\xc2'
>>> greek_beta_bytes.isupper()
False
>>> greek_beta_bytes.islower()
False
>>> greek_beta_bytes.lower()
b'\xc2'
>>> greek_beta_bytes.upper()
b'\xc2'

This is definitely a gotcha that may lead to hard to find bugs. So, it is best to avoid using those methods on bytes objects.

Otherwise, this change is definitely more “correct” in that Py3K forces you to know the type of your variables earlier or at the interfaces (to the outside world) so errors like these are less likely to sneak up from your back. For example, you can no longer use the + (sequence concatenation operator) to mix text and data. Whereas in Python 2.x, you can do:


>>> name = 'Wil'
>>> greet = lambda n: u'Hello ' + n
>>> greet(name)
u'Hello Wil'
>>> name = u'François'.encode('utf-8')
>>> name
'Fran\xc3\xa7ois'
>>> greet(name)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

What went wrong was that you’re relying on Python’s automatic Unicode conversion, which uses the standard ASCII codec (which is a good thing) to promote your str to unicode.

In Python 3.x, you will be greeted by the TypeError: Can't convert 'bytes' object to str implicitly message if you tried to pass a bytes object to the function. This will happen on any bytes object, so the error is easier to catch.

In this new version, the unicodedata module is upgraded to Unicode version 5.1.0.

Now, I’m not ready to run production code on Py3K yet but it would be nice if Django (my favourite Python-based web framework) can run on it. It looks like Martin von Löwis has started the porting.