Archive for December, 2008

Python 3.0: Text vs. Data Instead Of Unicode vs. 8-bit

Monday, December 15th, 2008

Python 3.0 (Py3K) is out. I’m with Sam Ruby — this seemingly simple change of paradigm from “Unicode vs. 8-bit” to “Text vs. Data” is a breath of fresh air.

What’s inconsistent in this new version though is that the new bytes type still contains many of the methods with text semantics that should only make sense as string methods: e.g. capitalize() and islower(). I suspect these are provided as convenience methods, which is fine. But one would imagine that these byte methods will work by decoding the bytes using the default encoding of your locale, then performing the operations on the resulting string. As it turns out from my trials, it seems to assume that your bytes are encoded in Latin1:

>>> greek_beta = "Β" # This is the uppercase greek letter "beta", not regular B.
>>> greek_beta.isupper()
>>> greek_beta.islower()
>>> greek_beta.lower()
>>> greek_beta_bytes = greek_beta.encode('iso8859-7')
>>> greek_beta_bytes
>>> greek_beta_bytes.isupper()
>>> greek_beta_bytes.islower()
>>> greek_beta_bytes.lower()
>>> greek_beta_bytes.upper()

This is definitely a gotcha that may lead to hard to find bugs. So, it is best to avoid using those methods on bytes objects.

Otherwise, this change is definitely more “correct” in that Py3K forces you to know the type of your variables earlier or at the interfaces (to the outside world) so errors like these are less likely to sneak up from your back. For example, you can no longer use the + (sequence concatenation operator) to mix text and data. Whereas in Python 2.x, you can do:

>>> name = 'Wil'
>>> greet = lambda n: u'Hello ' + n
>>> greet(name)
u'Hello Wil'
>>> name = u'François'.encode('utf-8')
>>> name
>>> greet(name)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

What went wrong was that you’re relying on Python’s automatic Unicode conversion, which uses the standard ASCII codec (which is a good thing) to promote your str to unicode.

In Python 3.x, you will be greeted by the TypeError: Can't convert 'bytes' object to str implicitly message if you tried to pass a bytes object to the function. This will happen on any bytes object, so the error is easier to catch.

In this new version, the unicodedata module is upgraded to Unicode version 5.1.0.

Now, I’m not ready to run production code on Py3K yet but it would be nice if Django (my favourite Python-based web framework) can run on it. It looks like Martin von Löwis has started the porting.