What’s inconsistent in this new version though is that the new
bytes type still contains many of the methods with text semantics that should only make sense as
string methods: e.g.
islower(). I suspect these are provided as convenience methods, which is fine. But one would imagine that these byte methods will work by decoding the bytes using the default encoding of your locale, then performing the operations on the resulting string. As it turns out from my trials, it seems to assume that your bytes are encoded in Latin1:
>>> greek_beta = "Β" # This is the uppercase greek letter "beta", not regular B.
>>> greek_beta_bytes = greek_beta.encode('iso8859-7')
This is definitely a gotcha that may lead to hard to find bugs. So, it is best to avoid using those methods on
Otherwise, this change is definitely more “correct” in that Py3K forces you to know the type of your variables earlier or at the interfaces (to the outside world) so errors like these are less likely to sneak up from your back. For example, you can no longer use the
+ (sequence concatenation operator) to mix text and data. Whereas in Python 2.x, you can do:
>>> name = 'Wil'
>>> greet = lambda n: u'Hello ' + n
>>> name = u'François'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
What went wrong was that you’re relying on Python’s automatic Unicode conversion, which uses the standard ASCII codec (which is a good thing) to promote your
In Python 3.x, you will be greeted by the
TypeError: Can't convert 'bytes' object to str implicitly message if you tried to pass a
bytes object to the function. This will happen on any bytes object, so the error is easier to catch.
In this new version, the
unicodedata module is upgraded to Unicode version 5.1.0.