@Paul: I LOVE that you brought up specific arguments — seriously. This discussion just got some actual substance.
I also like your examples. With regard to your first example, if it’s taking that long (and I could replicate the experiment) then it’s clearly counting from the beginning each time, which will clearly be inefficient. But that’s to be expected because it’s UTF-8.
With regard to your attempt to help Ruby out, it appears from reading my 1.9 Pickaxe book (which you couldn’t have done because I’m guessing you haven’t bought it) that the reason for the error is that Ruby wants you to specify little-endian or big endian. Again, this makes sense if you think about it a bit, because Ruby’s exposing the fact that internally, it’s always storing the data as a set of ordered bytes. So you’ll get more luck if you do this:
ruby19 -e “‘abc’.encode(’UTF-16LE′)”
You can see a list of all the encodings that are loaded by default by doing this:
./ruby -e ‘Encoding.list.each {|enc| puts enc.name}’
When I replicate your experiment with it encoded as UTF-32LE, it runs in 50 seconds on my machine. Which is clearly sub-optimal — the equivalent program using ASCII on Ruby 1.8 takes under a second on my machine. Not sure why the performance disparity — there’s certainly nothing preventing the implementation I described, which would make every encoding-specific operation about as expensive as a C virtual function call. Hopefully they will optimize this in the future.
So you’ve demonstrated that the Ruby 1.9 implementation of multiple encodings is not currently very efficient. I don’t think you’ve demonstrated that it’s confusing (though it still might be, I don’t have the experience to say).
But one performance-related characteristic of the Python approach is that you always pay the cost to transcode into Unicode first. So take a program that does nothing but count the number of characters in a file. Python will always run the entire file through a transcoder first, then perform the len() operation on Unicode internally. Using the Ruby model, it could conceivably perform the least amount of work possible — nothing but an algorithm optimized to do len() on the input byte data directly, for whatever encoding the input is in. The cost to read data from the outside world into a string is essentially a memcpy (unless you want to do validation up-front). With Python you always pay a transcode up-front, unless your data is already in Python’s internal format (UCS-2? UCS-4? UTF-8? Some mixture? I don’t actually know what Python does here, and would be interested for more info.)
So I think that in the end, Ruby’s approach actually has *greater* performance potential, though I can’t vouch for whether it’s currently optimized very well.