Tim Bray: Modern Character String Processing

Tim Bray:
Characters vs. Bytes:
This is the first of a three-part essay on modern character string processing for computer programmers. Here I explain and illustrate the methods for storing Unicode characters in byte sequences in computers, and discuss their advantages and disadvantages. These methods have well-known names like UTF-8 and UTF-16.

The next essay will consider string handling in the Java, and to a lesser extent C#, computer languages and argue that it is significantly broken, both in terms of efficiency and correctness. The third essay will propose a new approach to string handling in Java.

Tim Bray:Programming Languages and Text:
Welcome to another installment in ongoing‘s ongoing tour through
text-processing issues.
This one is about programming-language support, and while it makes specific
reference to Java, tries to be generally applicable to modern software
The conclusion is that Java is OK for some kinds of text processing, but
has real problems when the lifting gets heavy.
Last time out I said this
was going to be a three-part essay, but now I realize I’d already written
two other text-processing-centric pieces before that, one an
intro to Unicode, and the
other entitled
On Character Strings.
The present essay will recapitulate some of the material in that second note,
but no matter how you cut it, we’re already (to quote Douglas Adams) on
volume four of the trilogy.
To make it worse, I’m gestating some essays on full-text-search,
so we’ll just call it a continuing series.

Leave a Reply