Although Tim Berners-Lee richly deserves his knighthood for creating one of the most important technologies of the 20th century, in one respect the World Wide Web has failed to deliver. It may have been global from the start - potentially accessible anywhere in the world - but making it truly international - able to reflect all cultures, irrespective of their language or writing system - has been an enormous struggle for the non- Anglophone world.
The first problem to be addressed was how to create Web pages with characters other than standard ASCII. The solution seemed simple enough: the use of extended sets, which allowed different non-ASCII characters to be employed on a per-page basis. But the solution brought its own problems, with many alternative extensions for a given script.
Therefore, an overarching approach called Unicode was developed that defined a single, universal coding scheme embracing all scripts. Unicode may not yet include everything, but all the major families are there, and many of the less common ones will be added soon (even Egyptian hieroglyphs are being worked on).
Unicode addresses part of the problem that international Web pages pose: how to bring in extra characters in a consistent manner. But it leaves open another question: how to represent digitally the tens of thousands of different characters that go to make up the Unicode set. In fact, online, the challenge is even greater: how to represent those characters compactly in binary while preserving backward compatibility with existing systems.
The most popular solution is UTF-8 (short for Universal Multiple-Octet Coded Character Set Transformation Format 8). It was invented in 1992 by no less a person than Ken Thompson, writing on the proverbial place-mat; together with the co-inventor Rob Pike he later published a paper on the subject, aptly entitled "Hello World". A useful FAQ on Unicode and UTF-8 issues fills in the details.
There are a wide range of practical resources in this area. For example, test pages, help in setting up Unicode support in browsers and other programs, and in resolving display problems, as well as how to create multilingual Web pages.
Even this is by no means the end of the story. Unicode may make the content truly international, but does nothing to solve an equally pressing issue: how to create domain names using non-ASCII characters.
This problem has taken far longer to solve. ICANN, the main body governing Internet names, finally released guidelines on internationalised names last year, based around three RFCs: RFC 3490, RFC 3491 and RFC 3492. The last of these defines something called Punycode, which maps a Unicode string into ASCII characters that are allowed in host name labels (letters, digits, and hyphens). There are some examples of internationalised domains in the .nu domain.
The pent-up demand for such domain names can be judged from the fact that the registry for the German .de domain, DENIC, recently registered more than 130,000 of them in the first 48 hours of their availability. For the record, the first domain name with an umlaut was öko.de.
And yet there is a deep irony in all this. Before these latest moves, one, simple standard for writing Internet addresses was in place: a subset of ASCII. The arrival of internationalised domain names means that there will now be hundreds of different character sets deployed, most of which will be meaningless to any given user. The Internet will gradually become Balkanised, splitting up into islands of comprehensibility, defined by the character sets they employ - a result rather at odds with the traditional view of its unifying influence.
Glyn Moody welcomes your comments.
Posted by Glyn Moody in Around the Net
Your link here? Advertising on the Netcraft Blog