Internationalize your software, Part 1

Learn how to develop software for the global marketplace

Software usually reflects the conventions of the region or country in which it was developed. Anyone who uses this software is expected to be familiar with these conventions. For example, a developer based in France would probably create software that displays text in French, uses the French franc for currency calculations (at least until the European Union's Euro replaces it), and formats numbers with a space character as a thousands separator and a comma (,) as a decimal point (for example, 87 356,32). An American developer would probably create software that displays text in English, uses dollars, and formats numbers with a comma (,) as a thousands separator and a period (.) as a decimal point (for example, 87,356.32).

Read the whole "Internationalize Your Software" series:

  Software that breaks free of a single region's or country's conventions is known as international software. Although it can be costly to develop international software -- because of the time required to write the software, the cost of language translation, and so on -- the results can be rewarding. There is simply a larger market for this software.

"Internationalize your software" is a three-part series that explores the topic of developing Java-based software for an international audience, and it's divided into numerous subtopics. The following subtopics are covered in Part 1:

  • Internationalization and localization
  • Characters and character definition standards
  • Locales
  • Resource bundles

Java applets are used to illustrate Java's internationalization and localization features. These applets were compiled with the JDK 1.1.6 compiler and tested with the JDK 1.1.6 appletviewer and Netscape Navigator 4.06 programs. Netscape was running version 1.1.5 of the Java runtime environment during testing.

First, let's define internationalization and localization.

Internationalization and localization

The process of designing an application that can automatically adapt to different regions and countries without the need for recompilation is called internationalization. Because this word contains 18 letters between the first i and the last n, a shorter term, i18n, is sometimes used. A truly internationalized program contains no hard coded region- or country-specific elements -- for example, audio clips, text (GUI labels and other messages), graphics, currency/date/number formats, and so on. Instead, these elements are stored outside of the program, meaning that the program doesn't need to be recompiled every time a new region or country requires support.

The process of creating a set of region- or country-specific elements (including the translation of text) to support a new region or country is called localization. Because this word contains 10 letters between the first l and the last n, a shorter term, l10n, is sometimes used. The most time-consuming part of localization usually involves translating text. However, region- or country-specific elements such as currency/date/number formats need to be verified and this can also take a considerable time. Below is a partial checklist of elements that should be verified when localizing an internationalized program to a new region or country.

  • Icons
  • Text (GUI labels, other messages)
  • Audio clips
  • Online help
  • Currency/date/number formats
  • Calendars
  • Measurements
  • Colors
  • Graphics
  • Phone numbers
  • Addresses
  • Titles and honorifics

A correctly internationalized program detects the region/country currently using the program and loads the appropriate elements prior to interacting with a user. And a correctly localized program has all of its region/country-specific elements properly verified and stored outside the program for each region/country that will use the program.

Characters and character definition standards

Human beings understand symbols -- letters, digits, punctuation, and so on -- while computers recognize binary numbers. Symbols must be mapped to binary numbers in order for a computer to be used effectively. Once mapped, the association between a symbol and a binary number is known as a character. The set of all mappings used on a particular computer is known as that computer's character set. Over the years, various standards have been developed for defining characters. Three of these standards have gained considerable fame: EBCDIC, ASCII, and Unicode. Following is a description of each of these character standards:

EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC) was developed by IBM as a standard for associating 8-bit binary values, ranging from 0 to 255, with symbols taken from the English language. EBCDIC is a complex and proprietary code, existing in at least six mutually incompatible versions. (Even standards can get it wrong!) EBCDIC is used mainly in the mainframe world.

ASCII The American Standard Code for Information Interchange (ASCII) was developed by the American National Standards Institute (ANSI) committee as a standard for associating 7-bit binary values, ranging from 0 to 127, with symbols taken from the English language. ASCII is used primarily by smaller (that is, nonmainframe) computers.

Tables 1 and 2 identify all of the characters defined by ASCII. The first table lists those characters that have been used in the past to control devices, such as teletypes and printers, or have a special purpose in a programming language. (For example, the ASCII null [0] character is used in the C/C++ languages to identify the end of a sequence of characters -- a string.) These characters are not displayed to the user. The second table lists characters that can be displayed to the user.

Table 1: Control characters

Table 2: Displayable characters

There is a problem with the EBCDIC and ASCII character definition standards. Both standards are based on the English language and have no room for growth. How can they represent the many thousands of written characters used by modern and ancient languages? This problem has been addressed during the last few years and has resulted in the emergence of a new character definition standard called Unicode.

Unicode The Unicode standard maps out symbols to 16-bit binary numbers, which gives Unicode the ability to define 65,536 distinct characters. As of version 2.1, Unicode has defined a total of 38,887 characters. In contrast, ASCII defines a maximum of 128 characters. A link to the official Unicode Internet site is available in the Resources section.

The Java language supports Unicode. For example, Java's char (character) type has a defined size of 16 bits, allowing it to hold any one of Unicode's 65,536 characters. In contrast, the C and C++ languages have a char type that has no defined size (that is, the size can vary from one platform to another) and usually has a size of 8 bits, allowing it to hold a maximum of 256 characters. The Java language also defines a Unicode character constant notation. The '\uxxxx' notation identifies a Unicode character where xxxx represents that character's hexadecimal code (a number that ranges from 0000 to FFFF covering the entire 65,536 Unicode character range).

ANSI/ISO C also defines the wchar_t wide-character type. This type is intended for representing characters from the ISO 10646 Universal Character Set. However (on various platforms), it can also be used to represent Unicode characters. For more information about this type, check out Wikipedia's Wide character entry and the UTF-8 and Unicode FAQ.

Below is a code fragment that defines several Unicode international character constants and prints them. The printed characters are also shown.

char [] characters = { '\u00e5', '\u00a5', '\u00c7' };
System.out.println (new String (characters));
1 2 Page 1
Page 1 of 2