When is a character not a character?

Not every value that's assigned to a char variable represents a character

Just when you thought you knew everything there is to know about the Java language, along comes something new to challenge your sense of complete mastery. For example, did you know that Java lets you declare a class within an interface, as in interface A { class B {} }? Also, were you aware that you can "add" cast operators together in an assignment statement such as byte i = (byte) + (short) + (double) 2000;? (You're not really adding cast operators.) Neither of these oddities is the subject of this post. Instead, I'm going to show you that not every value that can be assigned to a char variable denotes a character.

Characters and Unicode

A character is a minimal unit of text that doesn't have a shape (a font's glyph provides the shape) and doesn't have an associated numeric value (e.g., "a" – I've placed this character in double quotes to signify its abstractness).

A character set is a collection of characters, and a coded character set is a character set in which code points (numeric values) are associated with characters. For example, the American Standard Code for Information Interchange (ASCII) is a coded character set (e.g., hexadecimal value 41 is assigned to "A").

ASCII is an old coded character set standard with an English language bias. In 1987, work began on a universal coded character set that could accommodate all of the characters of the world's living (and, eventually, dead) languages. The resulting standard became known as Unicode.

Unicode-based text is encoded for storage or transmission by using a Unicode Transformation Format (UTF) encoding. Various UTF encodings have been devised, with the variable-length UTF-8 and UTF-16 encodings being commonly used.

Planes and surrogates

Unicode 1.0 fixed the size of a character at 16 bits, limiting the maximum number of characters that could be represented to 65,536. To support the thousands of rarely used or obsolete characters (e.g., Egyptian Hieroglyphs) found in historic scripts, Unicode 2.0 increased its codespace to more than one million code points by introducing a new architecture based on planes and surrogates.

A plane is a group of 65,536 code points; Unicode supports 17 planes. The first plane (code points 0 through 65535), known as the Basic Multilingual Plane (BMP), represents its code points via syntax U+hhhh, where each "h" represents a hexadecimal digit. Additional planes are known as supplementary planes or astral planes; code points in these planes are represented via syntax U+hhhhh or syntax U+hhhhhh.

Code points found in the BMP are directly accessible. However, code points in the supplementary planes, which represent supplementary characters, are accessed indirectly via surrogate (substitute) pairs in UTF-16.

Code points ranging from U+D800 through U+DBFF (1,024 code points) are known as high-surrogate code points; code points ranging from U+DC00 through U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point (also known as a leading surrogate code point) followed by a low-surrogate code point (also known as a trailing surrogate code point) denotes a surrogate pair that UTF-16 uses to represent the 1,048,576 code points that exist outside of the BMP.

Surrogate pair (U+D800, U+DC00) represents the first supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 0s, which represents U+10000. Surrogate pair (U+DBFF, U+DFFF) represents the last supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 1s, which represents U+10FFFF. Subtracting U+10000 from U+10FFFF and adding 1 to the result yields 1,048,576. As you can see, surrogate pairs are able to represent the 1,048,576 code points for the supplementary characters.

Java and supplementary characters

The Java language was conceived between late 1990 and mid-1991. Many decisions were made regarding its various features; one of these decisions was to base Java's primitive character type on Unicode 1.0, which resulted in sizing this type to a fixed width of 16 bits.

As Unicode evolved to support supplementary characters, considerable thought went into how Java would respond. For example, would its character type be expanded to 32 bits? How would APIs handle these characters? Eventually, a tier-based solution was chosen for inclusion in Java 5:

  • Primitive type 32-bit integer (represented via int) would be used to represent code points in low-level APIs (e.g., java.lang.Character's class methods)
  • Sequences of char values always would be interpreted as UTF-16 sequences
  • APIs to convert between various char and code point-based representations would be provided

Under this new approach, a char value represents a UTF-16 code unit, which isn't always sufficient to represent a character code point because the code unit might represent a leading surrogate code point (e.g., char c = '\ud800';) or a trailing surrogate code point (e.g., char c = '\udfff';).

The Character class provides methods that let you map between various char and code point-based representations and more. Consider the following examples:

  • Character.toCodePoint(char high, char low) maps two UTF-16 code units to a code point.
  • Character.toChars(int codePoint) maps the given code point to one or two UTF-16 code units, wrapped into a char[] array.
  • Character.isSurrogatePair(char high, char low) returns Boolean true when the UTF-16 code units represent a surrogate pair; otherwise, false is returned.

These methods are fun to play with, and Listing 1 presents a small Mapper application that uses them to help you learn more about supplementary characters.

Listing 1. Mapping code points to code units and vice versa

public class Mapper
{
   public static void main(String[] args)
   {
      if (args.length < 1 || args.length > 2)
      {
         System.err.println("usage: java Mapper cuhigh culow");
         System.err.println("       java Mapper cp");
         return;
      }
      if (args.length == 1)
      {
         int value = Integer.parseInt(args[0], 16);
         char[] codeUnits = Character.toChars(value);
         if (codeUnits.length == 1)
            System.out.printf("Code units = U+%X%n", (int) codeUnits[0]);
         else
            System.out.printf("Code units = U+%X U+%X%n",
                              (int) codeUnits[0], (int) codeUnits[1]);
      }
      else
      {
         char value1 = (char) Integer.parseInt(args[0], 16);
         char value2 = (char) Integer.parseInt(args[1], 16);
         if (!Character.isSurrogatePair(value1, value2))
         {
            System.err.println("Not a surrogate pair");
            return;
         }
         System.out.printf("Code point = U+%X%n",
                           Character.toCodePoint(value1, value2));
      }
   }
}

You can pass one or two hexadecimal-based arguments to Mapper. A single argument identifies a code point and two arguments identify code units. For example, java Mapper 10000 outputs the following:

Code units = U+D800 U+DC00

In contrast, java Mapper D800 DC00 outputs the following:

Code point = U+10000

Be careful to specify valid surrogate values; otherwise, Mapper outputs an error message. For example, java Mapper 0 FFFF outputs the following:

Not a surrogate pair

Conclusion

Not every value that can be assigned to a char variable denotes a character. Instead, a char value represents a UTF-16 code unit in order to support supplementary characters. I introduced you to three methods that were added to Java's Character class that provide this support. However, there are additional methods. As an exercise, identify other methods that Character provides for dealing with supplementary characters. Consult this post's code archive for the answer.

download
Get the source code for this post's applications. Created by Jeff Friesen for JavaWorld

The following software was used to develop the post's code:

  • 64-bit JDK 8u60

The post's code was tested on the following platform(s):

  • JVM on 64-bit Windows 8.1

This story, "When is a character not a character?" was originally published by JavaWorld.

Copyright © 2016 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!