Just when you thought you knew everything there is to know about the Java language, along comes something new to challenge your sense of complete mastery. For example, did you know that Java lets you declare a class within an interface, as in interface A { class B {} }
? Also, were you aware that you can "add" cast operators together in an assignment statement such as byte i = (byte) + (short) + (double) 2000;
? (You're not really adding cast operators.) Neither of these oddities is the subject of this post. Instead, I'm going to show you that not every value that can be assigned to a char
variable denotes a character.
Characters and Unicode
A character is a minimal unit of text that doesn't have a shape (a font's glyph provides the shape) and doesn't have an associated numeric value (e.g., "a" – I've placed this character in double quotes to signify its abstractness).
A character set is a collection of characters, and a coded character set is a character set in which code points (numeric values) are associated with characters. For example, the American Standard Code for Information Interchange (ASCII) is a coded character set (e.g., hexadecimal value 41 is assigned to "A").
ASCII is an old coded character set standard with an English language bias. In 1987, work began on a universal coded character set that could accommodate all of the characters of the world's living (and, eventually, dead) languages. The resulting standard became known as Unicode.
Unicode-based text is encoded for storage or transmission by using a Unicode Transformation Format (UTF) encoding. Various UTF encodings have been devised, with the variable-length UTF-8 and UTF-16 encodings being commonly used.
Planes and surrogates
Unicode 1.0 fixed the size of a character at 16 bits, limiting the maximum number of characters that could be represented to 65,536. To support the thousands of rarely used or obsolete characters (e.g., Egyptian Hieroglyphs) found in historic scripts, Unicode 2.0 increased its codespace to more than one million code points by introducing a new architecture based on planes and surrogates.
A plane is a group of 65,536 code points; Unicode supports 17 planes. The first plane (code points 0 through 65535), known as the Basic Multilingual Plane (BMP), represents its code points via syntax U+hhhh, where each "h" represents a hexadecimal digit. Additional planes are known as supplementary planes or astral planes; code points in these planes are represented via syntax U+hhhhh or syntax U+hhhhhh.
Code points found in the BMP are directly accessible. However, code points in the supplementary planes, which represent supplementary characters, are accessed indirectly via surrogate (substitute) pairs in UTF-16.
Code points ranging from U+D800 through U+DBFF (1,024 code points) are known as high-surrogate code points; code points ranging from U+DC00 through U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point (also known as a leading surrogate code point) followed by a low-surrogate code point (also known as a trailing surrogate code point) denotes a surrogate pair that UTF-16 uses to represent the 1,048,576 code points that exist outside of the BMP.
Surrogate pair (U+D800, U+DC00) represents the first supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 0s, which represents U+10000. Surrogate pair (U+DBFF, U+DFFF) represents the last supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 1s, which represents U+10FFFF. Subtracting U+10000 from U+10FFFF and adding 1 to the result yields 1,048,576. As you can see, surrogate pairs are able to represent the 1,048,576 code points for the supplementary characters.
Java and supplementary characters
The Java language was conceived between late 1990 and mid-1991. Many decisions were made regarding its various features; one of these decisions was to base Java's primitive character type on Unicode 1.0, which resulted in sizing this type to a fixed width of 16 bits.
As Unicode evolved to support supplementary characters, considerable thought went into how Java would respond. For example, would its character type be expanded to 32 bits? How would APIs handle these characters? Eventually, a tier-based solution was chosen for inclusion in Java 5:
- Primitive type 32-bit integer (represented via
int
) would be used to represent code points in low-level APIs (e.g.,java.lang.Character
's class methods) - Sequences of
char
values always would be interpreted as UTF-16 sequences - APIs to convert between various
char
and code point-based representations would be provided
Under this new approach, a char
value represents a UTF-16 code unit, which isn't always sufficient to represent a character code point because the code unit might represent a leading surrogate code point (e.g., char c = '\ud800';
) or a trailing surrogate code point (e.g., char c = '\udfff';
).
The Character
class provides methods that let you map between various char
and code point-based representations and more. Consider the following examples:
Character.toCodePoint(char high, char low)
maps two UTF-16 code units to a code point.Character.toChars(int codePoint)
maps the given code point to one or two UTF-16 code units, wrapped into achar[]
array.Character.isSurrogatePair(char high, char low)
returns Boolean true when the UTF-16 code units represent a surrogate pair; otherwise, false is returned.
These methods are fun to play with, and Listing 1 presents a small Mapper
application that uses them to help you learn more about supplementary characters.
Listing 1. Mapping code points to code units and vice versa
public class Mapper
{
public static void main(String[] args)
{
if (args.length < 1 || args.length > 2)
{
System.err.println("usage: java Mapper cuhigh culow");
System.err.println(" java Mapper cp");
return;
}
if (args.length == 1)
{
int value = Integer.parseInt(args[0], 16);
char[] codeUnits = Character.toChars(value);
if (codeUnits.length == 1)
System.out.printf("Code units = U+%X%n", (int) codeUnits[0]);
else
System.out.printf("Code units = U+%X U+%X%n",
(int) codeUnits[0], (int) codeUnits[1]);
}
else
{
char value1 = (char) Integer.parseInt(args[0], 16);
char value2 = (char) Integer.parseInt(args[1], 16);
if (!Character.isSurrogatePair(value1, value2))
{
System.err.println("Not a surrogate pair");
return;
}
System.out.printf("Code point = U+%X%n",
Character.toCodePoint(value1, value2));
}
}
}
You can pass one or two hexadecimal-based arguments to Mapper
. A single argument identifies a code point and two arguments identify code units. For example, java Mapper 10000
outputs the following:
Code units = U+D800 U+DC00
In contrast, java Mapper D800 DC00
outputs the following:
Code point = U+10000
Be careful to specify valid surrogate values; otherwise, Mapper
outputs an error message. For example, java Mapper 0 FFFF
outputs the following:
Not a surrogate pair
Conclusion
Not every value that can be assigned to a char
variable denotes a character. Instead, a char
value represents a UTF-16 code unit in order to support supplementary characters. I introduced you to three methods that were added to Java's Character
class that provide this support. However, there are additional methods. As an exercise, identify other methods that Character
provides for dealing with supplementary characters. Consult this post's code archive for the answer.
The following software was used to develop the post's code:
- 64-bit JDK 8u60
The post's code was tested on the following platform(s):
- JVM on 64-bit Windows 8.1
This story, "When is a character not a character?" was originally published by JavaWorld.