Base64 encoding and decoding in Java 8

Java 8's new Base64 API could be just what you need to ensure data integrity in transit

jw javaqa dec2017
Bill Abbott (CC BY-SA 2.0)

Java 8 will be remembered mainly for introducing lambdas, streams, a new date/time model, and the Nashorn JavaScript engine to Java. Some will also remember Java 8 for introducing various small but useful features such as the Base64 API. What is Base64 and how do I use this API? This post answers these questions.

What is Base64?

Base64 is a binary-to-text encoding scheme that represents binary data in a printable ASCII string format by translating it into a radix-64 representation. Each Base64 digit represents exactly 6 bits of binary data.

Base64 is used to prevent data from being modified while in transit through information systems, such as email, that might not be 8-bit clean (they might garble 8-bit values). For example, you attach an image to an email message and want the image to arrive at the other end without being garbled. Your email software Base64-encodes the image and inserts the equivalent text into the message, as illustrated below:

Content-Disposition: inline;
	filename=IMG_0006.JPG
Content-Transfer-Encoding: base64

/9j/4R/+RXhpZgAATU0AKgAAAAgACgEPAAIAAAAGAAAAhgEQAAIAAAAKAAAAjAESAAMAAAABAAYA
AAEaAAUAAAABAAAAlgEbAAUAAAABAAAAngEoAAMAAAABAAIAAAExAAIAAAAHAAAApgEyAAIAAAAU
AAAArgITAAMAAAABAAEAAIdpAAQAAAABAAAAwgAABCRBcHBsZQBpUGhvbmUgNnMAAAAASAAAAAEA
...
NOMbnDUk2bGh26x2yiJcsoBIrvtPe3muBbTRGMdeufmH+Nct4chUXpwSPk/qK9GtJRMWWVFbZ0JH
I4rf2dkZSbOjt7hhEzwcujA4I7Gust75pYVwAPpXn+kzNLOVYD7xFegWEKPkHsM/pU1F0NKbNS32
o24sSCOlaaFYLUhjky4x9PSsKL5bJsdWkAz3xirH2dZLy1DM2C44zx1FZqL2PTXY/9k=

The illustration shows that this encoded image starts with / and ends with =. The ... indicates text that I haven't shown for brevity. Note that the entire encoding for this or any other example is about 33 percent larger than the original binary data.

The recipient's email software will Base64-decode the encoded textual image to restore the original binary image. For this example, the image would be shown inline with the rest of the message.

Base64 encoding and decoding

Base64 relies on simple encoding and decoding algorithms. They work with a 65-character subset of US-ASCII where each of the first 64 characters maps to an equivalent 6-bit binary sequence. Here is the alphabet:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
    0 A            17 R            34 i            51 z
    1 B            18 S            35 j            52 0
    2 C            19 T            36 k            53 1
    3 D            20 U            37 l            54 2
    4 E            21 V            38 m            55 3
    5 F            22 W            39 n            56 4
    6 G            23 X            40 o            57 5
    7 H            24 Y            41 p            58 6
    8 I            25 Z            42 q            59 7
    9 J            26 a            43 r            60 8
   10 K            27 b            44 s            61 9
   11 L            28 c            45 t            62 +
   12 M            29 d            46 u            63 /
   13 N            30 e            47 v
   14 O            31 f            48 w         (pad) =
   15 P            32 g            49 x
   16 Q            33 h            50 y

The 65th character (=) is used to pad Base64-encoded text to an integral size as explained shortly.

The encoding algorithm receives an input stream of 8-bit bytes. This stream is presumed to be ordered with the most-significant-bit first: the first bit is the high-order bit in the first byte, the eighth bit is the low-order bit in this byte, and so on.

From left to right, these bytes are organized into 24-bit groups. Each group is treated as four concatenated 6-bit groups. Each 6-bit group indexes into an array of the 64 printable characters; the resulting character is output.

When fewer than 24 bits are available at the end of the data being encoded, zero bits are added (on the right) to form an integral number of 6-bit groups. Then, one or two = pad characters may be output. There are two cases to consider:

  • One remaining byte: Four zero bits are appended to this byte to form two 6-bit groups. Each group indexes the array and a resulting character is output. Following these two characters, two = pad characters are output.
  • Two remaining bytes: Two zero bits are appended to the second byte to form three 6-bit groups. Each group indexes the array and a resulting character is output. Following these three characters, one = pad character is output.

Let's consider three examples to learn how the encoding algorithm works. First, suppose we wish to encode @!*:

Source ASCII bit sequences with prepended 0 bits to form 8-bit bytes:

@        !        *
01000000 00100001 00101010

Dividing this 24-bit group into four 6-bit groups yields the following:

010000 | 000010 | 000100 | 101010

These bit patterns equate to the following indexes:

16 2 4 42

Indexing into the Base64 alphabet shown earlier yields the following encoding:

QCEq

We'll continue by shortening the input sequence to @!:

Source ASCII bit sequences with prepended 0 bits to form 8-bit bytes:

@        !       
01000000 00100001

Two zero bits are appended to make three 6-bit groups:

010000 | 000010 | 000100

These bit patterns equate to the following indexes:

16 2 4

Indexing into the Base64 alphabet shown earlier yields the following encoding:

QCE

An = pad character is output, yielding the following final encoding:

QCE=

The final example shortens the input sequence to @:

Source ASCII bit sequence with prepended 0 bits to form 8-bit byte:

@       
01000000

Four zero bits are appended to make two 6-bit groups:

010000 | 000000

These bit patterns equate to the following indexes:

16 0

Indexing into the Base64 alphabet shown earlier yields the following encoding:

QA

Two = pad characters are output, yielding the following final encoding:

QA==

The decoding algorithm is the inverse of the encoding algorithm. However, it's free to take appropriate action upon detection of a character not in the Base64 alphabet or an incorrect number of pad characters.

Base64 variants

Several Base64 variants have been devised. Some variants require that the encoded output stream be divided into multiple lines of fixed length with each line not exceeding a certain length limit and (except for the last line) being separated from the next line via a line separator (carriage return \r followed by a linefeed \n). I describe the three variants that are supported by Java 8's Base64 API. Check out Wikipedia's Base64 entry for a complete list of variants.

Basic

RFC 4648 describes a Base64 variant known as Basic. This variant uses the Base64 alphabet presented in Table 1 of RFC 4648 and RFC 2045 (and shown earlier in this post) for encoding and decoding. The encoder treats the encoded output stream as one line; no line separators are output. The decoder rejects an encoding that contains characters outside the Base64 alphabet. Note that these and other stipulations can be overridden.

MIME

RFC 2045 describes a Base64 variant known as MIME. This variant uses the Base64 alphabet presented in Table 1 of RFC 2045 for encoding and decoding. The encoded output stream is organized into lines of no more than 76 characters; each line (except the last line) is separated from the next line via a line separator. All line separators or other characters not found in the Base64 alphabet are ignored during decoding.

URL and Filename Safe

RFC 4648 describes a Base64 variant known as URL and Filename Safe. This variant uses the Base64 alphabet presented in Table 2 of RFC 4648 for encoding and decoding. The alphabet is identical to the alphabet shown earlier except that - replaces + and _ replaces /. No line separators are output. The decoder rejects an encoding that contains characters outside the Base64 alphabet.

Base64 encoding is useful in the context of lengthy binary data and HTTP GET requests. The idea is to encode this data and then append it to the HTTP GET URL. If the Basic or MIME variant was used, any + or / characters in the encoded data would have to be URL-encoded into hexadecimal sequences (+ becomes %2B and / becomes %2F). The resulting URL string would be somewhat longer. By replacing + with - and / with _, URL and Filename Safe obviates the need for URL encoders/decoders (and their impacts on the lengths of encoded values). Also, this variant is useful when the encoded data is to be used for a filename because Unix and Windows filenames cannot contain /.

Working with Java's Base64 API

Java 8 introduced a Base64 API consisting of the java.util.Base64 class along with its Encoder and Decoder nested static classes. Base64 presents several static methods for obtaining encoders and decoders:

  • Base64.Encoder getEncoder(): Return an encoder for the Basic variant.
  • Base64.Decoder getDecoder(): Return a decoder for the Basic variant.
  • Base64.Encoder getMimeEncoder(): Return an encoder for the MIME variant.
  • Base64.Encoder getMimeEncoder(int lineLength, byte[] lineSeparator): Return an encoder for a modified MIME variant with the given lineLength (rounded down to the nearest multiple of 4 -- output not separated into lines when lineLength <= 0) and lineSeparator. It throws java.lang.IllegalArgumentException when lineSeparator includes any Base64 alphabet character presented in Table 1 of RFC 2045.

    RFC 2045's encoder, which is returned from the noargument getMimeEncoder() method, is rather rigid. For example, that encoder creates encoded text with fixed line lengths (except for the last line) of 76 characters. If you want an encoder to support RFC 1421, which dicates a fixed line length of 64 characters, you need to use getMimeEncoder(int lineLength, byte[] lineSeparator).

  • Base64.Decoder getMimeDecoder(): Return a decoder for the MIME variant.
  • Base64.Encoder getUrlEncoder(): Return an encoder for the URL and Filename Safe variant.
  • Base64.Decoder getUrlDecoder(): Return a decoder for the URL and Filename Safe variant.

Base64.Encoder presents several threadsafe instance methods for encoding byte sequences. Passing the null reference to one of the following methods results in java.lang.NullPointerException:

  • byte[] encode(byte[] src): Encode all bytes in src to a newly-allocated byte array, which this method returns.
  • int encode(byte[] src, byte[] dst): Encode all bytes in src to dst (starting at offset 0). If dst isn't big enough to hold the encoding, IllegalArgumentException is thrown. Otherwise, the number of bytes written to dst is returned.
  • ByteBuffer encode(ByteBuffer buffer): Encode all remaining bytes in buffer to a newly-allocated java.nio.ByteBuffer object. Upon return, buffer's position will be updated to its limit; its limit won't have been changed. The returned output buffer's position will be zero and its limit will be the number of resulting encoded bytes.
  • String encodeToString(byte[] src): Encode all bytes in src to a string, which is returned. Invoking this method is equivalent to executing new String(encode(src), StandardCharsets.ISO_8859_1).
  • Base64.Encoder withoutPadding(): Return an encoder that encodes equivalently to this encoder, but without adding any padding character at the end of the encoded byte data.
  • OutputStream wrap(OutputStream os): Wrap an output stream for encoding byte data. It's recommended to promptly close the returned output stream after use, during which it will flush all possible leftover bytes to the underlying output stream. Closing the returned output stream will close the underlying output stream.

Base64.Decoder presents several threadsafe instance methods for decoding byte sequences. Passing the null reference to one of the following methods results in NullPointerException:

1 2 Page 1
Page 1 of 2
InfoWorld Technology of the Year Awards 2023. Now open for entries!