# Learning BASE64 encoding

BASE64 encoding is so prevalent that it is worth learning how it works and how to code your own implementation.

The point of BASE64 is to communicate binary data as text, using only characters that are likely to exist on most computer platforms. These safe characters are known as the BASE64 alphabet and are the letters A to Z and a to z, the numerals 0 to 9, and the characters / and +. There are other ways to represent bytes as text; for example, by converting them to hexadecimal strings made up of the characters 0 to 9 and A to F. But doing so means that for every character in the original set, two hexadecimal characters are required, which doubles the size of the data.

The BASE64 alphabet consists of 64 characters, each one associated with an integer value. For example, the character A is represented by 0, the character Z by 25, and character / by 63. This means that to cover the range of integers from 0 to 63, the BASE64 word size must be six bits (because 2^6=64). As a consequence of this, during BASE64 encoding the original data must be laid out and padded to make its size — in bits — a number divisible by six.

The smallest number of bytes (or 8-bit words) that can be re-arranged in groups of 6-bit words is three (3×8 bits = 24 bits, which is divisible by six). In other words, data must be processed in groups of 24 bits, each group being equivalent to four 6-bit words (4×6 bits = 24 bits). The BASE64 character matching the value of each 6-bit word is then output as an 8-bit ASCII character. So for every three bytes of input, four bytes of output are generated, giving an inflation factor of 4:3 (which is a better compromise than the 2:1 ratio from hexadecimal encoding).

Data that cannot be split exactly in groups of 24 bits must be padded to make them so. For example, data that are one byte long (i.e. 8 bits) must be padded with two zero-value bytes (i.e. 8 bits + (2×8 bits)), and data that are 11 bytes long (i.e. 88 bits) must be padded with one zero-value byte (i.e. 88 bits + 8 bit = 96 bits = 4×24 bits). In other words, data must be padded to reach a size that is divisible by three.

With the theory out of the way, here is how BASE64 is implemented in Java, using the example ‘any carnal pleasure’.

First, encode the string as a series of bytes.

``````byte[] bytes = "any carnal pleasure".getBytes();
``````

This results in an array of 19 bytes.

Next, pad the array with two zero-value bytes to make its size divisible by three.

``````byte[] padded = Arrays.copyOf(bytes, 21);
``````

Then, convert each triplet of bytes into four 6-bit words and calculate the value of each. (Use bit shift operators.) Append the BASE64 character represented by each 6-bit value to a `StringBuilder` instance.

``````for (int byteIndex = 0; byteIndex < padded.length; byteIndex += byteGroupSize) {

// read the value of the 24-bit word starting at the current index
int wordOf24Bits = (padded[byteIndex] << 16) +
(padded[byteIndex + 1] << 8) +

// read the 24-bit word as 6-bit word values
int wordOf6Bits1 = (wordOf24Bits >> 18) & 63;
int wordOf6Bits2 = (wordOf24Bits >> 12) & 63;
int wordOf6Bits3 = (wordOf24Bits >>  6) & 63;
int wordOf6Bits4 = (wordOf24Bits      ) & 63;

result.append(BASE64_CHARS.charAt(wordOf6Bits1));
result.append(BASE64_CHARS.charAt(wordOf6Bits2));
result.append(BASE64_CHARS.charAt(wordOf6Bits3));
result.append(BASE64_CHARS.charAt(wordOf6Bits4));
}
``````

This yields the BASE64 string ‘YW55IGNhcm5hbCBwbGVhc3VyZQAA’.

Finally, replace the padding characters (“AA” in this example resulting from the two zero-value bytes) with as many “=” characters. The “=” is used in the BASE64 decoding process (which is not covered in this post) to determine the amount of padding that has been applied.

``````for (int i = result.length(); i > result.length() - paddingSize; i--) {
result.setCharAt(i - 1, '=');
}
``````

This gives the final result ‘YW55IGNhcm5hbCBwbGVhc3VyZQ==’.

I know that there are at least two classes in the standard Java libraries that provide BASE64 operations. One of those is undocumented and is subject to change, and the other is meant to be used by the mail library, which could cause confusion (or would be bad form?) if they are referenced in code that does not otherwise depend on the libraries where the classes reside. By writing my own implementation, I can avoid these unnecessary dependencies, and most importantly, I can do BASE64 in any language that does not have a built-in function for it.