Learning BASE64 encoding

I used to search the web whenever I needed to do BASE64 encoding in my code, but when today I had to do it again, I thought it would be beneficial in the long run to learn the algorithm. It turned out to be not too difficult.

The point of BASE64 is to communicate binary data as text, using only characters that are likely to exist on most computer platforms. These safe characters are known as the BASE64 alphabet and are the letters A to Z and a to z, the numerals 0 to 9, and the characters / and +. There are other ways to represent bytes as text; for example, by converting them to hexadecimal strings made up of the characters 0 to 9 and A to F. But, doing so means that for every character in the original set, two hexadecimal characters are required, which doubles the size of the data.

The BASE64 alphabet consists of 64 characters, each one associated with an integer value. For example, the character A is represented by 0, the character Z by 25, and character / by 63. This means that to cover the range of integers from 0 to 63, the BASE64 word size must be six bits. As a consequence of this, during BASE64 encoding the original data must be laid out and padded to make its size in bits divisible by six.

The smallest number of bytes (or 8-bit words) that can be re-arranged in groups of 6-bit words is three (3 × 8 bits = 24 bits, which is divisible by six). This means that data must be batched in triplets of bytes, and each triplet must be converted into four 6-bit words. The BASE64 character matching the value of each 6-bit word is then output as an 8-bit ASCII character. So, for every three bytes of data, four bytes of output are generated, giving an inflation factor of 4:3 (which is a better compromise than the 2:1 ratio from hexadecimal encoding).

Data that cannot be split exactly in groups of three bytes must be padded to make them so. For example, data that are one byte long must be padded with two zero-value bytes, and data that are 11 bytes long must be padded with one zero-value byte. In other words, data must be padded to reach a size that is divisible by three.

With the theory out of the way, here is how BASE64 is implemented in Java, using the example “any carnal pleasure”.

First, encode the string as a series of bytes.

This results in an array of 19 bytes.

Next, pad the array with two zero-value bytes to make its size divisible by three.

Then, convert each triplet of bytes into four 6-bit words and calculate the value of each. (Use bit shift operators.) Append the BASE64 character represented by each 6-bit value to a StringBuilder instance.

This yields the BASE64 string “YW55IGNhcm5hbCBwbGVhc3VyZQAA”.

Finally, replace the padding characters (“AA” in this example resulting from the two zero-value bytes) with as many “=” characters. The “=” is used in the BASE64 decoding process (which is not covered in this post) to determine the amount of padding that has been applied.

This gives the final result “YW55IGNhcm5hbCBwbGVhc3VyZQ==”.

Now, I know that there are at least two classes in the standard Java libraries that provide BASE64 operations. One of those is undocumented and is subject to change, and the other is meant to be used by the mail library, which could cause confusion (or would be bad form?) if they are referenced in code that does not otherwise depend on the libraries where the classes reside. By writing my own implementation, I can avoid these unnecessary dependencies, and most importantly, I can do BASE64 in any language that does not have a built-in function for it.

Leave a Reply