Learning BASE64 encoding

The purpose of BASE64 is to communicate binary data as text, using only characters that exist on most computer platforms. These safe characters form the BASE64 alphabet and are the letters A to Z and a to z, the numerals 0 to 9, and the characters / and +.

Other ways of representing bytes as text exist. For example, bytes can be converted to hexadecimal strings made up of the characters 0 to 9 and A to F. But this conversion results in two hexadecimal characters for each character in the original set—the output becomes twice the size of the input.

Each of the 64 characters of the BASE64 alphabet is associated with an integer value. For example, the character A is represented by 0, the character Z by 25, and the character / by 63. To have the range of 0 to 63, a BASE64 word must be six bits (2^6=64). Therefore, to convert a byte into a BASE64 character, it must be padded with extra bytes until the number of bits is divisible by six.

The smallest number of bytes (or 8-bit words) that can be re-arranged in a way that the number of bits is a multiple of six is three (3×8 bits = 24 bits, 24/6 = 4). In other words, the input of BASE64 encoding must be processed in groups of 24 bits. So for every three bytes (24 bits) of input, four bytes (32 bits) of output are generated, giving an inflation factor of 4:3—which is still better than the 2:1 ratio from hexadecimal encoding.

Input data that cannot be split exactly in groups of 24 bits must be padded to make them so. For example, an input that is one byte (8 bits) long must be padded with two zero-value bytes (8 bits + (2×8 bits), 24 / 24 = 1); an input that is 11 bytes (88 bits) long must be padded with one zero-value byte (88 bits + 8 bit = 96 bits, 96 / 24 = 4); and so on. In short, input data must be padded to reach a size in bytes that is divisible by three.

With the theory out of the way, here is how BASE64 is implemented in Java, using the example ‘any carnal pleasure’.

First, convert the string to an array of bytes.

byte[] bytes = "any carnal pleasure".getBytes();

This results in an array of 19 bytes.

Next, pad the array with two zero-value bytes to make its size divisible by three.

byte[] padded = Arrays.copyOf(bytes, 21);

Next, calculate each group of three bytes (24 bits) into an integer value.

Next, break each integer result (24 bits) into four integer values, each six bits long (4 x 6 bits), using bit-shifting.

Next, append the BASE64 character represented by each 6-bit integer result to a StringBuilder instance.

for (int byteIndex = 0; byteIndex < padded.length; byteIndex += byteGroupSize) {

    // read the value of the 24-bit word starting at the current index
    int wordOf24Bits = (padded[byteIndex] << 16) 
         + (padded[byteIndex + 1] << 8) 
         + padded[byteIndex + 2];

    // read the 24-bit word as 6-bit word value
    int wordOf6Bits1 = (wordOf24Bits >> 18) & 63;
    int wordOf6Bits2 = (wordOf24Bits >> 12) & 63;
    int wordOf6Bits3 = (wordOf24Bits >>  6) & 63;
    int wordOf6Bits4 = (wordOf24Bits      ) & 63;

    result.append(BASE64_CHARS.charAt(wordOf6Bits1));
    result.append(BASE64_CHARS.charAt(wordOf6Bits2));
    result.append(BASE64_CHARS.charAt(wordOf6Bits3));
    result.append(BASE64_CHARS.charAt(wordOf6Bits4));
}

This yields the BASE64 string ‘YW55IGNhcm5hbCBwbGVhc3VyZQAA’.

Finally, replace the padding characters (“AA” in this example resulting from the two zero-value bytes) with as many “=” characters. The “=” is used in the BASE64 decoding process (which is not covered in this post) to determine the amount of padding that has been applied.

for (int i = result.length(); i > result.length() - paddingSize; i--) {
    result.setCharAt(i - 1, '=');
}

This gives the final result ‘YW55IGNhcm5hbCBwbGVhc3VyZQ==’.

There are at least two classes in the standard Java libraries that provide BASE64 functions. One is undocumented and is, therefore, subject to change; the other is included in the mail library, which will confuse if referenced in a project that does not use mail. If you learn how to write your own implementation of BASE64, you can avoid these dependencies and — more importantly — implement it in any language.

Learning BASE64 encoding

Related

Leave a comment

Leave a ReplyCancel reply