LJDT: Base64 Encoding
So going back to the bit about encoding and enciphering consider an early use of encoded data, the ROT-13 encoding. This was done back long before computers came into being, unsurprisingly, for political purposes to hide data from untrusted parties while transferring it between two other trusted parties. The name comes from the word Rotation and the number of places characters were “rotated”. In the English alphabet there are twenty-six characters so half of that is thirteen. If you “rotate” every character you want to send by thirteen other characters you can hide data from an unmotivated or uneducated thirty party fairly quickly and that is what was done. Consider the line, “moveTroopsNorth” using this simple encoding and you end up with “zbirGebbcfAbegu” which looks terrible but really isn’t that terrible to break since you can try all twenty-five combinations of rotation encodings (assuming you stick with just alphabetic characters) in a fraction of a second with a computer. Use this in a time when encoding is brand new and computers are not around and it’s a bit more tricky. Anyway, notice that the “bb” characters occur for the two ‘o’ characters in ‘Troops’ which also means that this type of encoding can be broken using mathematics; certain letters occur in certain percentages in regular communication so certain letters, when showing up as the most-common in an encoded string, can be assumed to be these most-common characters in the cleartext string. The more data the easier this type of “decoding” but that’s the general idea.
The point behind all of this is that encoding is not meant to prevent data from being seen, and enciphering is. Ciphers use math to make the data irretrievable unless a mathematically-relevant key is used (the same that created the enciphered data, or one that is made specifically to decrypt the enciphered data) and encoding uses math to change the presentation of the data from one form to another, though not for purposes of hiding it from a third party.
Base64-encoding is used in many other parts of computing. Anybody who has used Lightweight Directory Access Protocol (LDAP) for more than a day likely has seen blocks of text about seventy-six characters in length though they may not have figured out how to present the data in a human-understandable form. Sending MIME messages (attachments of e-mail are sent this way) uses base64-encoding so that binary data is not sent to the receiving SMTP server as I recall. Some log files, instead of presenting binary data from an application in the text file itself, will base64-encode the data and then put the encoded data into the log file. SQL utilities can be set to show BLOB columns in a couple formats, which typically include hexadecimal (which is a base-16 counting system) and base64 (which is, essentially, a base64-counting system). Those who work with certificates in PEM format already know what base-64 data look like as PEM uses base64-encoding with sixty-four character lines (except the last line possibly) to represent the certificate data in a text format. So base-64 is useful anytime you want to send any data, especially binary data, anywhere; it is especially useful when the data you send cannot be binary in transit for one reason or another (makes log files difficult to read, munges up terminals, is against the SMTP protocol, etc.).
With all that said base64-encoding is something that can be done by hand. A commonly-asked question is why ’64’ was used instead of some higher or lower number. A higher number is desirable because more data can be represented with the same base characterset. Consider that all data, binary or otherwise, could be represented with zeros and ones just like a computer handles it. That’s fine except that it means that data are necessarily expanded by several times just to be properly represented if the characters zero and one are printed to the screen. Each characters takes a byte of space, which is eight bits. To print out all of the zeros and ones for any given byte of data one would need to print eight characters (each which takes up eight bits) and so the increased size just to properly show all data in text format (if using a base-2 system) would be around eight times. Real mathematicians were smarter than that and had a lot more characters than just zero and one to use within the set of ASCII. The base ASCII characters have 128 different characters, though not even all of those are printable. Taking out things like the Bell character, the newline, the null character, and a few others there are quite a few characters available (more-than sixty-four) but sixty-four was the next binary number down from 128 that was available. So in order to properly-represent all data as efficiently as possible using only printable characters sixty-four characters were chosen and they are:
The actual characters used for encoding sometimes vary slightly but these are the standard ones (see http://en.wikipedia.org/wiki/Base64 for alternatives). Because of how the encoding works it is also possible to need to pad the encoded data with null values, which are represented by an equal sign (=). The reasoning behind this will hopefully become more clear as we go along. For starters note that base64-encoded strings always have a multiple of four characters and in many cases wrap around with a newline at seventy-six characters (so you do not end up with multi-million-character long strings that mess up text editors). Many of you probably know that a standard terminal has eighty horizontal characters by default and seventy-six is the next multiple of four down from eighty so I presume that is why seventy-six was used instead of seventy-eight or some other number. Because of how the encoding works there is a necessary increase in the amount of data used to store the raw data when encoded. To get into that let’s show some examples of how the encoding works.
First, we’re going to start with binary data for the first example and we’ll move to ASCII text after that to show how it works with regular text strings. As mentioned above there are the sixty-four characters used to represent the binary data. A represents the binary equivalent of 0, B the binary equivalent of 1, C the binary equivalent of 2, etc. as shown below:
000000 = A 000001 = B 000010 = C
Notice that these are only covering six bits of data and not the usual eight (eight bits to a byte) so when you get to the end you have the last three characters:
111101 = z 111110 = + 111111 = /
This is how things work at the binary level, but now consider text uses ASCII characters which have eight-bit numbers. For example:
A = 65 = 01000001 B = 66 = 01000010 C = 67 = 01000011
This is where the padding comes in. Eight times three is twenty-four, which is the same as six times four. So three bytes of data are represented with four base64-encoding characters as shown below:
010000010100001001000011 is the string of binary digits that make up ABC’s ASCII values. Break those into four-digit chunks and you get:
010000 010100 001001 000011
Now these are numbers that range from zero to sixty-three. The first is sixteen, the second is twenty, the third is nine, and the last is three. Let’s create a quick chart of which characters go with which values:
00 A 01 B 02 C 03 D 04 E 05 F 06 G 07 H 08 I 09 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 a 27 b 28 c 29 d 30 e 31 f 32 g 33 h 34 i 35 j 36 k 37 l 38 m 39 n 40 o 41 p 42 q 43 r 44 s 45 t 46 u 47 v 48 w 49 x 50 y 51 z 52 0 53 1 54 2 55 3 56 4 57 5 58 6 59 7 60 8 61 9 62 + 63 /
So sixteen, twenty, nine, three results in:
And in just a few minutes you have converted an entire three characters to their base64-encoded equivalent. Converting back is fairly trivial as you simply put the value back in the opposite order, first writing down the baes64-numerals for each encoded digit (six bits each) and then the remaining sets of eight bits are the raw binary data, in this case representing ASCII text. From the command line this conversion is much faster as Linux has a ‘base64’ command built in by default so let’s give that a try. We just converted ‘ABC’ so we need to send those characters (and no others) to the base64 command. ‘echo’ is the obvious choice for me to do that but we need to be sure to use the ‘-n’ parameter so we do not send a new line character to base64 or else it makes things a bit more exciting:
echo -n 'ABC' | base64
The result is ‘QUJD’ as we expected. Doing the opposite to get back our decoded string we pass in the encoded string with echo and then tell base64 to decode with the ‘-d’ switch:
echo -n 'QUJD' | base64 -d
and ‘ABC’ is printed to the screen without any characters after it (so it will be squished in at the start of your prompt). So those are the basics. Going beyond ASCII data you can send binary data directly to the command as well. Let’s see what happens when sending the first sixty-four sets of bytes to the base64 command with a simple script that shows the base64 output as well as the decimal and binary representations of the value sent to the command:
for i in `seq 0 63`; do printf "%02d " $i && printf "%08d " `echo "obase=2;$i;" | bc` && TMPHEX=`echo "obase=16;$i;" | bc` && TMPHEX=`printf "\x$TMPHEX"` && echo -n "$TMPHEX" | base64 ; done
00 00000000 AA== 01 00000001 AQ== 02 00000010 Ag== 03 00000011 Aw== 04 00000100 BA== 05 00000101 BQ== 06 00000110 Bg== 07 00000111 Bw== 08 00001000 CA== 09 00001001 CQ== 10 00001010 Cg== 11 00001011 Cw== 12 00001100 DA== 13 00001101 DQ== 14 00001110 Dg== 15 00001111 Dw== 16 00010000 EA== 17 00010001 EQ== 18 00010010 Eg== 19 00010011 Ew== 20 00010100 FA== 21 00010101 FQ== 22 00010110 Fg== 23 00010111 Fw== 24 00011000 GA== 25 00011001 GQ== 26 00011010 Gg== 27 00011011 Gw== 28 00011100 HA== 29 00011101 HQ== 30 00011110 Hg== 31 00011111 Hw== 32 00100000 IA== 33 00100001 IQ== 34 00100010 Ig== 35 00100011 Iw== 36 00100100 JA== 37 00100101 JQ== 38 00100110 Jg== 39 00100111 Jw== 40 00101000 KA== 41 00101001 KQ== 42 00101010 Kg== 43 00101011 Kw== 44 00101100 LA== 45 00101101 LQ== 46 00101110 Lg== 47 00101111 Lw== 48 00110000 MA== 49 00110001 MQ== 50 00110010 Mg== 51 00110011 Mw== 52 00110100 NA== 53 00110101 NQ== 54 00110110 Ng== 55 00110111 Nw== 56 00111000 OA== 57 00111001 OQ== 58 00111010 Og== 59 00111011 Ow== 60 00111100 PA== 61 00111101 PQ== 62 00111110 Pg== 63 00111111 Pw==
The resulting output is interesting because it illustrates a couple points. First, note that the first base64-encoded character increments every four lines and also that the second character changes every time in a cycle of four characters (A, Q, g, and w). This illustrates the point that the first six bytes, which change every four lines, completely control the first character, while the second six bytes control the second character. Also you can see that the last twelve bytes are not even there so base64 pads the output with the ‘=’ character. Another interesting point is that since data come in eight-bit chunks changing the placement of those bytes will significantly change the encoding. For example:
echo -n 'asdasd' | base64 YXNkYXNk echo -n 'sdasd' | base64 c2Rhc2Q=
Notice that by removing just a single character from the front the entire hash was changed completely. Doing the binary math is very clear on why this happens and working that out is probably a good exercise to test understanding of the concepts.
Each set of four characters can hold, by my calculation (echo ‘2^24’ | bc), 16,777,216 unique combinations of binary data which is neat. The math to encode/decode is quick for a computer so it works out well for its designed purpose.
As mentioned earlier base64 encoding is used all over the place. Let’s find a few places where it may have been overlooked in the past. First, some LDAP output:
nsimForgottenAction:: PEZvcmdvdHRlblBhc3N3b3JkPjxFbmFibGVkPmZhbHNlPC9FbmFibGVk PjxTZXF1ZW5jZT48QXV0aGVudGljYXRpb24+PCFbQ0RBVEFbXV0+PC9BdXRoZW50aWNhdGlvbj48Q WN0aW9uPjwvQWN0aW9uPjwvU2VxdWVuY2U+PC9Gb3Jnb3R0ZW5QYXNzd29yZD4=
Notice the spaces at the beginning of all lines after the first one? That is how LDAP files (LDIF) indicate that the line is a continuation of the previous line (spaces are very important in LDIFs as a result). Also notice that after the attribute name there are two colons (::) instead of one that is usually there between an attribute and its value, or the changetype and a string, etc. Double-colons indicates a base64-encoded value is coming up next so the backend knows to store it decoded if that is the right thing to do as otherwise this just looks like a normal, ugly string. If we remove the newlines and the spaces we get one nice long value of:
Sending it through the base64 command:
echo 'PEZvcmdvdHRlblBhc3N3b3JkPjxFbmFibGVkPmZhbHNlPC9FbmFibGVkPjxTZXF1ZW5jZT48QXV0aGVudGljYXRpb24+PCFbQ0RBVEFbXV0+PC9BdXRoZW50aWNhdGlvbj48QWN0aW9uPjwvQWN0aW9uPjwvU2VxdWVuY2U+PC9Gb3Jnb3R0ZW5QYXNzd29yZD4=' | base64 -d
gives us the following output:
So from this we retrieved a snippet of XML which basically appears to be a disabled forgotten password policy. That was simple enough. Comparing the lengths of the strings we end up with 146 characters in the decoded data and 197 characters in the original which is almost exactly 4/3 of the original size which is what we expected, give or take, for the encoding size change as mentioned earlier. Another interesting source of base64-encoded data is a PEM file. Generate one with openssl or find one for your system and try decoding it (the output will be, mostly, junk) with something like the following:
echo -n '<pemFileContentsHereWithoutNewline>' | base64 -d | less
The following website shows how to create test certificates in two simple steps and was the first hit Google gave me (there are countless others): http://sial.org/howto/openssl/self-signed/ Be sure to have ‘less’ on the end of the command showing the decoded contents so that the binary data do not mangle up your terminal. This is possible if special characters recognized by the terminal are shown and leave you restarting the terminal which is a waste of time and ‘less’ usually protects against this nicely.
Another interesting place I found some base64-encoded data was in a Novell Sentinel log file. During the Collector Manager startup I saw a huge block of what appeared to be encoded data (several megabytes of it) so I copied it out, removed the newlines/spaces, and ran it through my `base64 -d` command sending the contents to a new file. The ‘file’ command immediately identified the resulting file as a type of zip, and opening it I recognized it as the ‘ojdbc14.jar’ file that was used to connect to Oracle databases. The previously-mysterious binary data that could not have been interpreted by a human could be easily returned to its original form and identified using simple tools like ‘base64’ and ‘file’.
The encoding/decoding of data is sometimes done better with the work somebody else has done (the base64 command in Linux illustrates this nicely). A tool I have found that does this nicely for those stuck on an OS that doesn’t have useful tools out of the box is this website (and many others like it, but so far this one is my favorite): http://home2.paulschou.net/tools/xlate/ The site “translates” from one form to another and handles many different forms of data.
So in the end I hope this provides a summary of what base64-encoding really is, some of its practical uses, and how you can do it on your own if needed. Chances are good that if you do much in IT you will run into this at one point or another, and even if you do not live in IT those MIME attachments on e-mails use base64 encoding. The concepts around it are, to me, the most important part of it. Data can be the same whether they are presented in binary, hex, or some other form with a high base of sixty-four. What is important to me is recognizing the data for what they really are and being able to use them and the tools and techniques presented will hopefully let you do just that in the future.