Jul 19, 2008

Base64

The term Base64 refers to a specific MIME content transfer encoding. It is also used as a generic term for any similar encoding scheme that encodes binary data by treating it numerically and translating it into a base 64 representation. The particular choice of base is due to the history of character set encoding: one can choose 64 characters that are both part of the subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through systems, such as email, which were traditionally not 8-bit clean.
The precise choice of characters is difficult. The earliest instances of this type of encoding were created for dialup communication between systems running the same OS - e.g. Uuencode for UNIX, BinHex for the TRS-80 (later adapted for the Macintosh) - and could therefore make more assumptions about what characters were safe to use. For instance, Uuencode uses uppercase letters, digits, and many punctuation characters, but no lowercase, since UNIX was sometimes used with terminals that did not support distinct letter case. Unfortunately for interoperability with non-UNIX systems, some of the punctuation characters do not exist in other traditional character sets. The MIME Base64 encoding replaces most of the punctuation characters with the lowercase letters, a reasonable requirement by the time it was designed.
MIME Base64 uses A–Z, a–z, and 0–9 for the first 62 digits. There are other similar systems, usually derived from Base64, that share this property but differ in the symbols chosen for the last two digits; an example is UTF-7.
A quote from Thomas Hobbes's Leviathan:
Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.
is encoded in MIME's base64 scheme as follows:
TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg
dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu
dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=
In the above quote the encoded value of Man is TWFu. Encoded in ASCII, M, a, n are stored as the bytes 77, 97, 110, which are 01001101, 01100001, 01101110 in base 2. These three bytes are joined together in a 24 bit buffer producing 010011010110000101101110. Packs of 6 bits (6 bits has a maximum of 64 different binary values) are converted into 4 numbers (24 = 6x4) which are then converted to their corresponding values in Base 64.
Base64 01
As this example illustrates, Base 64 encoding converts 3 uncoded bytes (in this case, ASCII characters) into 4 encoded ASCII characters.
The example below illustrates how shortening the input changes the output padding:
Input ends with: carnal pleasure. Output ends with: c3VyZS4=
Input ends with: carnal pleasure Output ends with: c3VyZQ==
Input ends with: carnal pleasur Output ends with: c3Vy
Input ends with: carnal pleasu Output ends with: c3U=
Note that the same characters will be encoded differently depending on their position within the three-octet group which is encoded to produce the four characters. For example
The Input: leasure. Encodes to bGVhc3VyZS4=
The Input: easure. Encodes to ZWFzdXJlLg==
The Input: asure. Encodes to YXN1cmUu
The Input: sure. Encodes to c3VyZS4=

No comments: