All About Programming: Base64 encoding

Base64 encoding - PrismoSkills

Every language can be represented using a set of symbols. For example: english can be represented by a set of 26 alphabets along with some punctuation marks and numerical digits from 0 to 9.

If a language is made up of 'N' symbols, then the same can be represented by log₂N bits.

Since each language is made up of different no of symbols, the number of bits for each language is different. For example: ASCII encoding used to represent english uses 8 bit characters. An encoding scheme used to represent Chinese may require a substantially larger number of bits since Chinese characters number in tens of thousands, though most of them are only minor graphic variants encountered in historical texts.

Also, there may be non-textual data such as an image where some bits make up textual header information while the rest of them represent the pixels of some image.

The point being made here is that there may exist different encoding schemes such that different number of bits may be required by each encoding.

To transmit such data over a media which supports textual data only, there should exists a common encoding scheme whose output is only textual.

Base-64 encoding provides a way of doing this by converting each group of 6 bits into a specific character. The character set chosen for output varies between different implementations of Base-64 but usually they adhere to the convention that the output characters should be printer friendly and they should also be a subset of most implementations.

With the above two guidelines, most Base-64 implementations chose A-Z, a-z and 0-9 as the first 62 characters. Difference is mostly in the choice of the remaining 2 characters.

Usually the last 2 characters are chosen from the following: +, -, /, ., :, ! etc.

Note 1: Since Base-64 encoding encodes 6 bits into characters using 8 bits, its output is always more than the input (usually 4/3 times the size of input). The only advantage gained by this inefficiency is that the output is textual and thus acceptable by most software systems.

Note 2: There exists other binary-to-text encoding schemes too with varying degree of efficiency. For example, the familiar hexadecimal system can also be considered as a binary-to-text encoding but its not widely used for this purpose because its much more inefficient than Base-64 (since it converts every 4 bits in input to 8-bit output symbols, the output of hexadecimal would be double the size of input).

Padding in Base-64

Base-64 converts in groups of 6 bits.

So for 8 bit encodings such as ASCII english, it converts 3 input characters into 4 output characters. If the number of characters in input is exact multiple of 3, then the number of characters in the output is an exact multiple of 4. For cases, where this is not so, 2 cases arise:

Last group in the input has 1 character.
Last group in the input has 2 characters.

For #1, the first 6 bits are processed as usual and the next 2 bits are padded with 4 0s to their right. This gives 2 output characters. To indicate that the input did not have 4 0s in the end, the output is padded with 2 '=' characters.

Similarly, for #2, there will be 16 input bits, out of which first 2 groups of 6 bits will be processed as normal. Next 4 bits are padded with 2 0s to their right. This gives 3 output characters. To indicate that the input did not have 2 0s in the end, the output is padded with 1 '=' character.

Thus by padding with 1 or 2 '=' characters, we always get 4 characters in the output.

Strictly speaking, the padding '=' characters are not needed since the number of input characters can always be found by calculating 3N_o/4 where N_o is the number of output characters. But still some implementations mandate padding.

Read full article from Base64 encoding - PrismoSkills

Base64 encoding - PrismoSkills

Padding in Base-64

No comments:

Post a Comment

Labels

Popular Posts