Floating point numbers and their java representation - Part 1

Floating point numbers are represented using IEEE 764 format in Java

IEEE 754 standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits). The binary32 and binary64 formats are the single and double formats of IEEE 754-1985.

binary32:

Also called Single-precision floating-point format

occupies 4 bytes (32 bits)

Used to represent float datatype in Java

Can be divided into 3 parts : (from leftmost to rightmost bit)

1. sign bit : 1 indicates negative, 0 indicates positive or zero

2. exponent : 8 bits

3. significand precision : 23 bits

IEEE Short Real exponents are stored as 8-bit unsigned integers with a bias of 127. Let's use the number 1.101 x 2^5 as an example. The exponent (5) is added to 127 and the sum (132) is binary 10000100.

The binary exponent is unsigned, and therefore cannot be negative. The largest possible exponent is 128-- when added to 127, it produces 255, the largest unsigned value represented by 8 bits

Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number. For example, decimal 1234.567 is normalized as 1.234567 x 10^3 by moving the decimal point so that only one digit appears before the decimal. The exponent expresses the number of positions the decimal point was moved left (positive exponent) or moved right (negative exponent).

Similarly, the floating-point binary value 1101.101 is normalized as 1.101101 x 2^3 by moving the decimal point 3 positions to the left, and multiplying by 2^3

Say,

s -> the value of the sign bit

e -> the decimal representation of the exponent

b[i] -> ith bit in the mantissa

The corresponding decimal floating point number value is

For example,

lets take the following binary value :

0 0111110001000000000000000000000

clearly, s = 0

e = 01111100 = 124

so, the corresponding decimal value = (-1)^0 * ( 1 + 0*1/2 + 1*1/4 + 0*1/8 + 0*1/16 + 0*1/32 + ....) * 2^(124-127)

= 1 * ( 1 + 0.25) * 2^-3

= 1.25/8

= 0.15625

Now, lets try the other way - converting a decimal floating point number to binary

Say we want to convert 13.273 to binary

Note that all decimals cannot be exactly represented in binary - so in such cases, the accuracy is limited by the number of bits assigned for the mantissa

Now in this case, we separate the integer part (13) from the fraction part (0.273)

Converting the integer part is easy 13 = 1101

For converting the fraction part, we keep on incrementally multiplying it with 2.

If the result is > 1 the corresponding bit it 1 and we continue on with the fraction part after subtracting 1

If the result is < 1 the corresponding bit is 0 and we continue with the result

if the result = 1, the corresponding bit is 1 and all bits after that will boe 0

Applying this rule on 0.273 this is what we get: ( we do it 23 times because the max number of bits in mantissa is 23)

	x2	bit	remaining
0.273	0.546	0	0.546
0.546	1.092	1	0.092
0.092	0.184	0	0.184
0.184	0.368	0	0.368
0.368	0.736	0	0.736
0.736	1.472	1	0.472
0.472	0.944	0	0.944
0.944	1.888	1	0.888
0.888	1.776	1	0.776
0.776	1.552	1	0.552
0.552	1.104	1	0.104
0.104	0.208	0	0.208
0.208	0.416	0	0.416
0.416	0.832	0	0.832
0.832	1.664	1	0.664
0.664	1.328	1	0.328
0.328	0.656	0	0.656
0.656	1.312	1	0.312
0.312	0.624	0	0.624
0.624	1.248	1	0.248
0.248	0.496	0	0.496
0.496	0.992	0	0.992

So, the bit representation of 13.273 is 1101.0100010111100011010100

i.e 1.10101000101111000110101 * 2^3

So exponent (biased) = 127 + 3 = 130 = 10000010

Since the number is positive, sign = 0

So final bit representation = 0 10000010 10101000101111000110101

Java code for verifying this :

            int binary = Float.floatToIntBits(13.273f);
            String strBinary = Integer.toBinaryString(binary);
            System. out.println(strBinary);

output :

1000001010101000101111000110101