Numbers, bases, and representations in digital computers


(Note: this page uses UTF-8 and CSS to get subscripts and simple math characters.)

Integers

Integers are the familiar numbers ..., –1, 0, +1, +2, ....  Integer values are also sometimes called “integral”, and can be divided into positive integers (1 to infinity), negative integers (–1 to –infinity), zero (0), non-negative (zero or positive), and (rarely) non-positive (zero or negative).  The disinction between positive and non-negative is usually quite important: for instance, C typically uses non-negative subscripts for arrays, specifically including zero.

Bases

We typically write our integers (and other numbers, for that matter) using “base 10” or “decimal” arithmetic.  This is a positional notation, where each “position” has a value ten times more significant than the next one.  The last digit is the number of ones, the second-to-last is the number of 10s, and so on: hence the digit-sequence 593 means “five one-hundreds, nine tens, and three ones” or five hundred ninety three.  Mathematically, we have a base ‘b’ (typically a positive integer, in this case 10) and a sequence of ‘n’ digits an–1, an–2, ..., a1, a0.  This represents the value an–1•bn–1 + an–2•bn–2 + ... + a1•b1 + a0•b0.  (Note that b1 = b and b0 = 1; we could simplify these, but the symmetry is nice.)

Note that in this mathematical notation, additional leading zero digits are allowed, with no effect: 0042 has zero thousands, zero hundreds, four tens, and two ones, and is the same as 42.  Normally we leave off the leading zeros, of course, as they do not add anything useful.

Internally, modern computers use binary, or base 2.  Here the digit-sequence 100110 represents 1•25 + 0•24 + 0•23 + 1•22 + 1•21 + 0•20, or 32 + 4 + 2 or 38.  You may wish to become intimiately familiar with base-2 numbers, and at least some of the smaller powers of two (1, 2, 4, 8, 16, 32, 64, 128, 256, and so on).

As a C programmer, you need to become familiar with two more bases as well: 8 (also called “octal”), and 16 (“hexadecimal”, although Knuth argues convincingly that it should really be called “senidenary”).  Bases above 10 have a notational problem: individual digits need to exceed nine.  The solution used in C is that the letter A (in either upper or lower case) represents 10, B represents 11, and so on through F.

To write the number forty-two in C, you can use either base 10 as usual, or base 8, or base 16.  In base 8, 42 needs five eights and two more ones, so instead of 42, we write 52.  In base 16, we need two sixteens and ten more ones, so we write 2A.  Of course, if you just write “52”, people will think you mean fifty-two, not forty-two; clearly we need some way to denote the base of the number, as well as the digit-sequence.  In mathematics, we use subscripts: 4210 = 528 = 2A16.  Some assembly languages use suffixes instead of subscripts; C uses prefixes.  To mark a number as being written in hexadecimal, we precede it with the digit zero and an X (uppercase or lowercase).  To mark it as being in octal, we precede it with the digit zero.  Hence, in C, 42, 052, and 0x2a all represent the abstract number “forty-two”.

Note that the number itself does not change just because we represented it differently, using a different base.  This fact is quite crucial: the number itself is an abstraction; the various representations are concrete manifestations of that abstract.

A useful property of binary (base 2) numbers is that each digit is always either zero or one.  Zero times anything is zero, and one times any other number is the other number, so only the ones “count”, as it were, and only the position of those ones matters.  Each power of two is either included, or excluded.  Also, when counting up, we find that the least significant bit simply flips back and forth, and each more-significant bit flips whenever its next-less-significant bit goes from 1 to 0: 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111,  1000, 1001, 1010, ....  Given some number of binary digits, the maximum possible value occurs when they are all ones (and the minimum is of course just zero).  Moreover, this maximum value is simply one less than the next power of two: for instance, 112 = 310, which is one less than 1002 = 410.  If you have your powers of two memorized, you can tell right away how many binary digits are required to hold any given integer.  For instance, since 211 is 2048 and 212 is 4096, an integer greater than or equal to 2048, but less than 4096, needs 11 bits to hold it.  (Extra bits are, of course, always allowed: extra bits at the front just need to be set to zero.)

C lacks the ability to write out binary numbers directly, but octal and hexadecimal tend to suffice.

Representations for integers

Today's computers store values in binary.  It is easy enough to represent non-negative integers this way, up to some upper limit anyway.  If a computer has 8-bit bytes, and uses four of these bytes to store an integer, it has 32 bits available.  If we go back to our positional notation above, we can see right away that this can represent values up to 231 + 230 + ... + 21 + 20, or 232–1, or (in decimal) 4,294,967,295.

This is all well and good for non-negative integers, but what about negative integers?  Over time, a number of schemes have been used to represent negative integers in binary computers.  The one that is perhaps simplest conceptually is the sign-and-magnitude method, also called signed magnitude.  The other two are called ones’ complement and two’s complement.  All three methods are allowed in C.  In all three cases, we pick one bit to represent the sign, and use the rest of the bits for the value.  It does not matter which bit we choose, but for notational convenience we tend to pick the “first” bit (this causes some interesting fights later on when we look at “endianness” of machines, since which bit comes “first” depends on where you stand—at the low-order end or the high-order end—when looking over at bits and bytes).

In signed magnitude, you simply take the value of all the “value bits” to get the value, and then if the sign bit is set, call the number negative.  Hence the number 1 (for the sign), 011 (for value) has the value 310, and is negative, so represents the value “negative three”.  To make the number positive we just turn off the sign bit, giving +3.  One minor drawback to this method is that there are two zeros: regular or “apparently positive” zero (with the sign bit zero as well as all the value bits), and a “negative” zero (still zero in the value bits, but the sign bit set).  A more serious drawback is that this representation makes doing arithmetic more complicated, slowing down the computers.  Before we can add two numbers, we have to see whether they have the same sign.  If not, we have to subtract the negative one, instead of just adding.  (Of course, this gives us a chance to catch overflows, but as we will see over and over again in computing, it seems to be more important to compute an answer as fast as possible, even if it is wrong.)

The ones’ complement method avoids the second drawback.  Here, to represent a negative number, we again set the sign bit, but this time we also invert all the value bits (change ones to zeros, and vice versa).  Now to represent negative-three we set the sign bit, but also flip the value bits, giving 1 (for the sign), and 100 for the value.  When we go to add two numbers, we need a device called an “end-around carry” to bring any carry bit out of the sign bit back over into the value bits.  Watch how this works when we add –3 and +2, and then –3 and –1, and then –3 and +4 (the dots below represent carries that are added back in):
 1100 (-3)
+0010 (+2)
----
1110 (-1)

. .
1100 (-3)
+1110 (-1)
----
1011 (-4)

. .
1100 (-3)
+0100 (+4)
----
0001 (+1)
The sum is negative in both of the first two cases, and positive in the third.  In the first addition, there are no carries, so the sum is easy.  In the second addition, there is a carry out of the two “1” bits in the most significant value bit, which is added to the two sign bits (both originally set) so that the final sign bit is still set.  The two sign bits, when added, also produce a carry, and this carry is sent “around the end” to add back to the lowest value bits (both zero in this case) so that the final sum has the low-order bit set as well.  The last case is similar: the two most significant value bits produce a carry-out as well as a zero.  But this time, the carry is added to the (single) sign bit, giving another carry but leaving the sign of the sum set to 0.  The carry out of the sign bit is brought around the end to add back in to the low order bit.

Again, there are two representations for zero: the normal, regular zero; and the one with the sign bit (and in fact every bit) set, which is a “negative zero”.  This value is slightly peculiar, and on a number of machines that used ones’ complement, an attempt to use “–0” was (sometimes optionally) caught at runtime and could thus catch attempts to use variables that had not been set to a “proper” value yet, by simply filling all memory with all-one-bits in advance.  (A similar trick is found on today's computers, in another form entirely.)

Here are a few more example sums, using ones’ complement arithmetic.  Note that if both of the low-order value bits being added are 1s, their sum is 0 (with a carry into the next value bit), but an end-around-carry will cause the low order bit to become one again:
 . ...
11001 (-6)
+11011 (-4)
-----
10101 (-10)
(We had to go to five bits total – or four value bits – to hold the result, in this case.)  Similarly, if only one of the low-order bits is 1, their sum is 1 (with no carry), but this time an end-around carry will cause a carry out, sending ripples on up the chain. The worst cases tend to occur when adding “negative zero” to a value:
 .....
00010 (+2)
+11111 (-0)
-----
00010 (+2)
Peculiarly, adding –0 to any number except +0 gives the original number, while adding –0 to +0 gives –0.

With five bits, the allowed value range is –15 to +15:
 .....
10111 ( -8)
+10000 (-15)
-----
01000 ( +8)
Here we actually got the wrong answer, not due to ones’ complement so much as the fact that the result just does not fit: it should be –23.

The last method, and actually the most common by far today, for representing negative integers is two’s complement.  (Incidentally, see Knuth for the reason for the difference in apostrophe placement in ones’ and two’s complement.)  This is much like ones’ complement, except that if the sign bit is on, we flip all the other value bits and then add 1 to them, to find the value being represented.  Hence, while –3 is 1100 in a four-bit ones’ complement number, it becomes 1101 in two’s complement.  This extra step—“then add 1”—avoids the need for an end-around carry.  Now we can add any two values, simply discarding the carry out of the sign bit, to get the sum:
 . 
11010 (-6)
+11100 (-4)
-----
10110 (-10)
This means we can use the same machine instruction to add signed integers that we use to add unsigned integers.  This is not the case with ones’ complement: look at the first example above that adds –6 and –4 to get –10.  The end-around carry produced the correct result (–10), but if the two five-bit sequences were meant as unsigned integers, their values would be 25 and 27 respectively, and the sum ought to be 52, or 110100 in binary.  Of course, this value needs six bits, not just five, but the low-order bits we got earlier were 10101, not 10100.  In our second example, when we added –6 and –4, their unsigned equivalents were not 25 and 27 but rather 26 and 28, and their sum should be 54, or 110110 in binary.  Now we have the correct low-order bits (and a carry, which indicates that the result is too large to fit in our unsigned, five-bit field). This kind of arithmetic is needed to implement C’s unsigned integers.

These are all relatively minor points in most cases, because as programmers working in a sufficiently high level language, we do not have to concern ourselves with the gritty details of how the machine implements numbers internally.  At some point, however, the seams start to show—and it is at this point that it becomes crucial that we understand the representations, and their failings.  For instance, while sign-and-magnitude and ones’ complement both have a “negative zero” that we may have to deal with, two’s complement has a different problem: it can represent one more negative value than positive.

In sign-and-magnitude, all-ones is –1, but a sign bit and all-zeros is “–0”.  In one’s complement, all-ones is “–0”.  In two’s complement, all-ones is –1 again—so what is a sign bit and all-zeros?  The answer is, it is a negative value that has no corresponding positive value.  If we flip all the bits in 1000...00, then add 1, we add 1 to 0111...11 which gets 1000...00 again.  In 16 bits, a 1 for the sign followed by all-zeros for the value represents –32768, while the largest positive value is +32767.  Negating –32768 leaves the value at –32768 (unless, as hardly ever happens, the machine catches this error for you; again, it is usually more important to get the wrong answer as fast as possible).

Thus, all three representations have drawbacks: sign-and-magnitude is easy to understand, but requires a lot of extra hardware and has a “negative zero”; ones’ complement requires a little extra hardware (the end-around carry) and, as with signed magnitude, needs separate instructions for signed and unsigned and still has a “negative zero”; and two’s complement has a negative value that has no corresponding positive value.  But two’s complement needs no extra hardware, can be made very fast, and can get the wrong answer as fast as possible using a simplified instruction set, so most modern computers do that.

Note, incidentally, that I have talked about computing the positive value of a negative number in terms of bit inversions (and possibly adding 1).  There is a nice neat mathematical interpretation we can use instead, with the digit-sequence summation rules as above.  Instead of the an–1 term being multiplied by 2n–1, however, this most significant “sign” bit simply has a value of –(2n–1) or –(2n–1–1), for two’s complement and ones’ complement respectively.  With a 4-bit binary number, for instance, the low 3 bits represent 4s, 2s, and 1s (so that the maximum value of these three bits is 7) but the “top” bit, instead of representing 8, represents –8 or –7.  Hence the digit-sequence 1001, in two’s complement, represents –8+1 or –7, and 1000 represents –8.   In ones’ complement, these sequences represent –7+1 or –6, and –7, respectively; and all-ones, which is –8+7 or –1 in two’s complement, is now –7+7, or zero (but because of the sign bit, is –0 instead of +0).

Fixed-point

C does not have a built-in fixed-point type, but these are easy to build out of integer types.  With a binary fixed point, we simply designate some of the bits that represent the number as “fraction bits”.  Continuing the mathematical notation from above, the digits before the “binary point” represent 2n–1, 2n–2, etc., down to 20.  The subsequent digits represent 2–1, 2–2, and so on.  Note that 2–1 is 0.5, 2–2 is 0.25, 2–3 is 0.125, and so on—so we can represent 4.5 and 4.25 and 4.75, but depending on how many bits we have, we may never be able to represent some values exactly.  If the last two bits are “past the binary point”, 10000 represents 4.00, 10001 represents 4.25, 10010 represents 4.50, and 10011 represents 4.75, for instance, but there is no bit pattern that can represent 4.10.  In effect, we just “pre-multiply” (or scale) by 2k, where k is the number of digits past the binary point: 4.5 times 4 is 18, which is 10010 in binary.  4.1 times 4 is 16.4, and the “point four” part is not representable.

C's unsigned integer types work well for fixed-point numbers, though of course, they are unsigned.  If you need negative fixed-point numbers, you can simply use signed arithmetic.  This will work fine for most cases, but note what happens when you multiply or divide two fixed-point numbers: they already include that scale factor of 2k, so the product of two includes a factor of 22k, and dividing two removes the scale factor.  Depending on the use of the result, you will have to re-adjust the scale.  It is temping to do this with a shift operation, but shift operations are only well-defined on unsigned integers.  If the underlying hardware uses ones’ complement or signed magnitude, you will usually get a wrong answer.  Even with two’s complement, shifting a negative number can produce the wrong answer sometimes.  (You can perhaps console yourself with the idea that you will get the wrong answer as fast as possible.)

Floating-point

Floating point comes in a bewildering variety of implementations, at least on older hardware.  C allows for almost all of them.  Addressing all the details is far beyond the scope of this web page, so I will deliberately ignore such horrors as IBM’s “hex float”, and concentrate strictly on IEEE floating point, as found on most of today’s microprocessors.

Consider a typical “scientific notation” number written in decimal, which looks like 7.2345•1012.  For convenience, this is often written without using superscripts as 7.2345E12.  This consists of two pieces, a significand (often also called a “mantissa”, although strictly speaking, the mantissa in this case is 0.2345, while the significand is the complete 7.2345 part) and an exponent—in this case, 12.  There are several useful features of this notation, only some of which apply to computer representations.  The first feature is that in this normal form, there is exactly one digit before the decimal point.  We can use this later, to avoid storing the “point” character in the computer.  One that does not apply, or at least not directly, is the concept of “significant digits”: the number 7.2345 has five digits (we do not count the decimal point) so the number can be considered to be accurate to five digits.  If the number were accurate to six digits, we could write 7.23450E12: the extra digit here does not affect the final value.  A third feature, which does apply this time, is that multiplication and division are easier—we multiply or divide the significands, and then simply add or subtract the exponents.  Curiously, addition and subtraction become harder, because 1.4171901E7 plus 3.3E1 requires "re-scaling" one of the numbers: 3.3E1 first becomes 0.0000033E7, in what we can call a denormalized form (because the decimal point is not immediately after the first digit), and then we can add the two to get 1.4171934E7.  (It is possible to denormalize the other operand, or even both, and do the addition and then normalize the result.  The situations can get tricky with extreme exponents, however.  They also get tricky when subtracting two numbers that are very close in value, such as 1.99999E5 minus 1.99998E5, giving 0.00001E5.  This has to be re-normalized to 1.00000E0.)

Scientific notation is, in effect, a decimal form of floating point.  It is easy to fall into a trap here: computer arithmetic is usually printed out in decimal floating point form, making it easy to believe that the numbers were in that form all along.  But in fact, most computers today use a binary floating point format.

Consider the number 4.5, but in binary floating-point.  As with the fixed-point example above, 4.5 is 22 + 2–1. In binary, then, it is “100.1”.  All we have to do now is “normalize” this number, to read (in a “newly reinvented” notation using the letter B instead of E) “1.001b+2”.  In other words, we now have 1 one, no halves, no quarters, and one eighth (1.125), but multiplied by two-squared or 4.  As with the fixed-point number, we simply scale by 4—well, in this case that is.  The difference between fixed and floating point is that the scale factor can change.  To represent 9.0, we will scale by 8 (23): now we want 1, no halves, no quarters, and one eighth (1.125) but multiplied by eight, giving 1.001b+3.

In decimal floating-point—scientific notation—it is trivial to scale by powers of ten: we just fiddle with the exponent.  Similarly, in binary floating-point, it is trivial to scale by powers of two.  Since 9 is twice 4.5, we just bump the exponent one.  To get to 18, we just bump the exponent again.  Each time, the significand stays 1.001; only the exponent changes.

Curiously enough, binary floating-point is absolutely terrible at representing numbers like one-tenth.  In decimal, this is easy: 1.0e–1.  In binary, however, one-tenth is one-sixteenth (0.0675) plus one-thirty-secondth (0.03125) plus no sixty-fourths (0.015625; this takes us over) plus no 128ths (0.0078125; this still takes us over) plus one 1/256th plus one 1/512th plus no 1/1024ths, and so on, giving 1.100110011001100...b–4 as its binary representation.  The sequence "1100" repeats forever, in much the same way as one-third, written in decimal, comes to 3.33333...e–1, using an endless string of "3"s.  (In fact, it is easy to see that any decimal fraction that does not end with the digit 5 has this problem.)  Fortunately, integer values, at least for “sufficiently small” integers, are always exactly representable in binary floating-point.

Note that if we always write our binary floating-point number in normalized form, not only is the binary-point always the second character, but also the first character is always—always—a “1”.  That means we can not only omit storage for the binary point, but omit storage for the first digit as well.  So, inside the computer, the number 1.1011b+4 (which represents 16+8+2+1 or 27) can be represented as just the binary digit sequence “1011” and the exponent (+4).  Hence, in IEEE binary floating point, the storage is broken into two pieces: one for the mantissa (this term is now being used correctly, since the leading digit is no longer stored) and one for the exponent.  Of course, we also have pesky sign issues to deal with: both the mantissa and the exponent are signed, as numbers can take forms like –1.1011b–4 as well as +1.1011b+4.  The technique used here, in IEEE format, is a bit peculiar: the mantissa is stored as signed-magnitude, as with older computers and integers; but the exponent is stored in an “excess” format.  Here, negative numbers have a bias added to make them non-negative.

Without getting into too many additional details, what this amounts to is that floating-point numbers have three pieces (instead of four): one bit for the sign of the mantissa; some fixed number of bits for the mantissa itself; and the rest for the exponent, in excess form. The “excess” amount depends on the number of bits remaining.  For instance, IEEE single-precision is a 32-bit format, with 1 bit for mantissa-sign, 23 bits for mantissa, and 8 bits for exponent.  Since eight unsigned bits can represent numbers from 0 to 255 inclusive, the exponent should logically be stored in “excess-128” form.  For reasons I am not really prepared to explain, however, the actual excess is 127, so an exponent value of 1 represents 21–127, i.e., 2–126.  As we are about to see, an exponent value of zero is used for other purposes.

Since we use the normalized form, the 23 mantissa bits actually store 24 bits of significand value, with the implied leading “1” digit and binary point.  Our numbers are always  plus-or-minus (based on mantissa sign) 1.something, b, plus-or-minus something (whatever is in the adjusted exponent).  Or, perhaps more understandably, if the actual mantissa value is m, and the actual (unadjusted) exponent value is e, the value represented is 1.m•2e–127 (and then negated if the sign bit is set).

If you have been paying attention, or are familiar with the issue, you should have noticed a problem by now.  If the mantissa always means one-point-something, how can we represent zero?  The best we can do with a minimum normal exponent (unadjusted value 1, representing 2–126) is 1.000...000b–126, which is about 1.17549435e–38.  Thus, we sneak a few values off the ends by declaring certain exponent values to be “special”.  In particular, an all-zero-bits exponent, which “ought” to represent b–127 in this case, is taken for two special cases. For a (perhaps false) symmetry, an all-ones exponent, which “ought” to represent b+128, is taken for two more.

The first two special cases are “zero” and denormalized (also known as subnormal) numbers: a value that has all-zero bits in the mantissa and a zero exponent represents 0.0 (or –0.0 if the sign bit is set), while one that has some nonzero mantissa bits is considered to be a “denorm”.  Instead of 1.m•2e–127, the number is taken as 0.m•2e–126 (or, if you like, move the binary point one character rightwards, and decrease the exponent to –127).  Of course e here is zero, so this is just 0.m•2–126 (or m.m•2–127 with the point after the first digit of m).  The minimum nonzero single-precision number is thus 0.00000000000000000000001b–126 (there are 22 zeros after the binary point, then the 1), which is 2–149, or about 1.4e–45.

The last two special cases, with all-ones exponents, are used to represent “Infinity” (if all the mantissa bits are zero) and “Not a Number”, or “NaN” (if some mantissa bits are set).  As a useful trick, if all bits are set, the exponent is by definition all ones, and the mantissa clearly has at least some nonzero bits (in fact all are nonzero), so this is a NaN.  Setting memory to all-ones thus catches some uses of uninitialized variables.

Note that infinities are signed—positive infinity is greater than all other numbers, and negative infinity is less than all others—but NaNs, while they still have a sign bit, are not supposed to be considered signed.  Not every implementation always gets this right.  As a further complication, NaNs are split into “Quiet NaNs” and “Signaling NaNs”, with QNaNs in calculations just causing the result to be another QNaN, while SNaNs are supposed to be caught at runtime.  Again, not every implementation gets this right.

Calculations whose result exceeds the maximum possible representable value are supposed to convert to +Infinity: for instance, 1e38 times 1e1 would overflow, so should yield +Infinity.  Negative numbers that exceed the maximum negative number likewise overflow to –Infinity (so –1e38*1e1 = –Infinity).

With an exponent that goes only to b+127, single-precision numbers are limited to about 1e38.  (The actual number, 340282346638528859811704183484516925440, or about 3.4e+38, is what you get with 1.m•2e–127 when m is all-1-bits and e=254.)  Double-precision numbers generally use 64 bits, broken into 1 sign bit, 52 mantissa bits, and 11 exponent bits (stored in excess-1023, since 210 is 1024).  The larger exponent gives a range of about 1e308.  Extended precision, if available, uses 80 bits on some systems including IA-32, 128 bits on others.  Both the mantissa and exponent fields are expanded, with the exponent normally getting 15 bits and the mantissa getting the rest (which thus vary); the maximum number here is about 1e4932.  Intel IA-32 architecture FPUs are peculiar in that they use 80-bit representations for all numbers internally, converting to 32 or 64 bits when loaded or stored.  While there are some control bits to compensate for the extra significand precision, there are none for the extra exponent range.  This means that some computations that should overflow, in fact give answers other than Infinity.  For instance, multiplying the two double-precision numbers 1e200 and 1e200 should overflow, not produce 1e400.  Dividing the result (+Infinity) by 1e300 should leave the result at Infinity.  On many Intel chips, the only way to cause this to happen is to store calculations back to memory, which slows down the computation.  Hence the C code fragment:
double a = 1.0e200, r;

r = a * a;
r = r / 1.0e300;
printf("%g\n", r);
should print "Infinity", but actually prints "1.0e100", on many Intel machines with many C compilers.  (Again, we see it is important to get the wrong answer—at least, wrong as far as IEEE floating-point is concerned—as fast as possible.  One can, of course, argue that this is obviously the right answer.  Still, the rules for both C and IEEE arithmetic dictate that the result should be Infinity.)

back