3DSoftware.com > Programming > Floating Point > Page 6
Floating Point Numbers  Page 6
 
Wide Floating Point
 
Addition
 
To add two WideFloat numbers, if they have the same exponent, simply add the corresponding digit packets. If they are not the same exponents, the number with the lesser exponent is right shifted (digit-wise by the difference of the exponents) into new digit packets, and then the corresponding digit packets are added together. Any digit packet that is shifted out of range is either discarded (truncated) or used for rounding.
 
Carrying is then performed beginning from the least significant digit packet. If the most significant digit packet generates a non-zero carry-out, that carry-out becomes a new digit, the other digits move one digit position to the right, the exponent is incremented, and the least significant digit is discarded (truncated) or used for rounding.
 
Subtraction
 
Subtraction is the same as for WideInteger, but with borrowing in the opposite direction, and with shifting of digit packets if the exponents are different.
 
If the operands have different signs, do addition instead of subtraction. Likewise, to perform addition of operands with different signs, perform subtraction. Addition and subtraction are unsigned operations in this implementation.
 
Truncation and Rounding
 
Truncation can cause gradual downward drift of values. [ 1 ]   That becomes less important with the larger significands used in these numbers. If you use truncation, you can round later when exporting to lower precision external formats.
 
Rounding can involve increasing the width of calculations. [ 2 ]  The high bit of the highest discarded digit packet becomes a round bit (guard bit), and some or all of the subsequent bits are recorded as a sticky bit. [ 3 ]   Rounding up (away from zero) is performed if the guard bit and sticky bit are both set, or if the guard bit and lowest bit of the significand are both set. [ 4 ]
 
Rounding if the guard bit and lowest significand bit are both set, and the sticky bit is clear, is called round to even. [ 5 ]
 
Footnotes:
 
1.   Goldberg, D., 2007, “Computer Arithmetic”, Appendix I (on disk) in Hennessy & Patterson, Computer Architecture: A Quantitative Approach 4th ed., pp. I-63 to I-64.
 
2.   Antia, H.M., 2002, Numerical Methods for Scientists and Engineers 2nd ed., p. 27.
 
3.   Patterson, D.A., and Hennessy, J.L., 2005, Computer Organization and Design: The Hardware/Software Interface 3rd ed., p. 214.
 
4.   Goldberg, 2007, p. I-20, Fig. I.11.
 
5.   Goldberg, D., “What Every Computer Scientist Should Know about Floating-Point Arithmetic”, pp. 185-186.   http://www.validlab.com/goldberg/paper.pdf
 
—  Page 6  —
 « Page 5 Contents Page 7 » 
 
Copyright © 2008 by 3D Software. All rights reserved.
3D Software, P.O. Box 221190, Sacramento CA 95822 USA
www.3DSoftware.com     Contact us
Thursday, 20-Nov-2008 14:23:39 GMT