Floating Point Arithmetic

FloatingPointArithmetic
TableofContents
o HistoryofFloatingPoint o DefiningFloatingPointArithmetic o FloatingPointRepresentation o FloatingPointFormat o FloatingPointPrecisions o FloatingPointOperation o Specialvalues o ErrorAnalysis o ExceptionHandling
IA32FloatingPoint
History

8086:firstcomputertoimplementIEEEFP
separate8087FPU(floatingpointunit)
486:mergedFPUandIntegerUnitontoonechip
Summary
Hardwaretoadd,multiply,anddivide Floatingpointdataregisters Variouscontrol&statusregisters
FloatingPointFormats
singleprecision(Cfloat):32bits doubleprecision(Cdouble):64bits extendedprecision(Clongdouble):80bits
DefiningFloatingPointArithmetic
o Representablenumbers o Scientificnotation:+/d.ddxrexp o signbit+/ o radixr(usually2or10,sometimes16) o significandd.dd(howmanybaserdigitsd?) o exponentexp(range?) o others? o Operations: o arithmetic:+,,x,/,... o howtoroundresulttofitinformat o comparison(<,=,>) o conversionbetweendifferentformats o shorttolongFPnumbers,FPtointeger o exceptionhandling o whattodofor0/0,2*largest_number,etc. o binary/decimalconversion o forI/O,whenradixnot10 o Language/librarysupportfortheseoperations
FloatingPointRepresentation
o Itdescribesasystemforrepresentingreal numberswhichsupportsawiderangeof values. o Anumberinwhichthedecimalpointcanbe inanyposition.
o Example:

Amemorylocationsetasideforafloatingpoint numbercanstore0.735,62.3,or1200.
Incomparisonwith:
o Radixpointorradixcharacteristhesymbolusedinnumerical representationstoseparatetheintegerpartofanumber(totheleftofthe radixpoint)fromitsfractionalpart(totherightoftheradixpoint).Radix pointisageneraltermthatappliestoallnumberbases. Ex:Inbase10(decimal):13.625(decimalpoint) Inbase2(binary):1101.101(binarypoint)

o Fixedpointanumberinwhichthepositionofthedecimalpointisfixed.A fixedpointmemorylocationcanonlyaccommodateaspecificnumberof
o
o BinaryCases

Sign bit
Biased Exponent
Significand or Mantissa
where:
S is the fractionmantissaorsignificand. E is the exponent. B is the base, in Binary case
IEEE754:FloatingPointinModern Computer o TheIEEEhasstandardizedthecomputer representationforbinaryfloatingpoint numbersinIEEE754.Thisstandardis followedbyalmostallmodernmachines. o
IEEE754:FloatingPointFormat
IEEE754format Definessingleanddoubleprecisionformats (32and64bits) Standardizesformatsacrossmanydifferent platforms Radix2 Single Range1038to10+38 8bitexponentwith127bias 23bitmantissa Double Range10308to10+308 11bitexponentwith1023bias
IEEE754FormatParameters
FloatingPointPrecisions
IEEE754: 16bit:Half(binary16) 32bit:Single(binary32),decimal32 64bit:Double(binary64),decimal64 128bit:Quadruple(binary128),decimal128 Other: o Minifloat o Extendedprecision o Arbitraryprecision
FloatingPointPrecisions
o SinglePrecision,called"float"intheC languagefamily,and"real"or"real*4"in Fortran.Thisisabinaryformatthatoccupies 32bits(4bytes)anditssignificandhasa precisionof24bits(about7decimaldigits). o Double precision, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significandhasaprecisionof53bits(about16 decimaldigits). o The other basic formats are quadruple precision (128bit) binary, as well as decimal
PRECISIONCONSIDERATIONS
o GuardBitspriortoafloatingpoint operation,theexponentandsignicandof eachareloadedintoALUregisters.The registercontainsadditionalbits,called guardbits,whichareusedtopadoutthe rightendofthesignificandwith0s. o Roundingtheprecisionoftheresultisthe roundingpolicy.Theresultofanyoperation onthesignificandsisgenerallystoredina longerregisters.
THEUSEOFGUARDBITS
THESTANDARDLISTFOUR ALTERNATIVEAPPROACHES:
Roundtonearest:Theresultisroundedtothe nearestrepresentablenumber. Roundtoward+:Theroundeduptowardplus infinity. Roundtoward:Theresultisroundeddown towardnegativeinfinity. Roundedtoward0:Theresultisrounded towardzero.
Internalrepresentation
Floatingpointnumbersaretypicallypackedintoacomputerdatum asthesignbit,theexponentfield,andthesignificand(mantissa), fromlefttoright.FortheIEEE754binaryformatstheyare apportionedasfollows:
IEEE STANDARD FOR BINARY FLOATINGPOINT ARITHMETIC
IEEE 754 goes beyond the simple definition of a format to lay down specific practices and procedures so that floating-point arithmetic produces uniform , predictable results independent of the hardware platform.
FloatingPointOperation
EXPONENTOVERFLOW
o Apositiveexponentexceedsthemaximumpossibleexponentvalue.In somesystem,thismaybedesignatedas+or.
EXPONENTUNDERFLOW
o Anegativeexponentislessthantheminimumpossibleexponentvalue (e.g.,200islessthan127).Thismeansthatthenumbersistoo smalltoberepresented,anditmaybereportedas0.
SIGNIFICANTUNDERFLOW
o Intheprocessofaligningsignificant,digitsmayflowofftherightend ofthesignificant.Asweshalldiscuss,someformofroundingis required
SIGNIFICANTOVERFLOW
o Theadditionoftwosignificantofthesamesignmayresultinacarry outofthemostsignificantbit.Thiscanbefixedbyrealignment,as
FLOATINGPOINT: ADDITIONANDSUBTRACTION(ZXY)
PHASE1:ZEROCHECK o Additionandsubtractionareidenticalexceptforasignchange,theprocessby changingthesignofthesubtractedifitisasubtractoperation.Next,if eitheroperandis0,theotherisreportedastheresult. PHASE2:SIGNIFICANDALIGMENT. o Thenextphaseistomanipulatethenumberssothatthetwoexponentsare equal. PHASE3:ADDITIION o Thetwosignificandsareaddedtogether,takingintoaccounttheirsign. Becausethesignsmaydiffer,theresultmaybe0.Thereisalsothe possibilityofsignificandoverflowby1digit.Ifso,theresultisshiftedright andtheexponentisincremented.Anexponentoverflowcouldnotoccuras aresult;thiswouldbereportedandtheoperationhalted. PHASE4:NORMALIZATION o Thefinalphasenormalizestheresult.Normalizationconsistsofshifting significanddigitsleftuntilthemostsignificanddigit(bit,or4bitsorbase 16exponent)isnonzero.
FLOATINGPOINTADDITIONAND SUBTRACTION(ZXY)
FLOATINGPOINT: MULTIPLICATION(ZX*Y)
FLOATINGPOINT:DIVISION(ZX/Y)
Specialvalues
o
Signedzero IntheIEEE754standard,zeroissigned,meaningthatthereexistbotha"positive zero"(+0)anda"negativezero"(0). Subnormalnumbers Subnormalvaluesfilltheunderflowgapwithvalueswheretheabsolutedistance betweenthemarethesameasforadjacentvaluesjustoutsideoftheunderflowgap. Infinities TheinfinitiesoftheextendedrealnumberlinecanberepresentedinIEEEfloating point data types, just like ordinary floating point values like 1, 1.5 etc. They are not error valuesinanyway,thoughtheyareoften(butnotalways,asitdependsontherounding)used asreplacementvalueswhenthereisanoverflow.Uponadividebyzeroexception,apositiveor negativeinfinityisreturnedasanexactresult.Aninfinitycanalsobeintroducedasanumeral (likeC's"INFINITY"macro,or""iftheprogramminglanguageallowsthatsyntax). NaNs IEEE754specifiesaspecialvaluecalled"NotaNumber"(NaN)tobereturnedas theresultofcertain"invalid"operations,suchas0/0,0,orsqrt(1).Therepresentationof NaNsspecifiedbythestandardhassomeunspecifiedbitsthatcouldbeusedtoencodethetype oferror;butthereisnostandardforthatencoding.Intheory,signalingNaNscouldbeusedby aruntimesystemtoextendthefloatingpointnumberswithotherspecialvalues,without slowingdownthecomputationswithordinaryvalues.Suchextensionsdonotseemtobe common,though.
IEEEFloatingPointArithmeticStandard754 NAN(NotANumber)
o NAN:Signbit,nonzerosignificand,maximumexponent o InvalidException
o occurswhenexactresultnotawelldefinedrealnumber o 0/0 o sqrt(1) o infinityinfinity,infinity/infinity,0*infinity o NAN+3 o NAN>3? o ReturnaNANinallthesecases
o TwokindsofNANs
o Quietpropagateswithoutraisinganexception o goodforindicatingmissingdata o Ex:max(3,NAN)=3 o Signalinggenerateanexceptionwhentouched o goodfordetectinguninitializeddata
OPERATIONSTHATPRODUCEA QUIETNaN
IEEEFloatingPointArithmeticStandard754 NormalizedNumbers
o NormalizedNonzeroRepresentableNumbers:+1.ddx2exp
o o o Macheps=Machineepsilon=2#significandbits OV=overflowthreshold=largestnumber
= relativeerrorineachoperation
Format #UN=underflowthreshold=smallestnumber bits #significandbits macheps #exponent bits o exponent range --------------------------------------------------------------------o ---------------------o Single 32 23+1 2-24 (~10-7 ) 8 -126 - 2127 (~10+-38 ) 2 o Double 64 52+1 2-53 (~10-16 ) 11 2-1022 - 21023 (~10+-308 ) o Double >=80 >=64 <=2-64 (~10-19 ) >=15 -16382 - 216383 (~10+-4932 ) o 2 Extended (80 bits on Intel machines) o
+Zero:+,significandandexponentallzero
Whybotherwith0later
IEEEFloatingPointArithmeticStandard754 Denorms
o DenormalizedNumbers:+0.ddx2min_exp
o signbit,nonzerosignificand,minimumexponent o FillsingapbetweenUNand0
o UnderflowException
o occurswhenexactnonzeroresultislessthanunderflowthresholdUN o Ex:UN/3 o returnadenorm,orzero
o Whybother?
o Necessarysothatfollowingcodeneverdividesbyzero o if(a!=b)thenx=a/(ab)
IEEEFloatingPointArithmeticStandard754 +Infinity
o +Infinity: Signbit,zerosignificand,maximumexponent o OverflowException
o occurswhenexactfiniteresulttoolargetorepresentaccurately o Ex:2*OV o return+infinity
o DividebyzeroException
o return+infinity=1/+0 o signofzeroimportant!Examplelater
o Alsoreturn+infinityfor
o 3+infinity,2*infinity,infinity*infinity o Resultisexact,notanexception!
Error Analysis
o Basicerrorformula
o fl(aopb)=(aopb)*(1+d)where
o oponeof+,,*,/ o |d|<==machineepsilon=macheps o assumingnooverflow,underflow,ordividebyzero
o Example:adding4numbers

fl(x1+x2+x3+x4)={[(x1+x2)*(1+d1)+x3]*(1+d2)+x4}*(1+d3) =x1*(1+d1)*(1+d2)*(1+d3)+x2*(1+d1)*(1+d2)*(1+d3) +x3*(1+d2)*(1+d3)+x4*(1+d3) =x1*(1+e1)+x2*(1+e2)+x3*(1+e3)+x4*(1+e4) whereeach|ei|<~3*macheps getexactsumofslightlychangedsummandsxi*(1+ei) BackwardErrorAnalysisalgorithmcallednumericallystableifit givestheexactresultforslightlychangedinputs NumericalStabilityisanalgorithmdesigngoal
Exception Handling
o Whathappenswhentheexactvalueisnotarealnumber, ortoosmallortoolargetorepresentaccurately? o 5Exceptions:
o Overflowexactresult>OV,toolargetorepresent o Underflowexactresultnonzeroand<UN,toosmallto represent o Dividebyzerononzero/0 o Invalid0/0,sqrt(1), o Inexactyoumadearoundingerror(verycommon!)
o Possibleresponses
o Stopwitherrormessage(unfriendly,notdefault) o Keepcomputing(default,buthow?)
Exception Handling User Interface

o Eachofthe5exceptionshasthefollowingfeatures o Astickyflag,whichissetassoonasanexception occurs o Thestickyflagcanberesetandreadbytheuser
o resetoverflow_flagandinvalid_flag o performacomputation o testoverflow_flagandinvalid_flagtoseeifanyexceptionoccurred
o Anexceptionflag,whichindicatewhetheratrap shouldoccur
o Nottrappingisthedefault o Instead,continuecomputingreturningaNAN,infinityordenorm o Onatrap,thereshouldbeauserwritableexceptionhandlerwithaccesstothe parametersoftheexceptionaloperation o Trappingorpreciseinterruptslikethisarerarelyimplementedfor performancereasons.
FPUDataRegisterStack
oFPUregisterformat(extendedprecision)
79 78 6463 s exp 0 frac
FPUregisterstack
ostackgrowsdown wrapsaroundfromR0>R7 oFPUregistersare typicallyreferenced relativetotopofstack st(0)istopofstack(Top) followedbyst(1),st(2), opush:incrementTop,load
absolute view stack view
R7 R6 R5 R4 R3 R2 R1 R0
st ( 5 ) st ( 4 ) st ( 3 ) st ( 2 ) st ( 1 ) st ( 0 ) st ( 7 ) st ( 6 )
Top
FPU instructions
o Largenumberoffloatingpointinstructionsand formats
q ~50basicinstructiontypes q load,store,add,multiply q sin,cos,tan,arctan,andlog!
o o Samplingofinstructions:
Instruction Effect fldz flds S fmuls S faddp
Description
push 0.0 Load zero push S Load single precision real st(0) <- st(0)*S Multiply st(1) <- st(0)+st(1); pop Add and pop
END

Floating Point Arithmetic

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Floating Point Arithmetic

Încărcat de

Drepturi de autor:

Formate disponibile

FloatingPointArithmetic

S is the fractionmantissaorsignificand. E is the exponent. B is the base, in Binary case

IEEE754:FloatingPointinModern Computer o TheIEEEhasstandardizedthecomputer representationforbinaryfloatingpoint numbersinIEEE754.Thisstandardis followedbyalmostallmodernmachines. o

Roundtonearest:Theresultisroundedtothe nearestrepresentablenumber. Roundtoward+:Theroundeduptowardplus infinity. Roundtoward:Theresultisroundeddown towardnegativeinfinity. Roundedtoward0:Theresultisrounded towardzero.

Floatingpointnumbersaretypicallypackedintoacomputerdatum asthesignbit,theexponentfield,andthesignificand(mantissa), fromlefttoright.FortheIEEE754binaryformatstheyare apportionedasfollows:

IEEE STANDARD FOR BINARY FLOATINGPOINT ARITHMETIC

o Anegativeexponentislessthantheminimumpossibleexponentvalue (e.g.,200islessthan127).Thismeansthatthenumbersistoo smalltoberepresented,anditmaybereportedas0.

o Intheprocessofaligningsignificant,digitsmayflowofftherightend ofthesignificant.Asweshalldiscuss,someformofroundingis required

Exception Handling User Interface

Instruction Effect fldz flds S fmuls S faddp

S-ar putea să vă placă și