Sunteți pe pagina 1din 35

FloatingPointArithmetic

TableofContents
o HistoryofFloatingPoint o DefiningFloatingPointArithmetic o FloatingPointRepresentation o FloatingPointFormat o FloatingPointPrecisions o FloatingPointOperation o Specialvalues o ErrorAnalysis o ExceptionHandling

IA32FloatingPoint

History

8086:firstcomputertoimplementIEEEFP

separate8087FPU(floatingpointunit)

486:mergedFPUandIntegerUnitontoonechip

Summary
Hardwaretoadd,multiply,anddivide Floatingpointdataregisters Variouscontrol&statusregisters

FloatingPointFormats
singleprecision(Cfloat):32bits doubleprecision(Cdouble):64bits extendedprecision(Clongdouble):80bits

DefiningFloatingPointArithmetic
o Representablenumbers o Scientificnotation:+/d.ddxrexp o signbit+/ o radixr(usually2or10,sometimes16) o significandd.dd(howmanybaserdigitsd?) o exponentexp(range?) o others? o Operations: o arithmetic:+,,x,/,... o howtoroundresulttofitinformat o comparison(<,=,>) o conversionbetweendifferentformats o shorttolongFPnumbers,FPtointeger o exceptionhandling o whattodofor0/0,2*largest_number,etc. o binary/decimalconversion o forI/O,whenradixnot10 o Language/librarysupportfortheseoperations

FloatingPointRepresentation
o Itdescribesasystemforrepresentingreal numberswhichsupportsawiderangeof values. o Anumberinwhichthedecimalpointcanbe inanyposition.
o Example:

Amemorylocationsetasideforafloatingpoint numbercanstore0.735,62.3,or1200.

Incomparisonwith:
o Radixpointorradixcharacteristhesymbolusedinnumerical representationstoseparatetheintegerpartofanumber(totheleftofthe radixpoint)fromitsfractionalpart(totherightoftheradixpoint).Radix pointisageneraltermthatappliestoallnumberbases. Ex:Inbase10(decimal):13.625(decimalpoint) Inbase2(binary):1101.101(binarypoint)

o Fixedpointanumberinwhichthepositionofthedecimalpointisfixed.A fixedpointmemorylocationcanonlyaccommodateaspecificnumberof

FloatingPointRepresentation
o

FloatingPointRepresentation
o BinaryCases

Sign bit

Biased Exponent

Significand or Mantissa

where:

S is the fractionmantissaorsignificand. E is the exponent. B is the base, in Binary case

IEEE754:FloatingPointinModern Computer o TheIEEEhasstandardizedthecomputer representationforbinaryfloatingpoint numbersinIEEE754.Thisstandardis followedbyalmostallmodernmachines. o

IEEE754:FloatingPointFormat

IEEE754format Definessingleanddoubleprecisionformats (32and64bits) Standardizesformatsacrossmanydifferent platforms Radix2 Single Range1038to10+38 8bitexponentwith127bias 23bitmantissa Double Range10308to10+308 11bitexponentwith1023bias

IEEE754FormatParameters

FloatingPointPrecisions
IEEE754: 16bit:Half(binary16) 32bit:Single(binary32),decimal32 64bit:Double(binary64),decimal64 128bit:Quadruple(binary128),decimal128 Other: o Minifloat o Extendedprecision o Arbitraryprecision

FloatingPointPrecisions
o SinglePrecision,called"float"intheC languagefamily,and"real"or"real*4"in Fortran.Thisisabinaryformatthatoccupies 32bits(4bytes)anditssignificandhasa precisionof24bits(about7decimaldigits). o Double precision, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significandhasaprecisionof53bits(about16 decimaldigits). o The other basic formats are quadruple precision (128bit) binary, as well as decimal

PRECISIONCONSIDERATIONS
o GuardBitspriortoafloatingpoint operation,theexponentandsignicandof eachareloadedintoALUregisters.The registercontainsadditionalbits,called guardbits,whichareusedtopadoutthe rightendofthesignificandwith0s. o Roundingtheprecisionoftheresultisthe roundingpolicy.Theresultofanyoperation onthesignificandsisgenerallystoredina longerregisters.

THEUSEOFGUARDBITS

THESTANDARDLISTFOUR ALTERNATIVEAPPROACHES:

Roundtonearest:Theresultisroundedtothe nearestrepresentablenumber. Roundtoward+:Theroundeduptowardplus infinity. Roundtoward:Theresultisroundeddown towardnegativeinfinity. Roundedtoward0:Theresultisrounded towardzero.

Internalrepresentation

Floatingpointnumbersaretypicallypackedintoacomputerdatum asthesignbit,theexponentfield,andthesignificand(mantissa), fromlefttoright.FortheIEEE754binaryformatstheyare apportionedasfollows:

IEEE STANDARD FOR BINARY FLOATINGPOINT ARITHMETIC

IEEE 754 goes beyond the simple definition of a format to lay down specific practices and procedures so that floating-point arithmetic produces uniform , predictable results independent of the hardware platform.

FloatingPointOperation

EXPONENTOVERFLOW

o Apositiveexponentexceedsthemaximumpossibleexponentvalue.In somesystem,thismaybedesignatedas+or.

EXPONENTUNDERFLOW

o Anegativeexponentislessthantheminimumpossibleexponentvalue (e.g.,200islessthan127).Thismeansthatthenumbersistoo smalltoberepresented,anditmaybereportedas0.

SIGNIFICANTUNDERFLOW

o Intheprocessofaligningsignificant,digitsmayflowofftherightend ofthesignificant.Asweshalldiscuss,someformofroundingis required

SIGNIFICANTOVERFLOW

o Theadditionoftwosignificantofthesamesignmayresultinacarry outofthemostsignificantbit.Thiscanbefixedbyrealignment,as

FLOATINGPOINT: ADDITIONANDSUBTRACTION(ZXY)
PHASE1:ZEROCHECK o Additionandsubtractionareidenticalexceptforasignchange,theprocessby changingthesignofthesubtractedifitisasubtractoperation.Next,if eitheroperandis0,theotherisreportedastheresult. PHASE2:SIGNIFICANDALIGMENT. o Thenextphaseistomanipulatethenumberssothatthetwoexponentsare equal. PHASE3:ADDITIION o Thetwosignificandsareaddedtogether,takingintoaccounttheirsign. Becausethesignsmaydiffer,theresultmaybe0.Thereisalsothe possibilityofsignificandoverflowby1digit.Ifso,theresultisshiftedright andtheexponentisincremented.Anexponentoverflowcouldnotoccuras aresult;thiswouldbereportedandtheoperationhalted. PHASE4:NORMALIZATION o Thefinalphasenormalizestheresult.Normalizationconsistsofshifting significanddigitsleftuntilthemostsignificanddigit(bit,or4bitsorbase 16exponent)isnonzero.

FLOATINGPOINTADDITIONAND SUBTRACTION(ZXY)

FLOATINGPOINT: MULTIPLICATION(ZX*Y)

FLOATINGPOINT:DIVISION(ZX/Y)

Specialvalues
o

Signedzero IntheIEEE754standard,zeroissigned,meaningthatthereexistbotha"positive zero"(+0)anda"negativezero"(0). Subnormalnumbers Subnormalvaluesfilltheunderflowgapwithvalueswheretheabsolutedistance betweenthemarethesameasforadjacentvaluesjustoutsideoftheunderflowgap. Infinities TheinfinitiesoftheextendedrealnumberlinecanberepresentedinIEEEfloating point data types, just like ordinary floating point values like 1, 1.5 etc. They are not error valuesinanyway,thoughtheyareoften(butnotalways,asitdependsontherounding)used asreplacementvalueswhenthereisanoverflow.Uponadividebyzeroexception,apositiveor negativeinfinityisreturnedasanexactresult.Aninfinitycanalsobeintroducedasanumeral (likeC's"INFINITY"macro,or""iftheprogramminglanguageallowsthatsyntax). NaNs IEEE754specifiesaspecialvaluecalled"NotaNumber"(NaN)tobereturnedas theresultofcertain"invalid"operations,suchas0/0,0,orsqrt(1).Therepresentationof NaNsspecifiedbythestandardhassomeunspecifiedbitsthatcouldbeusedtoencodethetype oferror;butthereisnostandardforthatencoding.Intheory,signalingNaNscouldbeusedby aruntimesystemtoextendthefloatingpointnumberswithotherspecialvalues,without slowingdownthecomputationswithordinaryvalues.Suchextensionsdonotseemtobe common,though.

IEEEFloatingPointArithmeticStandard754 NAN(NotANumber)
o NAN:Signbit,nonzerosignificand,maximumexponent o InvalidException
o occurswhenexactresultnotawelldefinedrealnumber o 0/0 o sqrt(1) o infinityinfinity,infinity/infinity,0*infinity o NAN+3 o NAN>3? o ReturnaNANinallthesecases

o TwokindsofNANs
o Quietpropagateswithoutraisinganexception o goodforindicatingmissingdata o Ex:max(3,NAN)=3 o Signalinggenerateanexceptionwhentouched o goodfordetectinguninitializeddata

OPERATIONSTHATPRODUCEA QUIETNaN

IEEEFloatingPointArithmeticStandard754 NormalizedNumbers
o NormalizedNonzeroRepresentableNumbers:+1.ddx2exp
o o o Macheps=Machineepsilon=2#significandbits OV=overflowthreshold=largestnumber
= relativeerrorineachoperation

Format #UN=underflowthreshold=smallestnumber bits #significandbits macheps #exponent bits o exponent range --------------------------------------------------------------------o ---------------------o Single 32 23+1 2-24 (~10-7 ) 8 -126 - 2127 (~10+-38 ) 2 o Double 64 52+1 2-53 (~10-16 ) 11 2-1022 - 21023 (~10+-308 ) o Double >=80 >=64 <=2-64 (~10-19 ) >=15 -16382 - 216383 (~10+-4932 ) o 2 Extended (80 bits on Intel machines) o

+Zero:+,significandandexponentallzero

Whybotherwith0later

IEEEFloatingPointArithmeticStandard754 Denorms
o DenormalizedNumbers:+0.ddx2min_exp
o signbit,nonzerosignificand,minimumexponent o FillsingapbetweenUNand0

o UnderflowException
o occurswhenexactnonzeroresultislessthanunderflowthresholdUN o Ex:UN/3 o returnadenorm,orzero

o Whybother?
o Necessarysothatfollowingcodeneverdividesbyzero o if(a!=b)thenx=a/(ab)

IEEEFloatingPointArithmeticStandard754 +Infinity
o +Infinity: Signbit,zerosignificand,maximumexponent o OverflowException
o occurswhenexactfiniteresulttoolargetorepresentaccurately o Ex:2*OV o return+infinity

o DividebyzeroException
o return+infinity=1/+0 o signofzeroimportant!Examplelater

o Alsoreturn+infinityfor
o 3+infinity,2*infinity,infinity*infinity o Resultisexact,notanexception!

Error Analysis
o Basicerrorformula
o fl(aopb)=(aopb)*(1+d)where
o oponeof+,,*,/ o |d|<==machineepsilon=macheps o assumingnooverflow,underflow,ordividebyzero

o Example:adding4numbers

fl(x1+x2+x3+x4)={[(x1+x2)*(1+d1)+x3]*(1+d2)+x4}*(1+d3) =x1*(1+d1)*(1+d2)*(1+d3)+x2*(1+d1)*(1+d2)*(1+d3) +x3*(1+d2)*(1+d3)+x4*(1+d3) =x1*(1+e1)+x2*(1+e2)+x3*(1+e3)+x4*(1+e4) whereeach|ei|<~3*macheps getexactsumofslightlychangedsummandsxi*(1+ei) BackwardErrorAnalysisalgorithmcallednumericallystableifit givestheexactresultforslightlychangedinputs NumericalStabilityisanalgorithmdesigngoal

Exception Handling
o Whathappenswhentheexactvalueisnotarealnumber, ortoosmallortoolargetorepresentaccurately? o 5Exceptions:
o Overflowexactresult>OV,toolargetorepresent o Underflowexactresultnonzeroand<UN,toosmallto represent o Dividebyzerononzero/0 o Invalid0/0,sqrt(1), o Inexactyoumadearoundingerror(verycommon!)

o Possibleresponses
o Stopwitherrormessage(unfriendly,notdefault) o Keepcomputing(default,buthow?)

Exception Handling User Interface


o Eachofthe5exceptionshasthefollowingfeatures o Astickyflag,whichissetassoonasanexception occurs o Thestickyflagcanberesetandreadbytheuser
o resetoverflow_flagandinvalid_flag o performacomputation o testoverflow_flagandinvalid_flagtoseeifanyexceptionoccurred

o Anexceptionflag,whichindicatewhetheratrap shouldoccur
o Nottrappingisthedefault o Instead,continuecomputingreturningaNAN,infinityordenorm o Onatrap,thereshouldbeauserwritableexceptionhandlerwithaccesstothe parametersoftheexceptionaloperation o Trappingorpreciseinterruptslikethisarerarelyimplementedfor performancereasons.

FPUDataRegisterStack
oFPUregisterformat(extendedprecision)
79 78 6463 s exp 0 frac

FPUregisterstack
ostackgrowsdown wrapsaroundfromR0>R7 oFPUregistersare typicallyreferenced relativetotopofstack st(0)istopofstack(Top) followedbyst(1),st(2), opush:incrementTop,load
absolute view stack view

R7 R6 R5 R4 R3 R2 R1 R0

st ( 5 ) st ( 4 ) st ( 3 ) st ( 2 ) st ( 1 ) st ( 0 ) st ( 7 ) st ( 6 )

Top

FPU instructions
o Largenumberoffloatingpointinstructionsand formats
q ~50basicinstructiontypes q load,store,add,multiply q sin,cos,tan,arctan,andlog!

o o Samplingofinstructions:

Instruction Effect fldz flds S fmuls S faddp

Description

push 0.0 Load zero push S Load single precision real st(0) <- st(0)*S Multiply st(1) <- st(0)+st(1); pop Add and pop

END

S-ar putea să vă placă și