Sunteți pe pagina 1din 17

Linear Least Squares Fitting with Microsoft Excel

by: Antony Kaplan


04/13/10

ExcelProject.nb

Abstract
Microsoft Excel is a popular spreadsheet software which is used widely in both industry, academia, and education. Whether it is
in a high school Physics classroom, or in the accounting departments of large Wall Street firms, people rely on Microsoft Excel
to give them accurate results. One of the most used functions of Excel is Least Squares Fitting, or finding a best-fitting curve
given a set of points, by minimizing the squared error. This paper will explore four different ways in which a user can calculate a
Least Squares Linear Fit with Excel and analyze how the four methods perform on mostly ill-conditioned data. We will compare
Excel's results to those of another popular program called Matlab.

The "Trendline" Function


Suppose we have 6 datasets which are of the form:

1+
1+
1+
1+
1+
1+
1+
1+

41
2p
42
2p
43
2p
44
2p
45
2p
46
2p
47
2p
48
2p

Y
1
2
3
4
5
6
7
8

where p{25,26,27,28,51,52}.
The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with the task.
First inputting the data into a spreadsheet, then plotting it on X-Y Scatter Plots, and then finally using the Excel function "Add
Trendline", we get the following results in Figures 1-6, for the least squares linear fits for the datasets:
FIGURE 1:

FIGURE 2:

where p{25,26,27,28,51,52}.
The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with the task.
First inputting the data into a spreadsheet, then plotting it on X-Y Scatter Plots, and then finally using the Excel function "Add
ExcelProject.nb
3
Trendline", we get the following results in Figures 1-6, for the least squares linear fits for the datasets:
FIGURE 1:

!"#$#%&

!"#"$$%%&&$'()))))))))))))))*"+"$$%%&&,'()))))))))))))))"
-."#"/()))))))))))))))"

2()))))"
1()))))"
,()))))"
0()))))"
%()))))"

'3'%"

&()))))"

456789:'3'%;"

$()))))"
'()))))"
/()))))"
)()))))"
/()))))"

/()))))"

/()))))"

/()))))"

/()))))"

/()))))"

FIGURE 2:

!"#$#%&
-)%%%%%"
2)%%%%%"
$)%%%%%"
')%%%%%"
1)%%%%%"
&)%%%%%"
()%%%%%"
*)%%%%%"
0)%%%%%"
%)%%%%%"

!"#"$%&'&(%$)*%%%%%%%%%%%%%%+","$%&'&(&-)'%%%%%%%%%%%%%%"
./"#"0)%1%%%%%%%%%%%%%"

0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%"

FIGURE 3:

FIGURE 4:

*3*'"
456789:*3*';"

ExcelProject.nb

FIGURE 3:

!"#$#%&
$')'''''"

!"#"$%&$&'%&()'''''''''''''''*"+"$%&$&'(,-)'''''''''''''''"
./"#"$)0$,1'''''''''''"

()'''''"
&)'''''"
-)'''''"

,2,%"

,)'''''"

3456789,2,%:"

')'''''"
$)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''"
+,)'''''"
+-)'''''"

FIGURE 4:

!"#$#%&
,"!!!!!#
+"!!!!!#
*"!!!!!#
)"!!!!!#
("!!!!!#
'"!!!!!#
&"!!!!!#
%"!!!!!#
$"!!!!!#
!"!!!!!#
$"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!#

FIGURE 5:

FIGURE 6:

%-%+#
./01234%-%+5#

ExcelProject.nb

FIGURE 5:

!"#$%&'
,"!!!!!#
+"!!!!!#
*"!!!!!#
)"!!!!!#
("!!!!!#

%-'(#

'"!!!!!#

./01234%-'(5#

&"!!!!!#
%"!!!!!#
$"!!!!!#
!"!!!!!#
!",,,)!#

$"!!$*!#

FIGURE 6:

!"#$%#&

!"#"$%&&&&&&&&&&&&&&&"
'("#"&%&&&&&&&&&&&&&&&"

0%&&&&&"
$%&&&&&"
/%&&&&&"
.%&&&&&"
-%&&&&&"

*1-*"

,%&&&&&"

2345678*1-*9"

+%&&&&&"
*%&&&&&"
)%&&&&&"
&%&&&&&"
&%000.&"

)%&&)/&"

Something strange is happening here. The only dataset for which Excel seems to perform well is P=25; it gives a linear fit with
an R2 value of 1.0. R2 is known in statistics as the coefficient of determination, whose value (which ranges from 0-1), describes
the "goodness of fit" of the model or more specifically, it is the proportion of the variability in the data explained by the model
over the total variability of the data. So the R2 that Excel shows in Figure 1, denotes that it found a perfect linear fit. Analytically,
we can confirm that this should be the case; each point is evenly spaced on the X-axis and the Y-axis, so they should all fall on
one line. In fact this is true of all 6 datasets.
For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematically
this is meaningless. For datasets with 28p51, Excel (without any warning message) refuses to give a linear fit and display an
equation or an R2 value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by saying
that, if excel cannot distinguish between the x values for 28p51 that is, if the numbers 1 +

41
and
2p

1+

42
are
2p

equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation is
contradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Furthermore, it seems that for P=52, Excel no longer has any problems displaying the equation and R2 of the linear fit (although not an
actual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display the
equation and R2 value while for others, it does not. It seems that this behavior should be added to the long list of mysterious
behaviors that plague Excel.
On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in the
values of slope, intercept, and R2 ! However, Excel can actually only display up to 15 siginificant decimal digits2 . Although this
bug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

one line. In fact this is true of all 6 datasets.


For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematically
is meaningless. For datasets with 28p51, Excel (without any warning message) refuses to give a linear fit and display an
6this ExcelProject.nb
equation or an R2 value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by saying
that, if excel cannot distinguish between the x values for 28p51 that is, if the numbers 1 +

41
and
2p

1+

42
are
2p

equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation is
contradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Furthermore, it seems that for P=52, Excel no longer has any problems displaying the equation and R2 of the linear fit (although not an
actual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display the
equation and R2 value while for others, it does not. It seems that this behavior should be added to the long list of mysterious
behaviors that plague Excel.
On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in the
values of slope, intercept, and R2 ! However, Excel can actually only display up to 15 siginificant decimal digits2 . Although this
bug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

"By Hand"
Unsatisfied with the results we got from the "Trendline" function in Excel, we can take another approach to calculating
the linear least squares fit for our data: input our own formulas into the Excel spreadsheet to calculate the fit. From any introductory statistics textbook5 , we can find that the formulas for the slope and intercept of linear least squares fit are:

b2 =
b1 =

nHSxyL-HSxL HSyL
nISx2 M-HSxL2
HSyL ISx2 M-HSxL HSxyL
nISx2 M-HSxL2

where b2 and b1 are the slope and y-intercept respectively, n is the number of data points, and x and y are the data points.
We implement these formulas in steps as follows:
(1) Calculate intermediate products (e.g. "xy", "x2 ")
(2) Calculate intermediate Sums (e.g. "Sx", "Sx2 ", "Sy", "Sxy")
(3) Calculate Numerators (e.g. "n(Sxy) - (Sx) (Sy)" and "HSyL ISx2 M - HSxL HSxyL")
(4) Calculate Denominator (e.g."nISx2 M - HSxL2 ")
(5) Calculate b1 and b2 using (3) and (4)
Having calculated b2 and b1 , we can calculate R2 using the following formula:
R2 1 -

SSerr
SStot

SStot =i Hyi - yL2


SSerr = i Hyi - fi L2
where y is the mean of the observed data (all datapoints y), fi 's are the values predicted by the linear fit i.e. fi = b2 xi + b1 , and
the index i goes over all data points.
The formulas for R2 are similarly implemented in steps.
Here are the results for the 6 datasets:
Dataset Hp = L
25
26
27
28
51
52

b2
b1
33 554 432 - 33 554 472
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !

R2
1
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !

Our fit performs even worse than Excel's native "Trendline" function! The results for the P=25 dataset are the same as those
produced by Trendline (refer to figure 1) however, for every other dataset, we get a "Division by zero" error. The division by 0
occurs when calculating the slope and intercept: the denominator nISx2 M - HSxL2 is calculated to be 0 for the 5 datasets. Where is

SStot

SStot =i Hyi - yL2

ExcelProject.nb

SSerr = i Hyi - fi L

where y is the mean of the observed data (all datapoints y), fi 's are the values predicted by the linear fit i.e. fi = b2 xi + b1 , and
the index i goes over all data points.
The formulas for R2 are similarly implemented in steps.
Here are the results for the 6 datasets:
Dataset Hp = L
25
26
27
28
51
52

b2
b1
33 554 432 - 33 554 472
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !
DIV 0 ! DIV 0 !

R2
1
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !

Our fit performs even worse than Excel's native "Trendline" function! The results for the P=25 dataset are the same as those
produced by Trendline (refer to figure 1) however, for every other dataset, we get a "Division by zero" error. The division by 0
occurs when calculating the slope and intercept: the denominator nISx2 M - HSxL2 is calculated to be 0 for the 5 datasets. Where is
our error? Can we fix this?
It turns out, that the error is not our own but Excel's! Prof. Velvel Kahan found that for some computations in Excel, an
extra set of parentheses around the entire expression (which analytically does not change the value of the expression), changed
the result of the computation.2 Can a few parentheses really change the results of our fit? After placing an extra set of parentheses
to surround the formula of each of the steps taken above to calculate b2 , b1 , and R2 , the results were the following:
Dataset Hp = L
25
26
27
28
51
52

b2
b1
R2
33 554 432.0000000 - 33 554 472.0000000 1.000000000000000
70 464 307.2000000 - 70 464 349.6000000 0.991666667198851
176 160 768.000000 - 176 160 824.000000 0.0673363095238095
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !
DIV 0 !
0.000000000000000 8.00000000000000
- 2.33333333333333

Every slope and intercept in the table above, matches those produced by the Trendline function! Datasets in the interval between
p=28 and p=51 still produce the # DIV/0! error, but those are exactly the datasets for which the Trendline could not compute the
linear fit. Notice however that the R2 values differ for datasets p=26,27,52. This is because Excel uses another less general but, in
the case of simple linear regression, equivalent formula to calculate R2 that is:
R2 =

SSreg
SStot
2

where SSreg =i I fi - f M and f is the mean of the values predicted by the model. Implementing this formula in the spreadsheet,
we reproduce exactly those values for R2 that the Trendline function produced. The two equivalent forms of R2 producing
drastically different results hints at a large numerical instability of our algorithms and ill-conditioning of our data.
Had it not been for Kahan's paper, such a bug would have never been considered and hence discovered in our implementation of the linear fit. The problem that the linear fit fails in the interval 28p51 still persists in our implementation; this could be
a limitation of our method in conjunction with finite-precision arithmetic.

linear fit. Notice however that the R2 values differ for datasets p=26,27,52. This is because Excel uses another less general but, in
the case of simple linear regression, equivalent formula to calculate R2 that is:
8
2

ExcelProject.nb
SSreg

R =

SStot
2

where SSreg =i I fi - f M and f is the mean of the values predicted by the model. Implementing this formula in the spreadsheet,
we reproduce exactly those values for R2 that the Trendline function produced. The two equivalent forms of R2 producing
drastically different results hints at a large numerical instability of our algorithms and ill-conditioning of our data.
Had it not been for Kahan's paper, such a bug would have never been considered and hence discovered in our implementation of the linear fit. The problem that the linear fit fails in the interval 28p51 still persists in our implementation; this could be
a limitation of our method in conjunction with finite-precision arithmetic.

Reformulating the Least Squares Problem


Let us reformulate the problem of a linear least squares fit in matrix notation. Consider the 8 x 2 matrix A:

1 1+
1 1+
1 1+
1 1+
A=

1 1+
1 1+
1 1+
1 1+

41
2p
42
2p
43
2p
44
2p
45
2p
46
2p
47
2p
48
2p

the 2 x 1 vector x:
x=K

b1
O
b2

and the 8 x 1 vector b:


1
2
3
4
b=
5
6
7
8
Our task is to find such a vector x, as to minimize the squared Euclidean norm of the residual r = b - Ax, or:
minx b - Ax 2 2 . We can extend this to the general case for A of size m x n (mn), x of size n x 1, and b of size m x 1.
The most straightforward approach to solving the least squares problem is called the method of Normal Equations. The derivation for the method is as follows:
We can define the residual as a vector function of x,
rHxL = b - A x
we are trying to minimize the squared Euclidean norm of the residual, or:
2
EHxL = b - Ax 2 2 =m
i=1 ri HxL

x=K

b1
O
b2

ExcelProject.nb

and the 8 x 1 vector b:


1
2
3
4
b=
5
6
7
8
Our task is to find such a vector x, as to minimize the squared Euclidean norm of the residual r = b - Ax, or:
minx b - Ax 2 2 . We can extend this to the general case for A of size m x n (mn), x of size n x 1, and b of size m x 1.
The most straightforward approach to solving the least squares problem is called the method of Normal Equations. The derivation for the method is as follows:
We can define the residual as a vector function of x,
rHxL = b - A x
we are trying to minimize the squared Euclidean norm of the residual, or:
2
EHxL = b - Ax 2 2 =m
i=1 ri HxL

to minimize EHxL we need to find an x such that the gradient of EHxL is zero, that is the partial derivative with respect to each x j is
zero:
EHxL
x j

= 0 = 2 * m
i=1 ri

ri
x j

by definition, ri = bi - A x = yi - nj=1 Aij x j , and so:


ri
x j

x j

bi - nj=1 Aij x j = -Aij

and so:
EHxL
x j

= 0 = 2 * m
i=1 ri

ri
x j

n
m
m
n
= -2 m
i=1 IAij M Ibi - k=1 Aik xk M = -2 i=1 IAij M Hbi L + 2 i=1 k=1 IAij M Aik xk = 0

So we just have to solve the following expression:


m
n
m
i=1 k=1 IAij M HAik L xk = i=1 IAij M Hbi L (1)

which in matrix notation is:


IAT AM x = AT b (2)
These are the normal equations.
The equations for the slope Hb2 ) and y-intercept Hb1 L of the linear fit that we used in the previous section are actually just directly
derived from the Normal Equations, by expanding equation (1) above.

EHxL
x j

10
So

= 0 = 2 * m
i=1 ri

ri
x j

n
m
m
n
= -2 m
i=1 IAij M Ibi - k=1 Aik xk M = -2 i=1 IAij M Hbi L + 2 i=1 k=1 IAij M Aik xk = 0

weExcelProject.nb
just have to solve the following expression:

m
n
m
i=1 k=1 IAij M HAik L xk = i=1 IAij M Hbi L (1)

which in matrix notation is:


IAT AM x = AT b (2)
These are the normal equations.
The equations for the slope Hb2 ) and y-intercept Hb1 L of the linear fit that we used in the previous section are actually just directly
derived from the Normal Equations, by expanding equation (1) above.

Conditioning and the Pseudo-Inverse


It is fairly straightforward to show that the method of Normal equations is numerically instable. Suppose that the matrix A (also
called the Vandermonde matrix), is ill conditioned, that is:
condHAL >> 1
To solve the linear least squares problem using the method of normal equations we have to solve the following system:
IAT AM x = AT b
and we can show that:
condIAT AM condHAL2
and so:
condIAT AM >> cond HAL>>1
Clearly, there is large growth in the algorithm. The method of Normal Equations is known to be numerically instable and to
perform poorly for ill-conditioned problems. Perhaps this is the reason why the Excel implementation of the linear fit fails in the
interval 28p51.
In the discussion above we referred to the condition number of the matrix A (cond(A)). The condition number is defined as:
condHAL = A * A-1
but A is an m x n matrix, with mn, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do we
calculate the condition number of A?
We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A + . For a general m x
n matrix A, A+ is an n x m matrix that has the following properties1 :
(1) A A+ A = A
(2) A+ A A+ = A+
(3) HA A+ L* = A A+
(4) H A+ AL* = A+ A
where * denotes the conjugate transpose.
For general m x n matrices A we define the condition number to be:
condHAL = A * A+
Using this definition we can find how ill conditioned each of our 6 data sets is:
Dataset Matrix Ap

Cond IAp M

A25

2.928874824058564 e + 07

condHAL = A * A-1
but A is an m x n matrix, with mn, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do we
ExcelProject.nb
11
calculate the condition number of A?
We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A + . For a general m x
n matrix A, A+ is an n x m matrix that has the following properties1 :
(1) A A+ A = A
(2) A+ A A+ = A+
(3) HA A+ L* = A A+
(4) H A+ AL* = A+ A
where * denotes the conjugate transpose.
For general m x n matrices A we define the condition number to be:
condHAL = A * A+
Using this definition we can find how ill conditioned each of our 6 data sets is:
Dataset Matrix Ap

Cond IAp M

A25
A26
A27
A28
A51
A52

2.928874824058564 e + 07
5.857745763834818 e + 07
1.171548756109411 e + 08
2.343097090872832 e + 08
1.925820207179830 e + 15
3.404401319607318 e + 15

It is interesting to note that the condition number of the datasets seems to follow the general trend:
A p+1
Ap

Our Excel implemented solution stops giving meaningful results at around p=26, and breaks down completely at p>27. We must
remember however that since we are essentially using the method of normal equations to solve the least squares problem, our
condition number is roughly squared, and so we can say our solution breaks down at about:
condIA26 * A26 T M = 3.431318543372478 e + 15 condHA26 L2

QR Decomposition
Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve the
linear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposition, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitioned
into two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:
A=QB

Rn
F
0

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:
r=b-Ax
We can multiply the expression above by QT to get:
QT r = QT b - QT A x = QT b - IQT QM R x = QT b - R x =

r 2 2 = rT r = rT Q QT r = K

IQT bMn - Rn x
T

IQ bMm-n

c1 T c1
c1
O K O = H c1 c2 L K O = c1 2 + c2 2
c
c
c

=K

c1
O
c2

12

ExcelProject.nb

Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve the
linear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposition, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitioned
into two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:
A=QB

Rn
F
0

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:
r=b-Ax
We can multiply the expression above by QT to get:
QT r = QT b - QT A x = QT b - IQT QM R x = QT b - R x =

r 2 2 = rT r = rT Q QT r = K

IQT bMn - Rn x
IQT bMm-n

=K

c1
O
c2

c1 T c1
c1
O K O = H c1 c2 L K O = c1 2 + c2 2
c2
c2
c2

Since the value of x does not affect the value of c2 , to minimize r 2 2 we must choose an x to minimize c1 that is, make it equal
zero. Therefore the equation governing the solution of the least linear squares problem becomes:
IQT bMn - Rn x = 0
or:
Rn x = IQT bMn
To solve the linear least squares problem, we must solve the equation above.
The QR factorization is accomplished via orthogonal transformations which, by definition, preserve Euclidean norms. Therefore,
we expect the algorithm to be more numerically stable than the method of Normal Equations, because there is no growth in the
QR algorithm, that is:
condHAL condHRn L

LINEST
Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QR
decomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations). Instead of
finding the R2 value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,
the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6
datasets:
Dataset Hp =L
b2
b1
r 2 2
25
0.000000000000000 4.50000000000000 42.000000000000000
26
0.000000000000000 4.50000000000000 42.000000000000000
27
0.000000000000000 4.50000000000000 42.000000000000000
28
0.000000000000000 4.50000000000000 42.000000000000000
51
0.000000000000000 4.50000000000000 42.000000000000000
52
0.000000000000000 4.50000000000000 42.000000000000000
An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives the
same bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel's
implementation?
To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for our
datasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:

Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QR
ExcelProject.nb
decomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations).
Instead of 13
finding the R2 value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,
the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6
datasets:
Dataset Hp =L
b2
b1
r 2 2
25
0.000000000000000 4.50000000000000 42.000000000000000
26
0.000000000000000 4.50000000000000 42.000000000000000
27
0.000000000000000 4.50000000000000 42.000000000000000
28
0.000000000000000 4.50000000000000 42.000000000000000
51
0.000000000000000 4.50000000000000 42.000000000000000
52
0.000000000000000 4.50000000000000 42.000000000000000
An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives the
same bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel's
implementation?
To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for our
datasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:
Dataset Hp =L
b2
b1
r 2 2
25
3.355443200800436 e7 - 3.355447200800437 e7
3.3307 e - 16
26
6.710886380847121 e7 - 6.710890380847107 e7 1.998401444325282 e - 15
27
28
51
52

1.342177264337416 e8 - 1.342177664337410 e8 5.595524044110790 e - 14


2.684354541191144 e8 - 2.684354941191140 e8 7.105427357601003 e - 15
4.499999999999912
0
41.999999999999844
4.499999999999957
0
41.999999999999908

The Matlab implemented algorithm performs very well- it gives good linear fits up to p=51. On the other hand, Excel's
implementation gives meaningful fits only up to p=23. Furthermore, the fits that it gives for p>23 have a larger residual than any
fit given by Matlab. So the developers of Excel implemented a more numerically stable QR algorithm to better perform on illconditioned data however, implemented it in such a way that it performs worse than the most naive algorithms (even Excel's own
Trendline).

Rank Deficiency
When using QR to solve least linear squares on datasets with p51, Matlab warns the user with the following error: "Warning:
Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly independent columns. Indeed for p51, matlab considers the matrix A

1 1+
1 1+
1 1+
1 1+
1 1+
1 1+
1 1+
1 1+

41
2p
42
2p
43
2p
44
2p
45
2p
46
2p
47
2p
48
2p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that for
square matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangular
matrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documentation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimal
length solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.
Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from the

14

ExcelProject.nb

When using QR to solve least linear squares on datasets with p51, Matlab warns the user with the following error: "Warning:
Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly independent columns. Indeed for p51, matlab considers the matrix A

1 1+
1 1+
1 1+
1 1+
1 1+
1 1+
1 1+
1 1+

41
2p
42
2p
43
2p
44
2p
45
2p
46
2p
47
2p
48
2p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that for
square matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangular
matrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documentation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimal
length solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.
Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from the
Normal Equations is:
-1

x = IAT AM

AT b
-1

If A is rank deficient, then AT A is no longer a square matrix, and IAT AM

does not even exist.

This is the reason that our implementations of the method of Normal equations to linear least squares problem begin to break
down at high condition numbers, the matrix A becomes rank deficient. Is there any way to calculate a linear least squares
solution for rank deficient matrices?

Pseudoinverse and the SVD


Yes! The way to calculate a least linear squares solution for rank deficient matrices is by using the Singular Value Decomposition. Before getting into the details of the algorithm, lets first look at the motivation.
Notice from the normal equations (eq. 2 above), we can find an expression for the solution x,
-1

x = IAT AM

AT b
-1

and the n x m matrix IAT AM

AT is actually A+ , the pseudoinverse of A-it satisfies each of the four properties listed above. We

will show that the four properties hold:


-1

AT A = A JIAT AM

-1

AT N A JIAT AM

(1) A A+ A = A IAT AM

(2) A+ A A+ = JIAT AM

-1

-1

I AT AMN = A QED
-1

AT N = JIAT AM
T

-1

(3) HA A+ L* = HA A+ LT = IHA + L AT M = J IAT AM


-1

= A IAT AM

-1

IAT AMN IAT AM

-1

AT = IAT AM

AT = A+ QED

-1 T

T -1

AT N AT = JAJ IAT AM N AT N = JA JIAT AM N

AT N

AT = A A+ QED

Note: since we know that in our case we're working in the Reals, we can say that M* = MT
-1

(4) H A+ AL* = H A+ ALT = JAT HA+ L N = AT JIAT AM


-1

= IAT AM IAT AM
-1

Since IAT AM

-1

= I = IAT AM

-1 T

T -1

AT N = AT JA JIAT AM N N = AT A JIAT AM N

IAT AM = A+ A QED

AT satisfies the four properties above, it is indeed the pseudoinverse of A.

We can see from the In a sense, the problem of linear least squares is reduced to finding the pseudoinverse of the Vandermonde

-1

x = IAT AM

AT b
ExcelProject.nb
-1

and the n x m matrix IA AM

15

A is actually A , the pseudoinverse of A-it satisfies each of the four properties listed above. We

will show that the four properties hold:


-1

AT A = A JIAT AM

-1

AT N A JIAT AM

(1) A A+ A = A IAT AM

(2) A+ A A+ = JIAT AM

-1

-1

I AT AMN = A QED
-1

AT N = JIAT AM
T

-1

(3) HA A+ L* = HA A+ LT = IHA + L AT M = J IAT AM


-1

= A IAT AM

-1

IAT AMN IAT AM

-1

AT = IAT AM

AT = A+ QED

-1 T

T -1

AT N AT = JAJ IAT AM N AT N = JA JIAT AM N

AT N

AT = A A+ QED

Note: since we know that in our case we're working in the Reals, we can say that M* = MT
-1

(4) H A+ AL* = H A+ ALT = JAT HA+ L N = AT JIAT AM


-1

= IAT AM IAT AM
-1

Since IAT AM

-1

= I = IAT AM

-1 T

T -1

AT N = AT JA JIAT AM N N = AT A JIAT AM N

IAT AM = A+ A QED

AT satisfies the four properties above, it is indeed the pseudoinverse of A.

We can see from the In a sense, the problem of linear least squares is reduced to finding the pseudoinverse of the Vandermonde
matrix A.
With Singular Value Decomposition, we can factor a matrix A as follows:
A=U S V T where S = diagIs1 , s2 . . . s p M
si follow the relation s1 s2 .... s p 0 and are called the singular values of A. The columns of U are called the left
singular vectors, and the columns of V are called the right singular vectors. Using the SVD, we can calculate the pseudoinverse
A+ by the relation:
A+ = V S+ U T
1

where S+ = diagJ s ,
1

1
,
s2

...,

1
N
sn

for all nonzero si .

If si = 0, then we set the corresponding singular value in S+ to 0.1


What makes the SVD so powerful, is that it allows us to manually "fiddle" with the singular values. For instance, if a matrix is
rank deficient, at least one of its singular values is zero. Hence when calculating the pseudoinverse, we set the corresponding
singular value in S+ to zero. By "manually" changing the singular value, we avoid the erroneous infinite result we would have
gotten from the division by zero. The SVD is powerful for nearly deficient Matrices as well- we could set a certain threshold
value, and ensure that if a singular value is smaller than the threshold value, we set it to zero (this is the way the algorithm is
implemented in Matlab). By setting small singular values to zero, we are essentially making the matrix less ill-conditioned (one
of the definitions of the condition number is the

s1
where
sn

s1 and sn are the largest and smallest non-zero singular values respec-

tively).
Using Matlab's pinv() routine, we find the least squares linear fit for our datasets:
Dataset Matrix Ap
b2
b1
A25
A26
A27
A28
A51
A52

3.355443195571761 e7
6.710886391143523 e7
1.342177268800614 e8
2.684354499888867 e8
2.249999999999956
2.249999999999977

- 3.355447195571753 e7
- 6.710890391143516 e7
- 1.342177668800610 e8
- 2.684354899888856 e8
2.250000000000000
2.249999999999999

r 2 2

2.164934898019056 e - 15
1.776356839400251 e - 15
5.240252676230738 e - 14
7.105427357601003 e - 14
41.999999999999080
41.999999999999571

Unfortunately, unlike any serious statistical or mathematical software, Excel does not have its own implemented routine to
calculate the SVD, or the pseudoinverse. We were however, able to find an open-source macro called Biplot3 , which had the
capability of calculating the SVD. Using Excel with the Biplot Macro, we find roughly the same (not very good) fit for each of
our 6 data sets:

value, and ensure that if a singular value is smaller than the threshold value, we set it to zero (this is the way the algorithm is
implemented in Matlab). By setting small singular values to zero, we are essentially making the matrix less ill-conditioned (one
16
of

theExcelProject.nb
definitions of the condition number is the

s1
where
sn

s1 and sn are the largest and smallest non-zero singular values respec-

tively).
Using Matlab's pinv() routine, we find the least squares linear fit for our datasets:
Dataset Matrix Ap
b2
b1
A25
A26
A27
A28
A51
A52

3.355443195571761 e7
6.710886391143523 e7
1.342177268800614 e8
2.684354499888867 e8
2.249999999999956
2.249999999999977

- 3.355447195571753 e7
- 6.710890391143516 e7
- 1.342177668800610 e8
- 2.684354899888856 e8
2.250000000000000
2.249999999999999

r 2 2

2.164934898019056 e - 15
1.776356839400251 e - 15
5.240252676230738 e - 14
7.105427357601003 e - 14
41.999999999999080
41.999999999999571

Unfortunately, unlike any serious statistical or mathematical software, Excel does not have its own implemented routine to
calculate the SVD, or the pseudoinverse. We were however, able to find an open-source macro called Biplot3 , which had the
capability of calculating the SVD. Using Excel with the Biplot Macro, we find roughly the same (not very good) fit for each of
our 6 data sets:
Dataset Matrix Ap
A25,26,27,28,51,52

b2

b1

r 2 2

2.249997055159946 2.250000039113575 41.999999999999834

Conclusion
In the course of this paper, we have explored 4 different methods by which to calculate the Least Squares Linear Fit in
Excel. Using Excel native Trendline function, we found meaningful results only in datasets with p25 (or cond(A)
2.928874824058564 e + 07). We then implemented commonly used statistical formulas for the least squares linear fit in the
spreadsheet, and essentially reproduced the results of Excel's Trendline function. We confirmed that both of the fits use the
method of Normal Equations which we showed was numerically instable (it has large growth), and hence does not perform well
on ill-conditioned data.
We went on to discuss a more numerically stable algorithm, which calculates the Least Squares Linear Fit using QR
Decomposition. Using Matlab, we were able to get reasonably good fits for datasets with p< 51 (or cond(A)<1.925820207179830 e + 15) at which point, Matlab considered our matrix A to be rank deficient. Unfortunately,
Excel's built-in LINEST function, which supposedly uses QR decomposition to give a linear fit, does not fare as well. LINEST
gives good fits for datasets with p23. For p>23, LINEST returns the same (very poor) fit. LINEST performs even worse than
the first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.
Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, or
nearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we were
able to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL 1015 M. Using the
method of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel does
not implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as with
LINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit for
all datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of the
datasets.
It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not have
access to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with the
fact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15
significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematically
equivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had caused
our code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found a
mention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stem
from the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excel
can only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on to
the fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hence
have written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the case
with LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is very
well conditioned, you should not use Excel to find a Least Squares Linear fit.

the first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.
Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, or
nearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we were
17
able to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL ExcelProject.nb
1015 M. Using the

method of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel does
not implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as with
LINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit for
all datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of the
datasets.
It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not have
access to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with the
fact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15
significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematically
equivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had caused
our code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found a
mention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stem
from the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excel
can only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on to
the fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hence
have written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the case
with LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is very
well conditioned, you should not use Excel to find a Least Squares Linear fit.

Bibliography
(1)Burdick,
Prof.
Joel.
"The
Moore-Penrose
Pseudo
<http://robotics.caltech.edu/~jwb/courses/ME115/handouts/pseudo.pdf>

Inverse."

Web.

(2) Kahan, William. "How Futile Are Mindless Assessments of Roundoff in Floating-Point Computation?" Web.
<http://www.cs.berkeley.edu/~wkahan/Mindless.pdf>.
(3)Lipkovich, Ilya, and Eric P. Smith. "Biplot and Singular Value Decomposition Macros for Excel." Virginia Tech Department
of Statistics. Web. <http://filebox.vt.edu/artsci/stats/vining/keying/biplot.doc>.
(4)Markovsky, Prof. Ivan. "Least Squares and Singular Value Decomposition." University of South Hampton. Web.
<http://users.ecs.soton.ac.uk/im/bari08/svd.pdf>.
(5) Taylor, John R. An Introduction to Error Analysis: the Study of Uncertainties in Physical Measurements. Sausalito, Calif.:
University Science, 1997. Print.

S-ar putea să vă placă și