
The Least Squares Regression Line (and friends)
The idea behind the technique of least squares is as follows. Suppose you were
interested in two quantities, call them X and Y, which were supposed to be
related via a linear function Y = a+bX.
Suppose that a and b are unknown, so you decide to make some
measurements (X_{1},Y_{1}),(X_{2},Y_{2}),... of X and Y. Suppose, however,
that as you measure the quantities, there are some errors introduced, so
the measured values Y_{i} are related to the measured values X_{i}
via Y_{i} = a+bX_{i}+E_{i}, where E_{i} is the (unknown) error in
measurement i. How can you find a and b?
This is, in fact, a remarkably common situation. This problem occurs in
biology, psychology, physics, chemistry, in fact all the pure and applied
sciences, as well as economics, business, medicine, nutrition... the list
is endless.
The most common technique used to find approximations for a
and b is as follows. Suppose we guessed two values a and b for
a and b. Suppose then we worked out a+bX_{i}, for each
of the X_{i}, and then found the difference between the measured Y_{i}
and the calculated a+bX_{i}, say r_{i} = Y_{i}(a+bX_{i}). It stands to reason
that if the a and b we guessed were correct, then most of the r_{i} should
be reasonably small. So to find a good approximation to a and b,
we should find the a and b that make the r_{i} collectively the smallest
in some sense.
How can this be done? The most common method is to calculate the sum of the
squares of the residuals,
S = 
n �
i = 1

r_{i}^{2} = 
n �
i = 1

(Y_{i}abX_{i})^{2}, 

and then find the a and b which make S as small as possible. This
technique is called least squares regression, and the line obtained,
say Y = A+BX is called the least squares line, the line of
best fit or the line of regression.
Why choose least squares? Why not least fourth powers, or least absolute
values or something else? It's a secret, but the main reason is that...
it makes the mathematics a lot simpler. There's no deep philosophical reason to
prefer least squares regression over any other kind of regression one might
invent, unless ït's easy to use" is a deep philosophical reason.
The fact is, minimizing S is an easy calculus problem. A bright
precalculus students could perhaps do it without using any calculus at all.
The values of A and B obtained are
B = 
n 
�
 (Y_{i}X_{i})( 
�
 Y_{i})( 
�
 X_{i}) 
n 
�
 (X_{i}^{2})( 
�
 X_{i})^{2} 

and A = 
n

b 
n

. 

The DotPlacer Applet (on this site) allows you to
place a set of points on the screen, and have it draw the least squares
line calculated from the points. You can even move the points around, and
watch the line change. Try it!
If you suspect that the X and Y do not follow a linear relationship,
but instead (for example) Y = a+bX+gX^{2}, you can apply
very similar techniques to the above, to obtain the least squares
quadratic. Or, if you suspect the relationship is that of a cubic polynomial,
or an exponential curve, or whatever, there are least squares techniques
available. The DotPlacer Applet will also draw
least squares polynomials for you, up to the degree 4 polynomial.
For more information on least squares regression, see this
World of Mathematics
article.
Or else, you may like to consider the book
displayed on the left. It is a well written book on statistics, including
of course regression. It shows not just the how of the techniques, but explains
also the why. It would be good for a person
seeking to seriously statistics, for example regression,
to a problem at hand.
The book on the right, with code samples and good explanations,
is regarded by many as one of the best books around for
applying java (and smalltalk) to numerical methods.
File translated from T_{E}X by T_{T}H, version 2.25.
