# Statistics question that noone can answer...

Discussion in 'Off Topic [BG]' started by Davidoc, Oct 10, 2005.

1. ### Davidoc

Sep 2, 2000
Northern VA and JMU
This is probably the wrong place to post this.

Why are only Y residuals used when calculating a least-squares linear regression line from a data set? This seems to be the universally accepted method for finding a line of best fit.

It seems to me however that this often creates lines that are more horizontal than they should be, especially when the data points depict a line with a very high or low slope. Using only Y residuals as is accepted, the line will often turn out to be much too horizontal. Perhaps an average of X and Y residuals should be used instead? The current method makes no sense to me.

(PS: I wish I knew more about computer programming so I could write a simple program to calculate a regression line using an average of x and y residuals, and compare it to the bazillion applets out that that do it based off of Y residuals only.)

2. ### mike_v_s

Sorry. Against the advise of my professor, I tossed the book after I graduated. I had an A, too...you'd have thought more of it stuck.

Mike

3. ### ThorModeratorStaff MemberGold Supporting Member

Boy, that course was about 30 years ago.

However, conceptually, the thing that strike me in a data
set, is that usually Y = f(x). Y = a function of X.

In other words, you take X as a given, and measure ( and plot) Y as a function of X. Since there are not 2 variables
at each data point, only one, there is no need to fit the
line to X as X does not vary. It is the Y function that varies,
and the line approximates where the Y would most likely
be found over the given range of X.

4. ### GrooveWarrior

I would also say that you are not trying to find a correlation or relationship between X and Y when doing this type of problem. Use a Pearson Product-Moment r correlation if you want to find out what the relationship is between all of X and all of Y, then you could work on a regression of some sort. There are a lot of cobwebs to dust off, but that was the first thought that came to my mind.

5. ### canopener

Sep 15, 2003
Isle of Lucy
Do you use Minitab? It's just as easy as Excel, but a lot more statistically in depth. There website has a free trial if your school does not have it.

http://www.minitab.com/

6. ### Kelly Coyle

Nov 16, 2004
Mankato, MN
With regression, you are trying to calculate y as a function of x; that is, given a value of x, what is the likely value of y. The error is how wrong that is, squared, on average. So the x residuals aren't relevant -- actually they don't exist for the purposes of the regression. Do you see what I mean? The answer you get, Rsquared, is the amount of variance in y accounted for by knowing the value of x and not the other way around, in other words, it's an index of the predictibility of y given x.

Correlation is different, and perhaps (I don't remember) does deal with error in x. But correlation is how much shared variance there is between x and y, which is the answer to a different question.

Edit: I should have read Thor's post first, ya?

7. ### westland

Oct 8, 2004
Hong Kong
You are thinking about this incorrectly. A regression takes two series Y and X (either may be a vector) and explains Y as a linear function of X:

Y=bX+e

So Y-bX=e is the error series, and in Ordinary Least Squares, you assume these are all independent, and you want to choose b to minimize [e1^2+e2^2+e3^2+...+en^2].

I assume you are referring to the e as the Y-residuals ... I mean, they are the only residuals. And you need some criterion to choose b ... this is the standard OLS criterion.

BTW, it's not the only way ... if the e are not independent, then you need to make assumptions about the covariance structure and you have a lot of variations. Additionally, computer intensive approaches offer alternatives to squared error criteria (which are an artifact from the hand calculation era) .. you have robust and stepwise regression approaches, and so forth

8. ### westland

Oct 8, 2004
Hong Kong
It may also help to think of this geometrically. Regression is a process that projects the Y column vector space on to the X column vector space. You have different criteria for how you can do this, and these will yield different coefficients ('b')

Think of this as shining a flashlight (Y-vectors) on a wall (X-vector plane). As you tilt the flashlight (change the criteria) you get a different oval of light (different 'b').