Virtual Laboratories > Random Samples > 1 2 3 4 5 6 7 8 [9]
As usual, we start with a basic random experiment that has a sample space and a probability measure. Suppose that X and Y are real-valued random variables for the experiment. We will denote the means, variances, and covariance as follows:
Finally, recall that the correlation is pX,Y = cor(X, Y) = dX,Y / (dX dY).
Now suppose that we repeat the experiment n times to get n independent, random vectors, each with the same distribution as (X, Y). That is, we get a random sample of size n from this distribution:
(X1, Y1), (X2, Y2), ..., (Xn, Yn).
As above, we will use the subscripts to distinguish the sample mean and the sample variance for the X variables and for the Y variables. These sample statistics depend on the sample size n, of course, but for simplicity we will suppress this dependence in the notation.
In this section, we will define and study statistics that are natural estimators of the distribution covariance and correlation. These statistics will be measures of the linear relationship of the sample points in the plane. As usual, the definitions depend on what other parameters are known and unknown.
Suppose first that means µX, µY are known. This is usually an unrealistic assumption, of course, but is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of dX,Y in this case
WX,Y = (1 / n)i
= 1, ..., n (Xi - µX)(Yi
- µY).
1. Show that WX,Y
is the sample mean for a random sample of size n from the distribution of
Consider now the more realistic assumption that the means µX, µY are unknown. A natural approach in this case is to average
(Xi - MX)(Yi - MY)
over i = 1, 2, ..., n. But rather than dividing by n in our average, we should divide by whatever constant gives an unbiased estimator of dX,Y.
3. Interpret the
sign of the (Xi - MX)(Yi - MY)
geometrically, in terms of the scatterplot of points and its center.
4. Show
that cov(MX, MY) = dX,Y
/ n.
5. Show
that
i
= 1, ..., n (Xi - MX)(Yi
- MY) = n [WX,Y -
(MX - µX)(M2 - µY)].
6. Use the
result of Exercise 5 and basic properties of expected value to show that
E[i
= 1, ..., n (Xi - MX)(Yi
- MY)]
= (n - 1)dX,Y.
Therefore, to have an unbiased estimator of dX,Y, we should define the sample covariance to be the random variable
SX,Y = [1 / (n - 1)] i
= 1, ..., n (Xi - MX)(Yi
- MY).
As with the sample variance, when n is large, it makes little difference whether we divide by n or n - 1.
The formula in the following exercise is sometimes better than the definition for computational purposes.
7.
Show that
SX,Y = [1 / (n - 1)] i
= 1, ..., n XiYi - [n / (n
- 1)]MXMY.
8. Use the
result of Exercise 5 and the strong law of large numbers to show that
SX,Y
with probability 1.
The properties established in the following exercises are analogues of properties for the distribution covariance
9. Show that SX,X
= SX2.
10. Show that SX,Y
= SY,X.
11.
Show that if a is a constant then SaX, Y
= a SX,Y.
12. Suppose that
we have a random sample of size n from the distribution of (X, Y,
Z). Show that
SX,Y + Z = SX,Y + SX,Z.
By analogy with the distribution correlation, the sample correlation is obtained by dividing the sample covariance by the product of the sample standard deviations:
RX,Y = SX,Y / (SXSY).
13. Use the
strong law of large numbers to show that
RX,Y
with probability 1.
14. Click
in the interactive scatterplot to define 20 points and try to come as close as possible to
the following conditions: sample means 0, sample standard deviations 1, sample correlation
as follows: 0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.
15. Click in
the interactive scatterplot to define 20 points and try to come as close as possible to
the following conditions: X sample mean 1, Y sample mean 3, X sample
standard deviation 2, Y sample standard deviation 1, sample correlation as follows:
0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.
Recall that in the section on (distribution) correlation and regression, we showed that the best linear predictor of Y based on X, in the sense of minimizing mean square error, is
aX + b where a = dX,Y / dX2 and b = µY - a µX .
Moreover, the (minimum) value of the mean square error, with this choice of a and b, is
E{[Y - (aX + b)]2} = dY2 (1 - pX,Y2).
Of course, in real applications, we are unlikely to know the distribution parameters that go into the definition of a and b above. Thus, in this section, we are interested in the problem of estimating the best linear predictor of Y based on X from our sample data.
(X1, Y1), (X2, Y2), ..., (Xn, Yn).
One natural approach is to find the line
y = Ax + B
that fits the sample points best. This is a basic and important problem in many areas of mathematics, not just statistics. The term best means that we want to find the line (that is, find A and B) that minimizes the average of the squared errors between the actual y values in our data and the predicted y values:
MSE = [1 / (n - 1)]i
= 1, ..., n[Yi - (AXi
+ B)]2.
Finding A and B that minimize MSE is a standard problem in calculus.
16. Show
that MSE is minimized when
17.
Show that the minimum value of MSE, with A and B given in Exercise
16, is
MSE = SY2[1 - RX,Y2].
18.
Use the result of Exercise 17 to show that
The fact that the results in Exercises 17 and 18 are the sample analogues of the corresponding distribution results is beautiful and reassuring. The line y = Ax + B, where A and B are given in Exercise 17, is known as the (sample) regression line for Y based on X. Note from 17 (b) that the sample regression line passes through (MX , MY ), the center of the empirical distribution. Naturally, A and B can be viewed as estimators of a and b, respectively.
19. Use the the law of large numbers to show that, with probability 1, A converges
to a and B converges to b as n increases to infinity.
As with the distribution regression lines, the choice of predictor and response variables is important.
20.
Show that the sample regression line for Y based on X and the
sample regression line for X based on Y are not the same line,
except in the trivial case where the sample points all lie on a line.
Recall that the constant B that minimizes
MSE = [1 / (n - 1)]i
= 1, ..., n (Yi - B)2.
is the sample mean MY, and the minimum value of MSE is the sample variance SY2. Thus, the difference between this value of the mean square error and the one in Exercise 17, namely
SY2 RX,Y2,
is the reduction in the variability of the Y data when the linear term in X is added to the predictor. The fractional reduction is RX,Y2, and hence this statistics is called the (sample) coefficient of determination.
21. Click
in the interactive
scatterplot, in various places, and watch how the regression line changes.
22. Click in
the interactive
scatterplot to define 20 points. Try to generate a scatterplot in which
the mean of the x values is 0, the standard deviation of the x values is
1, and in which the regression line has
23. Click in
the interactive
scatterplot to define 20 points with the following properties: the mean of
the x values is 1, the mean of the y values is 1, and the regression
line has slope 1 and intercept 2.
If you had a difficult time with Exercise 23, it's because the conditions imposed are impossible to satisfy!
24. Run
the bivariate uniform experiment 2000 times, with an update frequency of 10, in each of
the following cases. Note the apparent convergence of the sample means to the distribution
means, the sample standard deviations to the distribution standard deviations, the sample
correlation to the distribution correlation, and the sample regression line to
distribution regression line.
25. Run the
bivariate normal experiment 2000 times, with an update frequency of 10, in each of the
following cases. Note the apparent convergence of the sample means to the distribution
means, the sample standard deviations to the distribution standard deviations, the sample
correlation to the distribution correlation, and the sample regression line to the
distribution regression line.
26. Compute the
correlation between petal length and petal width for the following cases in Fisher's iris
data. Comment on the differences.
27. Compute the
correlation between each pair of color count variables the M&M
data.
28. Using all
cases in Fisher's iris
data,
29. Using the
Setosa cases only in Fisher's iris
data,