Sample Covariance and Correlation

9. Sample Covariance and Correlation

The Bivariate Model

As usual, we start with a basic random experiment that has a sample space and a probability measure. Suppose that X and Y are real-valued random variables for the experiment. We will denote the means, variances, and covariance as follows:

ľ_X = E(X)
ľ_Y = E(Y)
d_X² = var(X)
d_Y² = var(Y)
d_X,Y = cov(X, Y).

Finally, recall that the correlation is p_X,Y = cor(X, Y) = d_X,Y / (d_X d_Y).

Now suppose that we repeat the experiment n times to get n independent, random vectors, each with the same distribution as (X, Y). That is, we get a random sample of size n from this distribution:

(X₁, Y₁), (X₂, Y₂), ..., (X_n, Y_n).

As above, we will use the subscripts to distinguish the sample mean and the sample variance for the X variables and for the Y variables. These sample statistics depend on the sample size n, of course, but for simplicity we will suppress this dependence in the notation.

In this section, we will define and study statistics that are natural estimators of the distribution covariance and correlation. These statistics will be measures of the linear relationship of the sample points in the plane. As usual, the definitions depend on what other parameters are known and unknown.

An Estimator of the Covariance When ľ`_X`, ľ`_Y` are Known

Suppose first that means ľ_X, ľ_Yare known. This is usually an unrealistic assumption, of course, but is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of d_X,Y in this case

W_X,Y = (1 / n)_{i
= 1, ..., n} (X_i - ľ_X)(Y_i - ľ_Y).

$Mathematical Exercise$ 1. Show that W_X,Y is the sample mean for a random sample of size n from the distribution of (X - ľ_X)(Y - ľ_Y).

$Mathematical Exercise$ 2. Use the result of Exercise 1 to show that

E(W_X,Y) = d_X,Y.
W_X,Y d_X,Y as n with probability 1.

In particular, W_X,Y is an unbiased estimator of d_X,Y.

The Sample Covariance

Consider now the more realistic assumption that the means ľ_X, ľ_Y are unknown. A natural approach in this case is to average

(X_i - M_X)(Y_i - M_Y)

over i = 1, 2, ..., n. But rather than dividing by n in our average, we should divide by whatever constant gives an unbiased estimator of d_X,Y.

$Mathematical Exercise$ 3. Interpret the sign of the (X_i - M_X)(Y_i - M_Y) geometrically, in terms of the scatterplot of points and its center.

$Mathematical Exercise$ 4. Show that cov(M_X, M_Y) = d_X,Y / n.

$Mathematical Exercise$ 5. Show that

_{i
= 1, ..., n} (X_i - M_X)(Y_i - M_Y) = n [W_X_,Y- (M_X - ľ_X)(M₂ - ľ_Y)].

$Mathematical Exercise$ 6. Use the result of Exercise 5 and basic properties of expected value to show that

E[_{i
= 1, ..., n} (X_i - M_X)(Y_i - M_Y)] = (n - 1)d_X,Y.

Therefore, to have an unbiased estimator of d_X,Y, we should define the sample covariance to be the random variable

S_X,Y = [1 / (n - 1)] _{i
= 1, ..., n} (X_i - M_X)(Y_i - M_Y).

As with the sample variance, when n is large, it makes little difference whether we divide by n or n - 1.

Properties

The formula in the following exercise is sometimes better than the definition for computational purposes.

$Mathematical Exercise$ 7. Show that

S_X_,Y = [1 / (n - 1)] _{i
= 1, ..., n} X_iY_i - [n / (n - 1)]M_XM_Y.

$Mathematical Exercise$ 8. Use the result of Exercise 5 and the strong law of large numbers to show that

S_X,Y d_X,Y as n with probability 1.

The properties established in the following exercises are analogues of properties for the distribution covariance

$Mathematical Exercise$ 9. Show that S_X,X = S_X².

$Mathematical Exercise$ 10. Show that S_X,Y = S_Y,X.

$Mathematical Exercise$ 11. Show that if a is a constant then S_aX_,_Y = a S_X_,Y.

$Mathematical Exercise$ 12. Suppose that we have a random sample of size n from the distribution of (X, Y, Z). Show that

S_X,Y + Z = S_X,Y + S_X,Z.

Sample Correlation

By analogy with the distribution correlation, the sample correlation is obtained by dividing the sample covariance by the product of the sample standard deviations:

R_X,Y = S_X,Y / (S_XS_Y).

$Mathematical Exercise$ 13. Use the strong law of large numbers to show that

R_X,Y p_X,Y as n with probability 1.

14. Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: sample means 0, sample standard deviations 1, sample correlation as follows: 0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.

15. Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: X sample mean 1, Y sample mean 3, X sample standard deviation 2, Y sample standard deviation 1, sample correlation as follows: 0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.

The Best Linear Predictor

Recall that in the section on (distribution) correlation and regression, we showed that the best linear predictor of Y based on X, in the sense of minimizing mean square error, is

aX + b where a = d_X_,Y / d_X² and b = ľ_Y - a ľ_X .

Moreover, the (minimum) value of the mean square error, with this choice of a and b, is

E{[Y - (aX + b)]²} = d_Y² (1 - p_X_,Y²).

Of course, in real applications, we are unlikely to know the distribution parameters that go into the definition of a and b above. Thus, in this section, we are interested in the problem of estimating the best linear predictor of Y based on X from our sample data.

(X₁, Y₁), (X₂, Y₂), ..., (X_n, Y_n).

One natural approach is to find the line

y = Ax + B

that fits the sample points best. This is a basic and important problem in many areas of mathematics, not just statistics. The term best means that we want to find the line (that is, find A and B) that minimizes the average of the squared errors between the actual y values in our data and the predicted y values:

MSE = [1 / (n - 1)]_{i
= 1, ..., n}[Y_i - (AX_i + B)]².

Finding A and B that minimize MSE is a standard problem in calculus.

$Mathematical Exercise$ 16. Show that MSE is minimized when

A = S_X,Y / S_X².
B = M_Y - AM_X.

$Mathematical Exercise$ 17. Show that the minimum value of MSE, with A and B given in Exercise 16, is

MSE = S_Y²[1 - R_X_,Y²].

$Mathematical Exercise$ 18. Use the result of Exercise 17 to show that

R_X_,Y [-1, 1].
R_X_,Y = -1 if and only if the sample points lie on a line with negative slope.
R_X_,Y = 1 if and only if the sample points lie on a line with positive slope.

Thus, the sample correlation measures the degree of linearity of the sample points. The results in Exercise 18 can also be obtained by noting that the sample correlation is simply the correlation of the empirical distribution. Of course, properties (a), (b), and (c) are known for the distribution correlation.

The fact that the results in Exercises 17 and 18 are the sample analogues of the corresponding distribution results is beautiful and reassuring. The line y = Ax + B, where A and B are given in Exercise 17, is known as the (sample) regression line for Y based on X. Note from 17 (b) that the sample regression line passes through (M_X , M_Y ), the center of the empirical distribution. Naturally, A and B can be viewed as estimators of a and b, respectively.

$Mathematical Exercise$ 19. Use the the law of large numbers to show that, with probability 1, A converges to a and B converges to b as n increases to infinity.

As with the distribution regression lines, the choice of predictor and response variables is important.

$Mathematical Exercise$ 20. Show that the sample regression line for Y based on X and the sample regression line for X based on Y are not the same line, except in the trivial case where the sample points all lie on a line.

Recall that the constant B that minimizes

MSE = [1 / (n - 1)]_{i
= 1, ..., n} (Y_i - B)².

is the sample mean M_Y, and the minimum value of MSE is the sample variance S_Y². Thus, the difference between this value of the mean square error and the one in Exercise 17, namely

S_Y²R_X_,Y²,

is the reduction in the variability of the Y data when the linear term in X is added to the predictor. The fractional reduction is R_X_,Y², and hence this statistics is called the (sample) coefficient of determination.

Simulation Exercises

21. Click in the interactive scatterplot, in various places, and watch how the regression line changes.

22. Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the mean of the x values is 0, the standard deviation of the x values is 1, and in which the regression line has

slope 1, intercept 1
slope 3, intercept 0
slope -2, intercept 1

23. Click in the interactive scatterplot to define 20 points with the following properties: the mean of the x values is 1, the mean of the y values is 1, and the regression line has slope 1 and intercept 2.

If you had a difficult time with Exercise 23, it's because the conditions imposed are impossible to satisfy!

24. Run the bivariate uniform experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to distribution regression line.

The uniform distribution on the square
The uniform distribution on the triangle.
The uniform distribution on the circle.

25. Run the bivariate normal experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression line.

sd(X) = 1, sd(Y) = 2, cor(X, Y) = 0.5
sd(X) = 1.5, sd(Y) = 0.5, cor(X, Y) = -0.7

Data Analysis Exercises

26. Compute the correlation between petal length and petal width for the following cases in Fisher's iris data. Comment on the differences.

All cases
Setosa only
Verginica only
Versicolor only

27. Compute the correlation between each pair of color count variables the M&M data.

28. Using all cases in Fisher's iris data,

Compute the least squares regression line with petal length as the predictor variable and petal width as the response variable.
Draw the scatterplot and the regression line together.
Predict the petal width of an iris with petal length 40

29. Using the Setosa cases only in Fisher's iris data,

Compute the least squares regression line with sepal length as the predictor variable and sepal width as the unknown variable.
Draw the scatterplot and regression line together.
Predict the sepal width of an iris with sepal length 45.