Virtual Laboratories > Random Samples > 1 2 3 4 5 6 7 8 [9]

9. Sample Covariance and Correlation


The Bivariate Model

As usual, we start with a basic random experiment that has a sample space and a probability measure. Suppose that X and Y are real-valued random variables for the experiment. We will denote the means, variances, and covariance as follows:

Finally, recall that the correlation is pX,Y = cor(X, Y) = dX,Y / (dX dY).

Now suppose that we repeat the experiment n times to get n independent, random vectors, each with the same distribution as (X, Y). That is, we get a random sample of size n from this distribution:

(X1, Y1), (X2, Y2), ..., (Xn, Yn).

As above, we will use the subscripts to distinguish the sample mean and the sample variance for the X variables and for the Y variables. These sample statistics depend on the sample size n, of course, but for simplicity we will suppress this dependence in the notation.

In this section, we will define and study statistics that are natural estimators of the distribution covariance and correlation. These statistics will be measures of the linear relationship of the sample points in the plane. As usual, the definitions depend on what other parameters are known and unknown.

An Estimator of the Covariance When µX, µY are Known

Suppose first that means µX, µY are known. This is usually an unrealistic assumption, of course, but is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of dX,Y in this case

WX,Y = (1 / n)sumi = 1, ..., n (Xi - µX)(Yi - µY).

Mathematical Exercise 1. Show that WX,Y is the sample mean for a random sample of size n from the distribution of (X - µX)(Y - µY).

Mathematical Exercise 2. Use the result of Exercise 1 to show that

  1. E(WX,Y) = dX,Y.
  2. WX,Y converges to dX,Y as n converges to infinity with probability 1.

In particular, WX,Y is an unbiased estimator of dX,Y.

The Sample Covariance

Consider now the more realistic assumption that the means µX, µY are unknown. A natural approach in this case is to average

(Xi - MX)(Yi - MY)

over i = 1, 2, ..., n. But rather than dividing by n in our average, we should divide by whatever constant gives an unbiased estimator of dX,Y.

Mathematical Exercise 3. Interpret the sign of the (Xi - MX)(Yi - MY) geometrically, in terms of the scatterplot of points and its center.

Mathematical Exercise 4. Show that cov(MX, MY) = dX,Y / n.

Mathematical Exercise 5. Show that

sumi = 1, ..., n (Xi - MX)(Yi - MY) = n [WX,Y - (MX - µX)(M2 - µY)].

Mathematical Exercise 6. Use the result of Exercise 5 and basic properties of expected value to show that

E[sumi = 1, ..., n (Xi - MX)(Yi - MY)] = (n - 1)dX,Y.

Therefore, to have an unbiased estimator of dX,Y, we should define the sample covariance to be the random variable

SX,Y = [1 / (n - 1)] sumi = 1, ..., n (Xi - MX)(Yi - MY).

As with the sample variance, when n is large, it makes little difference whether we divide by n or n - 1.

Properties

The formula in the following exercise is sometimes better than the definition for computational purposes.

Mathematical Exercise 7. Show that

SX,Y = [1 / (n - 1)] sumi = 1, ..., n XiYi - [n / (n - 1)]MXMY.

Mathematical Exercise 8. Use the result of Exercise 5 and the strong law of large numbers to show that

SX,Y converges to dX,Y as n converges to infinity with probability 1.

The properties established in the following exercises are analogues of properties for the distribution covariance

Mathematical Exercise 9. Show that SX,X = SX2.

Mathematical Exercise 10. Show that SX,Y = SY,X.

Mathematical Exercise 11. Show that if a is a constant then SaX, Y = a SX,Y.

Mathematical Exercise 12. Suppose that we have a random sample of size n from the distribution of (X, Y, Z). Show that

SX,Y + Z = SX,Y + SX,Z.

Sample Correlation

By analogy with the distribution correlation, the sample correlation is obtained by dividing the sample covariance by the product of the sample standard deviations:

RX,Y = SX,Y / (SXSY).

Mathematical Exercise 13. Use the strong law of large numbers to show that

RX,Y converges to pX,Y as n converges to infinity with probability 1.

Simulation Exercise 14. Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: sample means 0, sample standard deviations 1, sample correlation as follows: 0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.

Simulation Exercise 15. Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: X sample mean 1, Y sample mean 3, X sample standard deviation 2, Y sample standard deviation 1, sample correlation as follows: 0, 0.5, -0.5, 0.7, -0.7, 0.9, -0.9.

The Best Linear Predictor

Recall that in the section on (distribution) correlation and regression, we showed that the best linear predictor of Y based on X, in the sense of minimizing mean square error, is

aX + b where a = dX,Y / dX2 and b = µY - a µX .

Moreover, the (minimum) value of the mean square error, with this choice of a and b, is

E{[Y - (aX + b)]2} = dY2 (1 - pX,Y2).

Of course, in real applications, we are unlikely to know the distribution parameters that go into the definition of a and b above. Thus, in this section, we are interested in the problem of estimating the best linear predictor of Y based on X from our sample data.

(X1, Y1), (X2, Y2), ..., (Xn, Yn).

One natural approach is to find the line

y = Ax + B

that fits the sample points best. This is a basic and important problem in many areas of mathematics, not just statistics. The term best means that we want to find the line (that is, find A and B) that minimizes the average of the squared errors between the actual y values in our data and the predicted y values:

MSE = [1 / (n - 1)]sumi = 1, ..., n[Yi - (AXi + B)]2.

Finding A and B that minimize MSE is a standard problem in calculus.

Mathematical Exercise 16. Show that MSE is minimized when

  1. A = SX,Y / SX2.
  2. B = MY - AMX.

Mathematical Exercise 17. Show that the minimum value of MSE, with A and B given in Exercise 16, is

MSE = SY2[1 - RX,Y2].

Mathematical Exercise 18. Use the result of Exercise 17 to show that

  1. RX,Y in [-1, 1].
  2. RX,Y = -1 if and only if the sample points lie on a line with negative slope.
  3. RX,Y = 1 if and only if the sample points lie on a line with positive slope.

Thus, the sample correlation measures the degree of linearity of the sample points. The results in Exercise 18 can also be obtained by noting that the sample correlation is simply the correlation of the empirical distribution. Of course, properties (a), (b), and (c) are known for the distribution correlation.

The fact that the results in Exercises 17 and 18 are the sample analogues of the corresponding distribution results is beautiful and reassuring. The line y = Ax + B, where A and B are given in Exercise 17, is known as the (sample) regression line for Y based on X. Note from 17 (b) that the sample regression line passes through (MX , MY ), the center of the empirical distribution. Naturally, A and B can be viewed as estimators of a and b, respectively.

Mathematical Exercise 19. Use the the law of large numbers to show that, with probability 1, A converges to a and B converges to b as n increases to infinity.

As with the distribution regression lines, the choice of predictor and response variables is important.

Mathematical Exercise 20. Show that the sample regression line for Y based on X and the sample regression line for X based on Y are not the same line, except in the trivial case where the sample points all lie on a line.

Recall that the constant B that minimizes

MSE = [1 / (n - 1)]sumi = 1, ..., n (Yi - B)2.

is the sample mean MY, and the minimum value of MSE is the sample variance SY2. Thus, the difference between this value of the mean square error and the one in Exercise 17, namely

SY2 RX,Y2,

is the reduction in the variability of the Y data when the linear term in X is added to the predictor. The fractional reduction is RX,Y2, and hence this statistics is called the (sample) coefficient of determination.

Simulation Exercises

Simulation Exercise 21. Click in the interactive scatterplot, in various places, and watch how the regression line changes.

Simulation Exercise 22. Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the mean of the x values is 0, the standard deviation of the x values is 1, and in which the regression line has

  1. slope 1, intercept 1
  2. slope 3, intercept 0
  3. slope -2, intercept 1

Simulation Exercise 23. Click in the interactive scatterplot to define 20 points with the following properties: the mean of the x values is 1, the mean of the y values is 1, and the regression line has slope 1 and intercept 2.

If you had a difficult time with Exercise 23, it's because the conditions imposed are impossible to satisfy!

Simulation Exercise 24. Run the bivariate uniform experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to distribution regression line.

  1. The uniform distribution on the square
  2. The uniform distribution on the triangle.
  3. The uniform distribution on the circle.

Simulation Exercise 25. Run the bivariate normal experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression line.

  1. sd(X) = 1, sd(Y) = 2, cor(X, Y) = 0.5
  2. sd(X) = 1.5, sd(Y) = 0.5, cor(X, Y) = -0.7

Data Analysis Exercises

Data Analysis Exercise 26. Compute the correlation between petal length and petal width for the following cases in Fisher's iris data. Comment on the differences.

  1. All cases
  2. Setosa only
  3. Verginica only
  4. Versicolor only

Data Analysis Exercise 27. Compute the correlation between each pair of color count variables the M&M data.

Data Analysis Exercise 28. Using all cases in Fisher's iris data,

  1. Compute the least squares regression line with petal length as the predictor variable and petal width as the response variable.
  2. Draw the scatterplot and the regression line together.
  3. Predict the petal width of an iris with petal length 40

Data Analysis Exercise 29. Using the Setosa cases only in Fisher's iris data,

  1. Compute the least squares regression line with sepal length as the predictor variable and sepal width as the unknown variable.
  2. Draw the scatterplot and regression line together.
  3. Predict the sepal width of an iris with sepal length 45.