Virtual Laboratories > Random Samples > 1 2 3 [4] 5 6 7 8 9

4. The Sample Variance


The Random Sample

As usual, we start with a basic random experiment that has a sample space and a probability measure P. Suppose that X is a real-valued random variable for the experiment with mean µ and standard deviation d. Additionally, let

dk = E[(X - µ)k]

denote the k'th moment about the mean. In particular, note that d0 = 1, d1 = 0, d2 = d2.

We repeat the basic experiment indefinitely to form a new, compound experiment, with a sequence of independent random variables, each with the same distribution as X:

X1, X2, ...

For each n, (X1, X2, ..., Xn) is a random sample of size n from the distribution of X. Recall that the sample mean

Mn = (1 / n)sumi = 1, ..., n Xi

is a natural measure of the center of the data and a natural estimator of µ. In this section, we will derive statistics that are natural measures of the dispersion of the data and estimators of the variance d2. The statistics that we will derive are different, depending on whether µ is known or unknown; for this reason, µ is referred to as a nuisance parameter for the problem of estimating d2.

An Estimator of d2 When µ is Known

First we will assume that µ is known, even though this is usually an unrealistic assumption in applications. In this case, our estimation problem is easy. Let

Wn2 = (1 / n)sumi = 1, ..., n (Xi - µ)2.

Mathematical Exercise 1. Show that Wn2 is the sample mean for a random sample of size n from the distribution of (X - µ)2.

Mathematical Exercise 2. Use the result of Exercise 1 to show that

  1. E[Wn2] = d2.
  2. var[Wn2] = (d4 - d4) / n.
  3. Wn2 converges to d2 as n converges to infinity with probability 1.

In particular 2(a) means that Wn2 is an unbiased estimator of d2.

Mathematical Exercise 3. Use basic properties of covariance to show that

cov(Mn, Wn2) = d3 / n.

It follows that the sample mean and the sample variance are uncorrelated if d3 = 0, and are asymptotically uncorrelated in any case.

Mathematical Exercise 4. Use Jensen's inequality to show that E(Wn) lteq.gif (846 bytes) d.

Thus, Wn is a biased estimator that tends to underestimate d.

The Sample Variance

Consider now the more realistic case in which µ is unknown. In this case, a natural approach is to average, in some sense, (Xi - Mn)2 over i = 1, 2, ..., n. It might seem that we should average by dividing by n. However, another approach is to divide by whatever constant would give us an unbiased estimator of d2.

Mathematical Exercise 5. Use basic algebra to show that

sumi = 1, ..., n (Xi - Mn)2 = sumi = 1, ..., n (Xi - µ)2 - n(Mn - µ)2.

Mathematical Exercise 6. Use the result in Exercise 5 and basic properties of expected value to show that

E[sumi = 1, ..., n (Xi - Mn)2] = (n - 1)d2.

From Exercise 6, the random variable

Sn2 = [1 / (n - 1)]sumi = 1, ..., n (Xi - Mn)2

is an unbiased estimator of d2; it is called the sample variance. As a practical matter, when n is large, it makes little difference whether we divide by n or n - 1. Returning to Exercise 5, note that

Sn2 = [n / (n - 1)] Wn2 + [n / (n - 1)](Mn - µ)2 .

Mathematical Exercise 7. Use the (strong) law of large numbers to show that with probability 1,

Sn2 converges to d2 as n converges to infinity.

Next we will show that Sn2 is a multiple of the sum of all the squared differences. This in turn leads to formulas for the variance of Sn2 and the covariance between the Mn and Sn2.

The formula in the following exercise is sometimes better than the definition for computational purposes.

Mathematical Exercise 8. Show that

Sn2 = [1 / (n - 1)]sumi = 1, ..., n Xi2 - [n / (n - 1)] Mn2.

The following sequence of exercises will allow us to compute the variance of Sn2 .

Mathematical Exercise 9. Show that

Sn2 = {1 / [2n(n -1)]} sum(i, j) (Xi - Xj)2.

Hint: Start with the expression on the right. In the term (Xi - Xj)2, add and subtract Mn. Expand and take the sums term by term.

Mathematical Exercise 10. If i and j are distinct, show that

E[(Xi - Xj)m] = sumk = 0, ..., m C(m, k) dk dm - k.

Hint: In E[(Xi - Xj)m], add and subtract µ, and then use the binomial theorem and independence.

Mathematical Exercise 11. Let Show that var(Sn2) = (1 / n)[d4 - (n - 3)d4 / (n - 1)] by the following steps:

  1. Use the Exercises 8 and 9, and the fact that the variance of a sum is the sum of all the pairwise covariances.
  2. Show that cov[(Xi - Xj)2, (Xk - Xl)2] = 0 if i = j or k = l or i, j, k, l are distinct.
  3. Show that cov[(Xi - Xj)2, (Xi - Xj)2] = 2d4 + 2d4 if i and j are distinct, and there are 2n(n - 1) such terms in the sum of covariances in (a).
  4. Show that cov[(Xi - Xj)2, (Xk - Xj)2] = d4 - d4 if i, j, k are distinct, and there are 4n(n - 1)(n - 2) such terms in the sum of covariances in (a).

Mathematical Exercise 12. Show that var(Sn2) > var(Wn2). Does this seem reasonable?

Mathematical Exercise 13. Show that var(Sn2) decreases to 0 as n increases to infinity.

Mathematical Exercise 14. Use techniques similar to Exercise 11 to show that

cov(Mn, Sn2) = d3 / n.

In particular, note that cov(Mn, Sn2) = cov(Mn, Wn2). Again, the sample mean and variance are uncorrelated if µ3 = 0, and asymptotically uncorrelated otherwise.

The square root of the sample variance is the sample standard deviation, denoted Sn.

Mathematical Exercise 15. Use Jensen's inequality to show that E(Sn) lteq.gif (846 bytes) d.

Thus, Sn is a biased estimator than tends to underestimate d.

Simulations

Many of the applets in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the standard deviation of the distribution d2, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, sample standard deviation Sn is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box.

Simulation Exercise 16. In the binomial coin experiment, the random variable is the number of heads. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation.

Simulation Exercise 17. In the simulation of the matching experiment, the random variable is the number of matches. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation.

Simulation Exercise 18. Run the simulation of the exponential experiment 1000 times with an update frequency of 10. Note the apparent convergence of the sample standard deviation to the distribution standard deviation.

Exploratory Data Analysis

The sample mean and standard deviation are often computed in exploratory data analysis, as measures of the center and spread of the data, respectively.

Data Analysis Exercise 19. Compute the sample mean and standard deviation for Michelson's velocity of light data.

Data Analysis Exercise 20. Compute the sample mean and standard deviation for Cavendish's density of the earth data.

Data Analysis Exercise 21. Compute the sample mean and standard deviation of the net weight in the M&M data.

Data Analysis Exercise 22. Compute the sample mean and standard deviation of the petal length variable for the following cases in Fisher's iris data. Compare the results.

  1. All cases
  2. Setosa only
  3. Versicolor only
  4. Verginica only

Suppose that instead of the actual data, we have a frequency distribution with classes A1, A2, ..., Ak, class marks x1, x2, ..., xk, and frequencies n1, n2, ..., nk. Thus,

nj = #{i in {1, 2, ..., n}: Xi in Aj}.

In this case, approximate values of the sample mean and variance are

These approximations are based on the hope that the data values in each class are well represented by the class mark.

Simulation Exercise 23. In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a frequency distribution with at least 6 nonempty classes and at least 10 values. Compute the mean, variance, and standard deviation by hand, and verify that you get the same results as the applet.

Simulation Exercise 24. In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the position and size of the mean ± standard deviation bar.

  1. A uniform distribution.
  2. A symmetric, unimodal distribution.
  3. A unimodal distribution that is skewed right.
  4. A unimodal distribution that is skewed left.
  5. A symmetric bimodal distribution.
  6. A u-shaped distribution.

Simulation Exercise 25. In the interactive histogram, construct a distribution that has the largest possible standard deviation.

Mathematical Exercise 26. Based on your answer to Exercise 25, characterize the distributions (on a fixed interval [a, b]) that have the largest possible standard deviation.