The Sample Variance

4. The Sample Variance

The Random Sample

As usual, we start with a basic random experiment that has a sample space and a probability measure P. Suppose that X is a real-valued random variable for the experiment with mean ľ and standard deviation d. Additionally, let

d_k = E[(X - ľ)^k]

denote the k'th moment about the mean. In particular, note that d₀ = 1, d₁ = 0, d₂= d².

We repeat the basic experiment indefinitely to form a new, compound experiment, with a sequence of independent random variables, each with the same distribution as X:

X₁, X₂, ...

For each n, (X₁, X₂, ..., X_n) is a random sample of size n from the distribution of X. Recall that the sample mean

M_n = (1 / n)_{i = 1, ..., n} X_i is a natural measure of the center of the data and a natural estimator of ľ. In this section, we will derive statistics that are natural measures of the dispersion of the data and estimators of the variance d². The statistics that we will derive are different, depending on whether ľ is known or unknown; for this reason, ľ is referred to as a nuisance parameter for the problem of estimating d². An Estimator of d² When ľ is Known First we will assume that ľ is known, even though this is usually an unrealistic assumption in applications. In this case, our estimation problem is easy. Let W_n² = (1 / n)_{i = 1, ..., n} (X_i - ľ)². 1. Show that W_n² is the sample mean for a random sample of size n from the distribution of (X - ľ)². 2. Use the result of Exercise 1 to show that E[W_n²] = d². var[W_n²] = (d₄ - d⁴) / n. W_n² d² as n with probability 1. In particular 2(a) means that W_n² is an unbiased estimator of d². 3. Use basic properties of covariance to show that cov(M_n, W_n²) = d₃ / n. It follows that the sample mean and the sample variance are uncorrelated if d₃ = 0, and are asymptotically uncorrelated in any case. 4. Use Jensen's inequality to show that E(W_n) d. Thus, W_n is a biased estimator that tends to underestimate d. The Sample Variance Consider now the more realistic case in which ľ is unknown. In this case, a natural approach is to average, in some sense, (X_i - M_n)² over i = 1, 2, ..., n. It might seem that we should average by dividing by n. However, another approach is to divide by whatever constant would give us an unbiased estimator of d². 5. Use basic algebra to show that _{i = 1, ..., n} (X_i - M_n)² = _{i = 1, ..., n}(X_i - ľ)² - n(M_n - ľ)². 6. Use the result in Exercise 5 and basic properties of expected value to show that E[_{i = 1, ..., n} (X_i - M_n)²] = (n - 1)d². From Exercise 6, the random variable S_n² = [1 / (n - 1)]_{i = 1, ..., n} (X_i - M_n)² is an unbiased estimator of d²; it is called the sample variance. As a practical matter, when n is large, it makes little difference whether we divide by n or n - 1. Returning to Exercise 5, note that S_n² = [n / (n - 1)] W_n² + [n / (n - 1)](M_n - ľ)² . 7. Use the (strong) law of large numbers to show that with probability 1, S_n² d² as n . Next we will show that S_n² is a multiple of the sum of all the squared differences. This in turn leads to formulas for the variance of S_n² and the covariance between the M_n and S_n². The formula in the following exercise is sometimes better than the definition for computational purposes. 8. Show that S_n² = [1 / (n - 1)]_{i = 1, ..., n}X_i² - [n / (n - 1)] M_n². The following sequence of exercises will allow us to compute the variance of S_n² . 9. Show that S_n² = {1 / [2n(n -1)]} _(i, j) (X_i - X_j)². Hint: Start with the expression on the right. In the term (X_i - X_j)², add and subtract M_n. Expand and take the sums term by term. 10. If i and j are distinct, show that E[(X_i - Xj)^m] = _{k = 0, ..., m} C(m, k) d_k d_{m - k}. Hint: In E[(X_i - X_j)^m], add and subtract ľ, and then use the binomial theorem and independence. 11. Let Show that var(S_n²) = (1 / n)[d₄ - (n - 3)d⁴ / (n - 1)] by the following steps: Use the Exercises 8 and 9, and the fact that the variance of a sum is the sum of all the pairwise covariances. Show that cov[(X_i - X_j)², (X_k - X_l)²] = 0 if i = j or k = l or i, j, k, l are distinct. Show that cov[(X_i - X_j)², (X_i - X_j)²] = 2d₄ + 2d⁴ if i and j are distinct, and there are 2n(n - 1) such terms in the sum of covariances in (a). Show that cov[(X_i - X_j)², (X_k - X_j)²] = d₄ - d⁴ if i, j, k are distinct, and there are 4n(n - 1)(n - 2) such terms in the sum of covariances in (a). 12. Show that var(S_n²) > var(W_n²). Does this seem reasonable? 13. Show that var(S_n²) decreases to 0 as n increases to infinity. 14. Use techniques similar to Exercise 11 to show that cov(M_n, S_n²) = d₃ / n. In particular, note that cov(M_n, S_n²) = cov(M_n, W_n²). Again, the sample mean and variance are uncorrelated if ľ₃ = 0, and asymptotically uncorrelated otherwise. The square root of the sample variance is the sample standard deviation, denoted S_n. 15. Use Jensen's inequality to show that E(S_n) d. Thus, S_n is a biased estimator than tends to underestimate d. Simulations Many of the applets in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the standard deviation of the distribution d², both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, sample standard deviation S_n is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box. 16. In the binomial coin experiment, the random variable is the number of heads. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation. 17. In the simulation of the matching experiment, the random variable is the number of matches. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation. 18. Run the simulation of the exponential experiment 1000 times with an update frequency of 10. Note the apparent convergence of the sample standard deviation to the distribution standard deviation. Exploratory Data Analysis The sample mean and standard deviation are often computed in exploratory data analysis, as measures of the center and spread of the data, respectively. 19. Compute the sample mean and standard deviation for Michelson's velocity of light data. 20. Compute the sample mean and standard deviation for Cavendish's density of the earth data. 21. Compute the sample mean and standard deviation of the net weight in the M&M data. 22. Compute the sample mean and standard deviation of the petal length variable for the following cases in Fisher's iris data. Compare the results. All cases Setosa only Versicolor only Verginica only Suppose that instead of the actual data, we have a frequency distribution with classes A₁, A₂, ..., A_k, class marks x₁, x₂, ..., x_k, and frequencies n₁, n₂, ..., n_k. Thus, n_j = #{i {1, 2, ..., n}: X_i A_j}. In this case, approximate values of the sample mean and variance are M = _{j = 1, ..., k} n_j x_j. S² = _{j = 1, ..., k} n_j ( x_j - M)². These approximations are based on the hope that the data values in each class are well represented by the class mark. 23. In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a frequency distribution with at least 6 nonempty classes and at least 10 values. Compute the mean, variance, and standard deviation by hand, and verify that you get the same results as the applet. 24. In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the position and size of the mean ą standard deviation bar. A uniform distribution. A symmetric, unimodal distribution. A unimodal distribution that is skewed right. A unimodal distribution that is skewed left. A symmetric bimodal distribution. A u-shaped distribution. 25. In the interactive histogram, construct a distribution that has the largest possible standard deviation. 26. Based on your answer to Exercise 25, characterize the distributions (on a fixed interval [a, b]) that have the largest possible standard deviation. Virtual Laboratories > Random Samples > 1 2 3 [4] 5 6 7 8 9 Contents | Applets | Data Sets | Biographies | Resources | Keywords | Š

4. The Sample Variance

The Random Sample

An Estimator of d2 When ľ is Known

The Sample Variance

Simulations

Exploratory Data Analysis

An Estimator of `d`² When ľ is Known