Virtual Laboratories > Random Samples > 1 2 3 [4] 5 6 7 8 9
As usual, we start with a basic random experiment that has a sample space and a probability measure P. Suppose that X is a real-valued random variable for the experiment with mean µ and standard deviation d. Additionally, let
dk = E[(X - µ)k]
denote the k'th moment about the mean. In particular, note that d0 = 1, d1 = 0, d2 = d2.
We repeat the basic experiment indefinitely to form a new, compound experiment, with a sequence of independent random variables, each with the same distribution as X:
X1, X2, ...
For each n, (X1, X2, ..., Xn) is a random sample of size n from the distribution of X. Recall that the sample mean
Mn
= (1 / n)is a natural measure of the center of the data and a natural estimator of µ. In this section, we will derive statistics that are natural measures of the dispersion of the data and estimators of the variance d2. The statistics that we will derive are different, depending on whether µ is known or unknown; for this reason, µ is referred to as a nuisance parameter for the problem of estimating d2.
First we will assume that µ is known, even though this is usually an unrealistic assumption in applications. In this case, our estimation problem is easy. Let
Wn2 = (1 / n)i
= 1, ..., n (Xi - µ)2.
1. Show that Wn2
is the sample mean for a random sample of size n from the distribution of (X - µ)2.
2. Use the result
of Exercise 1 to show that
In particular 2(a) means that Wn2 is an unbiased estimator of d2.
3. Use
basic
properties of covariance to show that
cov(Mn, Wn2) = d3 / n.
It follows that the sample mean and the sample variance are uncorrelated if d3 = 0, and are asymptotically uncorrelated in any case.
4. Use
Jensen's
inequality to show that E(Wn)
d.
Thus, Wn is a biased estimator that tends to underestimate d.
Consider now the more realistic case in which µ is unknown. In this case, a natural approach is to average, in some sense, (Xi - Mn)2 over i = 1, 2, ..., n. It might seem that we should average by dividing by n. However, another approach is to divide by whatever constant would give us an unbiased estimator of d2.
5. Use
basic algebra to show that
i
= 1, ..., n (Xi - Mn)2
=
i
= 1, ..., n (Xi - µ)2 - n(Mn - µ)2.
6. Use the
result in Exercise 5 and basic properties of expected value to show that
E[i
= 1, ..., n (Xi - Mn)2] = (n - 1)d2.
From Exercise 6, the random variable
Sn2 = [1 / (n - 1)]i
= 1, ..., n (Xi - Mn)2
is an unbiased estimator of d2; it is called the sample variance. As a practical matter, when n is large, it makes little difference whether we divide by n or n - 1. Returning to Exercise 5, note that
Sn2 = [n / (n - 1)] Wn2 + [n / (n - 1)](Mn - µ)2 .
7. Use the
(strong) law of large numbers to show that with probability 1,
Sn2 d2
as n
.
Next we will show that Sn2 is a multiple of the sum of all the squared differences. This in turn leads to formulas for the variance of Sn2 and the covariance between the Mn and Sn2.
The formula in the following exercise is sometimes better than the definition for computational purposes.
8.
Show that
Sn2 = [1 / (n - 1)]i
= 1, ..., n Xi2 - [n / (n
- 1)] Mn2.
The following sequence of exercises will allow us to compute the variance of Sn2 .
9. Show that
Sn2 = {1 / [2n(n -1)]} (i,
j) (Xi - Xj)2.
Hint: Start with the expression on the right. In the term (Xi - Xj)2, add and subtract Mn. Expand and take the sums term by term.
10. If i
and j are distinct, show that
E[(Xi - Xj)m] = k
= 0, ..., m C(m, k) dk dm
- k.
Hint: In E[(Xi - Xj)m], add and subtract µ, and then use the binomial theorem and independence.
11. Let Show
that var(Sn2) = (1 / n)[d4 - (n
- 3)d4 / (n - 1)] by the following steps:
12. Show that var(Sn2)
> var(Wn2). Does this seem reasonable?
13. Show that var(Sn2)
decreases to 0 as n increases to infinity.
14. Use techniques
similar to Exercise 11 to show that
cov(Mn, Sn2) = d3 / n.
In particular, note that cov(Mn, Sn2) = cov(Mn, Wn2). Again, the sample mean and variance are uncorrelated if µ3 = 0, and asymptotically uncorrelated otherwise.
The square root of the sample variance is the sample standard deviation, denoted Sn.
15. Use Jensen's
inequality to show that E(Sn)
d.
Thus, Sn is a biased estimator than tends to underestimate d.
Many of the applets in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the standard deviation of the distribution d2, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, sample standard deviation Sn is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box.
16. In
the binomial coin experiment, the
random variable is the number of heads. Run the simulation 1000 times updating every 10
runs and note the apparent convergence of the sample standard deviation to the
distribution standard deviation.
17. In the
simulation of the matching experiment, the random variable
is the number of matches. Run the simulation 1000 times updating every 10 runs and note
the apparent convergence of the sample standard deviation to the distribution standard
deviation.
18. Run the
simulation of the exponential experiment 1000 times
with an update frequency of 10. Note the apparent convergence of the sample standard
deviation to the distribution standard deviation.
The sample mean and standard deviation are often computed in exploratory data analysis, as measures of the center and spread of the data, respectively.
19. Compute the sample mean and standard deviation for Michelson's velocity of light
data.
20. Compute the sample mean and standard deviation for Cavendish's density of the
earth data.
21. Compute the sample mean and standard deviation of the net weight in the
M&M data.
22. Compute the sample mean and standard deviation of the petal length variable for the following cases in
Fisher's iris data. Compare the results.
Suppose that instead of the actual data, we have a frequency distribution with classes A1, A2, ..., Ak, class marks x1, x2, ..., xk, and frequencies n1, n2, ..., nk. Thus,
nj = #{i {1, 2, ..., n}:
Xi
Aj}.
In this case, approximate values of the sample mean and variance are
These approximations are based on the hope that the data values in each class are well represented by the class mark.
23. In the
interactive histogram, select mean and standard deviation. Set the class width to 0.1 and
construct a frequency distribution with at least 6 nonempty classes and at least 10
values. Compute the mean, variance, and standard deviation by hand, and verify that you
get the same results as the applet.
24. In the
interactive histogram, select mean and standard deviation. Set the class width to 0.1 and
construct a distribution with at least 30 values of each of the types indicated below.
Then increase the class width to each of the other four values. As you perform these
operations, note the position and size of the mean ± standard deviation bar.
25. In the
interactive histogram, construct
a distribution that has the largest possible standard deviation.
26. Based on your
answer to Exercise 25, characterize the distributions (on a fixed interval [a, b])
that have the largest possible standard deviation.