Variance and Higher Moments

2. Variance and Higher Moments

Definition

As usual, we start with a random experiment that has a sample space and a probability measure P. Suppose that X is a random variable for the experiment, taking values in a subset S of R. Recall that the expected value or mean of X gives the center of the distribution of X. The variance of X is a measure of the spread of the distribution about the mean and is defined by

var(X) = E{[X - E(X)]²}

Thus, the variance is the second central moment of X.

$Mathematical Exercise$ 1. Suppose that X has a discrete distribution with density function f. Use the change of variables theorem to show that

var(X) = _{x
in S} [x - E(X)]² f(x).

$Mathematical Exercise$ 2. Suppose that X has a continuous distribution with density function f. Use the change of variables theorem to show that

var(X) = _S [x - E(X)]² f(x)dx.

The standard deviation of X is the square root of the variance:

sd(X) = [var(X)]^1/2.

It also measures dispersion about the mean but has the same physical units as the variable X.

Properties

The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value:

$Mathematical Exercise$ 3. Show that var(X) = E(X²) - [E(X)]².

$Mathematical Exercise$ 4. Show that var(X) 0

$Mathematical Exercise$ 5. Show that var(X) = 0 if and only if P(X = c) = 1 for some constant c.

$Mathematical Exercise$ 6. Show that if a and b are constants then var(aX + b) = a²var(X)

$Mathematical Exercise$ 7. Let Z = [X - E(X)] / sd(X). Show that Z has mean 0 and variance 1.

The random variable Z in Exercise 7 is sometimes called the standard score associated with X. Since X and its mean and standard deviation all have the same physical units, the standard score Z is dimensionless. It measures the directed distance from E(X) to X in terms of standard deviations.

On the other hand, when E(X) is not zero, the ratio of standard deviation to mean is called the coefficient of variation:

sd(X) / E(X)

Note that this quantity also is dimensionless, and is sometimes used to compare variability for random variables with different means.

Examples and Special Cases

$Mathematical Exercise$ 8. Suppose that I is an indicator variable with P(I = 1) = p.

Show that var(I) = p(1 - p).
Sketch the graph of var(I) as a function of p.
Find the value of p that maximizes var(I).

$Mathematical Exercise$ 9. The score on a fair die is uniformly distributed on {1, 2, 3, 4, 5, 6}. Find the mean, variance, and standard deviation..

10. In the dice experiment, select one fair die. Run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation..

$Mathematical Exercise$ 11. For an ace-six flat die, faces 1 and 6 have probability 1/4 each, and faces 2, 3, 4, 5 have probability 1/8 each. Find the mean, variance and standard deviation.

12. In the dice experiment, select one ace-six flat die. Run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.

$Mathematical Exercise$ 13. Suppose that X is uniformly distributed on {1, 2, ..., n}. Show that

var(X) = (n² - 1) / 12.

$Mathematical Exercise$ 14. Suppose that Y has density function f(n) = p(1 - p)ⁿ^{- 1} for n = 1, 2, ..., where 0 < p < 1 is a parameter. This defines the geometric distribution with parameter p. Show that

var(Y) = (1 - p) / p².

$Mathematical Exercise$ 15. Suppose that N as density function f(n) = exp(-t)tⁿ / n! for n = 0, 1, ..., where t > 0 is a parameter. This defines the Poisson distribution with parameter t. Show that

var(N) = t.

$Mathematical Exercise$ 16. Suppose that X is uniformly distributed on the interval (a, b) where a < b. Show that

var(X) = (b - a)² / 12.

Note in particular that the variance depends only on the length of the interval, which is intuitively reasonable.

$Mathematical Exercise$ 17. Suppose that X has density function f(x) = r exp(-rx) for x > 0. This defines the exponential distribution with rate parameter r > 0. Show that

sd(X) = 1 / r.

18. In the gamma experiment, set k = 1 to get the exponential distribution. Vary r with the scroll bar and note the size and location of the mean-standard deviation bar. Now with r = 2, run the experiment 1000 times updating every 10 runs. Note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.

$Mathematical Exercise$ 19. Suppose that X has density f(x) = a / x^a^{+ 1} for x > 1, where a > 0 is a parameter. This defines the Pareto distribution with shape parameter a. Show that

var(X) = if 1 < a 2
var(X) = a / [(a - 1)²(a - 2)] if a > 2.

$Mathematical Exercise$ 20. Suppose that Z has density f(z) = exp(-z² / 2) / (2)^1/2 for z in R. This defines the standard normal distribution. Show that

var(Z) = 1.

Hint: In the integral for E(Z²), integrate by parts.

21. In the random variable experiment, select the normal distribution (the default parameter values give the standard normal distribution). Run the experiment 1000 times updating every 10 runs and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.

$Mathematical Exercise$ 22. Suppose that X is a random variable with E(X) = 5, var(X) = 4. Find

var(3X - 2)
E(X²)

$Mathematical Exercise$ 23. Suppose that X₁ and X₂ are independent random variables with E(X_i) = ľ_i, var(X) = d_i² for i = 1, 2. Show that

var(X₁X₂) = (d₁²+ ľ₁²)(d₂²+ ľ₂²) - ľ₁²ľ₂².

$Mathematical Exercise$ 24. Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find Marilyn's standard score.

Chebyshev's Inequality

Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be more than a specified distance from its mean.

$Mathematical Exercise$ 25. Use Markov's inequality to prove Chebyshev's inequality: for t > 0,

P[|X - E(X)| t] var(X) / t².

$Mathematical Exercise$ 26. Establish the following equivalent version of Chebyshev's inequality: for k > 0,

P[|X - E(X)| k sd(X)] 1 / k².

$Mathematical Exercise$ 27. Suppose that Y has the geometric distribution with parameter p = 3/4. Compute the true value and the Chebyshev bound for the probability that Y is at least 2 standard deviations away from the mean.

$Mathematical Exercise$ 28. Suppose that X has the exponential distribution with rate parameter r > 0. Compute the true value and the Chebyshev bound for the probability that X is at least k standard deviations away from the mean.

Skewness and Kurtosis

Recall again that the variance of X is the second moment of X about the mean, and measures the spread of the distribution of X about the mean. The third and fourth moments of X about the mean also measure interesting features of the distribution. The third moment measures skewness, the lack of symmetry, while the fourth moment measures kurtosis, the degree to which the distribution is peaked. The actual numerical measures of these characteristics are standardized to eliminate the physical units, by dividing by an appropriate power of the standard deviation.

Thus, let ľ = E(X) and d = sd(X). The skewness of X is defined to be

skew(X) = E[(X - ľ )³] / d³.

The kurtosis of X is defined to be

kurt(X) = E[(X - ľ )⁴] / d⁴.

$Mathematical Exercise$ 29. Suppose that X has density f, which is symmetric with respect to ľ. Show that skew(X) = 0.

$Mathematical Exercise$ 30. Show that

skew(X) = [E(X³) - 3ľE(X) + 2ľ³] / d³.

$Mathematical Exercise$ 31. Show that

kurt(X) = [E(X⁴) - 4ľE(X) + 6ľ²E(X²) - 3ľ⁴] / d⁴.

$Mathematical Exercise$ 32. Graph the following density functions and compute the skewness and kurtosis of each. (These distributions are all members of the beta family).

f(x) = 6x(1 - x), 0 < x < 1.
f(x) = 12x²(1 - x), 0 < x < 1.
f(x) = 12x(1 - x)², 0 < x < 1.

Norm

The variance and higher moments are related to the concept of norm and distance in the theory of vector spaces. This connection can help unify and illuminate some of the ideas. Thus, let X be a real-valued random variable. For k 1, we define the k-norm by

||X||_k = [E(|X|^k)]^1/k.

Thus, ||X||_kis a measure of the size of X in a certain sense. For a given probability space (that is, a given random experiment), the set of random variables with finite k'th moment forms a vector space (if we identify two random variables that agree with probability 1). The following exercises show that the k-norm really is a norm on this vector space.

$Mathematical Exercise$ 33. Show that ||X||_k 0 for any X.

$Mathematical Exercise$ 34. Show that ||X||_k= 0 if and only if P(X = 0) = 1.

$Mathematical Exercise$ 35. Show that ||cX||_k= |c| ||X||_k for any constant c.

The next exercise gives Minkowski's inequality, named for Hermann Minkowski. It is also known as the triangle inequality.

$Mathematical Exercise$ 36. Show that ||X + Y||_k ||X||_k+ ||Y||_k for any X and Y.

Show that g(x, y) = (x^1/k + y^1/k)^k is concave on {(x, y) in R²: x 0, y 0}.
Use (a) and Jensen's inequality to conclude that if U and V are nonnegative random variables then E[(U^1/k + V^1/k)^k] {[E(U)]^1/k + [E(V)]^1/k}^k.
In (b) let U = |X|^k and V = |Y|^k and then do some algebra.

Our next exercise gives Lyapanov's inequality, named for Aleksandr Lyapunov. This inequality shows that the k-norm of a random variable is increasing in k.

$Mathematical Exercise$ 37. Show that if j k then ||X||_j ||X||_k.

Show that g(x) = x^k/j is convex on {x: x 0}.
Use part (a) and Jensen's inequality to conclude that if U is a nonnegative random variable then [E(U)]^k/j E(U^k/j).
In (b), let U = |X|^j and do some algebra.

Lyapanov's inequality shows that if X has a finite k'th moment, and j < k, then X has a finite j'th moment as well.

$Mathematical Exercise$ 38. Suppose that X is uniformly distributed on the interval (0, 1).

Find ||X||_k.
Graph ||X||_k as a function of k.
Find limit ||X||_k as k .

$Mathematical Exercise$ 39. Suppose that X has density f(x) = a / x^a^{+ 1} for x > 1, where a > 0 is a parameter. This defines the Pareto distribution with shape parameter a.

Find ||X||_k.
Graph ||X||_k as a function of k < a.
Find limit ||X||_k as k a-.

$Mathematical Exercise$ 40. Suppose that (X, Y) has density f(x, y) = x + y for 0 < x < 1, 0 < y < 1. Verify Minkowski's inequality.

Distance

The k-norm, like any norm, can be used to measure distance; we simply compute the norm of the difference between the objects. Thus, we define the k-distance (or k-metric) between real-valued random variables X and Y to be

d_k(X, Y) = ||Y - X||_k = [E(|Y - X|^k)]^{1
/ k}.

The properties in the following exercises are analogues of the properties in Exercises 33-36 (and thus very little additional work should be required). These properties show that the k-distance really is a distance.

$Mathematical Exercise$ 41. Show that d_k(X, Y) 0 for any X, Y.

$Mathematical Exercise$ 42. Show that d_k(X, Y) = 0 if and only if P(Y = X) = 1.

$Mathematical Exercise$ 43. Show that d_k(X, Y) d_k(X, Z) + d_k(Z, Y) for any X, Y, Z (this is known as the triangle inequality).

Thus, the standard deviation is simply the 2-distance from X to its mean:

sd(X) = d₂[X, E(X)] = {E[(X - E(X)]²}^1/2.

and the variance is the square of this. More generally, the k'th moment of X about a is simply the k'th power of the k-distance from X to a. The 2-distance is especially important for reasons that will become clear below and in the next section. This distance is also called the root mean square distance.

Center and Spread Revisited

Measures of center and measures of spread are best thought of together, in the context of a measure of distance. For a random variable X, we first try to find the constants t that are closest to X, as measured by the given distance; any such t is a measure of center relative to the distance. The minimum distance itself is the corresponding measure of spread.

Let us apply this procedure to the 2-distance. Thus, we define the root mean square error function by

d₂(X, t) = ||X - t||₂ = {E[(X - t)²]}^1/2.

$Mathematical Exercise$ 44. Show that d₂(X, t) is minimized when t = E(X) and that the minimum value is sd(X). Hint: The minimum value occurs at the same points as the minimum value of E[(X - t)²]. Expand this and take expected values term by term. The resulting expression is a quadratic function of t.

45. In the histogram applet, construct a discrete distribution each of the types indicated below. Note the position and size of the mean ą standard deviation bar and the shape of the mean square error graph.

A uniform distribution.
A symmetric, unimodal distribution.
A unimodal distribution that is skewed right.
A unimodal distribution that is skewed left.
A symmetric bimodal distribution.
A u-shaped distribution.

Next, let us apply our procedure to the 1-distance. Thus, we define the mean absolute error function by

d₁(X, t) = ||X - t||₁ = E[|X - t|].

$Mathematical Exercise$ 46. Show that d₁(X, t) is minimized when t is any median of X.

The last exercise shows that mean absolute error has a basic deficiency as a measure of error, because in general there does exist a unique minimizing value of t. Indeed, for many discrete distributions, there is a median interval. Thus, in terms of mean absolute error, there is no compelling reason to choose one value in this interval, as the measure of center, over any other value in the interval.

47. Construct a distribution of each of the types indicated below. In each case, note the position and size of the boxplot and the shape of the mean absolute error graph.

A uniform distribution.
A symmetric, unimodal distribution.
A unimodal distribution that is skewed right.
A unimodal distribution that is skewed left.
A symmetric bimodal distribution
A u-shaped distribution.

$Mathematical Exercise$ 48. Let I be an indicator variable with P(I = 1) = p. Graph E[|I - t|] as a function of t in each of the cases below. In each case, find the minimum value of the mean absolute error function and the values of t where the minimum occurs.

p < 1/2
p = 1/2
p > 1/2

Convergence

Whenever we have a measure of distance, we automatically have a criterion for convergence. Let X_n, n = 1, 2, ..., and X be real-valued random variables. We say that X_n X as n in k'th mean if

d_k(X_n, X) 0 as n , equivalently E(|X_n- X|^k) 0 as n .

When k = 1, we simply say that X_n X as n in mean; when k = 2, we say that X_n X as n in mean square. These are the most important special cases.

$Mathematical Exercise$ 49. Use Lyaponov's inequality to show that if j < k then

X_n X as n in k'th mean implies X_n X as n in j'th mean.

Our next sequence of exercises shows that convergence in mean is stronger than convergence in probability.

$Mathematical Exercise$ 50. Use Markov's inequality to show that

X_n X as n in mean implies X_n X as n in probability.

The converse is not true. Moreover, convergence with probability 1 does not imply convergence in k'th mean and convergence in k'th mean does not imply convergence with probability 1. The next two exercises give some counterexamples.

$Mathematical Exercise$ 51. Suppose that X₁, X₂, X₃, ... is a sequence of independent random variables with

P(X_n = n³) = 1 / n², P(X_n = 0) = 1 - 1 / n² for n = 1, 2, ...

Use the first Borel-Cantelli lemma to show that X_n 0 as n with probability 1.
Show that X_n 0 as n in probability.
Show that E(X_n) as n

$Mathematical Exercise$ 52. Suppose that X₁, X₂, X₃, ... is a sequence of independent random variables with

P(X_n = 1) = 1 / n, P(X_n = 0) = 1 - 1 / n for n = 1, 2, ...

Use the second Borel-Cantelli lemma to show that P(X_n = 0 for infinitely many n) = 1.
Use the second Borel-Cantelli lemma to show that P(X_n = 1 for infinitely many n) = 1.
Show that P(X_n does not converge as n ) = 1.
Show that X_n 0 as n in k'th mean for any k 1.

To summarize, the implications go from left to right in the following table (where j < k); no other implications hold in general.

convergence with probability 1		convergence in probability	convergence in distribution
convergence in `k`'th mean	convergence in `j`'th mean	convergence in probability	convergence in distribution