Conditional Expectation

5. Conditional Expected Value

As usual, we start with a random experiment that has a sample space and a probability measure P. Suppose that X is a random variable taking values in a set S and that Y is a random variable taking values in a subset T of R. In this section, we will study the conditional expected value of Y given X, a concept of fundamental importance in both probability and statistics. As we will see, the expected value of Y given X is the function of X that best approximates Y in the mean square sense. Note that in general, X will be vector-valued.

A technical assumption that we will make is that all random variables occurring in expected values have finite second moment.

The Elementary Definition

Note that we can think of (X, Y) as a random variable that takes values in the subset S × T. Suppose first that the that (X, Y) has a continuous distribution with density function f. Recall that the marginal density g of X is given by

g(x) = _T f(x, y)dy for x S.

and that the conditional density of Y given X= x is given by

h(y | x) = f(x, y) / g(x), for x S, y T.

Finally, the conditional expected value of Y given X= x is simply the mean computed relative to the conditional distribution:

E(Y | X = x) = _T y h(y | x)dy.

Of course, the conditional mean of Y depends on the given value x of X. Temporarily, let u denote the function from S into R defined by

u(x) = E(Y | X = x) for x S.

The function u is sometimes refereed to as the regression function. The random variable u(X) is called the conditional expected value of Y given X and is denoted E(Y | X).

The General Definition

The random variable E(Y | X) satisfies a key property that characterizes it among all functions of X.

$Mathematical Exercise$ 1. Suppose that r is a function from S into R. Use the change of variables theorem for expected value to show that

E[r(X)E(Y | X)] = E[r(X)Y].

The result in Exercise 1 would also hold in the case that (X, Y) have a joint discrete definition; the same derivation would work, but with sums replacing the integrals.

In fact, the result in Exercise 1 can be used as a definition of conditional expected value, regardless of the joint distribution of (X, Y). Thus, generally we define E(Y | X) to be the random variable that satisfies the condition in Exercise 1 and is of the form E(Y | X) = u(X) for some function u from S into R. Then we define E(Y | X = x) to be u(x).

Properties

Our first consequence of Exercise 1 is a very compact and elegant statement of the law of total probability:

$Mathematical Exercise$ 2. By taking r to be the constant function 1 in Exercise 1, show that

E[E(Y | X)] = E(Y).

$Mathematical Exercise$ 3. Show that, in light of Exercise 2, the condition in Exercise 1 can be restated as follows: For any function r from S into R, Y - E(Y | X) and r(X) are uncorrelated.

The next exercise show that the condition in Exercise 1 characterizes E(Y | X).

$Mathematical Exercise$ 4. Suppose that u(X) and v(X) satisfy the condition in Exercise 1 and hence also the results in Exercises 2 and 3. Show that

var[u(X) - v(X)] = 0.
u(X) = v(X) (with probability 1).

$Mathematical Exercise$ 5. Suppose that s is a function from S into R. Use the characterization in Exercise 1 to show that

E[s(X)Y | X] = s(X)E(Y | X).

The following rule generalizes Exercise 5 and is sometimes referred to as the substitution rule for conditional expected value.

$Mathematical Exercise$ 6. Suppose that s is a function from S × T into R. Show that

E[s(X, Y) | X = x] = E[s(x, Y) | X = x].

$Mathematical Exercise$ 7. Suppose that X and Y are independent. Use the characterization in Exercise 1 to show that

E(Y | X) = E(Y).

Use the general definition to establish the properties in the following exercises, where Y and Z are real-valued random variables. Note that these are analogues of basic properties of ordinary expected value

$Mathematical Exercise$ 8. Show that E(Y + Z | X) = E(Y | X) + E(Z | X).

$Mathematical Exercise$ 9. Show that E(cY | X) = cE(Y | X).

$Mathematical Exercise$ 10. Show that if Y 0 then E(Y | X) 0.

$Mathematical Exercise$ 11. Show that if Y Z then E(Y | X) E(Z | X).

$Mathematical Exercise$ 12. Show that |E(Y | X)| E(|Y| | X).

Exercises

$Mathematical Exercise$ 13. Suppose that (X, Y) is uniformly distributed on the square R = {(x, y): -6 < x < 6, -6 < y < 6}. Find E(Y | X).

14. In the bivariate uniform experiment, select the square in the list box. Run the simulation 2000 times, updating every 10 runs. Note the relationship between the cloud of points and the graph of the regression function.

$Mathematical Exercise$ 15. Suppose that (X, Y) is uniformly distributed on the triangle R = {(x, y): -6 < y < x < 6}. Find E(Y | X).

16. In the bivariate uniform experiment, select the triangle in the list box. Run the simulation 2000 times, updating every 10 runs. Note the relationship between the cloud of points and the graph of the regression function.

$Mathematical Exercise$ 17. Suppose that (X, Y) has probability density function f(x, y) = x + y for 0 < x < 1, 0 < y < 1. Find

E(Y | X)
E(X | Y)

$Mathematical Exercise$ 18. Suppose that (X, Y) has probability density function f(x, y) = 2(x + y) for 0 < x < y < 1. Find

E(Y | X)
E(X | Y)

$Mathematical Exercise$ 19. Suppose that (X, Y) has probability density function f(x, y) = 6x²y for 0 < x < 1, 0 < y < 1. Find

E(Y | X)
E(X | Y)

$Mathematical Exercise$ 20. Suppose that (X, Y) has probability density function f(x, y) = 15x²y for 0 < x < y < 1. Find

E(Y | X)
E(X | Y)

$Mathematical Exercise$ 21. A pair of fair dice are thrown, and the scores (X₁, X₂) recorded. Let Y = X₁+ X₂denote the sum of the scores and U = min{X₁, X₂} the minimum scored. Find each of the following:

E(Y | X₁)
E(U | X₁)
E(Y | U)
E(X₂| X₁)

$Mathematical Exercise$ 22. Suppose that X, Y, and Z are random variables with E(Y | X) = X³, E(Z | X) = 1 / (1 + X²). Find

E[exp(X) Y - sin(X) Z | X].

Conditional Probability

The conditional probability of an event A, given random vector X, is a special case of the conditional expected value. We define

P(A | X) = E(I_A | X) where I_A is the indicator variable of A.

The properties above for conditional expected value, of course, have special cases for conditional probability. In particular, the following exercise gives a special version of the law of total probability:

$Mathematical Exercise$ 23. Show that P(A) = E[P(A | X)].

$Mathematical Exercise$ 24. A box contains 10 coins, labeled 0 to 9. The probability of heads for coin i is i / 9. A coin is chosen at random from the box and tossed. Find the probability of heads. This problem is an example of Laplace's rule of succession,

The Best Predictor

The next two exercises show that, of all functions of X, E(Y | X) is the best predictor of Y, in the sense of minimizing the mean square error. This is fundamentally important in statistical problems where the predictor vector X can be observed but not the response variable Y.

$Mathematical Exercise$ 25. Let u(X) = E(Y | X) and let v(X) be any other function of X. By adding and subtracting u(X), expanding, and using the result of Exercise 3, show that

E[(Y - v(X))²] = E[(Y - u(X))²] + E[(u(X)- v(X))²].

$Mathematical Exercise$ 26. Use the result of the last exercise to show that if v is a function from S into R then

E{[E(Y | X) - Y]²} E{[v(X) - Y)²]

and equality holds if and only if v(X) = E(Y | X) (with probability 1).

Suppose that X is real-valued. In the section on covariance and correlation, we found that the best linear predictor of Y based on X is

Y* = aX + b where a = cov(X, Y) / var(X) and b = E(Y) - a E(X).

On the other hand, E(Y | X) is the best predictor of Y among all functions of X. It follows that if E(Y | X) happens to be a linear function of X then E(Y | X) must agree with Y*.

$Mathematical Exercise$ 27. Using properties of conditional expected value, show directly that if E(Y | X) = aX + b, then a and b are as given above in the definition of Y*.

$Mathematical Exercise$ 28. Suppose that (X, Y) has density function f(x, y) = x + y for 0 < x < 1, 0 < y < 1.

Find Y*, the best linear predictor of Y based on X.
Find E(Y | X)
Graph Y*(x) and E(Y | X = x), as functions of x, on the same axes.

$Mathematical Exercise$ 29. Suppose that (X, Y) has density function f(x, y) = 2(x + y) for 0 < x < y < 1.

Find Y*, the best linear predictor of Y based on X.
Find E(Y | X)
Graph Y*(x) and E(Y | X = x), as functions of x, on the same axes.

$Mathematical Exercise$ 30. Suppose that (X, Y) has density function f(x, y) = 6x²y for 0 < x < 1, 0 < y < 1.

Find Y*, the best linear predictor of Y based on X.
Find E(Y | X)
Graph Y*(x) and E(Y | X = x), as functions of x, on the same axes.

$Mathematical Exercise$ 31. Suppose that (X, Y) has density function f(x, y) = 15x²y for 0 < x < y < 1.

Find Y*, the best linear predictor of Y based on X.
Find E(Y | X)
Graph Y*(x) and E(Y | X = x), as functions of x, on the same axes.

The mean square error of the predictor E(Y | X) will be studied next.

Conditional Variance

The conditional variance of Y given X is naturally defined as follows:

var(Y | X) = E{[Y - E(Y | X)]² | X}.

$Mathematical Exercise$ 32. Show that var(Y | X) = E(Y² | X) - [E(Y | X)]².

$Mathematical Exercise$ 33. Show that var(Y) = E[var(Y | X)] + var[E(Y | X)].

Let us return to the study of predictors of the real-valued random variable Y, and compare the three predictors we have studied in terms of mean square error. First, the best constant predictor of Y is

µ = E(Y),

with mean square error var(Y) = E[(Y - µ)²].

Next, if X is another real-valued random variable, then as we showed in the section on covariance and correlation, the best linear predictor of Y based on X is

Y* = E(Y) + [cov(X, Y) / var(X)][X - E(X)],

with mean square error E[(Y - Y*)] = var(Y)[1 - cor²(X, Y)].

Finally, if X is a general random variable, then as we have shown in this section, the best overall predictor of Y based on X is

E(Y | X)

with mean square error E[var(Y | X)] = var(Y) - var[E(Y | X)].

$Mathematical Exercise$ 34. Suppose that (X, Y) has density function f(x, y) = x + y for 0 < x < 1, 0 < y < 1. Continue Exercise 28 by finding

var(Y)
var(Y)[1 - cor²(X, Y)]
var(Y) - var[E(Y | X)]

$Mathematical Exercise$ 35. Suppose that (X, Y) has density function f(x, y) = 2(x + y) for 0 < x < y < 1. Continue Exercise 29 by finding

var(Y)
var(Y)[1 - cor²(X, Y)]
var(Y) - var[E(Y | X)]

$Mathematical Exercise$ 36. Suppose that (X, Y) has density function f(x, y) = 6x²y for 0 < x < 1, 0 < y < 1. Continue Exercise 30 by finding

var(Y)
var(Y)[1 - cor²(X, Y)]
var(Y) - var[E(Y | X)]

$Mathematical Exercise$ 37. Suppose that (X, Y) has density function f(x, y) = 15x²y for 0 < x < 1, 0 < y < 1. Continue Exercise 31 by finding

var(Y)
var(Y)[1 - cor²(X, Y)]
var(Y) - var[E(Y | X)]

$Mathematical Exercise$ 38. Suppose that X is uniformly distributed on (0, 1), and that given X, Y is uniformly distributed on (0, X). Find

E(Y | X)
var(Y | X)
var(Y)

Random Sums of Random Variables

Suppose that X₁, X₂, ... are independent and identically distributed real-valued random variables. Denote the common mean, variance, and moment generating function of these variables as follows:

a = E(X_i), b² = var(X_i), M(t) = E[exp(tX_i)].

Suppose also that N is a random variable taking values in {0, 1, 2, ...}, independent of X₁, X₂, ... Denote the mean, variance, and probability generating function of N as follows:

c = E(N), d² = var(N), G(t) = E(t^N).

Now define

Y = X₁ + X₂ + ··· + X_N (where Y = 0 if N = 0)

Note that Y is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable occurs in many different contexts. For example, N might represent the number of customers who enter a store in a given period of time, and X_i the amount spent by the customer i.

$Mathematical Exercise$ 39. Show that E(Y | N) = Na.

$Mathematical Exercise$ 40. Show that E(Y) = ca.

$Mathematical Exercise$ 41. Show that var(Y | N) = Nb².

$Mathematical Exercise$ 42. Show that var(Y) = cb² + a²d².

$Mathematical Exercise$ 43. Show that E[exp(tY)] = G[M(t)].

$Mathematical Exercise$ 44. In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let N denote the die score and X the number of heads.

Find the conditional distribution of X given N.
Find E(X | N).
Find var(X | N).
Find E(X).
Find var(X).

45. Run the die-coin experiment 1000 times, updating every 10 runs. Note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.

$Mathematical Exercise$ 46. The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the mean and standard deviation of the amount of money spent during the hour.

Mixtures

Suppose that X₁, X₂, ... are real-valued random variables and that N is a random variable taking values in {1, 2, ..., }, independent of X₁, X₂, ... Denote the means, variances and moment generating functions as follows:

µ_i = E(X_i), d_i² = var(X_i), M_i(t) = E[exp(tX_i)] for each i.

Denote the density function of N by

p_i = P(N = i) for i = 1, 2, ...

Now define a new random variable X by the condition

X = X_i if and only if N = i.

Recall that the distribution of X is a mixture of the distributions of X₁, X₂, ...

$Mathematical Exercise$ 47. Show that E(X | N) = µ_N.

$Mathematical Exercise$ 48. Show that E(X) = _i_{= 1, 2, ...} p_i µ_i.

$Mathematical Exercise$ 49. Show that var(X) = _i_{= 1, 2, ...} p_i (d_i² + µ_i²) - (_i_{= 1, 2, ...} p_i µ_i)².

$Mathematical Exercise$ 50. Show that E[exp(tY)] = _i_{= 1, 2, ...} p_i M_i(t).

$Mathematical Exercise$ 51. In the coin-die experiment, a biased coin is tossed with probability of heads 1/3. If the coin lands tails, a fair die is rolled; if the coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability 1/4 each, faces 2, 3, 4, 5 have probability 1/8 each). Find the mean and standard deviation of the die score.

52. Run the coin-die experiment 1000 times, updating every 10 runs. Note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.

Projection

Recall that the set of real-valued random variables on a given probability space (that is, for a given random experiment), with finite second moment, forms an inner product space, with inner product given by

<U, V> = E(UV).

In this context, suppose that Y is a real-valued random variable and X a general random variable. Then E(Y | X) is simply the projection of Y on to the subspace of real-valued random variables that can be expressed as functions of X.