Virtual Laboratories > Finite Sampling Models > 1 2 3 [4] 5 6 7 8 9 10

4. The Multivariate Hypergeometric Distribution


Suppose now that we have a multi-type population, in which each object is one of k types. For example, we could have an urn with balls of several different colors, or a population of voters who are either democrat, republican, or independent. Let Di denote the subset of all type i objects and let Ni denote the number of type i objects, for i = 1, 2, ..., k. Thus

D = D1 union D2 union ··· unionDk and N = N1 + N2 + ··· + Nk.

The dichotomous model considered earlier is clearly a special case, with k = 2. As in the basic sampling model, we sample n objects at random from D:

X = (X1, X2, ..., Xn), where Xi in D is the i'th object chosen.

Now let Yi denote the number of type i objects in the sample, for i = 1, 2, ..., k. Note that

Y1 + Y2 + ··· + Yk = n,

so if we know the values of k - 1 of the counting variables, we can find the value of the remaining counting variable. As with any counting variable, we can express Yi as a sum of indicator variables:

Mathematical Exercise 1. Show that Yi = Ii1 + Ii2 + ··· + Iin where Iij = 1 if Xj in Di and Iij = 0 otherwise.

We assume initially that the sampling is without replacement, since this is the realistic case in most applications.

Distributions

Basic combinatorial arguments can be used to derive the joint density of the counting variables. Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size n chosen from D.

Mathematical Exercise 2. Show that for nonnegative integers j1, j2, ..., jk with j1 + j2 + ··· + jk = n,

P(Y1 = j1, Y2 = j2, ..., Yk = jk) = C(N1, j1)C(N2, j2) ··· C(Nk, jk) / C(N, n).

The distribution of (Y1, Y2, ..., Yk) is called the multivariate hypergeometric distribution with parameters N, N1, N2, ..., Nk, and n. We also say that (Y1, Y2, ..., Yk - 1) has this distribution (recall again that the values of any k - 1 of the variables give the value of the remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to k = 2.

Mathematical Exercise 3. Show the following alternate from of the multivariate hypergeometric density in two ways: combinatorially, by considering the ordered sample uniformly distributed over the permutations of size n chosen from D, and algebraically, starting with the result in Exercise 2.

P(Y1 = j1, Y2 = j2, ..., Yk = jk) = C(n; j1, j2, ..., jk) (N1)j1(N2)j2··· (Nk)jk / (N)n.

Mathematical Exercise 4. Show that Yi has the hypergeometric distribution with parameters N, Ni, and n:

P(Yi = j) = C(Ni, j)C(N - Ni, n - j) / C(N, n) for j = 0, 1, ..., n.

The multivariate hypergeometric distribution is preserved when the counting variables are combined. Specifically, suppose that A1, A2, ..., Al is a partition of the index set {1, 2, ..., k} into nonempty subsets. For each j, let Wj denote the sum of Yi over i in Aj, and let Mj denote the sum of Ni over i in Aj.

Mathematical Exercise 5. Show that (W1, W2, ..., Wl) has the multivariate hypergeometric distribution with parameters N, M1, M2, ..., Ml. and n.

The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed. Specifically, suppose that A, B is a partition of the index set {1, 2, ..., k} into nonempty subsets. Suppose that we observe Yj = yj for j in B. Let z denote the sum of yj over j in B. and let M denote the sum of Ni over i in A.

Mathematical Exercise 6. Show that the conditional distribution of Yi, i in A given Yj = yj, j in B is multivariate hypergeometric with parameters M, Ni, for i in A, and n - z.

Combinations of the basic results in Exercises 5 and 6 can be used to compute any marginal or conditional distributions of the counting variables.

Moments

We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the representation in terms of indicator variables are the main tools.

Mathematical Exercise 7. Show that

  1. E(Yi) = n Ni / N
  2. var(Yi) = n (Ni / N)(1 - Ni / N) (N - n) / (N - 1)

Mathematical Exercise 8. Suppose that i and j are distinct. Show that

  1. cov(Iir, Ijr) = -NiNj / N2 for r = 1, 2, ..., n.
  2. cov(Iir, Ijs) = -NiNj / [N2(N - 1)] for distinct r, s = 1, 2, ..., n.

Mathematical Exercise 9. Suppose that i and j are distinct. Show that

  1. cor(Iir, Ijr) = -{NiNj / [(N - Ni)(N - Nj)]}1/2 for r = 1, 2, ..., n.
  2. cor(Iir, Ijs) = {NiNj / [(N - Ni)(N - Nj)]}1/2 [1 / (N - 1)] for distinct r, s = 1, 2, ..., n.

In particular, Iir, Ijr are negatively correlated for distinct i, j and for any r, s. Does this result seem reasonable?

Mathematical Exercise 10. Use the result of Exercises 7 and 8 to show that for distinct i and j,

  1. cov(Yi, Yj) = -(nNiNj / N2)[(N - n) / (N - 1)]
  2. cor(Yi, Yj) = -{NiNj / [(N - Ni)(N - Nj)]}1/2.

Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.

Mathematical Exercise 11. Show that the types of the objects in the sample form a sequence of n multinomial trials with parameters N1 / N, N2 / N, ..., Nk / N.

The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above could also be used.

Mathematical Exercise 12. Show that (Y1, Y2, ..., Yk) has the multinomial distribution with parameters n and N1 / N, N2 / N, ..., Nk / N: for nonnegative integers j1, j2, ..., jk with j1 + j2 + ··· + jk = n,

P(Y1 = j1, Y2 = j2, ..., Yk = jk) = C(n; j1, j2, ..., jk) N1j1N2j2··· Nkjk / Nn.

Mathematical Exercise 13. Show that

  1. E(Yi) = n Ni / N.
  2. var(Yi) = n (Ni / N)(1 - Ni / N).
  3. cov(Yi, Yj) = -(nNiNj / N2) for distinct i, j.
  4. cor(Yi, Yj) = -{NiNj / [(N - Ni)(N - Nj)]}1/2 for distinct i, j.

Convergence of the Multivariate Hypergeometric to the Multinomial

Suppose that the population size N is very large compared to the sample size n. In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well-approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly.

Mathematical Exercise 14. Suppose that Ni depends on N and that

Ni / N converges to pi in [0, 1] as N converges to infinity for i = 1, 2, ..., k.

Show that for fixed n, the multivariate hypergeometric density function with parameters N, N1, N2, ..., Nk, and n converges to the multinomial density with parameters n and p1, p2..., pk. Hint: Use the representation in Exercise 3.

Computational Problems

Mathematical Exercise 15. Suppose that a bridge hand (13 cards) is dealt at random from a standard deck of 52 cards. Find the probability that the hand has

  1. 4 hearts.
  2. 4 hearts and 3 spades.
  3. 4 hearts, 3 spades, and 2 clubs
  4. 7 red cards and 6 black cards

Mathematical Exercise 16. Suppose that a bridge hand (13 cards) is dealt at random from a standard deck of 52 cards. Find the

  1. mean and variance of the number of hearts.
  2. the covariance between the number of hearts and the number of spades.
  3. the correlation between the number of hearts and the number of spades.

Mathematical Exercise 17. A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen.

  1. Find the probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents.
  2. Find the multinomial approximation to the probability in part (a).

Mathematical Exercise 18. A bridge hand (13 cards) is dealt at random from a deck of 52 cards. Find the conditional probability that the hand has

  1. 4 hearts and 3 spades given 4 clubs.
  2. 4 hearts given 3 spades and 2 clubs.

Voids

In the card experiment, a hand that does not contain any cards of a particular suit is said to be void in that suit.

Mathemtical Exercise 19. Use the inclusion-exclusion rule to show that the probability that a poker hand is void in at least one suit is

1,913,496 / 2,598,960 ~ 0.736.

Simulation Exercise 20. In the card experiment, set n = 5. Run the simulation 1000 times, updating after each run. Compute the relative frequency of the event that the hand is void in at least one suit. Compare the relative frequency with the true probability given in Exercise 10.

Mathemtical Exercise 21. Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is

32,427,298,180 / 635,013,559,600 ~ 0.051.