Virtual Laboratories > Finite Sampling Models > 1 [2] 3 4 5 6 7 8 9 10

2. The Hypergeometric Distribution


Suppose that we have a dichotomous population D that consists of two types of objects. For example, we could have balls in an urn that are either red or green, a batch of electronic components that are either good or defective, a population of people who are either male or female, or a population of animals that are either tagged or untagged. Let D1 denote the subset of D consisting of the type 1 objects, and suppose that D1 has cardinality R. As in the basic sampling model, we sample n objects at random from D:

X = (X1, X2, ..., Xn), where Xi in D is the i'th object chosen.

In this section, we are interested in the the random variable Y that gives the number of type 1 objects in the sample. Note that Y is a counting variable, and thus like all counting variables, can be written as a sum of indicator variables.

Mathematical Exercise 1. Show that Y = I1 + I2 + ··· + In where Ii = 1 if Xi is in D1 (the i'th object is type 1) and Ii = 0 otherwise.

We will assume initially that the sampling without replacement, which is usually the realistic setting with dichotomous populations.

The Density Function

Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the set of all combinations of size n chosen from D. This observation leads to a simple combinatorial derivation of the density of Y.

Mathematical Exercise 2. Show that for k = max{0, n - (N - R)}, ..., min{n, R},

P(Y = k) = C(R, k) C(N - R, n - k) / C(N, n).

This is known as the hypergeometric distribution with parameters N, R, and n. If we adopt the convention that C(j, i) = 0 for i > j then the formula for the density function is correct for k = 0, 1, ..., n.

Mathematical Exercise 3. Show the following alternative form of the hypergeometric density in two ways: combinatorially by treating the outcome as a permutation of size n chosen from the population of N balls, and algebraically, starting from the result in Exercise 2.

P(Y = k) = C(n, k) (R)k (N - R)n - k / (N)n for k = 0, 1, ..., n.

Simulation Exercise 4. In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the shape of the graph of the density function. Now let N = 50, R = 30, and n = 10 and run the experiment with an update frequency of 100. Watch the apparent convergence of the relative frequency function to the density function.

Moments

In the following exercises, we will derive the mean and variance of Y. The exchangeable popery of the indicator variables, and properties of covariance and correlation will play a key role.

Mathematical Exercise 5. Show E(Ii) = R / N for any i.

Mathematical Exercise 6. Show that E(Y) = n (R / N).

Mathematical Exercise 8. Show that var(Ii) = (R / N) (1 - R / N) for any j.

Mathematical Exercise 9. Show that for distinct i and j,

  1. cov(Ii, Ij) = -(R / N) (1 - R / N) [1 / (N - 1)]
  2. cor(Ii, Ij) = -1 / (N - 1)

Note from Exercise 9 that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if N = 2. Think about these result intuitively.

Simulation Exercise 10. In the ball and urn experiment, set N = 50, R = 20, and n = 10. Now run the experiment 500 times, updating after each run. Compute the empirical correlation of the events of a red ball on draw 3 and a red ball on draw 7. Compare with the theoretical result in the last exercise.

Mathematical Exercise 11. Use the results of Exercises 8 and 9 to show that

var(Y) = n (R / N)(1 - R / N) (N - n) / (N - 1).

Note that var(Y) = 0 if R = 0, R = N, or n = N. Think about these results.

Simulation Exercise 14. In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the size and location of the mean/standard deviation bar. Now let N = 50, R = 30, and n = 10 and run the experiment with an update frequency of 100. Watch the apparent convergence of the empirical moments to the true moments.

Mathematical Exercise 15. A batch of 100 computer chips contains 10 defective chips. Five chips are chosen at random, without replacement.

  1. Compute explicitly the density function of the number of defective chips in the sample.
  2. Compute explicitly the mean and variance of the number of defective chips in the sample
  3. Find the probability that the sample contains at least one defective chip.

Mathematical Exercise 16. A club contains 50 members; 20 are men and 30 are women. A committee of 10 members is chosen at random.

  1. Give the mean and variance of the number of women on the committee.
  2. Give the mean and variance of the number of men on the committee.
  3. Find the probability that the committee members are all the same gender.

Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.

Mathematical Exercise 17. Show that I1, I2, ..., In form a sequence of n Bernoulli trials with success parameter R / N.

The following results now follow immediately from the general theory of Bernoulli trials, although modifications of the arguments above could also be used.

Mathematical Exercise 18. Show that Y has the binomial distribution with parameters n and R / N:

P(Y = k) = C(n, k) (R / N)k(1 - R / N)n - k for k = 0, 1, ..., n.

Mathematical Exercise 19. Show that

  1. E(Y) = n(R / N).
  2. var(Y) = n(R / N)(1 - R / N)

Note that for any values of the parameters, E(Y) is the same, whether the sampling is with or without replacement. On the other hand, var(Y) is smaller, by a factor of (N - n) / (N - 1), when the sampling is without replacement than with replacement. Think about these results. The factor (N - n) / (N - 1) is sometimes called the finite population correction factor.

Convergence of the Hypergeometric Distribution to the Binomial

Suppose that the population size N is very large compared to the sample size n. In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the hypergeometric distribution should be well-approximated by the binomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly.

Mathematical Exercise 20. Suppose that R depends on N and that

R / N converges to p in [0, 1] as N converges to infinity.

Show that for fixed n, the hypergeometric density function with parameters N, R, and n converges to the binomial density with parameters n and p. Hint: Use the representation in Exercise 3.

Simulation Exercise 21. In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric density and the binomial density. Now set N = 100, n = 10, and R = 30. Run the simulation 1000 times, updating every 100 runs. Compare the relative frequency function, the hypergeometric density function, and the approximating binomial density function.

Mathematical Exercise 22. A small pond contains 1000 fish; 100 are tagged. Suppose that 20 fish are caught.

  1. Compute the probability that the sample contains at least 2 tagged fish.
  2. Find the binomial approximation to the probability in (a).
  3. Compute the relative error of the approximation.

Mathematical Exercise 23. Forty percent of the registered voters in a certain district prefer candidate A. Suppose that 10 voters are chosen at random. Find the probability that at least 5 prefer candidate A.

Mathematical Exercise 24. In the setting of Exercise 20, show that the mean and variance of the hypergeometric distribution converge to the mean and variance of the binomial distribution as as N converges to infinity.