Virtual Laboratories > Finite Sampling Models > 1 2 3 4 5 6 7 8 [9] 10

9. The Coupon Collector Problem


Preliminaries

Our random experiment is to sample repeatedly, with replacement, from the population D = {1, 2, ..., N}. This generates a sequence of independent random variables, each uniformly distributed on D:

X1, X2, X3, ...

We will often interpret the sampling in terms of a coupon collector: each time the collector buys a certain product (bubble gum or Cracker Jack, for example) she receives a coupon (a baseball card or a toy, for example) which is equally likely to be any one of N types. Thus, in this setting, Xi is the coupon type received on the i'th purchase.

Let VN, n denote the number of distinct values in the first n selections, the random variable studied in the last section. Our interest is in this section is the sample size needed to get k distinct sample values:

WN, k = min{n: VN, n = k}, k = 1, 2, ..., N.

In terms of the coupon collector, this random variable gives the number of products required to get k distinct coupon types. Note that the possible values of WN, k are k, k + 1, k + 2, .... We will be particularly interested in WN,N, the sample size needed to get the entire population. In terms of the coupon collector, this is the number of products required to get the entire set of coupons.

Simulation Exercise 1. In the coupon collector experiment, set N = 50 and vary k. Note the shape of and position of the density graph. With k = 20, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency distribution to the true distribution.

The Density Function

Now let's find the distribution of WN, k. The results of the previous section will be very helpful

Mathematical Exercise 2. Argue that

WN, k = n if and only if VN, n - 1 = k - 1 and VN, n = k.

Mathematical Exercise 3. Use Exercise 2 and a conditional probability argument to show that

P(WN, k = n) = P(VN, n - 1 = k - 1)(N - k + 1) / N.

Mathematical Exercise 4. Use the result of the last exercise and the distribution of VN, n - 1 from the last section to show that for n = k, k + 1, ...,

P(WN,k = n) = C(N - 1, k - 1) sumj = 0, ..., k - 1 (-1)j C(k - 1, j)[(k - 1 - j) / N]n - 1.

Simulation Exercise 5. In the coupon collector experiment, set N = 100 and vary k. Note the shape of and position of the density graph. With k = 50, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency function to the density function.

Mathematical Exercise 6. Suppose that people are sampled at random until 10 distinct birth weeks are obtained. Find the probability that no more than 12 persons are sampled.

Mathematical Exercise 7. Suppose that a fair die is rolled until all 6 scores have occurred. Find the probability that the number of rolls is at least 10.

Mathematical Exercise 8. A box of a certain brand of cereal comes with a special toy. There are 10 different toys in all. Find the probability that all 10 toys can be obtained with no more than 15 boxes of cereal.

Moments

We will now show that WN, k can be decomposed as a sum of k independent, geometrically distributed variables. This will provide some additional insight into the nature of the distribution and will make the computation of the mean and variance easy.

For i = 1, 2, ... N, let Zi denote the number of additional samples needed to go from i - 1 distinct values to i distinct values.

Mathematical Exercise 9. Argue that

  1. Z1, Z2, ..., ZN are independent.
  2. Zi has the geometric distribution with parameter pi = (N - i + 1) / N.
  3. WN, k = Z1 + Z2 + ··· + Zk.

Exercise 9 shows clearly that each time a new coupon is obtained, it becomes harder to get the next new coupon.

Simulation Exercise 10. In the coupon collector experiment, set N = 30 and vary k. Note the shape and position of the density graph. With k = 25, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the sample statistics to the distribution parameters.

Mathematical Exercise 11. Use the results of Exercise 9 to show that

  1. E(WN, k) = sumi = 1, ..., k N / (N - i + 1).
  2. var(WN, k) = sumi = 1, ..., k (i - 1)N / (N - i + 1)2.

Mathematical Exercise 12. Compute the mean and variance of the number of persons that must be sampled to obtain 10 distinct birth weeks.

Mathematical Exercise 13. Compute the mean and standard deviation of the number of times a die must be rolled to obtain all 6 scores.

Mathematical Exercise 14. A box of a certain brand of cereal comes with a special toy. There are 10 different toys in all. Find the mean and standard deviation of the number of boxes that must be purchased to obtain a complete set of toys.

Mathematical Exercise 15. Compute the mean and standard deviation of the number of persons that must be sampled to obtain all 365 birthdays.

Simulation Exercise 16. In the coupon collector experiment, set N = 10 and vary k. Note the shape of and position of the density graph. With k = 10, run the experiment in single-step mode a few times and watch the outcomes. Finally, run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the sample statistics to the distribution parameters.

Mathematical Exercise 17. Use the Result of Exercise 9 to show that the probability generating function of WN, k is

GN, k(t) = tkproducti = 1, ..., k [N - (i - 1)] / [N - (i - 1)t] for |t| < N / (k - 1).

Recurrence Relation

An alternate approach to the distribution of the sample size needed to get k distinct values is via a recursion formula.

Mathematical Exercise 18. Let cN, k(n) = P(WN, k = n) for n = k, k + 1, .... Use a conditional probability argument to show that

cN, k(n + 1) = [(k - 1) / N]cN, k(n) + [(N - k + 1) / N]cN, k - 1(n).