Virtual Laboratories > Finite Sampling Models > 1 2 3 4 5 6 7 [8] 9 10

8. The Number of Distinct Sample Values


Basic Variables

Suppose that our random experiment is to select a random sample of size n, with replacement, from the population D = {1, 2, ..., N}:

X = (X1, X2, ..., Xn).

Recall that our basic modeling assumption is that X is uniformly distributed on the sample space

S = {1, 2, ..., N}n.

In this section, we are interested in the number of population values missing from the sample, and the number of (distinct) population values in the sample. Often, we will interpret the sampling experiment as a distribution of n balls into N cells; Xi is the cell number of ball i. In this model, our interest is in the number of empty cells and the number of occupied cells.

For i in D, let Yi denote the number of times that i occurs in the sample:

Yi = #{j in.gif (839 bytes){1, 2, ..., n}: Xj = i}.

Mathematical Exercise 1. Show that Y = (Y1, Y2, ..., YN) has the multinomial distribution: for nonnegative integers k1, ..., kN with k1 + k2 + ··· + kN = n,

P(Y1 = k1, Y2 = k2, ..., YN = kN) = C(n; k1, k2, ..., kN) / Nn

We will now define the main random variables of interest: the number of population values missing in the sample:

UN, n = #{j {1, 2, ..., N}: Yj = 0},

and the number of (distinct) population values that occur in the sample:

VN, n = #{j {1, 2, ..., N}: Yj > 0}.

Clearly we must have

UN, n + VN, n = N,

so once we have the probability distribution and moments of one variable, we can easily find them for the other variable. Note also that the birthday event, that there is at least one duplication in the sample, can be written as

{VN, n < n} = {UN, n > N - n}.

Simulation Exercise 2. In the general birthday experiment, set N = 100. Vary n and note the shape of the density graph of V and its position within the range. With n = 30, run the experiment in single-step mode a few times and observe the outcomes. Now run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency distribution to the true distribution.

The Density Function

For j in D, consider the event that j does not occur in the sample:

Aj = {Yj = 0}.

Let K be a subset of D with #(K) = k. Using the multiplication rule of combinatorics, it is easy to count the number of samples that do not contain any elements of K:

Mathematical Exercise 3. Show that

#[intersectj in K Aj] = (N - k)n.

Now the inclusion-exclusion rule of combinatorics can be used to count the number samples that are missing at least one population value:

Mathematical Exercise 4. Show that

#[j = 1, ..., N Aj] = k = 1, ..., N (-1)k - 1 C(N, k) (N - k)n.

Once we have this, it's simple to count the number samples that contain all population values:

Mathematical Exercise 5. Show that

#[intersectj = 1, ..., N Ajc] = k = 1, ..., N (-1)k C(N, k) (N - k)n.

Now we can use a two-step procedure to generate all samples that exclude exactly j population values: first, choose the j values that are to be excluded. Then select a sample of size n from the remaining population values so that none are excluded. Thus, we can use the multiplication principle of combinatorics to count the number of samples that exclude j population values.

Mathematical Exercise 6. Show that

#{UN,n = j} = C(N, j) k = 0, ..., N - j (-1)k C(N - j, k) (N - j - k)n.

Finally, since the probability distribution of X on the sample space S is uniform, we can find the density function of the number of excluded values:

Mathematical Exercise 7. Show that for j = max{N - n, 0}, ..., N - 1,

P(UN,n = j) = C(N, j) k = 0, ..., N - j (-1)k C(N - j, k) [1 - (j + k) / N]n.

Also, we can easily find the density function of the number distinct values in the sample:

Mathematical Exercise 8. Show that for j = 1, 2, ..., min{N, n},

P(VN,n = j) = C(N, j) k = 0, ..., j (-1)k C(j, k) [(j - k) / N]n.

Mathematical Exercise 9. Suppose that 20 persons are chosen at random. Find the probability that there are at least 18 different birth weeks represented.

Simulation Exercise 10. In the general birthday experiment, set N = 52. Vary n and note the shape and location of the density function. With n = 20 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency function to the density function.

Mathematical Exercise 11. Suppose that 10 fair dice are rolled. Find the probability that there will 4 or fewer distinct scores.

Simulation Exercise 12. In the general birthday experiment, set N = 6. Vary n and note the shape and location of the density function. With n = 10 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency function to the density function.

Recurrence Relation

The distribution of the number of excluded values can also be obtained by a recursion argument.

Mathematical Exercise 13. Let aN, n(j) = P(UN, n = j) for j = max{N - n, 0}, ..., N - 1. Use probability arguments to show that

  1. aN, 1(N - 1) = 1.
  2. aN, n+1(j) = [(N - j) / N]aN, n(j) + [(j + 1) / N]aN, n(j + 1).

Mathematical Exercise 14. Suppose that 20 persons are chosen at random. Find the probability that at least 3 birth months are not represented.

Simulation Exercise 15. In the general birthday experiment, set N = 12. Vary n and note the shape and location of the density function. With n = 20 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency function to the density function.

Mathematical Exercise 16. A fast food restaurant gives away one of 10 different types of toy with the purchase of each kid's meal. A family buys 15 kid's meals. Find the probability that at least 3 different toy types are missing.

Simulation Exercise 17. In the general birthday experiment, set N = 10. Vary n and note the shape and location of the density function. With k = 15 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the relative frequency function to the density function.

Moments

Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently the best for computing moments.

Let Ij = 1 if Aj occurs (j is not in the sample) and Ij = 0 if Aj does not occur (j is in the sample).

Note that the number of population values missing in the sample can be written as

UN, n = I1 + I2 + ··· + IN.

Mathematical Exercise 18. Show that

  1. E(Ij) = (1 - 1 / N)n for j = 1, 2, ..., N.
  2. E(Ii Ij) = (1 - 2 / N)n for i, j = 1, 2, ..., N with i <> j.

Mathematical Exercise 19. Use the results of the previous exercise to show that

  1. E(UN, n) = N(1 - 1 / N)n.
  2. E(VN, n) = N[1 - (1 - 1 / N)n].

Mathematical Exercise 20. Use the result of Exercise 18 to show that

  1. var(Ii) = (1 - 1 / N)n - (1 - 1 / N)2n.
  2. cov(Ii, Ij) = (1 - 2 / N)n - (1 - 1 / N)2n if i <> j.

Mathematical Exercise 19. Use the results of the previous exercise, and basic properties of variance to show that

var(UN, n) = var(VN, n) = N(N - 1)(1 - 2 / N)n + N(1 - 1 / N)n - N2(1 - 1 / N)2n.

Mathematical Exercise 20. Suppose that 100 persons are chosen at random. Find the mean and standard deviation of the number of distinct birthdays.

Mathematical Exercise 21. Suppose that 30 persons are chosen at random. Find the mean and standard deviation of the number of distinct birth weeks.

Simulation Exercise 22. In the general birthday experiment, set N = 52. Vary n and note the size and shape of the mean/standard deviation bar. With n = 30 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation, respectively.

Mathematical Exercise 23. Suppose that 20 persons are chosen at random. Find the mean and standard deviation of the number of distinct birth months.

Simulation Exercise 24. In the general birthday experiment, set N = 12. Vary n and note the size and shape of the mean/standard deviation bar. With n = 20 run the experiment 1000 times with an update frequency of 10 and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation, respectively.

Mathematical Exercise 25. The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat. The students, who did not anticipate this, select their answers independently and at random.

  1. Give the probability density function of the number of distinct answers.
  2. In particular, give the probability that the students get away with their deception.
  3. Give mean of the number of distinct answers.
  4. Give the standard deviation of the number of distinct answers.

Mathematical Exercise 26. The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots.

  1. Give the probability density function of the number of ducks that are killed.
  2. Give mean of the number of ducks that are killed.
  3. Give the standard deviation of the number of ducks that are killed..