Inferences in the Hypergeometric Model

3. Inferences in the Hypergeometric Model

Preliminaries

Suppose again that we have a dichotomous population D with R objects of type 1 and N - R of type 2. As in the introduction, we sample n objects at random from D:

X = (X₁, X₂, ..., X_n), where X_i in D is the i'th object chosen.

In many real problems, the parameters R or N (or both) may be unknown. In this case we are interested in drawing inferences about the unknown parameters based on our observation of Y, the number of type 1 objects in the sample. We will assume initially that the sampling is without replacement, the realistic setting in most applications. In this case, recall that Y has the hypergeometric distribution with parameters n, R, and N.

Estimation of `R` with `N` Known

Suppose that the size of the population N is known but that the number of type 1 object R is unknown. This type of problem could arise, for example, if we had a batch of N computer chips containing an unknown number R of defectives. It would be too costly and perhaps destructive to test all N chips, so we might instead select n chips at random and test those.

A simple estimator of R can be derived by hoping that the sample proportion of type 1 objects is close to the population proportion of type 1 objects. That is,

Y / n ~ R / N so R ~ N Y / n.

$Mathematical Exercise$ 1. Show that E(N Y / n) = R.

The result in Exercise 1 means that N Y / n is an unbiased estimator of R. Hence the variance is a measure of the quality of the estimator, in the mean square sense.

$Mathematical Exercise$ 2. Show that var(N Y / n) = R (N - R) (N - n) / [n (N - 1)].

$Mathematical Exercise$ 3. Show that for fixed N and R, the mean square error decreases to 0 as n increases to N.

Thus, the estimator improves as the sample size increases; this property is known as consistency.

4. In the ball and urn experiment, select sampling without replacement and set N = 50, R = 20, and n = 10. Run the experiment 100 times, updating after each run.

On each run, compute N Y / n (the estimate of R), NY / n - R (the error), and (NY / n - R)² (the squared error).
Compute the average error and the average squared error over the 100 runs.
Compute the square root of the average squared error and compare this empirical value with variance in Exercise 2.

$Mathematical Exercise$ 5. Suppose that 10 memory chips are sampled at random and without replacement from a batch of 100 chips. The chips are tested and 2 are defective. Estimate the number of defective chips in the entire batch.

$Mathematical Exercise$ 6. A voting district has 5000 registered voters. Suppose that 100 voters are selected at random and polled, and that 40 prefer candidate A. Estimate the number of voters in the district who prefer candidate A.

Acceptance Sampling

Sometimes we are not interested in estimating R, but just in determining whether R meets or exceeds a critical value C. In particular, this situation arises in acceptance sampling. Suppose that we have a population of items that are either good or defective. If the number of defective items R is at least C (the critical value), then we would like to reject the entire lot. However, testing the items is expensive and destructive, so we must test a random sample of n items (drawn without replacement, of course) and base our decision to accept or reject the lot on the number of defectives in the sample. Clearly, the only reasonable approach is to choose another critical value c and reject the lot if the number of defectives in the sample is at least c. In statistical terms, we have described an hypothesis test.

In the following problems, suppose that N = 100 and C = 10. Thus we would like to reject the lot of 100 items if the number of defectives R is10 or more. Suppose that we can only afford to sample and test n = 10 items.

We will first study the following test: Reject the lot if the number of defectives in the sample is at least 1.

$Mathematical Exercise$ 7. For each of the following values of R (the true number of defectives), find the probability that we make the correct decision and the probability that we make the wrong decision:

R = 6
R = 8
R = 10
R = 12
R = 14

8. In the ball and urn experiment, select sampling without replacement, and set N = 100, n = 10. For each of the values of R in Exercise 7, run the experiment 1000 times, updating every 100 runs. Compute the relative frequency of rejections and compare with the true probability in Exercise 7.

Now we will study the following test: Reject the lot if the number of defectives in the sample is at least 2.

$Mathematical Exercise$ 9. For each of the following values of R (the true number of defectives), find the probability that we make the correct decision and the probability that we make the wrong decision:

R = 6
R = 8
R = 10
R = 12
R = 14

10. In the ball and urn experiment, select sampling without replacement, and set N = 100, n = 10. For each of the values of R in Exercise 9, run the experiment 1000 times, updating every 100 runs. Compute the relative frequency of rejections and compare with the true probability in Exercise 9.

$Mathematical Exercise$ 11. Of the two tests studied above,

Which test works best when the lot should be accepted (R < 10)?
Which test works better when the lot should be rejected (R 10)?

Estimation of `N` with `R` Known

Suppose now that the number of type 1 objects R is known, but the population size N is unknown. As an example of this type of problem, suppose that we have a lake containing N fish where N is unknown. We capture R of the fish, tag them, and return them to the lake. Next we capture n of the fish and observe Y, the number of tagged fish in the sample. We wish to estimate N from this data. In this context, the estimation problem is sometimes called the capture-recapture problem.

$Mathematical Exercise$ 12. Do you think that the main assumption of the ball and urn experiment, namely equally likely samples, would be satisfied for a real capture-recapture problem? Explain.

Once again, we can derive a simple estimate of N by hoping that the sample proportion of type 1 objects is close the population proportion of type 1 objects. That is,

Y / n ~ R / N so N ~ nR / Y (if Y > 0).

Thus, our estimator of N is nR / Y if Y > 0 and is undefined if Y = 0.

13. In the ball and urn experiment, select sampling without replacement and set N = 80, R = 30, and n = 20. Run the experiment 100 times, updating after each run.

On each run, compute nR / Y ( the estimate of R), nR / Y - N (the error), and (nR / Y - N)² (the squared error).
Compute the average error and the average squared error over the 100 runs.
Compute the square root of the average squared error. This is an empirical estimate of the mean square error or the estimator.

$Mathematical Exercise$ 14. From a certain lake, 200 fish are caught, tagged and returned to the lake. Then 100 fish are caught and it turns out that 10 are tagged. Estimate the population of fish in the lake.

$Mathematical Exercise$ 15. Show that if k > 0, then nR / k maximizes P(Y = k) as a function of N for fixed R and n. This means that nR / Y is the maximum likelihood estimator of N.

$Mathematical Exercise$ 16. Use Jensen's inequality to show that E(nR / Y) N.

Thus, the estimator is biased and tends to over-estimate N. Indeed, if n N - R, so that P(Y = 0) > 0, E(nR / Y) is infinite.

17. In the ball and urn experiment, select sampling without replacement and set N = 100, R = 60, and n = 30. Run the experiment 100 times, updating after each run. On each run, compute nR / Y, the estimate of N. Average the estimates and compare with N.

For another approach to estimating N, see the section on Order Statistics.

Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is unrealistic in most applications. In this case, Y has the binomial distribution with parameters n and R / N.

$Mathematical Exercise$ 18. Show that

E(N Y / n) = R.
var(N Y / n) = R (N - R) / n.

Thus, the estimator of R with N known is still unbiased, but has larger mean square error. Thus, sampling without replacement works better, for any values of the parameters, than sampling with replacement.