Order Statistics

5. Order Statistics

Suppose that the objects in our population are numbered from 1 to N, so that D = {1, 2, ..., N}. For example, the population might consist of manufactured items, and the labels might correspond to serial numbers. We sample n objects at random, without replacement from D:

X = (X₁, X₂, ..., X_n), where X_i in D is the i'th object chosen.

Recall that X is uniformly distributed over the set of permutations of size n chosen from D. Recall also that W = {X₁, X₂, ..., X_n} is the unordered sample, which is uniformly distributed on the set of combinations of size n chosen from D.

For i = 1, 2, ..., n, let X_(i) denote the i'th smallest of X₁, X₂, ..., X_n. The random variable X_(i) is known as the i'th order statistic of the sample. Note that in particular, X₍₁₎ is the minimum score and X_(n) is the maximum score.

$Mathematical Exercise$ 1. Show that X_(i) takes values i, i + 1, ..., N - n + i.

We will denote the vector of order statistics by

U = (X₍₁₎, X₍₂₎, ..., X_(n)).

Note that U takes values in L = {(x₁, x₂, ..., x_n): 1 x₁ < x₂ < ЗЗЗ < x_n N}

2. Run the order statistic experiment. Note that you can vary the population size N and the sample size n. The order statistics are recorded on each update.

Distributions

$Mathematical Exercise$ 3. Show that L has C(N, n) elements and that U is uniformly distributed on L.
Hint: U = (x₁, x₂, ..., x_n) if and only if W = {x₁, x₂, ..., x_n} if and only if X is one of the n! permutations of (x₁, x₂, ..., x_n).

$Mathematical Exercise$ 4. Use a combinatorial argument to show that the density function of X_(i) is as follows:

P(X_(i) = k) = C(k - 1, i - 1)C(N - k, n - i) / C(N, n) for k = i, i + 1, ..., N - n + i.

5. In the order statistic experiment, vary the parameters and note the shape of the density function. Now with N = 30, n = 10 and i = 5, run the experiment 1000 times, updating very 10 runs. Note the apparent convergence of the empirical density function to the true density function.

Moments

The density function in Exercise 4 can be used to obtain an interesting identity involving the binomial coefficients. This identity, in turn, can be used to find the mean and variance of X_(i) .

$Mathematical Exercise$ 5. Show that for each i = 1, 2, ..., N,

_{k
= i, ..., N - n + i} C(k, i) C(N - k, n - i) = C(N + 1, n + 1).

$Mathematical Exercise$ 6. Use the identity in the Exercise 5 to show that

E(X_(i)) = i (N + 1) / (n + 1).

$Mathematical Exercise$ 7. Use the identity in Exercise 5 to show that

var(X_(i)) = (N + 1)(N - n)i(n + 1 - i) / [(n + 1)²(n + 2)].

8. In the order statistic experiment, vary the parameters and note the size and location of the mean/standard deviation bar. Now with N = 30, n = 10 and i = 5, run the experiment 1000 times, updating very 10 runs. Note the apparent convergence of the empirical moments to the true moments.

$Mathematical Exercise$ 10. Suppose that in a lottery, tickets numbered from 1 to 25 are placed in a bowl. Five tickets are chosen at random and without replacement. Compute

The density function of X₍₃₎.
E(X₍₃₎).
var(X₍₃₎).

Estimators

$Mathematical Exercise$ 11. Use the result of Exercise 6 to show that for i = 1, 2, ..., n, the following statistic is an unbiased estimator of N:

W_i = [(n + 1) X_(i) / i] - 1.

Since W_i is unbiased, its variance is the mean square error, a measure of the quality of the estimator.

$Mathematical Exercise$ 12. Show that var(W_i) = (N + 1)(N - n)(n + 1 - i) / [i(n + 2)]

$Mathematical Exercise$ 13. Show that for fixed N and n, var(W_i) decreases as i increases.

Thus, the estimators improve as i increases; in particular, W_n is the best and W₁ the worst.

$Mathematical Exercise$ 14. Show that var(W_j) / var(W_i) = j(n + 1 - i) / [i(n + 1 - j)]

This ratio is known as the relative efficiency of W_i with respect to W_j.

Usually, we hope that an estimator improves (in the sense of mean square error) as the sample size n increases (the more information we have, the better our estimate should be). This general idea is known as consistency.

$Mathematical Exercise$ 15. Show that the var(W_n) decreases to 0 as n increases to N.

$Mathematical Exercise$ 16. Show that for fixed i, var(W_i) at first increases and then decreases to 0 as n increases from 1 to N.

The following graph, due to Christine Nickel, shows var(W₁) as a function of n for N = 50, 75, and 100.

The variance of W2 as a function of n

The estimator W_n was used by the Allies during World War II to estimate the number of German tanks N that had been produced. German tanks had serial numbers, and captured German tanks and captured records formed the sample data. According to Richard Larsen and Morris Marx, this estimate of German tank production in 1942 was 3400, very close the the true number.

$Mathematical Exercise$ 17. Suppose that in a certain war, 100 enemy tanks have been captured. The largest serial number of the captured tanks is 1423. Estimate the total number of tanks that have been produced.

18. In the order statistic experiment, and set N = 100 and n = 10. Run the experiment 50 times, updating after each run. For each run, compute the estimate of N based on each order statistic. For each estimator, compute the square root of the average of the squares of the errors over the 50 runs. Based on these empirical error estimates, rank the estimators of N in terms of quality.

$Mathematical Exercise$ 19. Suppose that in a certain war, 100 enemy tanks have been captured. The smallest serial number of the captured tanks is 23. Estimate the total number of tanks that have been produced.

Sampling with Replacement

If the sampling is with replacement, then the sample variables X₁, X₂, ..., X_n are independent and identically distributed. The order statistics from such samples are studied in the chapter on Random Samples.