Virtual Laboratories > Random Samples > 1 2 3 4 5 6 [7] 8 9
As usual we start with a basic random experiment, with a sample space and a probability measure P. Suppose that X is a real-valued random variable for the experiment with distribution function F and density function f.
We perform n independent replications of the basic experiment to generate a random sample of size n from the distribution of X:
(X1, X2, ..., Xn),
Recall that these are independent random variables, each with the distribution of X.
Let X(k) denote the k'th smallest of X1, X2, ..., Xn. Note that X(k) is a function of the sample variables, and hence is a statistic, called the k'th order statistic. Often the first step in a statistical study is to order the data; thus order statistics occur naturally. Our goal in this section is to study the distribution of the order statistics in terms of the sampling distribution.
Note in particular that the extreme order statistics are the minimum and maximum values:
1. In the
order statistic experiment, use the default settings and run the experiment a few times.
Note the following:
Let Gk denote the distribution function of X(k). Fix a real number y and define
Ny = #{i {1, 2, ..., n}:
Xi
y}.
2. Show that Ny
has the binomial distribution with parameters n
and F(y).
3. Show that
X(k)
y
if and only if Ny
k.
4. Conclude from
Exercises 2 and 3 that for y in R,
Gk(y) = j
= k, ..., n C(n, j) [F(y)]j
[1 - F(y)]n - j.
5. In particular,
show that G1(y) = 1 - [1 - F(y)]n
for y in R.
6. In particular,
show that Gn(y) = [F(y)]n
for y in R.
7. Suppose now
that X has a continuous distribution. Show
that X(k) has a continuous distribution with density
gk(y) = C(n; k - 1, 1, n - k) [F(y)]k - 1[1 - F(y)]n - kf(y)
where C(n; k - 1, 1, n - k) is the multinomial coefficient. Hint: Differentiate the expression in Exercise 4 with respect to y.
8. In the
order statistic experiment, select the uniform distribution on (0, 1) and n = 5.
Vary k from 1 to 5 and note the shape of the density function of X(k).
Now with k = 4, run the simulation 1000 times with and update frequency of 10.
Note the apparent convergence of the empirical density function to the true density
function.
There is a simple heuristic argument for the result in Exercise 7. First, gk(y)dy is the probability that X(k) is in an infinitesimal interval dy about y. On the other hand, this event means that one of sample variables is in the infinitesimal interval, k - 1 sample variables are less than y, and n - k sample variables are greater than y. The number of ways of choosing these variables is the multinomial coefficient
C(n; k - 1, 1, n - k).
The probability that the chosen variables are in the specified intervals is
[F(y)]k - 1[1 - F(y)]n - kf(y)dy.
9. Consider a
random sample of size n from the exponential
distribution with rate parameter r. Compute the density function of the k'th
order statistic X(k). In particular, note that X(1)
has the exponential distribution with rate parameter nr.
10. In the
order statistic experiment, select the exponential (1) distribution and n = 5.
Vary k from 1 to 5 and note the shape of the density function of X(k).
Now with k = 3, run the simulation 1000 times with and update frequency of 10.
Note the apparent convergence of the empirical density function to the true density
function.
11. Consider a
random sample of size n from the uniform
distribution on (0, 1).
12. In the
order statistic experiment, select the uniform distribution on (0, 1) and n = 6.
Vary k from 1 to 6 and note the size and location of the mean/standard deviation
bar. Now with k = 3, run the simulation 1000 times with and update frequency of
10. Note the apparent convergence of the empirical moments to the distribution moments.
13. Four fair dice
are rolled. Find the (discrete) density function of each of the order statistics.
14. In the
dice
experiment, select the following order statistic and die
distribution. Increase the number of dice from 1 to 20, noting the shape of the density at
each stage. Now with n = 4, run the simulation 1000 times, updating every 10
runs. Note the apparent convergence of the relative frequency function to the density
function.
Suppose again that X has a continuous distribution.
15. Suppose that j
< k. Use an heuristic argument to show that the joint density of (X(j),
X(k)) is
g(y, z) = C(n; j - 1, 1, k - j - 1, 1, n - k) × [F(y)]j - 1 f(y) [F(z) - F(y)]k - j - 1 f(z) [1 - F(z)]n - k for y < z.
Similar arguments can be used to obtain the joint density of any number of the order statistics. Of course, we are particularly interested in the joint density of all of the order statistics; the following exercise gives this joint density, which has a remarkably simple form.
16. Show that (X(1), X(2), ..., X(n))
has joint density g given by
g(y1, y2, ..., yn) = n! f(y1)f(y2) ··· f(yn) for y1 < y2 < ··· < yn.
Hint: For each permutation i = (i1, i2, ..., in) of (1, 2, ..., n), let
Si = {x in Rn: xi1 < xi2 < ··· < xin}.
On Si, the mapping from (x1, x2, ..., xn) to (xi1, xi2, ···, xin) is one-to-one, has continuous first partial derivatives, and has Jacobian 1. The sets Si where i ranges over the n! permutations of (1, 2, ..., n) are disjoint, and the probability that (X1, X2, ..., Xn) is not in one of these sets is 0. Now use the multivariate change of variables formula.
Again, there is a simple heuristic argument for the formula in Exercise 16. For each y is in Rn with y1 < y2 < ··· < yn, there are n! permutations of the coordinates of y. The density of (X1, X2, ..., Xn) at each of the this points is
f(y1)f(y2) ··· f(yn)
Hence the density of (X(1), X(2), ..., X(n)) at y is n! times this product.
17. Consider a
random sample of size n from the exponential distribution with parameter r.
Compute the joint density function of the order statistics (X(1), X(2), ..., X(n)).
18. Consider a
random sample of size n from the uniform distribution on (0, 1). Compute the
joint density function of the order statistics (X(1), X(2), ..., X(n)).
19. Four fair dice
are rolled. Find the (discrete) joint density function of the order statistics.
The sample range is is the random variable
R = X(n) - X(1).
This statistic gives an measure of the dispersion of the sample. Note the the distribution of the sample range can be obtained from the joint distribution of (X(1), X(n)) given earlier.
20. Consider a
random sample of size n from the exponential distribution with parameter r.
Show that the sample range R has the same distribution as the maximum of
a random sample of size n -1 from this exponential distribution.
21. Consider a
random sample of size n from the uniform distribution on (0, 1).
22. Four fair dice
are rolled. Find the (discrete) density function of the sample range.
If n is odd, the sample median is the middle of the ordered observations, namely
X(k) where k = (n + 1)/2.
If n is even, there is not a single middle observation, but rather two middle observations. Thus, the median interval is
[X(k), X(k+1)] where k = n/2.
In this case, the sample median is defined to be the midpoint of the median interval
[X(k) + X(k+1)] / 2.
In a sense, this definition is a bit arbitrary because there is no compelling reason to prefer one point in the median interval over another. For more on this issue, see the discussion of error functions in the section on Variance. In any event, sample median is a natural statistic that is analogous to the median of the distribution. Moreover, the distribution of the sample median can be obtained from our results on order statistics.
We can generalize the sample median discussed above to other sample quantiles. Suppose that p is in (0, 1). If np is not an integer, we define the sample quantile of order p to be the order statistic
X(k) where k = ceil(np)
(recall that ceil(np) is the smallest integer greater than or equal to np). If np is an integer k, then we define the sample quantile of order p to be the average of the order statistics
[X(k) + X(k+1)] / 2.
Once again, the sample quantile of order p is a natural statistic that is analogous to the distribution quantile of order p. Morevoer, the distribution of a sample quantile can be obtained from our results on order statistics.
The sample quantile of order 1/4 is known as the first sample quartile and is frequently denoted Q1. The the sample quantile of order 3/4 is known as the third sample quartile and is frequently denoted Q3. Note that the sample median is the quartile of order 1/2, the second sample quartile, and thus is sometimes denoted Q2. The interquartile range is defined to be
IQR = Q3 - Q1.
The IQR is a statistic that measures the spread of the distribution about the median, but of course this number gives less information than the interval [Q1, Q3].
The five statistics
X(1), Q1, Q2, Q3, X(n)
are often referred to as the five-number summary. Together, these statistics give a great deal of information about the distribution in terms of the center, spread, and skewness. Graphically, the five numbers are often displayed as a boxplot, which consists of a line extending from the min to the max, with a rectangular box from Q1 to Q3, and tick marks at the min, median and max.
23. In the
interactive histogram, select boxplot. Construct a frequency distribution with at least 6
classes and at least 10 values. Compute the statistics in the five-number summary by hand
and verify that you get the same results as the applet.
24. In the
interactive histogram, select boxplot. Set the class width to 0.1 and construct a
distribution with at least 30 values of each of the types indicated below. Then increase
the class width to each of the other four values. As you perform these operations, note
the shape of the boxplot and the relative positions of the statistics in the five-number
summary:
25. In the
interactive histogram, select boxplot. Start with a distribution and add additional points
as follows. Note the effect on the boxplot:
In the last problem, you may have noticed that when you add an additional point to the distribution, one or more of the five statistics does not change. In general, quantiles can be relatively insensitive to changes in the data.
26. Compute the five statistics and sketch the boxplot for the velocity of light variable in
Michelson's data. Compare the median with the "true
value" of the velocity of light.
27.
Compute the five statistics and sketch the boxplot for the density of the earth variable
in Cavendish's
data. Compare the median with the
"true value" of the density of the earth.
28.
Compute the five statistics and sketch the boxplot for the net weight variable in the
M&M data.
29.
Compute the five statistics for the sepal length variable in Fisher's
iris data, using the cases indicated below. Plot the boxplots on parallel axes, so you
can compare.