Virtual Laboratories > Random Samples > 1 2 [3] 4 5 6 7 8 9
Random samples and their sample means are ubiquitous in probability and statistics. In this section, we will see how sample means can be used to estimate probabilities, density functions, and distribution functions. As usual, our starting place is a basic random experiment that has a sample space and a probability measure P.
Suppose that X is a random variable for the experiment, taking values in a space S. Note that X might be the outcome variable for the entire experiment, in which case S would be the sample space. Recall that the distribution of X is the probability measure on S given by
P(A) = P(X A)
for A
S.
Suppose now that we fix A. Recall that the indicator variable IA takes the value 1 if X is in A and 0 otherwise. This indicator variable has the Bernoulli distribution with parameter P(A) above.
1. Show that the mean and variance
of IA are given by
Now suppose that we repeat the basic experiment indefinitely to form independent random variables X1, X2, ..., each with the distribution of X. Thus, for each n, (X1, X2, ..., Xn) is a random sample of size n from the distribution of X. The relative frequency of A for this sample is
Pn(A) = #{i {1, 2,
..., n}: Xi
A} /
n for A
S.
The relative frequency of A is a statistic that gives the proportion of times that A occurred, in the first n runs.
2. Show
Pn(A) is the sample mean from a random sample of size n
from the distribution of IA. Thus, conclude that
This special case of the strong law of large numbers is basic to the very concept of probability.
3.
Show that for a fixed sample, Pn satisfies the axioms of a probability
measure.
The probability measure Pn gives the empirical distribution of X, based on the random sample. It is a discrete distribution, concentrated at the distinct values of X1, X2, ..., Xn. Indeed, it places probability mass 1/n at Xi for each i, so that if the sample values are distinct, the empirical distribution is uniform on these sample values.
Several applets in this project are simulations of random experiments with events of interest. When you run the experiment, you are performing independent replications of the experiment. In most cases, the applet displays the relative frequency of the event and its complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown graphically in red and also numerically.
4. In the
simulation of Buffon's coin experiment, the event of
interest is that the coin crosses a crack. Run the experiment 1000 times with an update
frequency of 10. Note the apparent convergence of the relative frequency of the event to
the true probability.
5. In the
simulation of Bertrand's experiment, the event of
interest is that that a "random chord" on a circle will be longer than the
length of a side of the inscribed equilateral triangle. Run the experiment 1000 times with
an update frequency of 10. Note the apparent convergence of the relative frequency of the
event to the true probability.
The following subsections consider a number of special cases of relative frequency.
Suppose now that X is a real-valued random variable for the basic experiment. Recall that the distribution function of X is the function F given by
F(x) = P(X x)
for x
R.
Suppose now that we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X. It is natural to define the empirical distribution function by
Fn(x) = #{i {1, 2,
..., n}: Xi
x} /
n.
For each x, Fn(x) is a statistic that gives the relative frequency of the sample variables that are less that or equal to x.
6. Show that
7. Show
that for each x, Fn(x) is the sample mean for a
random sample of size n from the distribution of the indicator variable
I of the event {X
x}. Thus, conclude that
Suppose now that X is a random variable for the basic experiment with a discrete distribution on a countable set S. Let f denote the density function of X so that
f(x) = P(X = x) for x S.
Suppose now that we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X. The relative frequency function (or empirical density function) corresponding to the sample is given by
fn(x) = #{i {1, 2,
..., n}: Xi = x} / n for x
S.
For each x, fn(x) is a statistic that gives the relative frequency of the sample variables that have the value x.
8. Show
that the empirical density function satisfies the mathematical
properties of a discrete density function:
9.
Show that if X is real valued, then the sample mean of (X1, X2, ..., Xn)
is the mean of the empirical density function..
10. Show
that for each x, fn(x) is the sample mean for a
random sample of size n from the distribution of the indicator variable
I of the event {X
= x}. Thus, conclude that
Many of the applets in this project are simulations of experiments which result in discrete variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density function numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown numerically in the table and visually as a red bar graph.
11. In the
poker experiment, the random variable is
the type of hand. Run the simulation 1000 times updating every 10 runs and note the
apparent convergence of the empirical density function to the true density function.
12. In the
simulation of the binomial coin experiment, the
random variable is the number of heads. Run the simulation 1000 times updating every 10
runs and note the apparent convergence of the empirical density function to the true
density function.
13. In the
simulation of the matching experiment, the random variable
is the number of matches. Run the simulation 1000 times updating every 10 runs and note
the apparent convergence of the empirical density function to the true density function.
Recall again that the standard k-dimensional measure on Rk is given by
mk(A) = A1dx
for A
Rk.
In particular m1 is the length measure on R, m2 is the area measure on R2, and m3 is the volume measure on R3.
Suppose now that X is a random variable for the basic experiment, with a continuous distribution on a subset S of Rk. Let f denote the density function of X; technically, f is the density with respect mk. Thus, by definition,
P(X
A) =
A
f(x) dx
for A
S.
Again, we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X.
Suppose now that {Aj: j
J} is a partition of S into a countable number of subsets. As
before, we can define the empirical probability of Aj, based
on the first n sample variables, by
Pn(Aj) = #{i {1, 2,
..., n}: Xi
Aj} / n.
We then define the empirical density function as follows:
fn(x) = Pn(Aj)
/ mk(Aj) for x
Aj.
Clearly the empirical density function fn depends on the partition, as well as n, but we suppress this to keep the notation from becoming completely unwieldy. Of course, for each x, fn(x) is a random variable (in fact, a statistic), but by the very definition of density, if the partition is sufficiently fine (so that Aj is small for each j) and if n is sufficiently large, then by the law of large numbers,
fn(x) ~ f(x) for x
S.
14.
Show that fn satisfies the mathematical properties of a
density function:
Many of the applets in this project are simulations of experiments which result in continuous variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density visually as a blue graph. When you run the simulation, an empirical density function is also shown visually as a red bar graph.
15. Run the
simulation of the exponential experiment 1000 times
with an update frequency of 10. Note the apparent convergence of the empirical density
function to the true density function.
16. In the
simulation of the random variable
experiment, select the normal distribution. Run the experiment1000 times with
an update frequency of 10, and note the apartment convergence of the empirical density
function to the true density function.
Many of the concepts discussed above are are frequently used in exploratory data analysis. Specifically, suppose that x is a variable for a population (generally vector valued), and that
x1, x2, ..., xn
are the observed data from a sample of size n, corresponding to this
variable. For example, x might be encode the color counts and net weight
for a bag of M&Ms. Now let {Aj: j
J} be a partition of the data set, where J is a finite index set.
The sets Aj: j
J are known as
classes. Just as above, we define the frequency and relative
frequency of Aj as follows:
If x is a continuous variable, taking values in Rk, we also define the density of Aj as follows:
f(Aj) = p(Aj) / mk(Aj),
The mapping q that assigns frequencies to classes is known as a frequency distribution for the data set. Similarly, p and f define a relative frequency distribution and a density distribution, respectively, for the the data set. When k = 1 or 2, the bar graph of any of these distributions is known as a histogram.
The whole purpose of constructing and graphing one of these empirical distributions is to summarize and display the data in a meaningful way. Thus, there are some general guidelines in choosing the classes:
17. In the
interactive histogram, click on the x-axis at various points to generate a data
set with 20 values. Vary the class width over the five values from 0.1 to 5.0 and then
back again. For each choice of class width, switch between the frequency histogram and the
relative frequency histogram. Note how the shape of the histogram changes as you perform
these operations.
It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable represents the weight of a bag of M&Ms (in grams) and that our measuring device (a scale) is accurate to 0.01 grams. If we measure the weight of a bag as 50.32, then we are really saying that the weight is in the interval [50.315, 50.324). Similarly, when two bags have the same measured weight, the apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two bags almost certainly do not have the exact same weight. Thus, two bags with the same measured weight really give us a frequency count of 2 for a certain interval.
Again, there is a tradeoff between the number of classes and the size of the classes; these determine the resolution of the empirical distribution. At one extreme, when the class size is smaller than the accuracy of the recorded data, each class contains a single distinct value. In this case, there is no loss of information and we can recover the original data set from the frequency distribution (except for the order in which the data values were obtained). On the other hand, it can be hard to discern the shape of the data when we have many classes of small size. At the other extreme is a frequency distribution with one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set. Between these two extreme cases, an empirical distribution gives us partial information, but not complete information. These intermediate cases can organize the data in a useful way.
18. In the
interactive histogram, set the class width to 0.1. click on the x-axis to
generate a data set with 10 distinct values and 20 values total.
19.
In Michelson's
data, construct a frequency distribution for the velocity of light variable
. Use 10 classes of equal width. Draw the
histogram and describe the shape of the distribution.
20.
In
Cavendish's data, construct a relative frequency distribution for the density of the earth variable
. Use 5 classes of equal width. Draw the
histogram and describe the shape of the distribution.
22.
In the M&M data,
construct a frequency distribution and histogram for the total count variable
and for the net weight variable.
23. In the Cicada
data, construct a density
distribution and histogram for the body weight variable for the cases given
below . Note
any differences.
24. In the interactive histogram,
set the class width in the to 0.1 and click on the axis
to generate a distribution of of the given type with 30 points. Now increase the class
width to each of the other four values and describe the type of distribution.