Virtual Laboratories > Random Samples > 1 2 [3] 4 5 6 7 8 9

3. Relative Frequency and Empirical Distributions


Random samples and their sample means are ubiquitous in probability and statistics. In this section, we will see how sample means can be used to estimate probabilities, density functions, and distribution functions. As usual, our starting place is a basic random experiment that has a sample space and a probability measure P.

Relative Frequency

Suppose that X is a random variable for the experiment, taking values in a space S. Note that X might be the outcome variable for the entire experiment, in which case S would be the sample space. Recall that the distribution of X is the probability measure on S given by

P(A) = P(X inA) for A S.

Suppose now that we fix A. Recall that the indicator variable IA takes the value 1 if X is in A and 0 otherwise. This indicator variable has the Bernoulli distribution with parameter P(A) above.

Mathematical Exercise 1. Show that the mean and variance of IA are given by

  1. E(IA) = P(A).
  2. var(IA) = P(A)[1 - P(A)].

Now suppose that we repeat the basic experiment indefinitely to form independent random variables X1, X2, ..., each with the distribution of X. Thus, for each n, (X1, X2, ..., Xn) is a random sample of size n from the distribution of X. The relative frequency of A for this sample is

Pn(A) = #{i in {1, 2, ..., n}: Xi in A} / n for A S.

The relative frequency of A is a statistic that gives the proportion of times that A occurred, in the first n runs.

Mathematical Exercise 2. Show Pn(A) is the sample mean from a random sample of size n from the distribution of IA. Thus, conclude that

  1. E[Pn(A)] = P(A).
  2. var[Pn(A)] = P(A)[1 - P(A)] / n
  3. Pn(A) converges to P(A) as n converges to infinity (with probability 1).

This special case of the strong law of large numbers is basic to the very concept of probability.

Mathematical Exercise 3. Show that for a fixed sample, Pn satisfies the axioms of a probability measure.

The probability measure Pn gives the empirical distribution of X, based on the random sample. It is a discrete distribution, concentrated at the distinct values of X1, X2, ..., Xn. Indeed, it places probability mass 1/n at Xi for each i, so that if the sample values are distinct, the empirical distribution is uniform on these sample values.

Several applets in this project are simulations of random experiments with events of interest. When you run the experiment, you are performing independent replications of the experiment. In most cases, the applet displays the relative frequency of the event and its complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown graphically in red and also numerically.

Simulation Exercise 4. In the simulation of Buffon's coin experiment, the event of interest is that the coin crosses a crack. Run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the true probability.

Simulation Exercise 5. In the simulation of Bertrand's experiment, the event of interest is that that a "random chord" on a circle will be longer than the length of a side of the inscribed equilateral triangle. Run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the true probability.

The following subsections consider a number of special cases of relative frequency.

The Empirical Distribution Function

Suppose now that X is a real-valued random variable for the basic experiment. Recall that the distribution function of X is the function F given by

F(x) = P(X <= x) for x in R.

Suppose now that we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X. It is natural to define the empirical distribution function by

Fn(x) = #{i in {1, 2, ..., n}: Xi <= x} / n.

For each x, Fn(x) is a statistic that gives the relative frequency of the sample variables that are less that or equal to x.

Mathematical Exercise 6. Show that

  1. Fn increases from 0 to 1.
  2. Fn is a step function with jumps at the distinct values of X1, X2, ..., Xn.
  3. Fn is the distribution function of the empirical distribution based on {X1, X2, ..., Xn}.

Mathematical Exercise 7. Show that for each x, Fn(x) is the sample mean for a random sample of size n from the distribution of the indicator variable I of the event {X <= x}. Thus, conclude that

  1. E[Fn(x)] = F(x).
  2. var[Fn(x)] = F(x) [1 - F(x)] / n.
  3. Fn(x) converges to F(x) as n converges to infinity (with probability 1).

Empirical Density for a Discrete Variable

Suppose now that X is a random variable for the basic experiment with a discrete distribution on a countable set S. Let f denote the density function of X so that

f(x) = P(X = x) for x S.

Suppose now that we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X. The relative frequency function (or empirical density function) corresponding to the sample is given by

fn(x) = #{i in {1, 2, ..., n}: Xi = x} / n for x S.

For each x, fn(x) is a statistic that gives the relative frequency of the sample variables that have the value x.

Mathematical Exercise 8. Show that the empirical density function satisfies the mathematical properties of a discrete density function:

  1. fn(x) >= 0 for each x in S.
  2. sumx in S fn(x) = 1.
  3. fn is the density function of the empirical distribution based on {X1, X2, ..., Xn}

Mathematical Exercise 9. Show that if X is real valued, then the sample mean of (X1, X2, ..., Xn) is the mean of the empirical density function..

Mathematical Exercise 10. Show that for each x, fn(x) is the sample mean for a random sample of size n from the distribution of the indicator variable I of the event {X = x}. Thus, conclude that

  1. E[fn(x)] = f(x).
  2. var[fn(x)] = f(x)[1 - f(x)] / n
  3. fn(x) converges to f(x) as n converges to infinity.

Many of the applets in this project are simulations of experiments which result in discrete variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density function numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown numerically in the table and visually as a red bar graph.

Simulation Exercise 11. In the poker experiment, the random variable is the type of hand. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

Simulation Exercise 12. In the simulation of the binomial coin experiment, the random variable is the number of heads. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

Simulation Exercise 13. In the simulation of the matching experiment, the random variable is the number of matches. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

Empirical Density for a Continuous Variable

Recall again that the standard k-dimensional measure on Rk is given by

mk(A) = integralA1dx for A Rk.

In particular m1 is the length measure on R, m2 is the area measure on R2, and m3 is the volume measure on R3.

Suppose now that X is a random variable for the basic experiment, with a continuous distribution on a subset S of Rk. Let f denote the density function of X; technically, f is the density with respect mk. Thus, by definition,

P(X A) = integralA f(x) dx for A S.

Again, we repeat the experiment to form independent random variables X1, X2, ..., each with the same distribution as X. For each n, (X1, X2, ..., Xn) is a random sample of size n form the distribution of X.

Suppose now that {Aj: j J} is a partition of S into a countable number of subsets. As before, we can define the empirical probability of Aj, based on the first n sample variables, by

Pn(Aj) = #{i in {1, 2, ..., n}: Xi Aj} / n.

We then define the empirical density function as follows:

fn(x) = Pn(Aj) / mk(Aj) for x Aj.

Clearly the empirical density function fn depends on the partition, as well as n, but we suppress this to keep the notation from becoming completely unwieldy. Of course, for each x, fn(x) is a random variable (in fact, a statistic), but by the very definition of density, if the partition is sufficiently fine (so that Aj is small for each j) and if n is sufficiently large, then by the law of large numbers,

fn(x) ~ f(x) for x S.

Mathematical Exercise 14. Show that fn satisfies the mathematical properties of a density function:

  1. fn(x) >= 0 for each x in S.
  2. integralS fn(x)dx = 1.
  3. fn corresponds to the distribution for which Pn(Aj) is uniformly distributed over Aj for each j.

Many of the applets in this project are simulations of experiments which result in continuous variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density visually as a blue graph. When you run the simulation, an empirical density function is also shown visually as a red bar graph.

Simulation Exercise 15. Run the simulation of the exponential experiment 1000 times with an update frequency of 10. Note the apparent convergence of the empirical density function to the true density function.

Simulation Exercise 16. In the simulation of the random variable experiment, select the normal distribution. Run the experiment1000 times with an update frequency of 10, and note the apartment convergence of the empirical density function to the true density function.

Exploratory Data Analysis

Many of the concepts discussed above are are frequently used in exploratory data analysis. Specifically, suppose that x is a variable for a population (generally vector valued), and that

x1, x2, ..., xn

are the observed data from a sample of size n, corresponding to this variable. For example, x might be encode the color counts and net weight for a bag of M&Ms. Now let {Aj: j J} be a partition of the data set, where J is a finite index set. The sets Aj: j J are known as classes. Just as above, we define the frequency and relative frequency of Aj as follows:

If x is a continuous variable, taking values in Rk, we also define the density of Aj as follows:

f(Aj) = p(Aj) / mk(Aj),

The mapping q that assigns frequencies to classes is known as a frequency distribution for the data set. Similarly, p and f define a relative frequency distribution and a density distribution, respectively, for the the data set. When k = 1 or 2, the bar graph of any of these distributions is known as a histogram.

The whole purpose of constructing and graphing one of these empirical distributions is to summarize and display the data in a meaningful way. Thus, there are some general guidelines in choosing the classes:

  1. The number of classes should be moderate.
  2. If possible, the classes should have the same size.

Simulation Exercise 17. In the interactive histogram, click on the x-axis at various points to generate a data set with 20 values. Vary the class width over the five values from 0.1 to 5.0 and then back again. For each choice of class width, switch between the frequency histogram and the relative frequency histogram. Note how the shape of the histogram changes as you perform these operations.

It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable represents the weight of a bag of M&Ms (in grams) and that our measuring device (a scale) is accurate to 0.01 grams. If we measure the weight of a bag as 50.32, then we are really saying that the weight is in the interval [50.315, 50.324). Similarly, when two bags have the same measured weight, the apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two bags almost certainly do not have the exact same weight. Thus, two bags with the same measured weight really give us a frequency count of 2 for a certain interval.

Again, there is a tradeoff between the number of classes and the size of the classes; these determine the resolution of the empirical distribution. At one extreme, when the class size is smaller than the accuracy of the recorded data, each class contains a single distinct value. In this case, there is no loss of information and we can recover the original data set from the frequency distribution (except for the order in which the data values were obtained). On the other hand, it can be hard to discern the shape of the data when we have many classes of small size. At the other extreme is a frequency distribution with one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set. Between these two extreme cases, an empirical distribution gives us partial information, but not complete information. These intermediate cases can organize the data in a useful way.

Simulation Exercise 18. In the interactive histogram, set the class width to 0.1. click on the x-axis to generate a data set with 10 distinct values and 20 values total.

  1. From the frequency distribution, explicitly write down the 20 values in the data set.
  2. Now increase the class width to 0.2, 0.5, 1.0, and 5.0. Note how the histogram loses resolution; that is, how the frequency distribution loses information about the original data set.

Data Analysis Exercise 19. In Michelson's data, construct a frequency distribution for the velocity of light variable . Use 10 classes of equal width. Draw the histogram and describe the shape of the distribution.

Data Analysis Exercise 20. In Cavendish's data, construct a relative frequency distribution for the density of the earth variable . Use 5 classes of equal width. Draw the histogram and describe the shape of the distribution.

Data Analysis Exercise 22. In the M&M data, construct a frequency distribution and histogram for the total count variable and for the net weight variable.

Data Analysis Exercise 23. In the Cicada data, construct a density distribution and histogram for the body weight variable for the cases given below . Note any differences.

  1. All cases
  2. Each species individually
  3. Male and female individually.

Simulation Exercise 24. In the interactive histogram, set the class width in the to 0.1 and click on the axis to generate a distribution of of the given type with 30 points. Now increase the class width to each of the other four values and describe the type of distribution.

  1. A uniform distribution
  2. A symmetric unimodal distribution
  3. A unimodal distribution that is skewed right.
  4. A unimodal distribution that is skewed left.
  5. A symmetric bimodal distribution
  6. A u-shaped distribution.