Relative Frequency and Empirical Distributions

3. Relative Frequency and Empirical Distributions

Random samples and their sample means are ubiquitous in probability and statistics. In this section, we will see how sample means can be used to estimate probabilities, density functions, and distribution functions. As usual, our starting place is a basic random experiment that has a sample space and a probability measure P.

Relative Frequency

Suppose that X is a random variable for the experiment, taking values in a space S. Note that X might be the outcome variable for the entire experiment, in which case S would be the sample space. Recall that the distribution of X is the probability measure on S given by

P(A) = P(X A) for A S.

Suppose now that we fix A. Recall that the indicator variable I_A takes the value 1 if X is in A and 0 otherwise. This indicator variable has the Bernoulli distribution with parameter P(A) above.

$Mathematical Exercise$ 1. Show that the mean and variance of I_A are given by

E(I_A) = P(A).
var(I_A) = P(A)[1 - P(A)].

Now suppose that we repeat the basic experiment indefinitely to form independent random variables X₁, X₂, ..., each with the distribution of X. Thus, for each n, (X₁, X₂, ..., X_n) is a random sample of size n from the distribution of X. The relative frequency of A for this sample is

P_n(A) = #{i {1, 2, ..., n}: X_i A} / n for A S.

The relative frequency of A is a statistic that gives the proportion of times that A occurred, in the first n runs.

$Mathematical Exercise$ 2. Show P_n(A) is the sample mean from a random sample of size n from the distribution of I_A. Thus, conclude that

E[P_n(A)] = P(A).
var[P_n(A)] = P(A)[1 - P(A)] / n
P_n(A) P(A) as n (with probability 1).

This special case of the strong law of large numbers is basic to the very concept of probability.

$Mathematical Exercise$ 3. Show that for a fixed sample, P_nsatisfies the axioms of a probability measure.

The probability measure P_ngives the empirical distribution of X, based on the random sample. It is a discrete distribution, concentrated at the distinct values of X₁, X₂, ..., X_n. Indeed, it places probability mass 1/n at X_i for each i, so that if the sample values are distinct, the empirical distribution is uniform on these sample values.

Several applets in this project are simulations of random experiments with events of interest. When you run the experiment, you are performing independent replications of the experiment. In most cases, the applet displays the relative frequency of the event and its complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown graphically in red and also numerically.

4. In the simulation of Buffon's coin experiment, the event of interest is that the coin crosses a crack. Run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the true probability.

5. In the simulation of Bertrand's experiment, the event of interest is that that a "random chord" on a circle will be longer than the length of a side of the inscribed equilateral triangle. Run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the true probability.

The following subsections consider a number of special cases of relative frequency.

The Empirical Distribution Function

Suppose now that X is a real-valued random variable for the basic experiment. Recall that the distribution function of X is the function F given by

F(x) = P(X x) for x R.

Suppose now that we repeat the experiment to form independent random variables X₁, X₂, ..., each with the same distribution as X. For each n, (X₁, X₂, ..., X_n) is a random sample of size n form the distribution of X. It is natural to define the empirical distribution function by

F_n(x) = #{i {1, 2, ..., n}: X_i x} / n.

For each x, F_n(x) is a statistic that gives the relative frequency of the sample variables that are less that or equal to x.

$Mathematical Exercise$ 6. Show that

F_n increases from 0 to 1.
F_n is a step function with jumps at the distinct values of X₁, X₂, ..., X_n.
F_n is the distribution function of the empirical distribution based on {X₁, X₂, ..., X_n}.

$Mathematical Exercise$ 7. Show that for each x, F_n(x) is the sample mean for a random sample of size n from the distribution of the indicator variable I of the event {X x}. Thus, conclude that

E[F_n(x)] = F(x).
var[F_n(x)] = F(x) [1 - F(x)] / n.
F_n(x) F(x) as n (with probability 1).

Empirical Density for a Discrete Variable

Suppose now that X is a random variable for the basic experiment with a discrete distribution on a countable set S. Let f denote the density function of X so that

f(x) = P(X = x) for x S.

Suppose now that we repeat the experiment to form independent random variables X₁, X₂, ..., each with the same distribution as X. For each n, (X₁, X₂, ..., X_n) is a random sample of size n form the distribution of X. The relative frequency function (or empirical density function) corresponding to the sample is given by

f_n(x) = #{i {1, 2, ..., n}: X_i = x} / n for x S.

For each x, f_n(x) is a statistic that gives the relative frequency of the sample variables that have the value x.

$Mathematical Exercise$ 8. Show that the empirical density function satisfies the mathematical properties of a discrete density function:

f_n(x) 0 for each x S.
_{x
in S} f_n(x) = 1.
f_n is the density function of the empirical distribution based on {X₁, X₂, ..., X_n}

$Mathematical Exercise$ 9. Show that if X is real valued, then the sample mean of (X₁, X₂, ..., X_n) is the mean of the empirical density function..

$Mathematical Exercise$ 10. Show that for each x, f_n(x) is the sample mean for a random sample of size n from the distribution of the indicator variable I of the event {X = x}. Thus, conclude that

E[f_n(x)] = f(x).
var[f_n(x)] = f(x)[1 - f(x)] / n
f_n(x) f(x) as n .

Many of the applets in this project are simulations of experiments which result in discrete variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density function numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown numerically in the table and visually as a red bar graph.

11. In the poker experiment, the random variable is the type of hand. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

12. In the simulation of the binomial coin experiment, the random variable is the number of heads. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

13. In the simulation of the matching experiment, the random variable is the number of matches. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the empirical density function to the true density function.

Empirical Density for a Continuous Variable

Recall again that the standard k-dimensional measure on R^k is given by

m_k(A) = _A1dx for A R^k.

In particular m₁ is the length measure on R, m₂ is the area measure on R², and m₃ is the volume measure on R³.

Suppose now that X is a random variable for the basic experiment, with a continuous distribution on a subset S of R^k. Let f denote the density function of X; technically, f is the density with respect m_k. Thus, by definition,

P(X A) = _Af(x) dx for A S.

Again, we repeat the experiment to form independent random variables X₁, X₂, ..., each with the same distribution as X. For each n, (X₁, X₂, ..., X_n) is a random sample of size n form the distribution of X.

Suppose now that {A_j: j J} is a partition of S into a countable number of subsets. As before, we can define the empirical probability of A_j, based on the first n sample variables, by

P_n(A_j) = #{i {1, 2, ..., n}: X_i A_j} / n.

We then define the empirical density function as follows:

f_n(x) = P_n(A_j) / m_k(A_j) for x A_j.

Clearly the empirical density function f_n depends on the partition, as well as n, but we suppress this to keep the notation from becoming completely unwieldy. Of course, for each x, f_n(x) is a random variable (in fact, a statistic), but by the very definition of density, if the partition is sufficiently fine (so that A_j is small for each j) and if n is sufficiently large, then by the law of large numbers,

f_n(x) ~ f(x) for x S.

$Mathematical Exercise$ 14. Show that f_n satisfies the mathematical properties of a density function:

f_n(x) 0 for each x S.
_S f_n(x)dx = 1.
f_n corresponds to the distribution for which P_n(A_j) is uniformly distributed over A_j for each j.

Many of the applets in this project are simulations of experiments which result in continuous variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the true density visually as a blue graph. When you run the simulation, an empirical density function is also shown visually as a red bar graph.

15. Run the simulation of the exponential experiment 1000 times with an update frequency of 10. Note the apparent convergence of the empirical density function to the true density function.

16. In the simulation of the random variable experiment, select the normal distribution. Run the experiment1000 times with an update frequency of 10, and note the apartment convergence of the empirical density function to the true density function.

Exploratory Data Analysis

Many of the concepts discussed above are are frequently used in exploratory data analysis. Specifically, suppose that x is a variable for a population (generally vector valued), and that

x₁, x₂, ..., x_n

are the observed data from a sample of size n, corresponding to this variable. For example, x might be encode the color counts and net weight for a bag of M&Ms. Now let {A_j: j J} be a partition of the data set, where J is a finite index set. The sets A_j: j J are known as classes. Just as above, we define the frequency and relative frequency of A_jas follows:

q(A_j)= #{i {1, 2, ..., n}: x_i A_j}.
p(A_j) = q(A_j) / n.

If x is a continuous variable, taking values in R^k, we also define the density of A_jas follows:

f(A_j) = p(A_j) / m_k(A_j),

The mapping q that assigns frequencies to classes is known as a frequency distribution for the data set. Similarly, p and f define a relative frequency distribution and a density distribution, respectively, for the the data set. When k = 1 or 2, the bar graph of any of these distributions is known as a histogram.

The whole purpose of constructing and graphing one of these empirical distributions is to summarize and display the data in a meaningful way. Thus, there are some general guidelines in choosing the classes:

The number of classes should be moderate.
If possible, the classes should have the same size.

17. In the interactive histogram, click on the x-axis at various points to generate a data set with 20 values. Vary the class width over the five values from 0.1 to 5.0 and then back again. For each choice of class width, switch between the frequency histogram and the relative frequency histogram. Note how the shape of the histogram changes as you perform these operations.

It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable represents the weight of a bag of M&Ms (in grams) and that our measuring device (a scale) is accurate to 0.01 grams. If we measure the weight of a bag as 50.32, then we are really saying that the weight is in the interval [50.315, 50.324). Similarly, when two bags have the same measured weight, the apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two bags almost certainly do not have the exact same weight. Thus, two bags with the same measured weight really give us a frequency count of 2 for a certain interval.

Again, there is a tradeoff between the number of classes and the size of the classes; these determine the resolution of the empirical distribution. At one extreme, when the class size is smaller than the accuracy of the recorded data, each class contains a single distinct value. In this case, there is no loss of information and we can recover the original data set from the frequency distribution (except for the order in which the data values were obtained). On the other hand, it can be hard to discern the shape of the data when we have many classes of small size. At the other extreme is a frequency distribution with one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set. Between these two extreme cases, an empirical distribution gives us partial information, but not complete information. These intermediate cases can organize the data in a useful way.

18. In the interactive histogram, set the class width to 0.1. click on the x-axis to generate a data set with 10 distinct values and 20 values total.

From the frequency distribution, explicitly write down the 20 values in the data set.
Now increase the class width to 0.2, 0.5, 1.0, and 5.0. Note how the histogram loses resolution; that is, how the frequency distribution loses information about the original data set.

19. In Michelson's data, construct a frequency distribution for the velocity of light variable . Use 10 classes of equal width. Draw the histogram and describe the shape of the distribution.

20. In Cavendish's data, construct a relative frequency distribution for the density of the earth variable . Use 5 classes of equal width. Draw the histogram and describe the shape of the distribution.

22. In the M&M data, construct a frequency distribution and histogram for the total count variable and for the net weight variable.

23. In the Cicada data, construct a density distribution and histogram for the body weight variable for the cases given below . Note any differences.

All cases
Each species individually
Male and female individually.

24. In the interactive histogram, set the class width in the to 0.1 and click on the axis to generate a distribution of of the given type with 30 points. Now increase the class width to each of the other four values and describe the type of distribution.

A uniform distribution
A symmetric unimodal distribution
A unimodal distribution that is skewed right.
A unimodal distribution that is skewed left.
A symmetric bimodal distribution
A u-shaped distribution.