Introduction

1. Introduction

The Basic Statistical Model

As usual, our starting point is a random experiment with a sample space and a probability measure P. In the basic statistical model, we have an observable random variable X (which we call the data variable) taking values in a set S. In general, X can have quite a complicated structure. For example, if the experiment is to sample from a population and record various measurements of interest, then

X = (X₁, X₂, ..., X_n)

where X_i is the vector of measurements for the i'th object. Here are some specific examples.

In the M&M data set, a sample of 30 bags of M&Ms were studied. For this study, X_i records the color counts for red, green, blue orange, yellow, and brown candies, and the net weight for bag number i.
In Fisher's iris data set, a sample of 150 irises were studied. For this study, X_i records the type, petal length, petal width, sepal length, and sepal width for iris number i.
In the cicada data, 104 cicadas were captured. For this study, X_i records the body weight, body length, wing width, wing length, gender, and species for cicada number i.

On the other hand, the hallmark of mathematical abstraction is the ability to gray out out the features that are not relevant at any particular time, to treat a complex structure as a single object. Thus, although X may actually be a vector of vectors, the crucial fact at this point is that it is a random variable for an experiment.

There are two broad branches of statistics. The term descriptive statistics refers to methods for summarizing and displaying the observed data x. The term inferential statistics refers to methods of drawing inferences about the distribution of X from an observed value x. Thus, in a sense, inferential statistics is the dual of probability. In probability, we try to predict the value of X assuming knowledge of the distribution. In statistics, by contrast, we observe the value of X and try to infer information about the underlying distribution..

The techniques of statistics have been enormously successful; these techniques are widely used in just about every subject that deals with quantification--the natural sciences, the social sciences, law, and medicine. On the other hand, statistics has a legalistic quality and a great deal of terminology that can make the subject a bit intimidating at first. In this section, we will discuss some of the basic definitions.

Types of Variables

Recall that a real variable is continuous if the possible values form an interval of real numbers. For example, the weight variable in the M&M data set, and the length and width variables in Fisher's iris data are continuous. In contrast, a discrete variable is one whose set of possible values forms a discrete set. For example, the counting variables in the M&M data set, the type variable in Fisher's iris data, and the denomination and suit variables in the card experiment are discrete. Continuous variables represent quantities that can, in theory, be measured to any degree of accuracy. In practice, of course, measuring devices have limited accuracy so data collected from a continuous variable is necessarily discrete. That is, there is only a finite (but perhaps very large) set of possible values that can actually be measured.

A real variable is also distinguished by its level of measurement, which determines the mathematical operations that make sense for the variable. Qualitative variables simply encode types, and thus no mathematical operations make sense, even if numbers are used for the encoding. Such variables have the nominal level of measurement. For example, the type variable in Fisher's iris data is qualitative. A variable for which only order is meaningful is said to have the ordinal level of measurement; differences are not meaningful even if numbers are used for the encoding. For example, in many card games, the suits are ranked, so the suit variable has the ordinal level of measurement. A quantitative variable for which differences, but not ratios are meaningful is said to have the interval level of measurement. Equivalently, a variable at this level has a relative zero value. Typical examples are temperature (in Fahrenheit or Centigrade) or time (clock or calendar). Finally, a quantitative variable for which ratios are meaningful is said to have the ratio level of measurement. A variable at this level has an absolute zero value. The count and weight variables in the M&M data set, and the length and width variables in Fisher's iris data are examples.

Parameters and Statistics

The term parameter refers to a non-random variable in a model that, once chosen, remains constant. Almost all probability models are actually parametric families of models; that is, they are models governed by one or more parameters that can be adjusted to fit the random process being modeled. More technically, a parameter is a characteristic of the distribution of the observable variable X. As usual, we will take the general point of view and allow parameters to be vector valued.

$Mathematical Exercise$ 1. Identify the parameters in each of the following:

A statistic is a random variable that is an observable function of the outcome variable of the experiment:

W = W(X).

The term observable means that the function should not contain any unknown parameters. After all, we need to be able to compute the value of the statistic from the observed data. The crucial point is that a statistic is a random variable and hence, like all random vectors, it has a probability distribution.. Ultimately, what we observe is a value of this random variable. As with X, W may have a complicated structure; typically, W is vector valued. Note that X itself is a statistic, the original observed data variable; all other statistics are derived from X.

Statistics U and V are equivalent if there exists a one-to-one function r from the range of U onto the range of V such that

V = r(U).

Equivalent statistics give equivalent information, in terms of drawing inferences.

$Mathematical Exercise$ 2. Show that statistics U and V are equivalent if and only if the following condition holds:

U(x) = U(y) if and only if V(x) = V(y) for x, y in S.

$Mathematical Exercise$ 3. Show that equivalence really is an equivalence relation on the collection of statistics:

W is equivalent to W for any statistic W (the reflexive property).
If U is equivalent to V then V is equivalent to U (the symmetric property).
If U is equivalent to V and V is equivalent to W then U is equivalent to W (the transitive property).

Random Samples

The most common and important special case of this statistical model occurs when the observation variable has the form

X = (X₁, X₂, ..., X_n).

Where X₁, X₂, ..., X_n are independent and identically distributed. Again, in the standard sampling model, X_i is a vector of measurements for the i'th object in the sample, and thus, we think of X₁, ..., X_n as independent copies of an underlying measurement vector. In this case, (X₁, X₂, ..., X_n) is said to be a random sample of size n from the common distribution.

The purpose of this chapter is to study random samples, descriptive statistics and some special statistics that are important.