Lack of Fit. For certain designs with replicates at the levels of the predictor variables, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. Specifically, the residual sums of squares can be partitioned into lack-of-fit and pure-error components. This involves determining the part of the residual sum of squares that can be predicted by including additional terms for the predictor variables in the model (for example, higher-order polynomial or interaction terms), and the part of the residual sum of squares that cannot be predicted by any additional terms (i.e., the sum of squares for pure error). A test of lack-of-fit for the model without the additional terms can then be performed, using the mean square pure error as the error term. This provides a more sensitive test of model fit, because the effects of the additional higher-order terms is removed from the error.

See also pure error, design matrix; or the General Linear Models, General Regression Models, or Experimental Design chapters.

Lambda Prime. Lambda is defined as the geometric sum of 1 minus the squared canonical correlation, where lambda is Wilk's lambda. The squared canonical correlation is an estimate of the common variance between two canonical variates, thus 1 minus this value is an estimate of the unexplained variance. Lambda is used as a test of significance for the squared canonical correlation and is distributed as Chi-square (see below).

2 = [-N -1 - {.5(p+q+1)}] * loge

where
N    is the number of subjects
p     is the number of variables on the right
q     is the number of variables on the left

Laplace Distribution. The Laplace (or Double Exponential) distribution has density function:

f(x) = 1/(2b)*e-|x-a|/b        - < x <

where
a     is the mean of the distribution
b     is the scale parameter
e     is the base of the natural logarithm, sometimes called Euler's e (2.71...)

The graphic above shows the changing shape of the Laplace distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.

Latent Semantic Indexing. In the context of text mining, the process of latent semantic indexing is concerned with the derivation of underlying dimensions of "meaning" from the words (terms) extracted from a collection of documents.

The most basic result of text mining is an initial indexing of words found in the input documents, and the computation of a frequency table with simple counts enumerating the number of times that each word occurs in each input document. Also, in practice, you can further transform those raw counts to indices that better reflect the (relative) "importance" of words and/or their semantic specificity in the context of the set of input documents (see, for example, inverse document frequencies).

Next, a common analytic tool for interpreting the "meaning" or "semantic space" described by the words that were extracted and, hence, by the documents that were analyzed, is to create a mapping of the words and documents into a common space, computed from the word frequencies or transformed word frequencies (e.g., inverse document frequencies). In general, here is how it works:

Suppose you index a collection of customer reviews of their new automobiles (e.g., for different makes and models). You may find that every time a review includes the word "gas-mileage," it also includes the term "economy." Further, when reports include the word "reliability" they also include the term "defects" (e.g., make reference to "no defects"). However, there is no consistent pattern regarding the use of the terms "economy" and "reliability," i.e., some documents include either one, both, or neither. In other words, these four words "gas-mileage" and "economy," and "reliability" and "defects," describe two independent dimensions - the first having to do with the overall operating cost of the vehicle, the other with quality and workmanship.

The idea of latent semantic indexing is to identify such underlying dimensions (of "meaning"), into which the words and documents can be mapped. As a result, you can identify the underlying (latent) themes described or discussed in the input documents, and also identify the documents that mostly deal with each dimension (e.g., economy, reliability, or both). In practice, singular value decomposition is often used to extract the underlying semantic dimensions from the matrix of (transformed) word counts across documents.

For more information, see Manning and Schütze (2002).

Latent Variable. A latent variable is a variable that cannot be measured directly, but is hypothesized to underlie the observed variables. An example of a latent variable is a factor in factor analysis. Latent variables in path diagrams are usually represented by a variable name enclosed in an oval or circle.

Layered Compression. When layered compression is used, the main graph plotting area is reduced in size to leave space for Margin Graphs in the upper and right side of the display (and a miniature graph in the corner). These smaller Margin Graphs represent vertically and horizontally compressed images (respectively) of the main graph.

For more information on Layered Compression (and an additional example), see Special Topics in Graphical Analytic Techniques: Layered Compression.

Learned Vector Quantization (in Neural Networks). The Learned Vector Quantization algorithm (LVQ) was invented by Tuevo Kohonen (Fausett, 1994; Kohonen, 1990), who also invented the Self-Organizing Feature Map.

Learned Vector Quantization provides a supervised version of the Kohonen training algorithm. The standard Kohonen algorithm iteratively adjusts the position of the exemplar vectors stored in the Radial layer of the Kohonen network by considering only the positions of the existing vectors and of the training data. In essence, the algorithm attempts to move the exemplar vectors to positions that reflect the centers of clusters in the training data. However, the class labels of the training data cases are not taken into account. For superior classification performance, it is desirable that the exemplar vectors are adjusted, to some extent, on a per-class basis - that is, that they reflect natural clusters in each separate class. An exemplar located on a class boundary, equally close to cases of two classes, is unlikely to be of much use in distinguishing class. On the other hand, exemplars located just inside class boundaries can be extremely useful.

There are several variants of Learned Vector Quantization. The basic version, LVQ1, is very similar to the Kohonen training algorithm. The closest exemplar to a training case is selected during training and has its position updated. However, whereas the Kohonen algorithm would move this exemplar toward the training case, LVQ1 checks whether the class label of the exemplar vector is the same as that of the training case. If it is, the exemplar is moved toward the training case; if it is not, the exemplar is moved away from the case. The more sophisticated LVQ algorithms, LVQ2.1 and LVQ3, take into account more information. They locate the nearest two exemplars to the training case. If one of these is of the right class and one the wrong class, they move the right class toward the training case and the wrong one away from it. LVQ3 also moves both exemplars toward the training case if they are both of the right class. In both LVQ2.1 and LVQ3, the concept is to move exemplars where there is some danger of misclassification.

Technical Details. The basic update rule is:

if the exemplar and training case have the same class,

if they do not.

x is the training case, ht is the learning rate.

In LVQ2.1, the two nearest exemplars are adjusted only if one is of the right class and one is not, and they are both "about the same" distance from the training case. The definition of "about the same" distance uses a special parameter, e, and the formulae below:

In LVQ3, an alternative formula is used to ensure that the two nearest are both "about the same distance" from the training case:

In addition, in LVQ3, if both the two nearest exemplars are of the same class as the training case, they are both moved toward the case, using a learning rate b times the standard learning rate at that epoch.

Learning Rate (in Neural Networks). A control parameter of some training algorithms, which controls the step size when weights are iteratively adjusted.

See also, the Neural Networks chapter.

Least Squares (2D graphs). A curve is fitted to the XY coordinate data according to the distance-weighted least squares smoothing procedure (the influence of individual points decreases with the horizontal distance from the respective points on the curve).

Least Squares (3D graphs). A surface is fitted to the XYZ coordinate data according to the distance-weighted least squares smoothing procedure (the influence of individual points decreases with the horizontal distance from the respective points on the surface).

Least Squares Estimator. In the most general terms, least squares estimation is aimed at minimizing the sum of squared deviations of the observed values for the dependent variable from those predicted by the model. Technically, the least squares estimator of a parameter q is obtained by minimizing Q with respect to q where:

Q = [Yi - fi()]2

Note that fi() is a known function of , Yi = fi() + i where i = 1 to n, and the i are random variables, and usually assumed to have expectation of 0. For more information, see Mendenhall and Sincich (1984), Bain and Engelhardt (1989), and Neter, Wasserman, and Kutner (1989). See also, Basic Statistics, Multiple Regression, and Nonlinear Estimation.

Least Squares Means. When there are no missing cells in ANOVA designs with categorical predictor variables, the subpopulation (or marginal) means are least square means, which are the best linear-unbiased estimates of the marginal means for the design (see, Milliken and Johnson, 1986). Tests of differences in least square means have the important property that they are invariant to the choice of the coding of effects for categorical predictor variables (e.g., the use of the sigma-restricted or the overparameterized model) and to the choice of the particular generalized inverse of the design matrix used to solve the normal equations. Thus, tests of linear combinations of least square means in general are said to not depend on the parameterization of the design.

See also categorical predictor variable, design matrix, sigma-restricted model, overparameterized, generalized inverse; see also the General Linear Models or General Regression Models chapters.

Left and Right Censoring. When censoring, a distinction can be made to reflect the "side" of the time dimension at which censoring occurs. Consider an experiment where we start with 100 light bulbs, and terminate the experiment after a certain amount of time. In this experiment the censoring always occurs on the right side (right censoring), because the researcher knows when exactly the experiment started, and the censoring always occurs on the right side of the time continuum. Alternatively, it is conceivable that the censoring occurs on the left side (left censoring). For example, in biomedical research one may know that a patient entered the hospital at a particular date, and that s/he survived for a certain amount of time thereafter; however, the researcher does not know when exactly the symptoms of the disease first occurred or were diagnosed.

Data sets with censored observations can be analyzed via Survival Analysis or via Weibull and Reliability/Failure Time Analysis. See also, Type I and II Censoring and Single and Multiple Censoring.

Levenberg-Marquardt algorithm (in Neural Networks). Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963; Bishop, 1995; Shepherd, 1997; Press et. al., 1992) is an advanced non-linear optimization algorithm. It can be used to train the weights in a network just as back propagation would be. It is reputably the fastest algorithm available for such training. However, its use is restricted as follows.

Single output networks. Levenberg-Marquardt can only be used on networks with a single output unit.

Small networks. Levenberg-Marquardt has space requirements proportional to the square of the number of weights in the network. This effectively precludes its use in networks of any great size (more than a few hundred weights).

Sum-squared error function. Levenberg-Marquardt is only defined for the sum squared error function. If you select a different error function for your network, it will be ignored during Levenberg-Marquardt training. It is usually therefore only appropriate for regression networks.

Note: Like other iterative algorithms, Levenberg-Marquardt does not train radial units. Therefore, you can use it to optimize the non-radial layers of radial basis function networks even if there are a large number of weights in the radial layer, as those are ignored by Levenberg-Marquardt. This is significant as it is typically the radial layer that is very large in such networks.

Levenberg-Marquardt works by making the assumption that the underlying function being modeled by the neural network is linear. Based on this calculation, the minimum can be determined exactly in a single step. The calculated minimum is tested, and if the error there is lower, the algorithm moves the weights to the new point. This process is repeated iteratively on each generation. Since the linear assumption is ill-founded, it can easily lead Levenberg-Marquardt to test a point that is inferior (perhaps even wildly inferior) to the current one. The clever aspect of Levenberg-Marquardt is that the determination of the new point is actually a compromise between a step in the direction of steepest descent and the above-mentioned leap. Successful steps are accepted and lead to a strengthening of the linearity assumption (which is approximately true near to a minimum). Unsuccessful steps are rejected and lead to a more cautious downhill step. Thus, Levenberg-Marquardt continuously switches its approach and can make very rapid progress.

Technical Details. The Levenberg-Marquardt algorithm is designed specifically to minimize the sum-of-squares error function, using a formula that (partly) assumes that the underlying function modeled by the network is linear. Close to a minimum this assumption is approximately true, and the algorithm can make very rapid progress. Further away it may be a very poor assumption. Levenberg-Marquardt therefore compromises between the linear model and a gradient-descent approach. A move is only accepted if it improves the error, and if necessary the gradient-descent model is used with a sufficiently small step to guarantee downhill movement.

Levenberg-Marquardt uses the update formula:

where is the vector of case errors, and Z is the matrix of partial derivatives of these errors with respect to the weights:

The first term in the Levenberg-Marquardt formula represents the linearized assumption; the second a gradient-descent step. The control parameter  governs the relative influence of these two approaches. Each time Levenberg-Marquardt succeeds in lowering the error, it decreases the control parameter by a factor of 10, thus strengthening the linear assumption and attempting to jump directly to the minimum. Each time it fails to lower the error, it increases the control parameter by a factor of 10, giving more influence to the gradient descent step, and also making the step size smaller. This is guaranteed to make downhill progress at some point.

Levene and Brown-Forsythe tests for homogeneity of variances (HOV). A important assumption in analysis of variance (ANOVA and the t-test for mean differences) is that the variances in the different groups are equal (homogeneous). Two powerful and commonly used tests of this assumption are the Levene test and the Brown-Forsythe modification of this test. However, it is important to realize that (1) the homogeneity of variances assumption is usually not as crucial as other assumptions for ANOVA, in particular in the case of balanced (equal n) designs (see also ANOVA Homogeneity of Variances and Covariances), and (2) that the tests described below are not necessarily very robust themselves (e.g., Glass and Hopkins, 1996, p. 436, call these tests "fatally flawed;" see also the description of these tests below). If you are concerned about a violation of the HOV assumption, it is always advisable to repeat the key analyses using nonparametric methods.

Levene's test (homogeneity of variances): For each dependent variable, an analysis of variance is performed on the absolute deviations of values from the respective group means. If the Levene test is statistically significant, then the hypothesis of homogeneous variances should be rejected.

Brown & Forsythe's test (homogeneity of variances): Recently, some authors (e.g., Glass and Hopkins, 1996) have called into question the power of the Levene test for unequal variances. Specifically, the absolute deviation (from the group means) scores can be expected to be highly skewed; thus, the normality assumption for the ANOVA of those absolute deviation scores is usually violated. This poses a particular problem when there is unequal n in the two (or more) groups that are to be compared. A more robust test that is very similar to the Levene test has been proposed by Brown and Forsythe (1974). Instead of performing the ANOVA on the deviations from the mean, one can perform the analysis on the deviations from the group medians. Olejnik and Algina (1987) have shown that this test will give quite accurate error rates even when the underlying distributions for the raw scores deviate significantly from the normal distribution. However, as Glass and Hopkins (1996, p. 436) have pointed out, both the Levene test as well as the Brown-Forsythe modification suffer from what those authors call a "fatal flaw," namely, that both tests themselves rely on the homogeneity of variances assumption (of the absolute deviations from the means or medians); and hence, it is not clear how robust these tests are themselves in the presence of significant variance heterogeneity and unequal n.

Leverage values. In regression, this term refers to the diagonal elements of the hat matrix (X(X'X)-1X'). A given diagonal element (h(ii)) represents the distance between X values for the ith observation and the means of all X values. These values indicate whether or not X values for a given observation are outlying. The diagonal element is refered to as the leverage. A large leverage value indicates that the ith observation is distant from the center of the X observations (Neter, et al, 1985).

Life Table. The most straightforward way to describe the survival in a sample is to compute the Life Table. The life table technique is one of the oldest methods for analyzing survival (failure time) data (e.g., Berkson & Gage, 1950; Cutler & Ederer, 1958; Gehan, 1969; see also Lawless, 1982, Lee, 1993). This table can be thought of as an "enhanced" frequency distribution table. The distribution of survival times is divided into a certain number of intervals. For each interval one can compute the number and proportion of cases or objects that entered the respective interval "alive," the number and proportion of cases that failed in the respective interval (i.e., number of terminal events, or number of cases that "died"), and the number of cases that were lost or censored in the respective interval.

Based on those numbers and proportions, several additional statistics can be computed. Refer to the Survival Analysis chapter for additional details.

Lift Chart. The lift chart provides a visual summary of the usefulness of the information provided by one or more statistical models for predicting a binomial (categorical) outcome variable (dependent variable); for multinomial (multiple-category) outcome variables, lift charts can be computed for each category. Specifically, the chart summarizes the utility that one may expect by using the respective predictive models, as compared to using baseline information only.

The lift chart is applicable to most statistical methods that compute predictions (predicted classifications) for binomial or multinomial responses. This and similar summary charts (see Gains Chart) are commonly used in Data Mining projects, when the dependent or outcome variable of interest is binomial or multinomial in nature.

Example. To illustrate how the lift chart is constructed, consider this example. Suppose you have a mailing list of previous customers of your business, and you want to offer to those customers an additional service by mailing an elaborate brochure and other materials describing the service. During previous similar mail-out campaigns, you collected useful information about your customers (e.g., demographic information, previous purchasing patterns) that you could relate to the response rate, i.e., whether the respective customers responded to your mail solicitation. Also, from similar prior mail-out campaigns, you were able to estimate the baseline response rate at approximately 7 percent, i.e., 7% of all customers who received a similar offer by mail responded (purchased the additional service).

Given this baseline response rate (7%) and the cost of the mail-out, sending the offer to all customers would result in a net loss. Hence, you want to use statistical analyses to help you identify the customers who are most likely to respond. Suppose you build such a model based on the data collected in the previous mail-out campaign. You can now select only the 10 percent of the customers from the mailing lists who, according to prediction from the model, are most likely to respond. If among those customers (selected by the model) the response rate is 14% percent (as opposed to the 7% baseline rate), then the relative gain or lift value due to using the predictive model can be computed as 14% / 7% = 2. In other words, you were able to do twice as well as you would have done using simple random selection.

Analogous lift values can be computed for each percentile of the population (customers on the mailing list). You could compute separate lift values for selecting the top 20% of customers who are predicted to be among likely responders to the mail campaign, the top 30%, etc. Hence, the lift values for different percentiles can be connected by a line that will typically descend slowly and merge with the baseline if all customers (100%) would be selected.

If more than one predictive model is used, multiple lift charts can be overlaid (as shown in the illustration above) to provide a graphical summary of the utility of different models.

Lilliefors test. In a Kolmogorov-Smirnov test for normality when the mean and standard deviation of the hypothesized normal distribution are not known (i.e., they are estimated from the sample data), the probability values tabulated by Massey (1951) are not valid. Instead, the so-called Lilliefors probabilities (Lilliefors, 1967) should be used in determining whether the KS difference statistic is significant.

Likelihood. The probability of an event based on current observations.

Line Plots, 2D. In line plots, individual data points are connected by a line.

Line plots provide a simple way to visually present a sequence of values. XY Trace-type line plots can be used to display a trace (instead of a sequence). Line plots can also be used to plot continuous functions, theoretical distributions, etc.

Line Plots, 2D - Aggregrated. Aggregated line plots display a sequence of means for consecutive subsets of a selected variable.

You can select the number of consecutive observations from which the mean will be calculated and if desired, the range of values in each subset will be marked by the whisker-type markers . Aggregated line plots are used to explore and present sequences of large numbers of values.

Line Plots, 2D (Case Profiles). Unlike in the regular line plots where the values of one variable are plotted as one line (individual data points are connected by a line), in case profile line plots, the values of the selected variables in a case (row) are plotted as one line (i.e., one line plot will be generated for each of the selected cases). Case profile line plots provide a simple way to visually present the values in a case (e.g., test scores for several tests).

Line Plots, 2D - Double-Y. The Double-Y line plot can be considered to be a combination of two separately scaled multiple line plots. A separate line pattern is plotted for each of the selected variables, but the variables selected in the Left-Y list will be plotted against the left-Y axis, whereas the variables selected in the Right-Y list will be plotted against the right-Y axis (see example below). The names of all variables will be identified in the legend with the letter (R) for the variables associated with the right-Y axis and with the letter (L) for the variables associated with the left-Y axis.

The Double-Y line plot can be used to compare sequences of values of several variables by overlaying their respective line representations in a single graph. However, due to the independent scaling used for the two axes, it can facilitate comparisons between otherwise "incomparable" variables (i.e., variables with values in different ranges).

Line Plots, 2D - Multiple. Unlike regular line plots in which a sequence of values of one variable is represented, the multiple line plot represents multiple sequences of values (variables). A different line pattern and color is used for each of the multiple variables and referenced in the legend.

This type of line plot is used to compare sequences of values between several variables (or several functions) by overlaying them in a single graph that uses one common set of scales (e.g., comparisons between several simultaneous experimental processes, social phenomena, stock or commodity quotes, shapes of operating characteristics curves, etc.).

Line Plots, 2D - Regular. Regular line plots are used to examine and present the sequences of values (usually when the order of the presented values is meaningful).

Another typical application for line sequence plots is to plot continuous functions, such as fitted functions or theoretical distributions. Note that an empty data cell (i.e., missing data) "breaks" the line.

Line Plots, 2D - XY Trace. In trace plots, a scatterplot of two variables is first created, then the individual data points are connected with a line (in the order in which they are read from the data file). In this sense, trace plots visualize a "trace" of a sequential process (movement, change of a phenomenon over time, etc.)

Linear (2D graphs). A linear function (e.g., Y = a + bX) is fitted to the points in the 2D scatterplot.

Linear (3D graphs). A linear function (e.g., Y = a + bX) is fitted to the points in the 3D scatterplot.

Linear Activation function. A null activation function: the unit's output is identical to its activation level.

See also, the Neural Networks chapter.

Linear Modeling. Approximation of a discriminant function or regression function using a hyperplane. Can be globally optimized using "simple" techniques, but does not adequately model many real-world problems.

See also, the Neural Networks chapter.

Linear Units. A unit with a linear PSP function. The unit's activation level is the weighted sum of its inputs minus the threshold - also known as a dot product or linear combination. The characteristic unit type of multilayer perceptrons. Despite the name, a linear unit may have a non-linear activation function.

See also, the Neural Networks chapter.

Link Function and Distribution Function. The link function in generalized linear models specifies a nonlinear transformation of the predicted values so that the distribution of predicted values is one of several special members of the exponential family of distributions (e.g., gamma, Possion, binomial, etc.). The link function is therefore used to model responses when a dependent variable is assumed to be nonlinearly related to the predictors.

Various link functions (see McCullagh and Nelder, 1989) are commonly used, depending on the assumed distribution of the dependent variable (y) values:
Normal, Gamma, Inverse normal, and Poisson distributions:
Identity link: f(z) = z
Log link: f(z) = log(z)
Power link: f(z) = za,for a given a
Binomial, and Ordinal Multinomial distributions:
Logit link: f(z)=log(z/(1-z))
Probit link: f(z)=invnorm(z) where invnorm is the inverse
of the standard normal
cumulative distribution function.
Complementary log-log link: f(z)=log(-log(1-z))
Loglog link: f(z)=-log(-log(z))
Multinomial distribution:
Generalized logit link: f(z1|z2,…,zc)=
   log(x1/(1-z1-…-zc))
where the model has
c+1 categories.

For discussion of the role of link functions, see the Generalized Linear Models chapter.

Local Minima. Local "valleys" or minor "dents" in a loss function which, in many practical applications, will produce extremely large or small parameter estimates with very large standard errors. The Simplex method is particularly effective in avoiding such minima; therefore, this method may be particularly well suited in order to find appropriate start values for complex functions.

Logarithmic Function. This fits to the data, a logarithmic function of the form:

y = q*[logn(x)] + b

Logistic Distribution. The Logistic distribution has density function:

f(x) = (1/b)*e-(x-a)/b * [1+e-(x-a)/b]-2

where
a     is the mean of the distribution
b     is the scale parameter
e     is the base of the natural logarithm, sometimes called Euler's e (2.71...)

[Animated Logistic Distribution]

The graphic above shows the changing shape of the Logistic distribution when the location parameter equals 0 and the scale parameter equals 1, 2, and 3.

Logistic Function. An S-shaped (sigmoid) function having values in the range (0,1). See, the Logistic Distribution.

Logit Regression and Transformation. In the logit regression model, the predicted values for the dependent or response variable will never be less than (or equal to) 0, or greater than (or equal to) 1, regardless of the values of the independent variables; it is, therefore, commonly used to analyze binary dependent or response variables (see also the binomial distribution). This is accomplished by applying the following regression equation (the term logit was first used by Berkson, 1944):

y=exp(b0 +b1*x1 + ... + bn*xn)/{1+exp(b0 +b1*x1 + ... + bn*xn)}

One can easily recognize that, regardless of the regression coefficients or the magnitude of the x values, this model will always produce predicted values (predicted y's) in the range of 0 to 1. The name logit stems from the fact that one can easily linearize this model via the logit transformation. Suppose we think of the binary dependent variable y in terms of an underlying continuous probability p, ranging from 0 to 1. We can then transform that probability p as:

p' = loge{p/(1-p)}

This transformation is referred to as the logit or logistic transformation. Note that p' can theoretically assume any value between minus and plus infinity. Since the logit transform solves the issue of the 0/1 boundaries for the original dependent variable (probability), we could use those (logit transformed) values in an ordinary linear regression equation. In fact, if we perform the logit transform on both sides of the logit regression equation stated earlier, we obtain the standard linear multiple regression model:

p' = (b0 +b1*x1 + ... + bn*xn)

For additional details, see also Nonlinear Estimation or the Generalized Linear Models chapter; see also Probit Transformation and Regression and Multinomial logit and probit regression for similar transformations.

Log-Linear Analysis. Log-linear analysis provides a "sophisticated" way of looking at crosstabulation tables (to explore the data or verify specific hypotheses), and it is sometimes considered an equivalent of ANOVA for frequency data. Specifically, it allows the user to test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance (see Elementary Concepts for a discussion of statistical significance testing).

For more information, see the Log-Linear Analysis chapter.

Log-normal Distribution. The lognormal distribution (the term first used by Gaddum, 1945) has the probability density function:

f(x) = 1/[x(2)1/2] * exp(-[log(x)-µ]2/22)
0 x <
µ > 0
> 0

where
µ     is the scale parameter
   is the shape parameter
e     is the base of the natural logarithm, sometimes called Euler's e (2.71...)

[Animated Log-normal Distribution]

The animation above shows the Log-normal distribution with mu equal to 0 for sigma equals .10, .30, .50, .70, and .90.

Lookahead (in Neural Networks). For neural networks time series analysis, the number of time steps ahead of the last input variable values the output variable values should be predicted.

See also, the chapter on neural networks.

Loss Function. The loss function (the term loss was first used by Wald, 1939) is the function that is minimized in the process of fitting a model, and it represents a selected measure of the discrepancy between the observed data and data "predicted" by the fitted function. For example, in many traditional general linear model techniques, the loss function (commonly known as least squares) is the sum of squared deviations from the fitted line or plane. One of the properties (sometimes considered to be a disadvantage) of that common loss function is that it is very sensitive to outliers.

A common alternative to the least squares loss function (see above) is to maximize the likelihood or log-likelihood function (or to minimize the negative log-likelihood function; the term maximum likelihood was first used by Fisher, 1922a). These functions are typically used when fitting non-linear models. In most general terms, the likelihood function is defined as:

L=F(Y,Model)=ni=1 { p[yi , Model Parameters(xi)]}

In theory, we can compute the probability (now called L, the likelihood) of the specific dependent variable values to occur in our sample, given the respective regression model.

Loss Matrix (in Neural Networks). If a network is trained so that the outputs estimate probabilities, it can be adjusted to support a loss matrix (Bishop, 1995).

In simple cases, a probability estimate may be used directly: an unknown case is simply assigned to the most-probable class. Inevitably, this means that sometimes the network can be wrong (and this is unavoidable if data is noisy).

However, some mistakes can be more costly than others. For example, if diagnosing a potentially fatal illness, prescribing medication to somebody who is not actually ill may be considered a less grave error than failing to prescribe to somebody who is.

A loss matrix is a square matrix of coefficients that reflect the relative costs of various misclassifications. It is multiplied by the vector of probability estimates, resulting in a vector of cost estimates, and the case is assigned to the class with the lowest cost estimate.

Since a correct classification has zero cost, the leading diagonal of a loss matrix always contains zeros; in other positions, the coefficient in the n'th column and m'th row represents the cost of misclassifying a case that is actually in the n'th class as being in the m'th class.

See also, the chapter on neural networks.

LOWESS Smoothing (Robust Locally Weighted Regression). Robust locally weighted regression is a method of smoothing 2D scatterplot data (pairs of x-y data). A local polynomial regression model is fit to each point and the points close to it. The method is also sometimes referred to as LOWESS smoothing. The smoothed data usually provide a clearer picture of the overall shape of the relationship between the x and y variables. For more information, see also Cleveland (1979, 1985).




© Copyright StatSoft, Inc., 1984-2003
STATISTICA is a trademark of StatSoft, Inc.