Mlp trains a 3-layer feed-forward linear perceptron using novel methods of machine learning that help control the learning dynamics of the network. As a result, the derived minima are superior, the decision surfaces of the trained network are well-formed, the information content of confidence values is increased, and generalization is enhanced. The theory behind the machine learning techniques used in this program is discussed in the following reference:
[C. L. Wilson, J. L. Blue, O. M. Omidvar, "The Effect of Training Dynamics on Neural Network Performance," NIST Internal Report 5696, August 1995.]
Machine learning is controlled through a batch-oriented iterative process of training the MLP on a set of prototype feature vectors, and then evaluating the progress made by running the MLP (in its current state) on a separate set of testing feature vectors. Training on the first set of patterns then resumes for a predetermined number of passes through the training data, and then the MLP is tested again on the evaluation set. This process of training and then testing continues until the MLP has been determined to have satisfactorily converged.
The MLP neural network is suitable for use as a classifier or as a function-approximator. The network has an input layer, a hidden layer, and an output layer, each layer comprising a set of nodes. The input nodes are feed-forwardly connected to the hidden nodes, and the hidden nodes to the output nodes, by connections whose weights (strengths) are trainable. The activation function used for the hidden nodes can be chosen to be sinusoid, sigmoid (logistic), or linear, as can the activation function for the output nodes. Training (optimization) of the weights is done using either a Scaled Conjugate Gradient (SCG) algorithm [1], or by starting out with SCG and then switching to a Limited Memory Broyden Fletcher Goldfarb Shanno (LBFGS) algorithm [2]. Boltzmann pruning [3], i.e. dynamic removal of connections, can be performed during training if desired. Prior weights can be attached to the patterns (feature vectors) in various ways.
[1] J. L. Blue and P. J. Grother, "Training Feed Forward Networks Using Conjugate Gradients," NIST Internal Report 4776, February 1992, and in Conference on Character Recognition and Digitizer Technologies, Vol. 1661, pp. 179-190, SPIE, San Jose, February 1992.
[2] D. Liu and J. Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization," IMathematical Programming B, Vol. 45, 503-528, 1989.
[3] O. M. Omidvar and C. L. Wilson, "Information Content in Neural Net Optimization," NIST Internal Report 4766, February 1992, and in Journal of Connection Science, 6:91-103, 1993.
Training and Testing Runs
When mlp is invoked, it performs a sequence of runs. Each run does either training, or testing:
training run: A set of patterns is used to train (optimize) the weights of the network. Each pattern consists of a feature vector, along with either a class or a target vector. A feature vector is a tuple of floating-point numbers, which typically has been extracted from some natural object such as a handwritten character. A class denotes the actual class to which the object belongs, for example the character which a handwritten mark is an instance of. The network can be trained to become a classifier: it trains using a set of feature vectors extracted from objects of known classes. Or, more generally, the network can be trained to learn, again from example input-output pairs, a function whose output is a vector of floating-point numbers, rather than a class; if this is done, the network is a sort of interpolator or function-fitter. A training run finishes by writing the final values of the network weights as a file. It also produces a summary file showing various information about the run, and optionally produces a longer file that shows the results the final (trained) network produced for each individual pattern.
testing run: A set of patterns is sent through a network, after the network weights are read from a file. The output values, i.e. the hypothetical classes (for a classifier network) or the produced output vectors (for a fitter network), are compared with target classes or vectors, and the resulting error rate is computed. The program can produce a table showing the correct classification rate as a function of the rejection rate.
Only do error checking on the specfile parameters and print any warnings or errors that occur in the specfile format.
Specfile to be used by mlp. The default is a specfile named "spec" located in the current working directory.
This is a file produced by the user, which sets the parameters (henceforth "parms") of the run(s) that mlp is to perform. It consists of one or more blocks, each of which sets the parms for one run. Each block is separated from the next one by the word "newrun" or "NEWRUN". Parms are set using name-value pairs, with the name and value separated by non-newline white space characters (blanks or tabs). Each name-value pair is separated from the next pair by newline(s) or semicolon(s). Since each parm value is labeled by its parm name, the name-value pairs can occur in any order. Comments are allowed; they are delimited the same way as in C language programs, with /* and */. Extraneous white space characters are ignored.
When mlp is run, it first scans the entire specfile, to find and report any (fatal) errors (e.g. omitting to set a necessary parm, or using an illegal parm name or value) and also any conditions in the specfile which, although not fatally erroneous, are worthy of warnings (e.g. setting a superfluous parm). Mlp writes any applicable warning or error messages; then, if there are no errors in the specfile, it starts to perform the first run. Warnings do not prevent mlp from starting to run. The motivation for having mlp check the entire specfile before it starts to perform even the first run, is that this will prevent an mlp instance that runs a multi-run specfile from failing, perhaps many hours, or days, after it was started, because of an error in a block far into the specfile: such errors will be detected up front and presumably fixed by the user, because that is the only way to cause mlp to get past its checking phase. To cause mlp only to check the specfile without running it, use the -c option.
The following listing describes all the parms that can be set in a specfile. There are four types of parms: string (value is a filename), integer, floating-point, and switch (value must be one of a set of defined names, or may be specified as a code number). A block of the specfile, which sets the parms for one run, often can omit to set the values of several of the parms, either because the parm is unneeded (e.g., a training "stopping condition" when the run is a test run; or, temperature when boltzmann is no_prune), or because it is an architecture parm (purpose, ninps, nhids, nouts, acfunc_hids, or acfunc_outs), whose value will be read from wts_infile. The descriptions below indicate which of the parms are needed only for training runs (in particular, those described as stopping conditions). Architecture parms should be set in a specfile block only if its run is to be a training run that generates random initial network weights: a training run that reads initial weights from a file (typically, final weights produced by a previous training session), or a test run (must read the network weights from a file), does not need to set any of the architecture parms in its specfile block, because their values are stored in the weights file that it will read. (Architecture parms are ones whose values it would not make sense to change between training runs of a single network that together comprise a training "meta-run", nor between a training run for a network and a test run of the finished network.) Setting unneeded parms in a specfile block will result in warning messages when mlp is run, but not fatal errors; the unneeded values will be ignored.
If a parm-name/parm-value pair occurring in a specfile has just its value deleted, i.e. leaving just a parm name, then the name is ignored by mlp; this is a way to temporarily unset a parm while leaving its name visible for possible future use.
String Parms (Filename)
This file will contain summary information about the run, including a history of the training process if a training run. The set of information to be written is controlled, to some extent, by the switch parms do_confuse and do_cvr.
This optionally produced file will have two lines of header information followed by a line for each pattern. The line will show: the sequence number of the pattern; the correct class of the pattern (as a number in the range 1 through nouts); whether the hypothetical class the network produced for this pattern was right (R) or wrong (W); the hypothetical class (number); and the nouts output-node activations the network produced for the pattern. (See the switch parm show_acs_times_1000 below, which controls the formatting of the activations.) In a testing run, mlp produces this file for the result of running the patterns through the network whose weights are read from wts_infile; in a training run, mlp produces this file only for the final network weights resulting from the training session. This is often a large file; to save disk space by not producing it, just leave the parm unset.
This file contains patterns upon which mlp is to train or test a network. A pattern is either a feature-vector and an associated class, or a feature-vector and an associated target-vector. The file must be in one of the two supported patterns-file formats, i.e. ASCII and (FORTRAN-style) binary; the switch parm patsfile_ascii_or_binary must be set to tell mlp which of these formats is being used.
This optional file contains a set of network weights. Mlp can read such a file at the start of a training run - e.g., final weights from a preceding training run, if one is training a network using a sequence of runs with different parameter settings (e.g., decreasing values of regfac) - or, in a testing run, it can read the final weights resulting from a training run. This parm should be left unset if random initial weights are to be generated for a training run (see the integer parm seed).
This file is produced only for a training run; it contains the final network weights resulting from the run.
Each line of this optional file should consist of a long class-name (as shown at the top of patterns_infile) and a corresponding short class-name (1 or 2 characters), with the two names separated by white space; the lines can be in any order. This file is required only for a run that requires short class-names, i.e. only if purpose is classifier and (1) priors is class or both (these settings of priors require class-weights to be read from class_wts_infile, and that type of file can be read only if the short class-names are known) or (2) do_confuse is true (proper output of confusion matrices requires the short class-names, which are used as labels).
This optional file contains class-weights, i.e. a "prior weight" for each class. (See switch parm priors to find out how mlp can use these weights.) Each line should consist of a short class-name (as shown in lcn_scn_infile) and the weight for the class, separated by white space; the order of the lines does not matter.
This optional file contains pattern-weights, i.e. a "prior weight" for each pattern. (See switch parm priors to find out how mlp can use these weights.) The file should be just a sequence of floating-point numbers (ascii) separated from each other by white space, with the numbers in the same order as the patterns they are to be associated with.
Integer Parms
Number of (first) patterns from patterns_infile to use.
Specify the number of input, hidden, and output nodes in the network. If ninps is smaller than the number of components in the feature-vectors of the patterns, then the first ninps components of each feature-vector are used. If the network is a classifier (see purpose), then nouts is the number of classes, since there is one output node for each class. If the network is a fitter, then ninps and nouts are the dimensionalities of the input and output real vector spaces. These are architecture parms, so they should be left unset for a run that is to read a network weights file.
For the UNI random number generator, if initial weights for a training run are to be randomly generated. Its values must be positive. Random weights are generated only if wts_infile is not set. (Of course, the seed value can be reused to generate identical initial weights in different training runs; or, it can be varied in order to do several training runs using the same values for the other parameters. It is often advisable to try several seeds, since any particular seed may produce atypically bad results (training may fail). However, the effect of varying the seed is minimal if Boltzmann pruning is used.)
A stopping condition: maximum number of iterations a training run will be allowed to use.
At every nfreq'th iteration during a training run, the errdel and nokdel stopping conditions are checked and a pair of status lines is written to the standard error output and to short_outfile.
A stopping condition: stop if the number of iterations used so far is at least kmin and, for each of the most recent NNOT (defined in src/lib/mlp/optchk.c) sequences of nfreq iterations, the number right and the number right minus number wrong have both failed to increase by at least nokdel during the sequence.
This value is used for the m argument of the LBFGS optimizer (if that optimizer is used, i.e. only if there is no Boltzmann pruning). This is the number of corrections used in the bfgs update. Values less than 3 are not recommended; large values will result in excessive computing time, as well as increased memory usage. Values in the range 3 through 7 are recommended; value must be positive.
Floating-Point Parms
Regularization factor. The error value that a training run attempts to minimize, contains a term consisting of regfac times half the average of the squares of the network weights. (The use of a regularization factor often improves the generalization performance of a neural network, by keeping the size of the weights under control.) This parm must always be set, even for test runs (since they also compute the error value, which always uses regfac); however, its effect can be nullified by just setting it to 0.
A parm required by the type_1 error function.
For Boltzmann pruning: see the switch parm boltzmann. A higher temperature causes more severe pruning.
A stopping condition: stop when error becomes less than or equal to egoal.
A stopping condition: stop when | g | / | w | becomes less than or equal to gwgoal, where w is the vector of network weights and g is the gradient vector of the error with respect to w.
A stopping condition: stop if the number of iterations used so far is at least kmin and the error has not decreased by at least a factor of errdel over the most recent block of nfreq iterations.
The value of the highest network output activation produced when the network is run on a pattern (the position of this highest activation among the output nodes is the hypothetical class) can be thought of as a measure of confidence. This confidence value is compared with the threshold oklvl, in order to decide whether to classify the pattern as belonging to the hypothetical class, or to reject it, i.e. to consider its class to be unknown because of insufficient confidence that the hypothetical class is the correct class. The numbers and percentages of the patterns that mlp reports as correct, wrong, and unknown, are affected by oklvl: a high value of oklvl generally increases the number of unknowns (a bad thing) but also increases the percentage of the accepted patterns that are classified correctly (a good thing). If no rejection is desired, set oklvl to 0. (Mlp uses the single oklvl value specified for a run; but if the switch parm do_cvr is set to true, then mlp also makes a full correct vs. rejected table for the network (for the finished network if a training run). This table shows the (number correct) / (number accepted) and (number unknown) / (total number) percentages for each of several standard oklvl values.)
This number sets how mildly the target values for network output activations vary between their "low" and "high" values. If trgoff is 0 (least mild, i.e. most extreme, effect), then the low target value is 0 and the high, 1; if trgoff is 1 (most mild effect), then low and high targets are both (1 / nouts); if trgoff has an intermediate value between 0 and 1, then the low and high targets have intermediately mild values accordingly.
This is a percentage that controls how soon a hybrid SCG/LBFGS training run (hybrid training can be used only if there is to be no Boltzmann pruning) switches from SCG to LBFGS. The switch is done the first time a check (checking every nfreq'th iteration) of the network results finds that every class-subset of the patterns has at least scg_earlystop_pct percent of its patterns classified correctly. A suggested value for this parm is 60.0.
This value is used for the gtol argument of the LBFGS optimizer. It controls the accuracy of the line search routine mcsrch. If the function and gradient evaluations are inexpensive with respect to the cost of the iteration (which is sometimes the case when solving very large problems) it may be advantageous to set lbfgs_gtol to a small value. A typical small value is 0.1. Lbfgs_gtol must be greater than 1.e-04.
Switch Parms
Each of these parms has a small set of allowed values; the value is specified as a string, or less verbosely, as a code number (shown in parentheses after string form):
Train a network, i.e. optimize its weights in the sense of minimizing an error function, using a training set of patterns.
Test a network, i.e. read in its weights and other parms from a file, run it on a test set of patterns, and measure the quality of the resulting performance.
Which of two possible kinds of engine the network is to be. This is an architecture parm, so it should be left unset for a run that is to read a network weights file. The allowed values are:
The network is to be trained to map any feature vector to one of a small number of classes. It is to be trained using a set of feature vectors and their associated correct classes.
The network is to be trained to approximate an unknown function that maps any input real vector to an output real vector. It is to be trained using a set of input-vector/output-vector pairs of the function. NOTE: this is not currently supported.
Type of error function to use (always with the addition of a regularization term, consisting of regfac times half the average of the squares of the network weights).
Mean-squared-error between output activations and target values, or its equivalent computed using classes instead of target vectors. This is the recommended error function.
Type 1 error function; requires floating-point parm alpha be set. (Not recommended.)
Positive sum error function. (Not recommended.)
Controls whether Boltzmann pruning of network weights is to be done and, if so, the type of threshold to use:
Do no Boltzmann pruning.
Do Boltzmann pruning using threshold exp(- |w| / T), where w is a network weight being considered for possible pruning and T is the Boltzmann temperature.
Do Boltzmann pruning using threshold exp(- w^2 / T), where w and T are as above.
The types of activation functions to be used on the hidden nodes and on the output nodes (separately settable for each layer). These are architecture parms, so they should be left unset for a run that is to read a network weights file. The allowed values are:
f(x) = 0.5 * (1 + sin(0.5 * x))
f(x) = 1 / (1 + exp(-x)) (Also called logistic function.)
f(x) = 0.25 * x
What kind of prior weighting to use to set the final pattern-weights, which control the relative amounts of impact the various patterns have when doing the computations. These final pattern-weights remain fixed for the duration of a training run, but of course they can be changed between training runs.
Set each final pattern-weight to (1 / npats). (The simplest thing to do; appropriate if the set of patterns has a natural distribution.)
Set each final pattern-weight to the class-weight of the class of the pattern concerned divided by npats. The class-weights are derived by dividing the given-class-weights, read from the class_wts_infile, by the derived-class-weights, computed for the current data set and the normalize them to sum to 1.0. (Appropriate if the frequencies of the several classes, in the set of patterns, are not approximately equal to the natural frequencies (prior probabilities), so as to compensate for that situation.)
Set the final pattern-weights to values read from pattern_wts_infile divided by npats. (Appropriate if none of the other settings of priors does satisfactory calculations (one can do whatever calculations one desires), or if one wants to dynamically change these weights between sessions of training.)
Set each final pattern-weight to the class-weight of the class of the pattern concerned, times the provided pattern-weight, and divided by npats; compute the class-weights as previously described in class priors and read pattern-weights from file pattern_wts_infile. (Appropriate if one wants to both adjust for unnatural frequencies, and dynamically change the pattern weights.)
Tells mlp which of two supported formats to expect for the patterns file that it will read at the start of a run. (If much compute time is being spent reading ascii patsfiles, it may be worthwhile to convert them to binary format: that causes faster reading, and the binary-format files are considerably smaller.)
patterns_infile is in ascii format.
patterns_infile is in binary (FORTRAN-style binary) format.
Compute the confusion matrices and miscellaneous information and include them in short_outfile.
Do not compute the confusion matrices and miscellaneous information.
This parm need be set only if the run is to produce a long_outfile.
Before recording the network output activations in long_outfile, multiply them by 1000 and round to integers.
Record the activations as their original floating-point values.
Produce a correct-vs.-rejected table and include it in short_outfile.
Do not produce a correct-vs.-rejected table.