4.1.3 Training the Neural Networks This section explains how to optimize the features for the PNN and MLP classifiers. Optimization for PNN is done using the optrws (regional weights optimization) and optosf(optimize the overall smoothing factor) commands, described in Section 4.1.3.1. MLP uses the features from Section 4.1.2 as its input but does a series of S,trainingT^ runs to optimize its set of neural network weights. Section 4.1.3.2 discusses the training process for MLP that results in these tof optimized weights used by the classifier. 11The mean is not needed for further processing, but is computed because if multiple processors are available, it maybe possible to save time by running several simultaneous instances of mean cov on different subsets of the oas. The resulting output files are combined using the cmbmcs command, but to combine several covariance matrices, cmbmcs needs the means as wells as the covariance matrices of the subsets. 4.1.3.1 OptimizingtheProbabilisticNeuralNetwork Several steps are needed to optimize the feature set for PNN. First a set of regional weights are computed that place emphasis on the most significant regions of the fingerprint( typically the corearea). These results are combined with the eigenvectors to produce a transform matrix to use when reducing the dimensionality of the original oa features. Finally, the overall smoothing factor(osf) for PNN is optimized. 4.1.3.1.1 OptimizetheRegionalWeights optrws This command optimizes the regional weights. First, it finds an optimal single value to set all the weights. Having thus defined an initial point in weight space, the program finishes the optimization by a very simple version of gradient descent. It estimates(by secant method)the gradient of the activation error rate, using the PNN classifier and its prototype features. Classification on the prototype features is done by excluding the print being classified from the prototypes(i.e. leave-one-out). Then it searches the line pointing in the anti-gradient direction from the initial point, using a very simple method to find the minimum(or at least a local minimum)of the error along this line. The program then estimates the gradient there and does another downhill search. It stops after a specified number of iterations. A reasonable number of iterations are three or four, which may take several hours of time to run on a typical workstation, if using a few thousand prints as the data. If several processors are available, it maybe possible to save optrws runtime by setting its parameters so that, in one of its phases of processing, it spawns several processes to divide the work. Consult the manual page in Appendix B and the default parameter file mentioned in the manual page to find about this. If your operating system does implement fork() and execl(), which are required by the several-processes version of optrws, then optrws can link properly(i.e., without the fork and execl calls becoming unresolved references)by adding the argument -DNO_FORK_AND_EXECL to the definition of CLAGSin src/bin/optrws/makefile.mak. That will cause a different subset of the sourcecode file to be compiled(conditional compilation). In order to efficiently evaluate the error function at a point in regional-weights space, optrws produces the square matrix #tW# of order NFEATS from the eigen vectors # and the diagonal matrix W that is equivalent to the regional weights. It then applies this matrix to all the K-L feature vectors before computing distances. This is only an approximation to the direct use of the regional weights, because of the use of only a partial set of eigen vectors, which also are not recomputed each time the weights are changed. The results seem satisfactory, and the total runtime is much smaller than for direct methods. 4.1.3.1.2 MaketheTransformMatrix mktran Reads the optimized regional weights made by optrws, and the eigen vectors, and makes the transform matrix#tW used in the next step. 4.1.3.1.3 ApplytheTransformMatrix lintran Lintran should be run on the entire set of prototype oas made earlier, using the transform matrix made by mktran. The resulting feature vectors will be the prototype feature vectors used by the finished PNN classifier. The transform matrix applies both the optimal pattern of regional weights, and uses the eigen vectors to accomplish dimension reduction. When the finished classifier runs on an incoming print, it applies this same transform matrix to the oa made from the print and then sends the resulting feature vector to PNN. This approximately duplicates the effect that would have resulted if PNN had been used on the oas themselves, but with the optimized regional weights pattern applied before the distance computation. 4.1.3.1.4 OptimizetheOverallSmoothingFactor optosf Optimizes an overall smoothing factor(osf) used by the PNN classifier. As noted above, the optimization of the regional weights should be done using the K-L vectors of only a subset of the prototypeprints, to savetime. Since the fullset of prototypes will be used in the finished classifier, better accuracy is expected if the classifier uses an osf that is slightly larger than 1, which is the value used during regional weights optimization. This corresponds to SpechtS^s observation [43] that as the number of prototypes increases, the optimal smoothing parameter $ decreases. Increasing the osf corresponds to decreasing $.If the full prototype set was used to optimize the regional weights, then optosf should not be run and the osfset to 1. Completing the above optimization process results in the finished PNN classifier data, consisting of prototype feature vectors, a transform matrix that will be applied to the oasof incoming fingerprints, and the overall smoothing factor. The PNN classification system then consists of a combination of the PNN classifier and the pseudo-ridge tracer. 4.1.3.2 TrainingtheMulti-layerPerceptronNeuralNetwork mlp The program mlp trains a 3-layerfeed-forward linear perceptron[48] using novel methods of machine learning that help control the learning dynamics of the network. As a result, the derived minima are superior, the decision surfaces of the trained network are well formed, the information content of confidence values is increased, and generalization is enhanced. As a classifier, MLP is superior to the PNN classifier in terms of its memory requirements and classification speed. The theory behind the machine learning techniques used in this program is discussed in References[49], [50], &[51]. The main routine for this program is found in src/bin/mlp/mlp. candthe majority of supporting subroutines is located in the library src/lib/mlp. Machine learning is controlled through a batch-oriented iterative process of training the MLP on a set of prototype feature vectors, then evaluating the progress made by running the MLP(in its current state)on a separate set of feature vectors. Training on the first set of patterns then resumes for a predetermined number of passes through the training data and then the MLP is tested again on the evaluation set. This process of training and then testing continues until the MLP has been determined to have satisfactorily converged. For details on the commandline and spec file parameters see the included manualpage in the Reference Manual in Appendix B or on the CD-ROM. This command trains or tests an MLP neural network suitable for use as a classifier. The network has an input layer, a hidden layer, and an output layer, each layer comprising a set of nodes. The input nodes are feed-forwardly connected to the hidden nodes, and the hidden nodes to the output nodes, by connections whose weights(strengths) are trainable. The activation function used for the hidden nodes can be chosen to be sinusoid, sigmoid(logistic), or linear, as can the activation function for the output nodes. Training(optimization) of the weights is done using either a Scaled Conjugate Gradient(SCG) algorithm[52], or by starting out with SCG and then switching to a Limited Memory Broyden Fletcher Goldfarb Shanno(LBFGS) algorithm [53]. Boltzmann pruning[54], i.e. dynamic removal of connections; can be performed during training if desired. Prior weights can be attached to the patterns(feature vectors)in various ways. When mlp is invoked, it performs a sequence of runs. Each run does either training, or testing: training run: A set of patterns is used to train(optimize) the weights of the network. Each pattern consists of a feature vector, along with either a class or a target vector. A feature vector is a tuple of floating-point numbers, which typically has been extracted from some natural object such as a finger printimage. A class denotes the actual class to which the object belongs, for example"whorl.T^ Th enetwork can be trained to become a classifier: it trains using a set of feature vectors extracted from objects of known classes. Training runs finish by writing the final values of the network weights as a file. It also produces a summary file showing in formation about the run, and optionally produces a longer file that shows the results the final(trained) network produced foreach individual pattern. testing run: A set of patterns is sent through a network, after the network weights are read from a file. The output values, i.e. the hypothetical classes are compared with target classes or vectors, and the resulting error rate is computed. The program can produce a table showing the correct classification rate as a function of the rejection rate. Output files generated from mlp training are provided in test/pcasys/execs/mlp/mlp_dir. The spec file used by mlp to train the classifier on fingerprint images is spec. This spec file requires the input files patnames, patwts, priors, the training set fv1-9mlp.kls, and the testing set sv10mlp.kls, and it invokes 5sequential pairs of mlptraining/testing sessions. Three files are generated from each training/testing session. For example, from the first session: trn1.err, trn1l.err, trn1.wts,and tst1.err are created. Trn1.err is a report of the progressive error rates achieved on the training set. Trn1l.err is a file containing the output activations foreach fingerprint in the training set. Trn1.wts is the resulting weights trained in the session. Tst1.err is a report of the error rate achieved on the testing set using the most recent set of weights from training. For the next training/testing session, training resumes with the MLP network initialized to the weights contained in trn1.wts. The output files from this session are trn2.err,trn2.err, trn2.wts, and tst2.err. The weights file trn2.wts is then used as input to the next session and soon until the final session is complete. The files trn5.err, trn5l.err, and trn5.wts contain the final results of training and tst5.err contains the error rate achieved by using the final set of weights to classify the testing set contained in sv10mlp.kls. Appendix A gives more details about the output files of the mlp training process including formulas and sample dat afrom PCASYS training. There are numerous parameters(see the manualpage for details on all the parameters)to be specified in the spec file for running the program mlp. A good strategy for training the MLP on a new classification problem is to first work with a single training/testing session. Try different combinations of parameter settings until a reasonable amount of training is achieved within the first 50 iterations, for example. This typically involves using a relatively high value for regularization(such as 2.0 with fingerprint classification); varying the number of hidden nodes in the network; and trying different levels of temperature, typically incrementing or decrementing by powers of 10. For fingerprint classification, the number of hidden nodes is typically set to equal or greater than the number of input KL features, and a temperature of 1.0e-5 works well. Once reasonable training is achieved, the se parameters should remain fixed, and successive sessions of training/testing are performed according to a schedule of decreasing regularization. For fingerprint classification it works well to specify about 50 iterations for each training session, and to use a regularization factor schedule starting at 2.0 and decreasing to 1.0, 0.5, 0.2, 0.1 for each successive training session. This process of multiple training/testing sessions initiates MLP training within a reasonable solution space. It also enables the machine learning to refine its solution so that convergence is achieved while maintaining a high level of generalization by controlling the dynamics of constructing wellS,behavedT^ decision surfaces. The intermediate testing sessions allow one to evaluate the progress made on an independent testing set, so that a judgment can be made as to whether incremental gains in training have reached diminishing returns. The theory behind the control of dynamical changes within the MLP learning process is discussed in References [49], [50], &[51]. Training the MLP in this fashion generates superior decision surfaces thus providing robust activations for use as confidence values when rejecting confusing classification. This training process is of course done once off-line, and then the resulting weight files are reused by the actual recognition system. In practice, the user could use the mlp command to do a batch run over a set of test data versus running the PCASYS commands and processing each test image individually. The PCASYS commands are merely for demonstrating the procedure used to get the final classification results and when possible allow the user to see graphics of the progress at each step along the way.