Research Article | Open Access

Jingyi Liu, Ba Tuan Le, "Incremental Multiple Hidden Layers Regularized Extreme Learning Machine Based on Forced Positive-Definite Cholesky Factorization", *Mathematical Problems in Engineering*, vol. 2019, Article ID 6740523, 15 pages, 2019. https://doi.org/10.1155/2019/6740523

# Incremental Multiple Hidden Layers Regularized Extreme Learning Machine Based on Forced Positive-Definite Cholesky Factorization

**Academic Editor:**Alberto Olivares

#### Abstract

The theory and implementation of extreme learning machine (ELM) prove that it is a simple, efficient, and accurate machine learning method. Compared with other single hidden layer feedforward neural network algorithms, ELM is characterized by simpler parameter selection rules, faster convergence speed, and less human intervention. The multiple hidden layer regularized extreme learning machine (MRELM) inherits these advantages of ELM and has higher prediction accuracy. In the MRELM model, the number of hidden layers is randomly initiated and fixed, and there is no iterative tuning process. However, the optimal number of hidden layers is the key factor to determine the generalization ability of MRELM. Given this situation, it is obviously unreasonable to determine this number by trial and random initialization. In this paper, an incremental MRELM training algorithm (FC-IMRELM) based on forced positive-definite Cholesky factorization is put forward to solve the network structure design problem of MRELM. First, an MRELM-based prediction model with one hidden layer is constructed, and then a new hidden layer is added to the prediction model in each training step until the generalization performance of the prediction model reaches its peak value. Thus, the optimal network structure of the prediction model is determined. In the training procedure, forced positive-definite Cholesky factorization is used to calculate the output weights of MRELM, which avoids the calculation of the inverse matrix and Moore-Penrose generalized inverse of matrix involved in the training process of hidden layer parameters. Therefore, FC-IMRELM prediction model can effectively reduce the computational cost brought by the process of increasing the number of hidden layers. Experiments on classification and regression problems indicate that the algorithm can be effectively used to determine the optimal network structure of MRELM, and the prediction model training by the algorithm has excellent performance in prediction accuracy and computational cost.

#### 1. Introduction

The neural network is a complex nonlinear system interconnected by a large number of neurons, and it is based on the research of human brain information processing ability by modern neurobiology and cognitive science. The neural network is also a mathematical simulation form of human brain physiological structure, with strong adaptability, self-learning ability, and nonlinear mapping, and it has been widely used by researchers in many scientific fields [1â€“3]. However, the above prediction models are all based on the traditional neural networks, and the network training process needs to modify the network weights repeatedly according to the training objectives and gradient information. The entire network training process usually takes hundreds or even thousands of iterations before it can be finally completed, which requires a large amount of calculation.

Extreme learning machine (ELM) is a novel single hidden layer feedforward neural network. It transforms the iterative adjustment process of traditional neural network parameter training into solving linear equations. According to Moore-Penrose generalized inverse matrix theory, the least squares solution with the minimum norm is obtained analytically as the network weights. The whole training process can be completed in one time without iteration. Compared with the traditional neural network training algorithm, which requires several iterations to determine the network weights, the training speed of ELM is significantly improved [4, 5]. This advantage enables ELM to be successfully applied in pattern recognition [6, 7] and regression estimation [8â€“10]. In order to improve the generalization ability of ELM, the literature [11] draws on the principle of structural risk minimization in statistical learning theory and proposes a regularized extreme learning machine (RELM). RELM has a better generalization ability by introducing parameters to weigh structural risks and empirical risks [12â€“15]. For the single hidden layer RELM model with multiple input and single output, the literature [16] designed the Cholesky factorization method for regularized output weight matrix. In the learning and forgetting process of the sample sequence, the Cholesky factorization factor is calculated recursively by adding and deleting samples one by one, and then the output weights are adjusted, and the network structure is fixed. However, if dealing with input data with complex noise signals and high-dimensional information, or with more classification categories, RELM also shows its own shortcomings, and the accuracy of the established model is greatly reduced.

In order to improve the embarrassing situation of RELM, the literature [17, 18] starts from improving its network structure. On the basis of the traditional RELM three-layer structure, the number of hidden layers is increased to form a neural network with one input layer, multiple hidden layers, and one output layer, that is, the multiple hidden layers RELM network model (MRELM), in which the neuron nodes of each hidden layer are fully connected. MRELM inherits the idea that RELM randomly initializes the weights matrix between the input layer and the hidden layer as well as the bias vector of the hidden layer. By forcing the actual output of the hidden layer to be as close as possible to the expected output, the weights matrix and the bias vector of the added hidden layers are calculated; thereby a neural network model with multiple hidden layers is established. The parameter training process needs to calculate the inverse matrix and the MP generalized inverse matrix, in which the first hidden layer parameters are randomly initialized, and the remaining hidden layers parameters are obtained by minimizing the error between the actual output and the expected output of the corresponding hidden layer. Compared with the traditional RELM model, MRELM can effectively improve the prediction accuracy through the layer-by-layer optimization of network parameters between different hidden layers. Moreover, it has the advantages of strong generalization ability and fast computing speed and is not easy to fall into local optimum [19â€“22]. However, since the initial parameter values of MRELM are randomly initialized, although this avoids the situation that the algorithm falls into local optimum and overfitting, it also leads to the failure of some hidden layers or the reduction of the effect on the neural network during the modeling process. As a result, there are some redundant hidden layers in the MRELM network, which often require more reasonable selection methods and theories for the number of hidden layers. Meanwhile, the network structure of MRELM is determined by the users based on their own practical experience, but this empirical choice is not reasonable, and it is difficult to guarantee the optimality. In practical applications, users often need to carry out repeated experiments for many times and choose the network structure with the least time consuming and the highest accuracy from the complex results comparison as the optimal network model for the training and prediction of the actual data.

In order to realize the effective design of MRELM network structure, select the number of hidden layers reasonably, and achieve the desired accuracy requirements, an incremental MRELM training algorithm based on forced positive definite Cholesky factorization (FC-IMRELM) [23, 24] is put forward in this paper. The algorithm can adjust the number of hidden layers in the network adaptively according to the predicted data, so as to determine the optimal network structure of FC-IMRELM. At the same time, a novel method is adopted to calculate the parameters of the newly added hidden layers, that is, the connection weight matrix and the bias vector of the hidden layers. Based on previous research, MRELM typically requires fewer hidden neurons than ELM to achieve a desirable performance level. This is a basic requirement for considering the multiple-hidden-layers structure presented. The foundational ideas for the FC-IMRELM algorithm are far simpler to produce and more stable by comparing and contrasting its characteristics with other ELM variants. Experimental results for classification and regression problems show that the proposed FC-IMRELM algorithm has more advantages in terms of average accuracy compared to the traditional RELM model and other improved models of MRELM.

The rest of this paper is organized as follows: Section 2 presents a brief review of the basic concepts and related work of multiple hidden layers RELM, Section 3 describes the proposed incremental FC-IMRELM technique, Section 4 reports and analyzes the experimental results, and, finally, Section 5 summarizes key conclusions of the present study.

#### 2. Brief Review of Multiple Hidden Layers Regularized Extreme Learning Machine

The MRELM algorithm tries to find a mapping relationship that makes the output predicted by the ELM neural network with multiple hidden layers infinitely close to the actual given result. This mapping relationship will be embodied in the solution process of the weight and bias parameters of the hidden layers. The number of hidden layers in the MRELM neural network needs to be selected according to the change of the predicted data. Therefore, in the training process of network parameters, in order to ensure that the final hidden layer output is closer to the expected hidden layer output, in addition to the random initialization of the parameters of the first hidden layer, the parameter training process starts from the second hidden layer to optimize the network parameters until all the network parameters are completed. Furthermore, during the establishment of the neural network, the weight matrix and bias vector of each hidden layer are acquired and recorded, so as to obtain the final predicted output result of the MRELM neural network. The solving process of the network parameters will be explained in detail in the following algorithm flow.

Suppose that a set of training sample dataset given in MRELM neural network is , where is the input samples, is the input vector, is the corresponding labeled samples, is the observation vector, and is the total number of training samples. Meanwhile, it is assumed that all hidden layers in the MRELM model contain the same number of hidden nodes , and each hidden node chooses the same activation function . In the modeling process of MRELM algorithm, multiple hidden layers in the neural network are first treated as a single hidden layer, and then the hidden layer parameters in the MRELM network containing only a single hidden layer are randomly initialized, namely, the input weights matrix connecting the input layer and the first hidden layer, and the bias vector of the first hidden nodes. Thus, the output matrix of the first hidden layer can be calculated as follows:whose scalar entries are interpreted as the output of the hidden node in the first hidden layer with respect to and is the vector of connection weights between input nodes and the hidden node in the first hidden layer. To better balance the empirical risk and structural risk, the MRELM adjusts the proportion of the two risks by introducing parameter , which can be expressed as the following constrained optimization problem.where is the connection weights matrix between the first hidden layer and the output layer, with vector components that denote the connection weights between the hidden node in the first hidden layer and output nodes, denotes the training error, and is the regularization parameter.

According on the KKT theorem, the constrained optimization of (2) can be transformed into the following dual optimization problem:where is the Lagrange multipliers vector. Utilizing KKT optimality conditions, the following equations can be obtained:Finally, can be gotten as follows:orIn order to reduce the computational costs, if , one may prefer to apply the solution (5a), and if , one may prefer to apply the solution (5b).

Now the second hidden layer is added to the MRELM neural network, the network structure with two hidden layers is restored, and the two hidden layers are fully connected, so the prediction output of the second hidden layer can be obtained as follows:where denotes the weights matrix between the first hidden layer and the second hidden layer. We suppose that the first and second hidden layers have the same number of nodes, and thus is a square matrix. The matrix represents the bias of the second hidden layer. The expected output of the second hidden layer can be calculated aswhere is the MP generalized inverse of the matrix , which can be calculated using the orthogonal projection method. Namely, if is nonsingular, then ; otherwise if is nonsingular. To make the predicted output of the hidden layer in the MRELM neural network infinitely close to the expected output, we may set . Subsequently, we define the augmented matrix , and it can be gotten aswhere is the MP generalized inverse of the matrix , and 1 represents a one-column vector of size N whose elements are the scalar unit 1. The solving method of is the same as previously discussed for . The notation indicates the inverse of the activation function . For classification and regression problems, we all invoke the widely used logistic sigmoid function . The predicted output of the second hidden layer is obtained as Therefore, the connection weights matrix between the second hidden layer and the output layer is calculated asorThe solving method of is chosen according to what is previously discussed for .

According to the MRELM algorithm flow, the third hidden layer is added to the MRELM network, and restore the network structure with three hidden layers. Since the nodes between each hidden layer are all connected together, the prediction output of the third hidden layer can be obtained as where represents the weights matrix between the second hidden layer and the third hidden layer, and the vector denotes the bias of the third hidden layer. Thus, the expected output of the third hidden layer can be gotten as where is the MP generalized inverse of the weights matrix , obtained using the approach described before. To meet the requirement that the predicted output of the third hidden layer is infinitely close to the expected output, let . Accordingly, the augmented matrix can be defined as , and we can solve it as follows.where is the MP generalized inverse of the matrix , the specific meaning of the symbol 1 is described above, and the calculation of also proceeds in the manner discussed before. Therefore, we can update the predicted output of the third hidden layer asFinally, the connection weight matrix between the third hidden layer and the output layer can be calculated asorThe calculation approach of is still selected according to the principle of discussed previously. The final output of the MRELM network with three hidden layers after training can be expressed as

If the number of hidden layers in the MRELM network is more than 3, an iterative format can be adopted to realize the calculation process. In other words, the iterative calculation of formula (6) to formulas (15a) and (15b) is performed for times until all hidden layer parameters are solved. Emphasized finally, this algorithm does not add all hidden layers to the network at one time, nor does it calculate all hidden layer parameters at one time, but one hidden layer after another is added to the network. Every time a new hidden layer is added, the weights matrix and the bias vector of the hidden layers are calculated immediately to prepare for the parameter calculation of the hidden layer to be added next time.

#### 3. Solutions of IMRELM by the Forced Positive-Definite Cholesky Factorization

For the single hidden layer feed-forward neural network, the literature [15] puts forward the regularized extreme learning machine algorithm based on Cholesky factorization (CF-FORELM), introduces the Cholesky factorization of positive definite matrix into the solving process of RELM, and designs a recursive solution method for the calculation of the regularized output matrix Cholesky factorization factor. The advantages of CF-FORELM algorithm prompted us to introduce the forced positive-definite Cholesky factorization method into the framework of MRELM algorithm with multiple hidden layers, and we then proposed an MRELM neural network training algorithm based on forced positive definite Cholesky factorization (FC-IMRELM). Compared with the inverse matrix calculation of invertible matrix in traditional RELM algorithm and the calculation of MP generalized inverse of matrix in MELM algorithm ( denotes the number of hidden layers), the algorithm effectively reduces the computational cost and complexity brought by the matrix inverse process. Meanwhile, the numerical stability of the forced positive definite Cholesky factorization method also greatly weakens the randomness effect of the ELM algorithm on the prediction results.

##### 3.1. Forced Positive-Definite Cholesky Factorization (FC)

The main difficulty of the MRELM algorithm is the calculation of the inverse matrix and the MP generalized inverse matrix involved in the training process, including the inverse calculation of symmetric positive semidefinite matrix. In this case, the improved MRELM mode based on the traditional Cholesky factorization could not be realized, because the Cholesky factorization of the symmetric positive semidefinite matrix might not exist. Even if such a factorization exists, the calculation process is generally numerically unstable for the elements of the matrix factorization factor may be unbounded. In order to overcome these difficulties, we put forward a modified approach based on the forced positive definite Cholesky factorization for the MRELM algorithm with multiple hidden layers, which is a numerical stability approach.

When the forced positive definite strategy is adopted to improve the MRELM algorithm, the key problem is how to form the positive definite matrix from the modified Cholesky decomposition of the undetermined matrix. If the matrix is not a positive definite matrix, the Cholesky factorization method, which forces the matrix to have positive definite property, is to find a unit lower triangular matrix and a positive definite diagonal matrix for the general symmetric matrix , so that the matrix is positively definite, and it is only one diagonal matrix away from the matrix .In fact, the Cholesky factorization of symmetric positive definite matrix can be described as follows:where represents the element of matrix and denotes the main diagonal element of matrix . Here, the Cholesky factorization factors and are required to satisfy two requirements: one is that all elements of are strictly positive, and the other is that the elements of the factorization factor are uniformly bounded. That is, for and a positive number , the formula (19) is required:where the auxiliary quantity , is a given small positive number. The matrix satisfying the above conditions is said to be sufficiently positive definite, where is a zero matrix.

Next, we describe the step of this factorization. Suppose the column of the forced positive-definite Cholesky factorization has been calculated. For , equation (19) holds. First calculatewhere is taken as , and the test value is defined aswhere is a small positive number. In order to determine whether can accept as the element of , we check whether satisfies the formula (19). If so, let , and get the column of from . Otherwise,let , select positive number to make , and produce the column of .

If the above process is completed, we obtain the Cholesky factorization formula (17) of the positive definite matrix , where is a nonnegative diagonal matrix and the diagonal element is . For the given matrix , this nonnegative diagonal matrix depends on . If , where is the maximum norm of the nondiagonal elements of and is the maximum norm of the diagonal elements of . If , the upper bound is minimized. So, let satisfy formula (24):where represents the machine precision. We increase to prevent from being small.

Finally, we present the forced positive-definite Cholesky factorization algorithm, where the auxiliary quantity , , . These values need not be stored separately; they can be stored in the matrix .

*Algorithm 1. *Forced Positive-Definite Cholesky Factorization (FC)*tep 1*. Calculate the bounds of the elements of the factorization factor. Let , where and and are the maximum norm of the diagonal and nondiagonal elements of , respectively.*tep 2*. Initialization. Let , , .*tep 3*. Determine the minimum index , so that , and exchange the information of rows and rows, columns and columns of .*tep *4. Calculate the row of , and solve the maximum norm of . Let , , calculate , , and let . If , let .*tep 5*. Calculate the diagonal element of . The diagonal element of is modified to . If , stop.*tep 6*. Correct the diagonal elements and column index, and let , and ; jump to tep 3.

##### 3.2. Process of Matrices Decomposition for FC-IMRELM

According to the MRELM training process shown in equations (1) to (16), its essence is to solve the connection weight matrix between the hidden layer and the output layer. However, it can be seen from equations (5a), (5b), (10a), (10b), (15a), and (15b) that the solution method given in literature [18] involves matrix inversion, and the solution process of each hidden layer's parameters ( is the number of hidden layers) involves the calculation of the MP generalized inverse of matrix. The problem of large amount of calculation reduces the modeling efficiency of the MRELM prediction model. In order to solve the above problems effectively, we propose a solution method of the weights matrix and hidden layer parameters based on the forced positive definite Cholesky factorization.

First, on the basis of equations (4a), (4b), and (4c), can be obtained from equation (4a), and can be obtained from equation (4b), and then substituting and into equation (4c), we can obtainwhere , , and , andwhere , , and .

Therefore, the process of solving according to equation (5a) can be transformed into solving linear equations in the form of equation (25), and the process of solving according to equation (5b) can be transformed into solving linear equations in the form of equation (26); is the dimension of observation vector.

If the number of training samples is greater than the number of hidden nodes, that is, , the solution process of based on the Cholesky factorization is as follows. First, calculate the Cholesky factorization of matrix :where is a lower triangular matrix with positive diagonal elements. The nonzero element in can be calculated by the element of according to equation (28):where , . Substitute equation (27) into equation (25) and multiply both sides of the equation by : where . Solving is equivalent to solving equation (29). Because is equivalent to , the calculation formula for the elements of can be obtained by comparing the elements on both sides of the equation where , is the element of . Finally, on the basis of obtaining and , the element of can be calculated by using the elements of and : where , , .

If the number of hidden nodes is greater than the number of training samples, that is, , the solution process of based on the Cholesky factorization is as follows. First, calculate the Cholesky factorization of :where is a lower triangular matrix with positive diagonal elements. The nonzero element in can be calculated by the element of according to equation (33): where , . Substitute equation (32) into equation (26) and multiply both sides of the equation by : where . Solving is equivalent to solving equation (34). Because is equivalent to , the calculation formula for the element of can be obtained by comparing the elements on both sides of the equation where , is the element of . Finally, on the basis of obtaining and , the element of can be calculated by using the elements of and . where , , .

At this point, we get the connection weight matrix between the first hidden layer and the output layer. The connection weight matrix between the other hidden layer and output layer can also be calculated by the above method. Compared with the solution method of the connecting weight matrix as shown in equations (5a), (5b), (10a), (10b), (15a), and (15b), the solution of based on Cholesky factorization does not involve the inverse operation of the matrix, and it can be achieved by using simple algebraic operations.

The MRELM model contains multiple hidden layers, and the solving process of each hidden layer parameter needs to calculate the MP generalized inverse of the corresponding matrices. However, it can be concluded from equations (7), (8), (11), and (13) that the solution method of and using orthogonal projection method [5] involves the inverse calculation of symmetric semipositive definite matrixes , , , and , which has the problem of large computational cost and numerical instability, and if the condition number of the above matrices is too large, the calculation results of MP generalized inverse of matrices and are usually unable to be obtained. This not only affects the modeling efficiency and prediction effect of MRELM model, but also may make the modeling process impossible to complete. However, the traditional Cholesky factorization method can only be used to solve the calculation of symmetric positive definite matrix. To effectively overcome the above difficulties, we use the forced positive definite Cholesky factorization to solve the MP generalized inverse of matrices and , and we then get the hidden layer parameters .

If is nonsingular, then take and substitute it into (7) and (11): where , .

If is nonsingular, then take and substitute it into (7): where , .

Therefore, the process of solving can be transformed into solving linear equations in the form of equation (37), or solving linear equations in the form of equation (38), is the number of hidden nodes, and is the dimension of the observation vector.

If is nonsingular, the solution process of based on forced positive definite Cholesky factorization is as follows. First, calculate the modified Cholesky factorization result of matrix :where is the unit lower triangular matrix and is the positive definite diagonal matrix. is the nonzero element below the main diagonal in , and is the main diagonal element in ; it can be calculated according to Algorithm 1 by using element of , where , . If we set , then (39a) can be written aswhere . Substitute equation (39b) into equation (37) and multiply both sides of the equation by . where . Solving is equivalent to solving equation (40). Since is equivalent to , by comparing the elements on both sides of the equation, the calculation formula for the element of can be obtained: where is the element of , and is the element of , , . Finally, on the basis of obtaining and , the element of can be calculated by using the elements of and . where , , ; consequently, we can get .

If is nonsingular, the solution process of based on forced positive definite Cholesky factorization is as follows. First, calculate the modified Cholesky factorization results of matrix : where is the unit lower triangular matrix and is the positive definite diagonal matrix. is the nonzero element below the main diagonal in , and is the main diagonal element in ; it can be calculated according to Algorithm 1 by using the element of , , . If we set , then (43a) can be written aswhere . Substitute equation (43b) into equation (38) and multiply both sides of the equation by . where . Solving = is equivalent to solving equation (44). Since is equivalent to , by comparing the elements on both sides of the equation, the calculation formula for the element of can be gotten as where is the element of and is the element of , , . Finally, on the basis of obtaining and , the element of can be obtained by using elements of and . where , , ; therefore, we can get .

If is nonsingular, then let and substitute it into (8) and (13): where , .

If is nonsingular, then let and substitute it into (8) and (13): where = , , , = , .

Therefore, the process of solving can be transformed into solving linear equations in the form of equation (47), or solving linear equations in the form of equation (48), where is the number of hidden nodes.

If is nonsingular, the solution process of based on the forced positive definite Cholesky factorization is as follows. First, calculate modified Cholesky factorization results of matrix : where is the unit lower triangular matrix and is the positive definite diagonal matrix. is the nonzero element below the main diagonal in , and is the main diagonal element in ; it can be calculated according to Algorithm 1 by using element of , , . If we set , then (49a) can be written aswhere . Substitute equation (49b) into equation (47) and multiply both sides of the equation by . where . Solving is equivalent to solving equation (46). For is equivalent to , by comparing the elements on both sides of the equation, the calculation formula for the element of can be calculated as where is the element of and is the element of , , . Finally, on the basis of obtaining and , the element of can be calculated by using the elements of and . where , , ; accordingly, we can obtain the matrix .

If is nonsingular, the solution process of based on the forced positive definite Cholesky factorization is as follows. First, calculate modified Cholesky factorization results of matrix :where is the unit lower triangular matrix and is the positive definite diagonal matrix. is the nonzero element below the main diagonal in , and is the main diagonal element in ; it can be calculated according to Algorithm 1 by using element of , , . If we set , then (53a) can be written aswhere . Substitute equation (53b) into equation (48) and multiply both sides of the equation by . where . Because solving is equivalent to solving equation (54) and is equivalent to , by comparing the elements on both sides of the equation, the calculation formula for the element of can be calculated as where is the element of and is the element of , , . Finally, on the basis of obtaining and , the element of can be calculated by using the elements of and .