Convolutional Neural Networks。 1: Leading Back Propagation Back Propagation Algorithm

Network structure

The classic BP network has a three-layer structure: input layer X, output layer O, and hidden layer Y.

Input vector: X = (x1, x2, ..., xn) T

Hidden layer output: Y = (y1, y2, ..., ym) T weight

V = (v1, v2, ..., vm) T

Output vector: O = (o1, o2, ..., ol) T weight W = (w1, w2, ..., wl) T

Expected output: D = (d1, d2, ..., dn) T

Learning algorithm

Input layer to hidden layer calculation process:

The calculation process from hidden layer to output layer:

The network output layer error function is:

Expanding the error function to the hidden layer is:

The training process is to make the final E as small as possible to achieve the optimal value, so E can be partial derivative of each input parameter to achieve the optimal. and so:

η is a proportionality coefficient. After a series of calculations, the above formula can be transformed into:

The weight matrix is adjusted by minimizing the error back-propagation, and iterating until it is optimal.

Two: Convolutional Neural Networks

Each layer of the BP neural network is a linear one-dimensional array state, and the layers and the network nodes of the layer are fully connected. And if the node connection between the layers in the BP network is no longer fully connected, but locally connected . In this way, it is the simplest one-dimensional convolutional network. If we extend the above idea to two dimensions, this is the convolutional neural network we see on most references, see Figure 2:

a. Fully connected network b. Partially connected network

figure 2

Figure 2.a: Fully connected network. If the L1 layer has a 1000 × 1000 pixel image and the L2 layer has 1000,000 hidden layer neurons, each hidden layer neuron connects to each pixel point of the L1 layer image, and there are 1000x1000x1000,000 = 10 ^ 12 Connection, that is, 10 ^ 12 weight parameters.

Figure 2.b: Partially connected network. Each node of the L2 layer is connected to a 10 × 10 window near the same position of the node of the L1 layer, then 1 million hidden layer neurons have only 100w times 100, which is 10 ^ 8 parameters. The number of weighted connections is reduced by four orders of magnitude.

Another feature of convolutional neural networks is weight sharing . For example, as shown in Figure 2.b, the weights are shared, not to say that all the red line labels have the same connection weight. Instead, each color line has a red line with the same weight, so each node in the second layer has the same parameters for convolution from the previous layer.

Each neuron in the hidden layer in FIG. 2 is connected to 10 × 10 image regions, that is, each neuron has 10 × 10 = 100 connection weight parameters. What if our 100 parameters are the same for each neuron? In other words, each neuron uses the same convolution kernel to deconvolve the image. So we have only 100 parameters in the L1 layer. But this way, only one feature of the image is extracted? If different features need to be extracted, several more convolution kernels are added. So suppose we add 100 kinds of convolution kernels, which is 10,000 parameters.

The parameters of each convolution kernel are different, indicating that it proposes different features (different edges) of the input image. In this way, each convolution kernel deconvolves the image to get a projection of different features of the image. We call it a Feature Map, which is a feature map.

It should be noted that the above discussion did not consider the bias of each neuron, plus the bias parameters, the number of weight parameters required for each neuron needs to be increased by 1.

The above description is only a single-layer network structure. The paper "Gradient-Based Learning Applied to Document Recognition" published by Yann LeCun et al in 1998 proposed a text recognition system LeNet-5 based on convolutional neural networks, which was subsequently used for bank handwriting Identification of numbers.

Three: LeCun - who is L E N et -5

Without input, LeNet-5 has 7 layers in total, and each layer contains connection weights (trainable parameters). The input image is 32 * 32. We need to be clear: each layer has multiple feature maps, each feature map uses a convolution filter to extract a feature of the input, and then each feature map has multiple neurons.

C1, C3, and C5 are convolutional layers, and S2, S4, and S6 are down-sampling layers. Using the principle of local image correlation, downsampling the image can reduce the amount of data processing while retaining useful information.

The C1 layer is a convolution layer and is composed of 6 feature maps. Each neuron in the feature map is connected to a 5 * 5 neighborhood in the input layer. The size of C1 is 28 * 28, which can prevent the input connection from falling outside the boundary. C1 has 156 trainable parameters with a total of 122,304 connections.

Trainable parameters are the number of convolution kernel trainable parameters plus an offset parameter, and then multiplied by the number of feature maps. (formula)

The number of connections will multiply the training parameters by the size of the feature map. (formula)

For C1:

(5 * 5 + 1) * 6 = 156 parameters

156 * (28 * 28) = 122,304 connections

The S2 layer is a down-sampling layer and has six 14 * 14 feature maps. Each cell in the feature map is connected to the 2 * 2 neighborhood of the corresponding feature map in C1. The 4 inputs of each unit of the S2 layer are added, multiplied by a trainable parameter, and a trainable offset is added. The result is calculated by the sigmoid function. Trainable coefficients and biases control the degree of non-linearity of the sigmoid function. If the coefficients are small, the operation is similar to a linear operation, and subsampling is equivalent to a blurred image. If the coefficient is relatively large, subsampling can be viewed as a noisy OR operation or a noisy AND operation depending on the magnitude of the offset. The 2 * 2 receptive fields of each unit do not overlap,

For S2:

The size of each feature map is 1/4 of the size of the feature map in C1 (1/2 for each row and column). The S2 layer has (1 + 1) * 6 = 12 trainable parameters and 14 * 14 * (4 + 1) * 6 = 5880 connections.

The C3 layer is also a convolution layer. It also deconvolves the layer S2 through a 5 × 5 convolution kernel, and then the feature map is only 10 × 10 neurons, but it has 16 different convolution kernels, so There are 16 feature maps. Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating that the feature map of this layer is a different combination of the feature maps extracted by the previous layer (this is not the only practice) .

Why not connect each feature map in S2 to each feature map in C3? There are two reasons for this. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, it destroys the symmetry of the network. Since different feature maps have different inputs, they are forced to extract different features.

The method adopted by LeCun is: the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as input. The next six feature maps take 4 subsets of neighboring feature maps in S2 as input. The next three take the non-adjacent 4 feature map subsets as input. The last one takes all the feature maps in S2 as input. As shown in the figure:

In this way, the C3 layer has (25 * 3 + 1) * 6 + (25 * 4 + 1) * 6 + (25 * 4 + 1) * 3 + (25 * 6 + 1) * 1 = 1516 trainable parameters and ((25 * 3 + 1) * 6 + (25 * 4 + 1) * 6 + (25 * 4 + 1) * 3 + (25 * 6 + 1) * 1) * (10 * 10) = 151600 connection.

The S4 layer is a down-sampling layer and is composed of 16 feature maps of size 5 * 5. Each cell in the feature map is connected to the 2 * 2 neighborhood of the corresponding feature map in C3, which is the same as the connection between C1 and S2. The S4 layer has 16 * (1 + 1) = 32 trainable parameters (one factor and one bias for each feature map) and 5 * 5 * (4 + 1) * 16 = 2000 connections. (If you do not understand the formula at this point, you can read it out of order, first look at all the convolutional layers, and then look at all the downsampling layers)

The C5 layer is a convolutional layer with 120 feature maps. Each unit is connected to the 5 * 5 neighborhood of all 16 units in the S4 layer. Since the size of the S4 layer feature map is also 5 * 5 (same as the filter), the size of the C5 feature map is 1 * 1: this constitutes a full connection between S4 and C5.

The reason why C5 is still labeled as a convolutional layer instead of a fully associative layer is because if the input of LeNet-5 becomes larger and the other remains unchanged, then the dimension of the feature map will be larger than 1 * 1. . The C5 layer has (5 * 5 * 16 + 1) * 120 = 48120 trainable parameters. Since the size of the C5 feature map is 1: 1, there are 48120 * 1 * 1 = 48120 links (Yann's original article only said that there are 48120 The trainable connection is different from the term used above. For the time being, it means that there are 48120 trainable parameters and 48120 connections, which is consistent with our calculations).

The F6 layer has 84 units (the reason for choosing this number comes from the design of the output layer), which is fully connected to the C5 layer. Like the classic neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to generate a state of unit i. There are (120 + 1) * 84 = 10164 trainable parameters and also 10164 connections.

Finally, the output layer consists of Euclidean Radial Basis Function units, one unit for each class, and 84 inputs for each. In other words, each output RBF unit calculates the Euclidean distance between the input vector and the parameter vector. The farther the input is from the parameter vector, the larger the RBF output. An RBF output can be understood as a penalty term that measures how well the input pattern matches a model associated with the RBF. In probabilistic terms, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given an input mode, the loss function should be able to make the configuration of F6 close enough to the RBF parameter vector (that is, the expected classification of the mode). The parameters of these units are manually selected and kept fixed (at least initially). The components of these parameter vectors are set to -1 or 1. Although these parameters can be selected in a way such as -1 and 1 probability, or constitute an error correction code, they are designed as a 7 * 12 size (ie 84) formatted picture of the corresponding character class. This representation is not very useful for identifying individual numbers, but it is useful for identifying strings in the printable ASCII set.

Another reason to use this distributed encoding instead of the more commonly used "1 of N" encoding for generating output is that, when the categories are larger, non-distributed encoding is less effective. The reason is that the output of non-distributed encoding must be 0 most of the time. This makes it difficult to achieve with sigmoid units. Another reason is that classifiers are used not only to recognize letters, but also to reject non-letters. RBF using distributed coding is more suitable for this goal. Because unlike sigmoid, they are excited in a better restricted area of the input space, and non-typical patterns are more likely to fall outside.

The RBF parameter vector plays the role of the F6 layer target vector. It should be pointed out that the components of these vectors are +1 or -1, which is in the range of F6 sigmoid, so it can prevent the sigmoid function from being saturated. In fact, +1 and -1 are the points of maximum curvature of the sigmoid function. This allows the F6 unit to operate within the maximum non-linear range. Saturation of the sigmoid function must be avoided, as this will lead to slower convergence and ill-conditioned problems of the loss function.

Four: CNNs training process

The CNNs training algorithm is similar to the traditional BP algorithm. It consists of 4 steps, which are divided into two phases:

The first stage, the forward propagation stage:

a) Take a sample (X, Yp) from the sample set and enter X into the network;

b) Calculate the corresponding actual output Op.

At this stage, information is transformed step by step from the input layer to the output layer. This process is also performed when the network is running normally after completing training. In this process, the network performs calculations (in fact, the input is multiplied by the weight matrix of each layer to get the final output result):

Op = Fn (... (F2 (F1 (XpW (1)) W (2)) ...) W (n))

Second stage, backward propagation stage

a) Calculate the difference between the actual output Op and the corresponding ideal output Yp;

b) Backpropagate the adjustment weight matrix by minimizing the error.

Five: summary

The CNNs algorithm is currently widely used in image recognition and processing. In the ImageNet 2014 large-scale visual recognition competition, CNNs have been widely used, and the optimal algorithm with an error rate of only 6.656% is also derived from CNNs.

Yann LeCun made LeNet in the 90s, and today it has become the most important technology for visual recognition. On the one hand, it is inseparable from his efforts. On the other hand, he is willing to persist in this direction during the low period of neural networks. the spirit of.

Mohammad Mostofa Zaman

Convolutional Neural Networks。 1: Leading Back Propagation Back Propagation Algorithm

Network structure

Learning algorithm

Two: Convolutional Neural Networks

Three: LeCun - who is L E N et -5

Four: CNNs training process

Five: summary

0 comments:

Post a Comment

Popular Posts

New Research

SAY HELLO TO ME

ADDRESS

EMAIL

TELEPHONE

MOBILE