Dowemo
0 0 0 0

  • Introduction to neural networks
  • Blog main content
  • Neural network framework
  • Forward propagation of neural networks
  • Inverse propagation of neural networks
  • Summary
  • Reference

Introduction to

Today, nerual networks ( networks ) have been a very large, subject domain [ 1 ] domain. It doesn't use a simple"algorithm","a framework"to summarize its contents. From early neurons ( neuron ), to sensor ( perceptron ), then to bp neural networks, then go to the depth learning ( deep learning ), which is a general evolution process. Although different in different times, the idea of propagation, such as forward numerical propagation, inverse error propagation is a.

. Main contents.

This blog mainly introduces the neural network. Forward propagation And and Error propagation process By simply using a single layer hidden layer neural network, a detailed numerical example is used to demonstrate. Also, each step, the blogger provides the implementation code for tensorflow [ 2 ].

framework

这里写图片描述

Figure 1 single hidden layer nn

这里写图片描述

For a neuron.

这里写图片描述

Code for tensorflow

defmultilayer_perceptron(x, weights, bias): layer_1 = tf.add(tf.matmul(x, weights["h1"]), bias["b1"])
 layer_1 = tf.nn.sigmoid(layer_1)
 out_layer = tf.add(tf.matmul(layer_1, weights["out"]), bias["out"])
 layer_2 = tf.nn.sigmoid(out_layer)
 return layer_2
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Examples of forward propagation of with neural network

( 1 ) determine input data and gd

X = [[1, 2], [3, 4]]Y = [[0, 1], [0, 1]]
  • 1
  • 2
  • 1
  • 2

Clearly batch_size = 2. The fi & t batch size is.
X1 = 1
X2 = 2

( 2 ) initialization weight and

Figure 1 shows that the number of weights is 8

weights = {
 'h1': tf.Variable([[0.15, 0.16], [0.17, 0.18]], name="h1"),
 'out': tf.Variable([[0.15, 0.16], [0.17, 0.18]], name="out")
}
biases = {
 'b1': tf.Variable([0.1,0.1], name="b1"),
 'out': tf.Variable([0.1, .1], name="out")
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

( 3 ) forward propagation I tance
Taking the first neuron as an example:

There's
这里写图片描述
There's
这里写图片描述

Finally, the output of hidden layer

M = [ [ 0. 64336514, 0.650 21855 ] ].

Code for forward propagation tensorflow

import tensorflow as tf
import numpy as npx = [1, 2]
weights = [[0.15, 0.16], [0.17, 0.18]]
b = [0.1, 0.1]X = tf.placeholder("float", [None, 2])
W = tf.placeholder("float", [2, 2])
bias = tf.placeholder("float", [None, 2])
mid_value = tf.add(tf.matmul(X, W), b)
result = tf.nn.sigmoid(tf.add(tf.matmul(X, W), b))
with tf.Session() as sess:
 x = np.array(x).reshape(1, 2)
 print x b = np.array(b).reshape(1, 2)
 result, mid_value = sess.run([result, mid_value], feed_dict={X : x, W : weights, bias : b})
 print mid_value
 print result
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

In the same way, we get the pred of the output layer.

Brooke = [ [ 0. 57616305, 0.579 31882 ] ].

( 4 ) error calculation

A lot of error functions are optional, the mean square error function used in this example ( mean squared error ).

这里写图片描述

Notice that the difference between the square error and the square error is the square error ( sum-squared error ).

这里写图片描述

The error generated by the mean square error:

这里写图片描述

At this point, the forward propagation process is complete.

Application of in reverse propagation of neural network

It's known that the default output and the error we expect, we need to optimize the parameters in neural networks according to the error. The error of the parameter. Bp algorithm is the core part of neural network, and almost all neural network models are optimized by the algorithm or improved algorithm. Bp algorithm is based on gradient descent ( gradient descent ) strategy, which is based on the negative direction of.

We've a given learning rate 0. 5, every update.

这里写图片描述

( 1 ) update the weight of the output layer ( weight ) first

According to chain method,

这里写图片描述

The first item is the derivative of the mean square error function

这里写图片描述

The second is the gradient of the activation function.

这里写图片描述

Third third

这里写图片描述

So so, so,

这里写图片描述

Update

这里写图片描述

In other words,

这里写图片描述

( 2 ) update bias

According to chain method,

这里写图片描述

Update

这里写图片描述

Same as the same.

这里写图片描述

( 3 ) next update the weight of the hidden layer

这里写图片描述

Figure 2 error flow feedback diagram

So.

这里写图片描述

For the total error.

这里写图片描述

There's

这里写图片描述

First, cost1.

这里写图片描述

( some of these items have been worked out before, so take it directly to calculate ).

Find cost2 again.

这里写图片描述

Total.

这里写图片描述

And calculate the second.

这里写图片描述

Calculated third.

这里写图片描述

Merge calculation

这里写图片描述

Update

这里写图片描述

Similarly, other

这里写图片描述

这里写图片描述

这里写图片描述

( 4 ) update the bias of the hidden layer

Also, according to chain method.

这里写图片描述

Update

这里写图片描述

Same update

这里写图片描述

. Summary.

( 1 ) to this, all of the parameters are updated, then in the next batch [ 3, 4 ], through the forward propagation of the new parameter, the resulting error is

Loss = 0.238 827

It's smaller than the first 0.254 468, which also illustrates the effectiveness of gradient descent.

( 2 ) I'm using the results of calculation and code to be 0, 01, and I guess one of my calculations is the error that I calculated, in one of the numerical results.

The mistake, if you find, please give me a hint, thank you.

( 3 ) code details

An optimization method in tensorflow

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
  • 1
  • 1

Cost definition

cost = tf.reduce_mean(tf.pow(pred - y, 2))
  • 1
  • 1

( 4 ) gradient vanishing ( vanish gradient problem )

Gradient vanishing problem is, in the course of training, the gradient becomes very small, or the gradient becomes 0, so that the parameter update is too slow. Gradient disappears is a relationship with the activation function, usually we use ReLU or ReLU ( such as prelu [ 3 ] ) to reduce the gradient disappear. In our case, we use the ligistic function, which is the derivative, and of course is the derivative 1. We can see from the previous operation that there will be a problem in the process of backward uploading.

这里写图片描述

As a result, it'll be smaller and smaller when the derivative is less than 1. So, for the sake of this problem, the problem is to use ReLU ( derivative 1 ) instead of a function of ligistic type, too.




Copyright © 2011 Dowemo All rights reserved.    Creative Commons   AboutUs