Advanced Control Techniques—Assignment

Introduction

Artificial neural networks (ANNs) are computational models that are loosely inspired by their biological counterparts. Artificial neurons, which are elementary units in an ANN, appears to have been first introduced by McCulloch and Pitts in 1943 to demonstrate the biological neurons using algorithms and mathematics. An ANN is based on a set of connected artificial neurons. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. Since the early work of Hebb, who proposed cell assembly theory which attempted to explain synaptic plasticity, considerable effort has been devoted to ANNs. Though perceptron was created by Rosenblatt for pattern recognition by training an adaptive McCulloch-Pitts neuron model, it is incapable of processing the exclusive-or circuit. A key trigger for renewed interest in neural networks and learning was Werbos’s back-propagation algorithm, which makes the training of multi-layer networks feasible and efficient.

Methods

Multi-Layered Perceptron(MLP) gains prestige as a feedforward neural network, which has been the most widely used neural networks. Error Back Propagation(BP) algorithm is implemented as a training method for MLPs.

Multi-Layered Perceptron

MLP, which is a fully-connected feedforward neural network, uses a supervised learning methods. An MLP consists of an input layer, an output layer and at least one hidden layer. The fundamental structure of MLP is shown schematically in Fig. 1:

Considering the input layer, neurons take effect on passing the information, i.e. the activation functions of this layer are identity functions. Each neurons, who consist the input layer, completely connect with the first hidden layer, thus those neurons hold multiple outputs. By giving neurons in input layer and neurons in hidden layer, each connection is assigned a weight . A bias term added to total weighted sum of inputs to serve as threshold to shift the activation function.The propagation function computes the hidden layer input to the neuron from the outputs of predecessor neurons has the form

where donates the bias term.

The activation function, which defines the output of the node, were chosen as “Hyperbolic tangent” and “Softsign” in this case. Equations, plot and derivative are given by TABLE 1.

MLPs’ scheme essentially has three features:

  • There are no connections between the neurons which are in the layers, and neurons shall not have connections with themselves.
  • Full connections exist only between the nearby layers.
  • It consists of two aspects: the feed-forward transmission of information and the feed-back transmission of error.

Error Back Propagation

Back Propagation is a method used in ANNs to calculate a gradient that is needed in the calculation of the weights between layers, and it is commonly used by the gradient descent optimization algorithm. Calculating adjustments of weights in Fig. 1 is chosen as an example to introduce BP algorithm. We represent the parameter with respect to -th layer by the notation .

For calculating the weights between hidden layer and output layer, loss function is defined as:

where is expected output, is actual output. Based on gradient descent algorithm with learning rate , which controls how much algorithm adjusting the weights of neuron network with respect the loss gradient, and the input of activation function , we have

with

and we have

where donates the error of the output layer, thus, the adjustment of $w$ shall be:

Consider calculating the weights between input layer and output layer:

with

and

where

thus

In summary, the equations describing the adjustment of weights is:

where for output layer is:

for hidden layer and input layer is

respectively.

The process of training MLP with BP algorithm is shown in Algorithm 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
\begin{algorithm}
\caption{Error Back Propagation Algorithm}
\begin{algorithmic}[1]
\STATE SET initial weights $w^{l}_{ji}$, bias $\theta^{l}_{j}$;
\STATE SET training sample $(x_q, y_q)$;
\FOR{each $iter \in [1, maxEpoch]$}
\STATE $x^{(l)} = f(s^{(l)}) = f(W^{(l)}x^{(l-1)})$
\STATE $\delta^{(l)}_j = (\widehat{y} - y^{(l)})f'(s^{(l)}_j)$ for output layer
\STATE $\delta^{(l)}_j = f'(s^{(l)}_j)\sum_{k = 1}^{n_{l+1}}\delta^{(l+1)}_k w^{(l+1)}_{kj}$ for other layers
\STATE $w^{(l)}_{ji}[k + 1] = w^{(l)}_{ji}[k] + \mu \delta_j^{(l)}x_i^{(l-1)}$
\STATE $\theta^{(l)}_j[k + 1] = \theta^{(l)}_j[k] + \mu \delta_j^{(l)}$
\STATE Calculate $sumLoss = sumLoss + E_q$
\IF{$sumLoss/iter < setLossTolerant$}
\STATE \textbf{break};
\ENDIF
\ENDFOR
\label{code:ag1}
\end{algorithmic}
\end{algorithm}

Results

In this section, we have designed different model structures for addressing curve fitting problems.

Network Design

We designed two different structures, which are shown in Fig. 2.

The structure which has one hidden layer with 600 nodes was applied to fitting and . For , we chosen the structure with 2 hidden layers to reach a better regression performance.

We trained the layers with 9 sample, using as the loss function. The training process will stop when . The results with 361 test samples are shown in Fig. 3:

Runtime loss $E$ and average loss are shown in Fig. 4, where the testing value evaluated by the average loss function is .

We notice that the loss suffers a drastic drop within epochs, then, to decrease the average loss from to , more than epochs are needed.

The absolute error is shown in Fig. 5:

Apparently, max error between actual value and predict value is less than 0.05, which shows the prestige performance when fitting .

When it comes to a more complex nonlinear curve, the fitting error using elementary MLP structure which only has one hidden layer is far from satisfactory. A hidden layer with softsign activation function was added to the basic MLP structure. By implementing the structure with 9 training sample, we arrive at the following results:

Fig. 7 shows the loss decreased over training epochs:

From the figure above, we can conclude that though the loss suffers the chattering within 500,000 epochs, while the average loss obtains a smooth constant rate of descent. Though the training loss which is less than after epochs yields the guaranteed performance on training samples, as we can observed in Fig. 6, the test error(more than 0.2) is unbearable for some test samples. We assumed lacking of train samples has brought about the overfitting of neural network, thus we expanded the training sample with 29 independent (x, y) to reach the better performance.

When average loss is less than 0.005, we obtained the following result, as shown in Fig. 8:

Apparently, has a non-differentiable point in . From Fig. 9, the point which is non-differentiable has the maximal error compared to other points. i.e. providing more training sample can not reinforce the ability of non-linear fitting for a certain MLP model.

Still, adding training sample improves the performance of prediction on test sets. From Fig. 9, though we have a test point which posses the error more than 0.1, absolute errors of other test points are less than 0.02. i.e. this approach provides more robust solutions than 9 training sample with the same loss tolerant. The training loss of this approach is shown in Fig. 10

In this section, let us consider a function which has two inputs. samples are given for training with the basic MLP structure. The hidden layer has 500 nodes, and the training will stop when . Fig. 11 shows the actual curve and the curve predicted with samples respectively.

We plot and , varying from in Fig. 12, from which we observe decrease rapidly from 1.1 to 0.1 within 1000 epochs. Thus, the training process shows the efficiency and robust performance of back propagation algorithm.

The test error is shown as Fig. 13, from which we observe the maximal training error is less than 0.15.

Conclusions

In this report, we implemented multilayer perceptron to provide the solutions to three specific non-linear functions. Simulation results illustrates the accuracy of artificial neural networks and the efficiency of back propagation algorithm.

Source Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
%% For y = sin(x)
clear
sample_num = 9; N1 = 600; Lr = 0.005;
train_x = zeros(1,sample_num);
for i = 1:sample_num
train_x(i) = (i-1)*2*pi/(sample_num-1);
end
train_y = sin(train_x);

w1 = 0.1*randn(1,N1);
b1 = 0.1*randn(1,N1);
b2 = 0.1*randn(1,1);
w2 = 0.1*randn(1,N1);
sum_e = 0; er = []; i = 0; ei = [];
while(1)
i = i + 1;
j = unidrnd(sample_num);
in = train_x(j);
L1 = w1*in+b1;
L2 = tanh(L1);
L3 = sum(w2.*L2)+b2;
e = train_y(j)-L3;
sum_e = sum_e + 0.5*e^2;
er = [er sum_e/i];
ei = [ei e];
sum_e/i
if sum_e/i < 0.002
break
end
delta_2 = e;
delta_1 = (1-tanh(L1).^2).*(delta_2*w2);
w2 = w2+Lr*delta_2.*L2;
w1 = w1+Lr*delta_1*in;
b1 = b1+Lr*delta_1;
b2 = b2+Lr*e;
end

test_x = zeros(1,1000);
test_y = [];
true_y = [];
for i = 1:1000
test_x(i) = (i-1)*2*pi/(1000-1);
end
for i = test_x
test_y = [test_y sum(w2.*tanh(w1*i+b1))+b2];
true_y = [true_y sin(i)];
end
plot(test_x, test_y, test_x, true_y);
\end{lstlisting}
\begin{lstlisting}
%% For y = abs(sin(x))
clear
sample_num = 29; N1 = 500; Lr = 0.002;
train_x = zeros(1,sample_num);
for i = 1:sample_num
train_x(i) = (i-1)*2*pi/(sample_num-1);
end
train_y = abs(sin(train_x));

w1 = 0.1*randn(1,N1);
b1 = 0.1*randn(1,N1);
b2 = 0.1*randn(1,N1);
b3 = 0.1*randn(1,1);
w2 = 0.1*randn(N1,N1);
w3 = 0.1*randn(1,N1);
L3 = zeros(1,N1);
delta_1 = zeros(1,N1);
sum_e = 0; er = []; ei = []; i = 0;
while(1)
i = i + 1;
s = unidrnd(sample_num);
in = train_x(s);
L1 = w1*in+b1;
L2 = softsign(L1);
for j = 1:N1
L3(j) = sum(w2(j,:).*L2)+b2(j);
end
L4 = tanh(L3);
L5 = sum(w3.*L4)+b3;
e = train_y(s)-L5;
sum_e = sum_e + 0.5*e^2;
sum_e/i
if sum_e/i < 0.005
break
end
er = [er sum_e/i];
ei = [ei e];
delta_3 = e;
delta_2 = (1-tanh(L2).^2).*(delta_3*w3);
temp = dsoftsign(L1);
for j = 1:N1
delta_1(j) = temp(j).*...
sum(delta_2.*w2(j,:));
end
w3 = w3+Lr*delta_3.*L4;
w2 = w2+Lr*delta_2.*L2;
w1 = w1+Lr*delta_1*in;
b1 = b1+Lr*delta_1;
b2 = b2+Lr*delta_2;
b3 = b3+Lr*delta_3;
end
test_x = zeros(1,1000); t = zeros(1,N1);
test_y = []; true_y = [];
for i = 1:1000
test_x(i) = (i-1)*2*pi/(1000-1);
end
for i = test_x
var = softsign(w1*i+b1);
for j = 1:N1
t(j) = sum(w2(j,:).*var)+b2(j);
end
test_y = [test_y sum(w3.*tanh(t))+b3];
true_y = [true_y abs(sin(i))];
end
plot(test_x, test_y, test_x, true_y);
\end{lstlisting}
\begin{lstlisting}
%% For z = sin(x)/x*sin(y)/y
clear
sample_num = 20; N1 = 500; Lr = 0.006;
x1 = linspace(-10,10,sample_num);
x2 = linspace(-10,10,sample_num);
y = zeros(sample_num,sample_num);
for i = 1:sample_num
for j = 1:sample_num
y(i,j) = value(x1(i),x2(j));
end
end
w1 = 0.1*randn(2,N1);
b1 = 0.1*randn(1,N1);
b2 = 0.1*randn(1,1);
w2 = 0.1*randn(1,N1);
er = []; sum_e = 0; i = 0; ei = [];

while(1)
i = i + 1;
batch1 = unidrnd(sample_num);
batch2 = unidrnd(sample_num);
in1 = x1(batch1);
in2 = x2(batch2);
L1 = w1(1,:)*in1+w1(2,:)*in2+b1;
L2 = tanh(L1);
L3 = sum(w2.*L2)+b2;
e = y(batch1, batch2)-L3;
sum_e = sum_e + 0.5*e^2;
sum_e/i
er = [er sum_e/i];
ei = [ei e];
if sum_e/i < 0.002
break
end
delta_2 = e;
delta_1 = (1-tanh(L1).^2).*(delta_2*w2);
w2 = w2+Lr*delta_2.*L2;
w1(1,:) = w1(1,:)+Lr*delta_1*in1;
w1(2,:) = w1(2,:)+Lr*delta_1*in2;
b1 = b1+Lr*delta_1;
b2 = b2+Lr*e;
end
[X,Y] = meshgrid(-10:0.3:10);
Z_ = (sin(X)./X).*(sin(Y)./Y);
Z = zeros(length(X));
for i = 1:length(X)
for j = 1:length(X)
temp = tanh(w1(1, :)*X(i, j)...
+w1(2,:)*Y(i, j)+b1);
Z(i, j) = sum(w2.*temp)+b2;
end
end
mesh(X,Y,Z)
hold on
mesh(X,Y,Z_)
\end{lstlisting}
\begin{lstlisting}
function y = softsign(x)
y = x./(1+abs(x));
end
\end{lstlisting}
\begin{lstlisting}
function y = dsoftsign(x)
y = (1./(1+x).^2).*(x>=0)+(1./(1-x).^2).*(x<0);
end