IT story

역 전파 신경망에서 비선형 활성화 함수를 사용해야하는 이유는 무엇입니까?

hot-time 2020. 7. 13. 07:58
반응형

역 전파 신경망에서 비선형 활성화 함수를 사용해야하는 이유는 무엇입니까?


신경망에서 몇 가지 내용을 읽었으며 단일 계층 신경망의 일반적인 원리를 이해합니다. 추가 레이어의 필요성을 이해하지만 왜 비선형 활성화 함수가 사용됩니까?

이 질문은 다음과 같습니다 : 역 전파에 사용되는 활성화 함수의 파생물은 무엇입니까?


활성화 기능의 목적은 네트워크에 비선형 성 을 도입 하는 것입니다

결과적으로 설명 변수와 비선형 적으로 변하는 응답 변수 (일명 대상 변수, 클래스 레이블 또는 점수)를 모델링 할 수 있습니다.

비선형 출력 (- 이것 단어 인 직선을 렌더링하는 출력과 동일하지 입력의 선형 조합으로부터 재생 될 수 없다는 것을 의미 아핀 ).

그것을 생각하는 또 다른 방법 : 네트워크에 비선형 활성화 함수가 없으면 NN은 레이어 수에 관계없이 단일 레이어 퍼셉트론처럼 작동합니다.이 레이어를 합하면 또 다른 선형 함수가 제공되므로 (위의 정의 참조).

>>> in_vec = NP.random.rand(10)
>>> in_vec
  array([ 0.94,  0.61,  0.65,  0.  ,  0.77,  0.99,  0.35,  0.81,  0.46,  0.59])

>>> # common activation function, hyperbolic tangent
>>> out_vec = NP.tanh(in_vec)
>>> out_vec
 array([ 0.74,  0.54,  0.57,  0.  ,  0.65,  0.76,  0.34,  0.67,  0.43,  0.53])

백프로 프 ( 쌍곡 탄젠트 )에 사용 된 공통 활성화 함수 는 -2에서 2까지 평가되었습니다.

여기에 이미지 설명을 입력하십시오


A linear activation function can be used, however on very limited occasions. In fact to understand activation functions better it is important to look at the ordinary least-square or simply the linear regression. A linear regression aims at finding the optimal weights that result in minimal vertical effect between the explanatory and target variables, when combined with the input. In short, if the expected output reflects the linear regression as shown below then linear activation functions can be used: (Top Figure). But as in the second figure below linear function will not produce the desired results:(Middle figure). However, a non-linear function as shown below would produce the desired results:(Bottom figure) 여기에 이미지 설명을 입력하십시오

Activation functions cannot be linear because neural networks with a linear activation function are effective only one layer deep, regardless of how complex their architecture is. Input to networks is usually linear transformation (input * weight), but real world and problems are non-linear. To make the incoming data nonlinear, we use nonlinear mapping called activation function. An activation function is a decision making function that determines the presence of a particular neural feature. It is mapped between 0 and 1, where zero means absence of the feature, while one means its presence. Unfortunately, the small changes occurring in the weights cannot be reflected in the activation values because it can only take either 0 or 1. Therefore, nonlinear functions must be continuous and differentiable between this range. A neural network must be able to take any input from -infinity to +infinite, but it should be able to map it to an output that ranges between {0,1} or between {-1,1} in some cases - thus the need for activation function. Non-linearity is needed in activation functions because its aim in a neural network is to produce a nonlinear decision boundary via non-linear combinations of the weight and inputs.


If we only allow linear activation functions in a neural network, the output will just be a linear transformation of the input, which is not enough to form a universal function approximator. Such a network can just be represented as a matrix multiplication, and you would not be able to obtain very interesting behaviors from such a network.

The same thing goes for the case where all neurons have affine activation functions (i.e. an activation function on the form f(x) = a*x + c, where a and c are constants, which is a generalization of linear activation functions), which will just result in an affine transformation from input to output, which is not very exciting either.

A neural network may very well contain neurons with linear activation functions, such as in the output layer, but these require the company of neurons with a non-linear activation function in other parts of the network.

Note: An interesting exception is DeepMind's synthetic gradients, for which they use a small neural network to predict the gradient in the backpropagation pass given the activation values, and they find that they can get away with using a neural network with no hidden layers and with only linear activations.


"The present paper makes use of the Stone-Weierstrass Theorem and the cosine squasher of Gallant and White to establish that standard multilayer feedforward network architectures using abritrary squashing functions can approximate virtually any function of interest to any desired degree of accuracy, provided sufficently many hidden units are available." (Hornik et al., 1989, Neural Networks)

A squashing function is for example a nonlinear activation function that maps to [0,1] like the sigmoid activation function.


There are times when a purely linear network can give useful results. Say we have a network of three layers with shapes (3,2,3). By limiting the middle layer to only two dimensions, we get a result that is the "plane of best fit" in the original three dimensional space.

But there are easier ways to find linear transformations of this form, such as NMF, PCA etc. However, this is a case where a multi-layered network does NOT behave the same way as a single layer perceptron.


A feed-forward neural network with linear activation and any number of hidden layers is equivalent to just a linear neural neural network with no hidden layer. For example lets consider the neural network in figure with two hidden layers and no activation 여기에 이미지 설명을 입력하십시오

y = h2 * W3 + b3 
  = (h1 * W2 + b2) * W3 + b3
  = h1 * W2 * W3 + b2 * W3 + b3 
  = (x * W1 + b1) * W2 * W3 + b2 * W3 + b3 
  = x * W1 * W2 * W3 + b1 * W2 * W3 + b2 * W3 + b3 
  = x * W' + b'

We can do the last step because combination of several linear transformation can be replaced with one transformation and combination of several bias term is just a single bias. The outcome is same even if we add some linear activation.

So we could replace this neural net with a single layer neural net.This can be extended to n layers. This indicates adding layers doesn't increase the approximation power of a linear neural net at all. We need non-linear activation functions to approximate non-linear functions and most real world problems are highly complex and non-linear. In fact when the activation function is non-linear, then a two-layer neural network with sufficiently large number of hidden units can be proven to be a universal function approximator.


To understand the logic behind non-linear activation functions first you should understand why activation functions are used. In general, real world problems requires non-linear solutions which are not trivial. So we need some functions to generate the non-linearity. Basically what an activation function does is to generate this non-linearity while mapping input values into a desired range.

However, linear activation functions could be used in very limited set of cases where you do not need hidden layers such as linear regression. Usually, it is pointless to generate a neural network for this kind of problems because independent from number of hidden layers, this network will generate a linear combination of inputs which can be done in just one step. In other words, it behaves like a single layer.

There are also a few more desirable properties for activation functions such as continuous differentiability. Since we are using backpropagation the function we generate must be differentiable at any point. I strongly advise you to check the wikipedia page for activation functions from here to have a better understanding of the topic.


As I remember - sigmoid functions are used because their derivative that fits in BP algorithm is easy to calculate, something simple like f(x)(1-f(x)). I don't remember exactly the math. Actually any function with derivatives can be used.


A layered NN of several neurons can be used to learn linearly inseparable problems. For example XOR function can be obtained with two layers with step activation function.


Let me give to explain it to you as simple as possible:

Neural Networks are used in pattern recognition correct? And pattern finding is a very non-linear technique.

Suppose for the sake of argument we use a linear activation function y=wX+b for every single neuron and set something like if y>0 -> class 1 else class 0.

Now we can compute our loss using square error loss and back propagate it so that the model learns well, correct?

WRONG.

  • For the last hidden layer, the updated value will be w{l} = w{l} - (alpha)*X.

  • For the second last hidden layer, the updated value will be w{l-1} = w{l-1} - (alpha)*w{l}*X.

  • For the ith last hidden layer, the updated value will be w{i} = w{i} - (alpha)*w{l}...*w{i+1}*X.

This results in us multiplying all the weight matrices together hence resulting in the possibilities: A)w{i} barely changes due to vanishing gradient B)w{i} changes dramatically and inaccurately due to exploding gradient C)w{i} changes well enough to give us a good fit score

In case C happens that means that our classification/prediction problem was most probably a simple linear/logistic regressor based one and never required a neural network in the first place!

No matter how robust or well hyper tuned your NN is, if you use a linear activation function, you will never be able to tackle non-linear requiring pattern recognition problems


Several good answers are here. It will be good to point out the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. It is a book worth referring to for getting a deeper insight about several ML related concepts. Excerpt from page 229 (section 5.1):

If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units.


It's not at all a requirement. In fact, the rectified linear activation function is very useful in large neural networks. Computing the gradient is much faster, and it induces sparsity by setting a minimum bound at 0.

See the following for more details: https://www.academia.edu/7826776/Mathematical_Intuition_for_Performance_of_Rectified_Linear_Unit_in_Deep_Neural_Networks


Edit:

There has been some discussion over whether the rectified linear activation function can be called a linear function.

Yes, it is technically a nonlinear function because it is not linear at the point x=0, however, it is still correct to say that it is linear at all other points, so I don't think it's that useful to nitpick here,

나는 정체성 기능을 선택할 수 있었지만 여전히 사실이지만, 최근 인기로 인해 ReLU를 예로 선택했습니다.

참고 URL : https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net

반응형