Software Courses/Neural network and Deep learning

[Neural Network and Deep Learning] Activation functions

김 정 환 2020. 3. 18. 12:29
반응형

This note is based on Coursera course by Andrew ng.

(It is just study note for me. It could be copied or awkward sometimes for sentence anything, because i am not native. But, i want to learn Deep Learning on English. So, everything will be bettter and better :))

 

 

 

 

 

 

When we build our neural network, one of the choices you get to make is what activation function to use in the hidden layers, as well as what is the output units of your neural network. So far, we've just been using the sigmoid activation function. But sometimes other choices can work much better.

 

 

 

 

The tanh function is almost always strictly superior. The one exception is for the output layer because if y is either 0 or 1, then it makes sense for y_hat to be a number, the one to ouput that's between 0 and 1 rather than between -1 and 1. So the one exception where we would use the sigmoid activiation function is when we are using binary classification, in which case we might use the sigmoid activation function for the output layer. 

 

 

 

 

 

One of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small, then the gradient or the slope becomes very small. And so, this can slow down gradient descent. One other choice that is very popular in the machine learning is what is called the rectify linear unit(ReLU). The derivative is 1, so long as z is positive. And the derivative is 0, when z is negative. One disadvantage of the ReLU is that the derivative is equal to zero, when z is negative. In practice, this works fine. But there is another version of the ReLU called the leaky ReLU. For half of the range of z, the slope of ReLU is 0, but in practice, enough of our hidden units will have z greater than 0. So learning can still be quite fast for most training examples.

 

 

 

 

 

Here are some rules of thumb for choosing activation functions. 

● If you are using binay classification, then the sigmoid activation function is a very natural choice for the output layer. 

● ReLU is increasingly the default choice of activation function. So if you are not sure what to use for your hidden layer, we would just use the ReLU activation function. 

 

 

We often have a lot of choices in how we build our neural network. Ranging from number of hidden units, to the choice activation function, to how we initialize the ways. It's sometimes difficult to get good guidelines for exactly what would work best for our problem. So a common piece of advice would be, if you are not sure whih one of these activation functions work best, try them all, and evaluate on a validation set, or a development set. 

 

 

 

 

 

Why does a neural network need a non-linear activation function? Let's see one. Why don't we just get rid of function g? And set a1 = z1. It turns out if we do this, then this model is just computing y or y_hat as a linear function of our input features x. 

 

 

 

 

 

If we were to user linear activation function or identity activation functions, then the neural network is just ouputting a linear function of the input. No matter how many layers our neural network has, all it's doing is just computing a linear activation function. So a linear hidden layer is more or less useless because the composition of two linear function is itself a linear function. 

There is just one place where we might use a linear activation function. It is usually in the output layer. If we are predicting housing prices, it might be okay to use linear activation function in the output layer because prices are not all non-negative.

반응형