Software Courses/Improving Deep Neural Networks

[Improving: Hyper-parameter tuning, Regularization and Optimization] Multi-class classification - softmax classifier

김 정 환 2020. 4. 23. 12:53
반응형

This note is based on Coursera course by Andrew ng.

(It is just study note for me. It could be copied or awkward sometimes for sentence anything, because i am not native. But, i want to learn Deep Learning on English. So, everything will be bettter and better :))

 

 

 

INTRO

The name Softmax comes from constrasting it to a Hardmax which would have taken the vector Z and matched it to vector like this [1 0 0 0]. So, Hardmax function will look at the elements of Z and just put a 1 in the position of the biggest element(high possibility) of Z and then 0s eveywhere else. 

 

So far, the classification examples we have talked about have used binary classification. There is a generalization of logistic regression called Softmax regression. The less we make predictions where we are trying to recognize one of C or one of multiple classes, rather than just recognize two classes.  

 

 

 

MAIN

WHAT

Recognizing cats, dogs, and baby chicks.

We use C to denote the number of classes. So, we have 4 classes. In this case, we are going to build a network, where the output layer has 4. And what we want is for the number of units in the upper layer to tell what is the probability of each of these 4 classes. 

 

HOW

The standard model for getting network to do with this uses Softmax layer. Let's see what the Softmax is doing. 

 

Let's go through a specific example that will make this equation clear. 

 

One of the things the Softmax classifier can represent is that the decision boundary between any two classes will be more linear.  

 

 

Train

How to train the softmax model? 

 

If we see the example, the neural network's not doing very well in this example because cat is assigned only a 20%.

 

 

So, what is the loss function we want to use to train this neural network? In Softmax classification, loss function is as follows: 

 

We want our algorithm to reduce loss on training set. So, we will make -log(y2_hat) small. It means we will make y2_hat as big as possible.  

 

 

This loss function was for a single example. What about J for entire examples? 

 

Finally, let's take a look at ho we would implement gradient descent when we have a Softmax output layer. The key equation we need to initialize back prop is this expression:  dJ/dz[L] = y_hat - y. With this, we can then compute dz[L] and start off the back prop. 

 

반응형