Software Courses/Neural network and Deep learning

[Neural Network and Deep Learning] Logistic Regression Cost Function and Gradient Descent

김 정 환 2020. 3. 4. 15:17
반응형

This note is based on Coursera course by Andrew ng.

(It is just study note for me. It could be copied or awkward sometimes for sentence anything, because i am not native. But, i want to learn Deep Learning on English. So, everything will be bettter and better :))

 

 

 

 

 

 

To train the parameters W and b of the logistic regression model, I need to define a cost function. Let's take a look at the cost function.

 

The predictions I have on the training set, which I only writes as y_hat(i) that this will be close to the ground truth labels y(i) that I got in the training set. Loss function helps me to measure how my algorithm is doing. One thing I could do is define the loss when my algorithm outputs y_hat and the true label as Y to be the sqaure error or one half a square error.

 

 

 

 

But, in logistic regression, people do not usually do this because when you come to learn the parameters, you find that the optimization problem with multiple local optima. So, gradient descent may not find the global optimum.

 

 

 

 

 

 

 

So, in logistic regression, we will actually define a different loss function that plays a similar role as squared error. I wrote down the function below. Here is some intuition for why this loss function makes sense. If we are using squared error then you want the squared error to be as small as possible. And with this logistic regression loss function, we will also want this to be as small as possible. To understand why this makes sense, let's look at the two cases. 

 

 

 

In the first case, let's say Y is equal to one then the loss function y_hat comma y is just that first term. So, that means you want log y_hat to be as small as possible. And that means you want y_hat to be large. Because y_hat is the sigmoid function, it can never be bigger than one. So, you want y_hat to be close to one. The other case is if y equals to zero. As same thing we did, you want -log(1-y_hat) to be large. So, your loss function will push the the parameters to make y_hat as close to zero possible.

 

Finally, the loss function was defined with respect to a single training example. It measures how well you are doing on a single training example. There is something called the cost function, which measures how well you are doing on entire training set. So, the cost function J which is applied to your parameters W and b is going to be the average wit one over the m of the sum of the loss function applied to each of the training examples and turn.

 

 

 

 

 

 

So, the cost function measures how well your parameters w and b are doing on the training set. In order to learn the set of parameters w and b, it seems natural that we want to find w and b that the cost function J(w,b) as small as possible. What we want to do is really to find the value of w and b that corresponds to the minimum of the cost function J. So to find a good value for the parameters, what we will do is initialize w and b to some initial value, maybe denoted by that little red dot at top. No matter where you initialize, you should get the same point. And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. So, after one step of gradient descent you might end up second red dot, because it's trying to take a step downhill in the direction of steepest descent or as quickly downhill as possible. So that is one iteration of gradient descent. At the end, you converge to the global optimum or get to somehing close to the global optimum. 

 

 

 

 

 

Let's write a bit more of the details. To make it easier, I am going to ignore b for now, just to make this plot a one-dimensional plot instead of a high-dimensional plot. We are going to repeatedly carry out the following update. We are going to take the value of w and update it. We will repeatedly do that until the algorithm coverges. Alpha is the learning rate and it controls how big a step we take on each iteration or gradient descent.  

Let's just make sure that this gradient descent update makes sense. We are on the blue dot at the top. And we are going down to left. Remember that the definition of a derivative is the slope of a function at the point. And here the derivative is positive. Gradient descent will make your algorithm slowly decrease the parameter if you have started off with the large value of w. If w was other side(left), of course, it will decrease the parameter.

 

 

 

In logistic regression, your cost function is a function of both w and b. So in that case, the inner loop of gradient descent have to repeat as follows.

 

 

 

 

 

 

 

 

 

 

반응형