[Improving: Hyper-parameter tuning, Regularization and Optimization] Programming

Software Courses/Improving Deep Neural Networks

[Improving: Hyper-parameter tuning, Regularization and Optimization] Programming - Gradient Checking

김 정 환 2020. 4. 15. 12:22

This note is based on Coursera course by Andrew ng.

(It is just study note for me. It could be copied or awkward sometimes for sentence anything, because i am not native. But, i want to learn Deep Learning on English. So, everything will be bettter and better :))

INTRO

To be clear that our backward propagation is correct, we are going to use gradient checking.

Backpropagation computes the gradients ∂J/∂θ, where θ denotes the parameters of the model. J is computed using forward propagation and loss function.

Definition of derivative:

The following figure describes the forward and backward propagation.

MAIN

Let's look at forward propagation and backward propagation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

def forward_propagation_n(X, Y, parameters):
    """
    Implements the forward propagation (and computes the cost) presented in Figure 3.
    
    Arguments:
    X -- training set for m examples
    Y -- labels for m examples 
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (5, 4)
                    b1 -- bias vector of shape (5, 1)
                    W2 -- weight matrix of shape (3, 5)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    
    Returns:
    cost -- the cost function (logistic cost for one example)
    """
    
    # retrieve parameters
    m = X.shape[1]
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
 
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
 
    # Cost
    logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)
    cost = 1./m * np.sum(logprobs)
    
    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    
    return cost, cache
 
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

def backward_propagation_n(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.
    
    Arguments:
    X -- input datapoint, of shape (input size, 1)
    Y -- true "label"
    cache -- cache output from forward_propagation_n()
    
    Returns:
    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T) * 2
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients
 
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

http://colorscripter.com/info#e" target="_blank" style="text-decoration:none;color:white">cs

θ is not a scalar. It is a dictionary called 'parameters'.

We implemented a function 'dictionary_to_vector()'. It converts the 'parameters' dictionary into a vector. And inverse function is 'vector_to_dictionary' which outputs back the 'parameters' dictionary.

Finally, compute the relative difference between 'gradapprox' and the 'grad' using the following formula.

We will need 3 steps to compute this formula:

compute the numerator using np.linalg.norm(...)
compute the denominator. You will need to call np.linalg.norm(...) twice.
divide them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values) # Step 1 : copy parameters_values
        thetaplus[i][0] = thetaplus[i][0] + epsilon                                      # Step 2 : add epsilon to each value(theta and bias)
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))     # Step 3 : compute cost
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values) # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                                     # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))    # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2*epsilon)
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(gradapprox - grad)                                           # Step 1'
    denominator = np.linalg.norm(gradapprox) + np.linalg.norm(grad) # Step 2'
    difference = numerator / denominator                                           # Step 3'
    ### END CODE HERE ###
 
    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference
 
 
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

shape of prameters and parameters_values:

parameters : {'W1': array([[-0.3224172 , -0.38405435, 1.13376944, -1.09989127], [-0.17242821, -0.87785842, 0.04221375, 0.58281521], [-1.10061918, 1.14472371, 0.90159072, 0.50249434], [ 0.90085595, -0.68372786, -0.12289023, -0.93576943], [-0.26788808, 0.53035547, -0.69166075, -0.39675353]]), 'b1': array([[-0.6871727 ], [-0.84520564], [-0.67124613], [-0.0126646 ], [-1.11731035]]), 'W2': array([[ 0.2344157 , 1.65980218, 0.74204416, -0.19183555, -0.88762896], [-0.74715829, 1.6924546 , 0.05080775, -0.63699565, 0.19091548], [ 2.10025514, 0.12015895, 0.61720311, 0.30017032, -0.35224985]]), 'b2': array([[-1.1425182 ], [-0.34934272], [-0.20889423]]), 'W3': array([[ 0.58662319, 0.83898341, 0.93110208]]), 'b3': array([[ 0.28558733]])}
parameters_values: [[-0.3224172 ] [-0.38405435] [ 1.13376944] [-1.09989127] [-0.17242821] [-0.87785842] [ 0.04221375] [ 0.58281521] [-1.10061918] [ 1.14472371] [ 0.90159072] [ 0.50249434] [ 0.90085595] [-0.68372786] [-0.12289023] [-0.93576943] [-0.26788808] [ 0.53035547] [-0.69166075] [-0.39675353] [-0.6871727 ] [-0.84520564] [-0.67124613] [-0.0126646 ] [-1.11731035] [ 0.2344157 ] [ 1.65980218] [ 0.74204416] [-0.19183555] [-0.88762896] [-0.74715829] [ 1.6924546 ] [ 0.05080775] [-0.63699565] [ 0.19091548] [ 2.10025514] [ 0.12015895] [ 0.61720311] [ 0.30017032] [-0.35224985] [-1.1425182 ] [-0.34934272] [-0.20889423] [ 0.58662319] [ 0.83898341] [ 0.93110208] [ 0.28558733]]

If we see the gradient checking, it ouputs 0.285093156781. It means there is a mistake in the backward propagation. Then if we checked backward_propagtaion_n function, we can find dW2 multiply with 2 and db1 multiply with 4. So, change them to 1. And if we execute gradient checking again, we will get 1.18855520355e-07 difference.

CONCLUSION

Remind

Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient.
Gradient checking is slow because of computing approximation of the gradient. So we run it only to make sure back propagtaion(gradient) is corrent.
Gradient checking doesn't work with dropout. So, we run the gradient checking without dropout to make sure backpropagation is correct.

'Software Courses > Improving Deep Neural Networks' 카테고리의 다른 글

[Improving: Hyper-parameter tuning, Regularization and Optimization] Exponentially weighted averages (0)	2020.04.16
[Improving: Hyper-parameter tuning, Regularization and Optimization] Mini-batch gradient descent (0)	2020.04.15
[Improving: Hyper-parameter tuning, Regularization and Optimization] Programming - Regularization (0)	2020.04.14
[Improving: Hyper-parameter tuning, Regularization and Optimization] Programming - Initialization (0)	2020.04.13
[Improving: Hyper-parameter tuning, Regularization and Optimization] Gradient checking (0)	2020.04.09

현재글[Improving: Hyper-parameter tuning, Regularization and Optimization] Programming - Gradient Checking

거창한 시작