This note is based on Coursera course by Andrew ng.
(It is just study note for me. It could be copied or awkward sometimes for sentence anything, because i am not native. But, i want to learn Deep Learning on English. So, everything will be bettter and better :))
INTRO
How do we find a good setting for these hyperparameters?
MAIN
Tuning process
One of the painful things about training deep net is the sheer number of hyperparameters we have to deal with. If we try to tune some set of hyperparameters, how do we select a set of values to explore?
1. Try random values
It is difficult to know in advance which hyperparameters are going to be the most important for our problem. So, we just choose hyperparamters randonly. If the hyperparamters are listed([]), we can select values from list ramdonly.
2. Coarse to fine
We might find the two points work best. Then, in the coarse to fine scheme, we zoom in to a smaller region of the hyperparameters and then sample more density within this space.
But, the sampling at random doesn't mean sampling uniformly at random over the range of valid values. Instead, it is important to pick the appropriate scale on which to explore the hyperparameters.
Pick hyperparameters uniformly
Sampling uniformly from 1 to 100 at random, 2, 3, 40, and 50 might be reasonable. But it is not true for all hyperparameters.
We are searching for the hyperparameters α, the learning rate. If we draw the number line from 0.0001 to 1, and sample values uniformly at random over this number line, we will get 90% of the values between 0.1 and 1. And only 10% of the resource to search between 0.0001 and 0.1.
Instead of using a linear scale(above process), it seems more reasonable to search for hyperparamters on a log scale. We sample uniformly on this type of logarithmic scale at random.
In python, we implement this as follow:
One other case is sampling the hyperparameter β used for computing expnentially weighted averages. It also doesn't make sense to sample on the linear scale uniformly between 0.9 and 0.999 at random.
So the best way to think about this is that we want to explore the range of values for 1 - β.
However, why is it such a bad idea to sample in a liear scale? It is that when β is close to 1, the sensitivity of the results we get changes with very small changes to β. For example:
If β goes from 0.9 to 0.9005, it is no big deal. This is hardly any change in our results. It is averaging over roughly 10 examples before and after. Equation: 1/(1-β)
If β goes from 0.999 to 0.9995, this will have a huge impact on exactly what our algorithm is doing. It is averaging over 1000 examples to 2000 examples. Equation: 1/(1-β)
So, what this sampling process does is that it cause us to sample more densely in the region of when β is close to 1. Because there are more impact values around β = 1.
Organize hyperparameters seach process
1. Babysitting one model
If we have a huge data set but not a lot of computational resource such as computer, CPUs, GPUs, so we can basically afford to train only one model or small number of models at a time. In this case, we might gradually babysit that model even as it is training. We watch the performance and patiently nudge the parameters.
We gradually watch the learning curve, it might be the cost function J or dev-set error or something else. Then, at day 3, we might try increasing the learning rate a little bit and see how it does. At day 5, we will fill the momentum term a bit or deacress the learning rate a bit. At day 8, we found the learning curve surge. So, we might go back to the previous day's model.
2. Training many models in parallel
We have some setting of the hyperparameters and let it run by itself for a day or even for multiple days. And then, at the same time we start up a different model with a different setting of the hyperparameters.
CONCLUSION