Hyperparameter tuning for Keras models (Part 1)

June 8, 2019

UPDATE: Since this guide came out the keras team has released an official Python package for hyperparameter tuning. While the code here still works, you should check out that package as well.

When encountering a new dataset, you use your own (helpful) inductive biases and come up with an initial model structure that you think will fit the data well. That’s only the first step to finding a good model since now you’re left with a bunch of hyperparameters and often only a vague idea of how to set them. Thorough tuning of those parameters can greatly increase the performance of your model.

Here I compiled two ways to do this for your Keras model, without running into memory problems nor having to rely on single-thread execution.This guide is mostly for smaller models that train well on CPU, since that’s what I’ve been working on recently in the context of density estimation. The first option uses scikit-learn, the second option is more advanced and flexible by using Python multiprocessing directly.

Hyperparameter tuning using scikit-learn

Benefits: Quick to set up, little extra code necessary, flexible

Drawbacks: Possible memory issues for big models / parameter gridsEspecially when using the TensorFlow backend, sklearn is being sloppy about cleaning up the memory after each fit. For a memory-safe alternative approach for big models see Part 2.

The Keras library provides wrappers, that make it’s model adhere to the sklearn API for regressors and classifiers. That way a Keras model can be used directly by the GridSearchCV and RandomizedSearchCV classes. The notebook with the full code for this post can be found here.

Step 1: Getting some data

This is the data we’ll use, a sinusoid with some heteroscedasticA ten-dollar word for saying that the inherent noise in the data is varies over the input space. Here we have more noise for x < -3. noise.

png

Step 2: Defining a model and a build function

To model the heteroscedastic noise, we’ll use a simple dense neural network that outputs a normal distribution. The build function takes the hyperparameters parameters and returns a trainable model.If you’ve never seen the t[...,0:1] syntax, it’s an ellipsis. It’s syntactic sugar that inserts as many : as necessary to fill up the dimensions of the array.

class DensityModel(tf.keras.Sequential):
    def __init__(self, hidden_sizes=(32,32), learning_rate=0.003):
        # hidden layers
        layers = [tf.keras.layers.Dense(size, activation='relu') 
                  for size in hidden_sizes]

        # output layers
        layers += [tf.keras.layers.Dense(2, activation='linear')]

        # transforms the output into a normal distribution
        layers += [tfp.layers.DistributionLambda(
                   lambda t: tfp.distributions.Normal(
                       loc=t[..., 0:1], 
                       scale=tf.nn.softplus(t[..., 1:2])))]

        super().__init__(layers)

        # for the loss we use the negative log likelihood of the data
        self.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
                     loss=lambda y, p_y: -p_y.log_prob(y))

    @staticmethod
    def build_fn(hidden_sizes=(10,10), learning_rate=0.03):
        # !important! destroys the current tf graph and clears the memory
        tf.keras.backend.clear_session()
        return DensityModel(hidden_sizes, learning_rate)

IMPORTANT: If you don’t destroy the current TF graph before you create a model, your memory will overflow.

Step 3: Initial fit

We take a guess at the hyperparameters and fit the model. Our learning rate might be slightly too low, therefore we don’t fit the too data well. This is a very simple example. With real data and a more complex model it won’t be as obvious which hyperparameters should be changed and how.

initial_model = DensityModel(hidden_sizes=(16, 16), learning_rate=0.0003)
initial_model.fit(x_train, y_train, epochs=200, verbose=0)

png

Step 4: Tuning the hyperparameters

We use Randomized Search, take 20 different parameter settings at random and evaluate their goodness-of-fit via a 5-fold cross validation. By setting n_jobs=-1, sklearn will spawn as many jobs as there are processors on your system.

# we take a few vague guesses at where our optimal hyperparameters might lie
param_grid = {
    'learning_rate' : scipy.stats.uniform(loc=0.0001, scale=0.05),
    'hidden_sizes': [(x, x) for x in range(5, 30, 5)] 
                    + [(x, ) for x in range(5, 30, 5)]
}
# here we use the Keras wrapper to make our model adhere to the sklearn API
cv = RandomizedSearchCV(
    estimator=tf.keras.wrappers.scikit_learn.KerasRegressor(build_fn=DensityModel.build_fn),
    param_distributions=param_grid,
    cv=5,
    n_jobs=-1,
    n_iter=20)
cv.fit(x_train, y_train, verbose=0, epochs=200)

print(cv.best_params_)

>> {'hidden_sizes': (25, 25), 'learning_rate': 0.024464187877681353}

According to our cross validation, a bigger model size and higher learning rate would lead to better performance

Step 5: Testing the tuned model

tuned_model = DensityModel(**cv.best_params_)
tuned_model.fit(x_train, y_train, epochs=200, verbose=0)
plot_dist(x_train, y_train, x_test, tuned_model)

print('Initial model test loss: ' + str(initial_model.evaluate(x_test, y_test)))
print('Tuned model test loss: ' + str(tuned_model.evaluate(x_test, y_test)))

>> Initial model test loss: 2.2928287410736083
>> Tuned model test loss: 1.290901689529419

png

It does a much better job at fitting the data and even captures the heteroscedasticity.

Further Notes

In Part 2 we’ll have a look at writing our own, more flexible grid search and using it to fit a mixture density network.If you have feedback about any of my posts, reach out to me via Twitter or email me at [firstname].[lastname]@mailbox.org!

Hyperparameter tuning for Keras models (Part 1) - June 8, 2019 - Simon Boehm