# Hyperparameter tuning for Keras models (Part 2)

June 8, 2019

In the first part of this post we used scikit-learn and it’s implementation of grid search and random search to find the optimal settings for our hyperparameters. While that was quick to implement, it lacked some flexibility, for example in what metrics could be recorded and still wasn’t completely secure against memory overflow issues. To improve on this, here I provide a second approach using Python’s multiprocessing library and my own implementation of grid search. To illustrate, we’ll do some regression and fit a Mixture Density Network to some bimodal data.

# Hyperparameter tuning using a process pool

We’ll be using multiprocessing and a process pool. This solves many possible memory problems, as each process is being killed and restarted after having completed one step of the crossvalidation. This frees the associated memory.

The notebook with the full code for this post can be found here.

## Step 1: Getting some data

To make things slightly more difficult this time, we’ll use data generated by a bimodal conditional distribution.

## Step 2: Defining the model

To deal with the bimodality of our data, we’ll be using a Mixture Density Network.This model was first described by Christopher Bishop in this paper. It’s an elegant way to model data characteristics like multimodality and heteroscedasticity. It’s a dense neural network that parametrizes a mixture distribution, in our case a mixture of three Normal distributions.

```
class MixtureDensityNetwork(tf.keras.Sequential):
def __init__(self, hidden_sizes=(10, 10), learning_rate=0.001):
layers = [tf.keras.layers.Dense(size, activation="relu")
for size in hidden_sizes]
# the output layer that will parametrize the mixture distribution
layers += [tf.keras.layers.Dense(9, activation="linear")]
# a distribution, transforming the output of the net into a tf distribution
layers += [
tfp.layers.DistributionLambda(
lambda t: tfp.distributions.MixtureSameFamily(
mixture_distribution=tfp.distributions.Categorical(
probs=tf.expand_dims(tf.nn.softmax(t[..., 0:3]), axis=1)),
components_distribution=tfp.distributions.Normal(
loc=tf.expand_dims(t[..., 3:6], axis=1),
scale=tf.expand_dims(tf.nn.softplus(t[..., 6:9]), axis=1))
)
)
]
super().__init__(layers)
# for the loss we use the negative log likelihood of the data
self.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate),
loss=lambda y, p_y: -p_y.log_prob(y)
)
```

## Step 3: Getting an initial fit

Let’s set the hyperparameters using a vague guess and see what we get.

```
initial_model = MixtureDensityNetwork()
initial_model.fit(x_data, y_data, verbose=0, epochs=200)
```

To get an intuition, this is a plot of the output distribution over the input space. As we can see, it’s already not too far off.

## Step 4: Defining the parameter grid

We’ll implement a grid search. To do that, let’s first implement a python iterator that gives us the cartesian product of all the parameter values that we want to try out.Careful with which parameters to tune over, as the cartesian product grows exponentially with the amount of parameters included in the grid search.

```
class ParamProductIterator:
def __init__(self, param_grid):
self._keys = sorted(param_grid.keys())
self._iterator = itertools.product(*[param_grid[key] for key in self._keys])
def __iter__(self):
return self
def __next__(self):
return {key: value for key, value in
zip(self._keys, self._iterator.__next__())}
```

## Step 5: Running the hyperparameter tuning

Now we can define a function that, give a set of parameters, performs a cross-validation over the training data and computes the mean loss.

```
fold = sklearn.model_selection.KFold(n_splits=3)
def eval_model(param_dict):
results = []
for train_index, test_index in fold.split(x_data, y_data):
tf.keras.backend.clear_session()
model = MixtureDensityNetwork(
hidden_sizes=param_dict['hidden_sizes'],
learning_rate=param_dict['learning_rate'])
model.fit(x_data[train_index], y_data[train_index], epochs=200, verbose=0)
results.append(model.evaluate(x_data[test_index], y_data[test_index]))
return sum(results) / float(len(results))
```

We spawn a pool of processes and use them to evaluate the outcome of our scoring function for every parameter setting returned by the iterator.

```
param_grid = {"hidden_sizes": [(10, 10), (20, 20), (30, 30)],
"learning_rate": [0.1, 0.01, 0.001]}
param_iterator = ParamProductIterator(param_grid)
# Spawn a pool of four processes, each one being killed and respawned
# after every task to deal with memory overflow issues
with Pool(processes=4, maxtasksperchild=1) as p:
result = p.map(eval_model, param_iterator)
print(list(ParamProductIterator(param_grid))[result.index(min(result))])
>> {'hidden_sizes': (30, 30), 'learning_rate': 0.001}
```

## The final result

Let’s compare our tuned model with our inital guess.

```
print('Initial model test loss: ' + str(initial_model.evaluate(x_test, y_test)))
print('Tuned model test loss: ' + str(tuned_model.evaluate(x_test, y_test)))
>> Initial model test loss: 0.9130306644439697
>> Tuned model test loss: 0.4483817667961121
```

The loss improved significantly through tuning which is also noticeable in the plot.

## Summary

- Structured hyperparameter tuning is a necessary step in optimizing the performance of your model.Grad student descent is the scientific term for unstructured hyperparameter tuning
- The sklearn GridSearch and RandomizedSearch modules are fast to implement and often sufficient
- If you want more flexibility or are struggling with memory overflow issues, using a process pool is a clean and "pythonic" way of implementing grid search.If you have feedback about any of my posts, reach out to me via Twitter or email me at [firstname].[lastname]@mailbox.org!