Keras memory leak

Reading time ~4 minutes

Keras memory usage keeps increasing

I was having fun, attempting to do some deep learning with a 2M lines dataset (nothing my computer can’t handle, xgboost was running with roughly 15% of my RAM) when suddenly, as I was adding neural networks in my fancy stacked models, the script kept failing, the memory usage went to the moon, etc, etc.

What did I do wrong ? Did I introduce a memory leak between my model stacking / neural network factory code ? I would be suprised, it worked fine with every other model. And a neural network is more or less a simple vector of floats (in my case, with only hundreds of parameters) so there is no reason for it to be that big.

The only thing I was attempting to do was to cross validate different neural networks, with different architectures.

So, after a quick research : I found this stack overflow question , also some people mentioning a weird behavior coming from model.predict() . Another Github issue is simply called Memory leak . There even is another article simply titled Dealing with memory leak issue in Keras model training and is even mentioned on twitter .

What I ended up suspecting is that there are actually many memory leaks from different methods in the code. So I gathered the list of workarounds I could find.

Workarounds

Beware, none of them actually works. Some just alleviate the pain, but most likely, the memory usage will keep increasing. Anyways, the good news is that, combining many of the tricks I could read, I managed to have my models run ;)

Garbage collecting

Generally, when you see these lines in the code it means that the person who wrote it was desperate to make it run while closely monitoring the memory usage of the script and combined tricks not to make sure everything was fitting into the memory. Usually, performing tasks in dedicated functions and trusting the garbage collector to do its job at the right time is enough. But sometimes you meet these del / garbage collector random invokations.

import gc
del model
gc.collect()
K.clear_session()

I did put these lines after every model.fit() I found. They did not help at all in my case.

Force eager evaluation

This one kind of worked for me. It slows down the training (3 times slower in my case), the memory keeps increasing for no reason, but much less. Just add the following argument in the model.compile() method :

model.compile( [...]
              run_eagerly=True)

model(x) instead of model.predict(x)

Some people mentioned it. It did not change a thing for me, but I wrote it that way. Be careful though, model(x) will return a tensorflow object while model.predict(x) will return a numpy object.

Run it in a dedicated script

Yes, kind of ugly. It does not solve the issue, but if you make your cross validation in a python script, itself being called from the terminal level, you can pass parameters using JSON and hope that each script won’t hit your memory limit.

In my case, I wrote the following class:

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
import tensorflow as tf
import gc
import numpy as np

class NNModel:

    def __init__(self, architecture, epochs, loss="binary_crossentropy", optimizer="adam"):
        self._epochs = epochs
        self._loss = loss
        self._optimizer = optimizer
        self._architecture = architecture
        self._model = None

    def fit(self, X, y):

        gc.collect()
        tf.keras.backend.clear_session()
    
        self._model = self._model_factory(X.shape[1])

        X_tf = tf.convert_to_tensor(X, dtype=tf.float32)
        y_tf = tf.convert_to_tensor(y, dtype=tf.float32)

        self._model.fit(X_tf, y_tf, epochs=self._epochs)
        return self

    def _model_factory(self, input_dim):

        model = Sequential()

        architecture = self._architecture.copy()
        first_layer = architecture.pop(0)

        model.add(Dense(first_layer[0], input_dim=input_dim, activation=first_layer[1]))
        for layer in architecture:
            model.add(Dense(layer[0], activation=layer[1]))

        model.compile(loss=self._loss, 
                      optimizer=self._optimizer, 
                      run_eagerly=True,
                      metrics=['accuracy'])

        return model

    def predict(self, X):
        raise NotImplementedError

    def predict_proba(self, X):
        X_tf = tf.convert_to_tensor(X, dtype=tf.float32)
        res =  self._model(X_tf)
        res = np.hstack((1-res, res))
        return res

Which I can configure using a JSON that will contain the arguments of the class constructor:

{
  "epochs": 8,
  "architecture": [[ 12, "relu" ], [ 8, "relu" ], [ 1, "sigmoid" ]]
}

And then I invoke them with:

find ../models/ -name \*.json | xargs --max-args=1 python run_nn.py

So that I can run my different models while I am sure that the memory will be totally released between the execution of two scripts.

model.predict_on_batch

Quoting MProx from a git issue

I have managed to get around this error by using model.predict_on_batch() instead of model.predict(). This returns an object of type <class ‘tensorflow.python.framework.ops.EagerTensor’> - not a numpy array as claimed in the docs - but it can be cast by calling np.array(model.predict_on_batch(input_data)) to get the output I want.

Side note: I also noticed a similar memory leak problem with calling model.fit() in a loop, albeit with a slower memory accumulation, but this can be fixed in a similar way using model.train_on_batch().

I did not try this one, as segregating different models in different scripts and setting run_eagerly did the job.

Use tf-nightly

So, tf-nightly is built more or less every day, with the latest features and less tests. Many people claimed that the leak disapeared when using this library. But there are many versions, with potentially other bugs.

re install the 1.14 version

This bug has been around for a while, some tickets mention it from october 2019 and it is still present in the 2.4 version.

Conclusion

I look forward to this issue being solved.

How to optimize PyTorch code ?

Optimizing some deep learning code may seem quite complicated. After all, [PyTorch](https://pytorch.org/) is already super optimized so w...… Continue reading

Acronyms of deep learning

Published on March 10, 2024

AI with OCaml : the tic tac toe game

Published on September 24, 2023