Python – How to normalize data when using Keras fit_generator

How to normalize data when using Keras fit_generator… here is a solution to the problem.

How to normalize data when using Keras fit_generator

I

have a very large dataset and I’m using Keras’ fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized throughout the dataset, but when using fit_generator, I have access to relatively small batches of data, and normalizing the data in this small batch does not mean normalizing the data in the entire dataset. The impact is still quite large (I tested it and the accuracy of the model dropped significantly).

My question is: what is the right thing to do to normalize the data for the entire dataset when using Keras’ fit_generator? One final note: my data is a mix of text and numeric data, not images, so I can’t use some of the features in the image generator that Keras provides that might solve some problems with image data.

I’ve looked at normalizing the entire dataset before training (a “brute force” approach I guess), but I’m wondering if there’s a more elegant way to do this.

Solution

The generator does allow you to process the data on the fly, but preprocessing the data before training is the preferred method:

  1. Preprocessing and saving avoids processing data for each epoch, and you should really only perform small operations that can be applied to batch processing. For example, single-hot encoding is a common type of encoding, while tokenized sentences, etc., can be done offline.
  2. You may adjust and fine-tune your model. You don’t want the overhead of normalizing data and ensuring that each model is trained with the same normalized data.

Therefore, preprocess once offline before training and save it as your training data. When forecasting, you can process it instantly.

Related Problems and Solutions