Python – How does Google Cloud Machine Learning handle large numbers of HDF5 files?

How does Google Cloud Machine Learning handle large numbers of HDF5 files?… here is a solution to the problem.

How does Google Cloud Machine Learning handle large numbers of HDF5 files?

I have about 5k raw data input files and 15k raw data test files, for a total of several GB. Since these are raw data files, I had to iterate on them in Matlab to get the features I wanted to train my actual classifier on (CNN). So I generated an HDF5 mat file for each raw data file. I developed my model locally using Keras and modified DirectoryIterator, where I had something similar

for i, j in enumerate(batch_index_array):
            arr = np.array(h5py. File(os.path.join(self.directory, self.filenames[j]), "r").get(self.variable))
            # process them further

The file structure is

|  
|--train  
|    |--Class1
|    |    |-- 2,5k .mat files  
|    |      
|    |--Class2
|         |-- 2,5k .mat files  
|--eval  
|    |--Class1
|    |    |-- 2k .mat files  
|    |      
|    |--Class2
|         |-- 13k .mat files

This is the file structure I now have in my Google ML bucket. It uses python natively with small models, but now I want to take advantage of Google ML hyperparameter tuning because my model is much larger. The problem is that I read HDF5 files on the Internet and cannot be read directly and easily from Google ML storage. I tried to modify my script like this :

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io. FileIO((os.path.join(self.directory, self.filenames[j], mode='r') as input_f:
        arr = np.array(h5py. File(input_f.read(), "r").get(self.variable))
        # process them further

But it gives me an error like error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte just with other hexadecimal and position 512.
I also have something like this:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io. FileIO((os.path.join(self.directory, self.filenames[j], mode='rb') as input_f:
        arr = np.fromstring(input_f.read())
        # process them further

But it doesn’t work either.

Question
How do I modify my script to be able to read those HDF5 files in Google ML? I know the way data pickling does, but the problem is that loading pickles created from 15k files (a few GB) into memory doesn’t seem to be efficient.

Solution

HDF is a very common file format, and unfortunately, it is not the best choice in the cloud. For some explanations of why, see this blog post.

Given the inherent complexity of HDF on the cloud, I recommend one of the following:

  1. Convert your data to another file format such as CSV or tf. Example of TFRecord
  2. Copy the data to local/tmp

Conversion can be inconvenient at best, and for some datasets, gymnastics may be required. A cursory search on the Internet revealed several tutorials on how to do this. Here’s one you might quote.

Again, there are multiple ways to copy data to your local machine, but note that your job doesn’t start any actual training until the data is copied. In addition, if one of the workers dies, it will have to replicate all data again when it starts again. If the master dies and you’re doing distributed training, this results in a lot of work being lost.

That said, if you feel this is a viable approach for your situation (e.g. you’re not doing distributed training and/or you’re willing to wait for data transfer as described above), just start Python with something like this:

import os
import subprocess

if os.environ.get('TFCONFIG', {}).get('task', {}).get('type') != 'ps':
  subprocess.check_call(['mkdir', '/tmp/my_files'])
  subprocess.check_call(['gsutil', '-m', 'cp', '-r', 'gs://my/bucket/my_subdir', '/tmp/myfiles']) 

Related Problems and Solutions