Python – Groups lists of numpy arrays based on shapes. Pandas ?

Groups lists of numpy arrays based on shapes. Pandas ?… here is a solution to the problem.

Groups lists of numpy arrays based on shapes. Pandas ?

I have some instances of classes that contain numpy arrays.

import numpy as np
import os.path as osp
class Obj():
  def_init__(self, file):
     self.file = file
     self.data = np.fromfile(file)
     self.basename = osp.basename(file)

I

have a list of such objects and I want to group them by shape. I can do this using sorting :

obj_list = [obj1, obj2, ..., objn]
obj_list.sort(key=lambda obj: obj.data.shape)

Now I have a second list, like obj_list_2:
The objects in obj_list_2 are initialized from different files, but the resulting array has the same shape (but in a different order) as the first array, and the basename is the same.

To clarify that these are files that are loaded from different folders. In each folder, I have the same file, to which I applied different preprocessing)

If I sort

them using the method shown above, I end up going

I want both lists to be sorted by shape and aligned according to their basename

I want to sort first by shape and then by basename (a function of it). Kind of like

obj_list.sort(key=lambda obj: obj.data.shape)
obj_list.sort(key=lambda obj: obj.basename)

However, the second can screw up the first. They should somehow be done together.

My ultimate goal is to extract objects with the same shape and the same base name from both lists

I’ve tried using pandas, but I’m not very familiar with it.
First, I align them according to basename, then create a list of lists and pass them to pandas.

import pandas as pd
obj_list_of_list = [obj_list1, obj_list2]
obj_df = pd. DataFrame.from_records(obj_list_of_list)

What is missing is to group them by shape and extract different groups.

Solution

You can use collections.defaultdict to create a dictionary map (file, shape) to the list object:

from collections import defaultdict

d = defaultdict(list)

obj_list = [obj1, obj2, ..., objn]

for obj in obj_list:
    d[(obj.filename, obj.data.shape)].append(obj)

Again, you can sort only by shape if you want:

d_shape = defaultdict(list)

for obj in obj_list:
    d_shape[obj.data.shape].append(obj)

You can then access the unique shape via d_shape.keys() and the list of objects for a given shape via d_shape[some_shape]. The benefit of this solution is that your complexity is O(n), while ordering will have higher complexity, such as O(n log n).

Related Problems and Solutions