Python – How to create a list in a column in a pyspark data frame

How to create a list in a column in a pyspark data frame… here is a solution to the problem.

How to create a list in a column in a pyspark data frame

I have a data frame with the following data:

df.show()

+-----+------+--------+
    | id_A| idx_B| B_value|
    +-----+------+--------+
    |    a|     0|       7|
    |    b|     0|       5|
    |    b|     2|       2|
    +-----+------+--------+

Assuming B has a total of 3 possible indexes, I want to create a table that combines all indexes and values into a single list (or numpy array) as follows:

final_df.show()

+-----+----------+
    | id_A|  B_values|
    +-----+----------+
    |    a| [7, 0, 0]|
    |    b| [5, 0, 2]|
    +-----+----------+

I’ve managed to get to this point :

from pyspark.sql import functions as f

temp_df = df.withColumn('B_tuple', f.struct(df['idx_B'], df['B_value']))\
            .groupBy('id_A').agg(f.collect_list('B_tuple').alias('B_tuples'))
temp_df.show()

+-----+-----------------+
    | id_A|         B_tuples|
    +-----+-----------------+
    |    a|         [[0, 7]]|
    |    b| [[0, 5], [2, 2]]|
    +-----+-----------------+

But now I can’t run the proper UDF function to convert temp_df to final_df.

Is there an easier way?

If not, what function should I use to complete the conversion?

Solution

So I found a solution

def create_vector(tuples_list, size):
    my_list = [0] * size
    for x in tuples_list:
        my_list[x["idx_B"]] = x["B_value"]
    return my_list

create_vector_udf = f.udf(create_vector, ArrayType(IntegerType()))

final_df = temp_df.with_column('B_values', create_vector_udf(temp_df['B_tuples'])).select(['id_A', 'B_values'])

final_df.show()

+-----+----------+
    | id_A|  B_values|
    +-----+----------+
    |    a| [7, 0, 0]|
    |    b| [5, 0, 2]|
    +-----+----------+

Related Problems and Solutions