How to create a list in a column in a pyspark data frame… here is a solution to the problem.
How to create a list in a column in a pyspark data frame
I have a data frame with the following data:
df.show()
+-----+------+--------+
| id_A| idx_B| B_value|
+-----+------+--------+
| a| 0| 7|
| b| 0| 5|
| b| 2| 2|
+-----+------+--------+
Assuming B has a total of 3 possible indexes, I want to create a table that combines all indexes and values into a single list (or numpy array) as follows:
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+
I’ve managed to get to this point :
from pyspark.sql import functions as f
temp_df = df.withColumn('B_tuple', f.struct(df['idx_B'], df['B_value']))\
.groupBy('id_A').agg(f.collect_list('B_tuple').alias('B_tuples'))
temp_df.show()
+-----+-----------------+
| id_A| B_tuples|
+-----+-----------------+
| a| [[0, 7]]|
| b| [[0, 5], [2, 2]]|
+-----+-----------------+
But now I can’t run the proper UDF
function to convert temp_df
to final_df
.
Is there an easier way?
If not, what function should I use to complete the conversion?
Solution
So I found a solution
def create_vector(tuples_list, size):
my_list = [0] * size
for x in tuples_list:
my_list[x["idx_B"]] = x["B_value"]
return my_list
create_vector_udf = f.udf(create_vector, ArrayType(IntegerType()))
final_df = temp_df.with_column('B_values', create_vector_udf(temp_df['B_tuples'])).select(['id_A', 'B_values'])
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+