Python – Unable to access RowMatrix methods in PySpark: columnSimilarities(), computeColumnSummaryStatistics()

Unable to access RowMatrix methods in PySpark: columnSimilarities(), computeColumnSummaryStatistics()… here is a solution to the problem.

Unable to access RowMatrix methods in PySpark: columnSimilarities(), computeColumnSummaryStatistics()

I’m trying to use functions columnSimilarities(), computeColumnSummaryStatistics().

  • In particular, the columnSimilarities() function mentioned in this article:

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

I’m using a list of sparse vectors from mlib.

sparse_vectors = []

for cust, group in df.groupby(0):

i_v = zip(group[1].values, group[2].values)
    i_v = sorted(i_v)
    indices = [x[0] for x in i_v]
    values = [x[1] for x in i_v]
    sparse_vectors.append(Vectors.sparse(len(df[1].unique()), indices, values))

rows = sc.parallelize(sparse_vectors)
mat = RowMatrix(rows)

I get the error :

AttributeError: ‘RowMatrix’ object has no attribute
‘computeColumnSummaryStatistics’

or

AttributeError: ‘RowMatrix’ object has no attribute
‘columnSimilarities’

Every time I run the function.

Is this a problem with PySpark, not Scala Spark? I also can’t find the page for the RowMatrix function by googling.

Thanks

Solution

You can’t access these methods because they are not currently implemented in PySpark (Spark 1.6).

IndexedRowMatrix.columnSimilarities (see SPARK-12041) Available in the current master, but to use it, you have to build Spark from source.

Related Problems and Solutions