Python – Use lookup tables to average vectors in Pyspark

Use lookup tables to average vectors in Pyspark… here is a solution to the problem.

Use lookup tables to average vectors in Pyspark

I’m trying to use a pre-trained GloVe model from https://nlp.stanford.edu/projects/glove/ PySpark implements a simple Doc2Vec algorithm

I have two RDDs:

    An

  • RDD pair named documents of the form (K:[V]), where K is the document ID and [V] is a list of all words in the document, for example

    ('testDoc1': 'I'm using spark')
    ('testDoc2': 'Test Spark')

  • A pair of RDDs called words represent word embeddings of the form K:V, where K is a word and V is a vector representing the word, for example

    ('I', [0.1, 0.1, 0.1])
    (' Spark ': [0.2, 0.2, 0.2])
    ('I', [0.3, 0.3, 0.3])
    ('Test', [0.5, 0.5, 0.5])
    ('use', [0.4, 0.4, 0.4])

What is the correct way to iterate over words in documents to get the average vector sum of all words? In the example above, the end result looks like this:

('testDoc1':[0.25, 0.25, 0.25])
('testDoc2':[0.35, 0.35, 0.35])

Solution

Suppose you have a function tokenize to convert a string to a list of words. Then you can flatMap documents to get the RDD: of the tuple (word, document id).

flattened_docs = documents.flatMap(lambda x: [(word, x[0]) for word in tokenize(x[1])])

Then adding words will give you the (word, (document id, vector)) tuple, at which point you can delete the word:

doc_vectors = flattened_docs.join(words).values

Note that this is an internal join, so you will discard a word that is not embedded. Since you may want to count these words in the mean, the left join may be more appropriate, and then you have to replace any result with a vector of zero (or whatever vector you choose) with None

We can group by document ID to get the rdd of (document id, [list of vectors]) and then average it (I assume you have a function called average).

final_vectors = doc_vectors.groupByKey().mapValues(average)

(Forgive me for being influenced by Scala for Python.) It’s been a while since I’ve been using pyspark and I haven’t checked if it’s flatMap or flat_map etc.)

Related Problems and Solutions