Use lookup tables to average vectors in Pyspark
I’m trying to use a pre-trained GloVe model from https://nlp.stanford.edu/projects/glove/ PySpark implements a simple Doc2Vec algorithm
I have two RDDs:
- An
RDD pair named
documents
of the form (K:[V]), where K is the document ID and [V] is a list of all words in the document, for example
('testDoc1': 'I'm using spark')
('testDoc2': 'Test Spark')
A pair of RDDs called
words
represent word embeddings of the form K:V, where K is a word and V is a vector representing the word, for example
('I', [0.1, 0.1, 0.1])
(' Spark ': [0.2, 0.2, 0.2])
('I', [0.3, 0.3, 0.3])
('Test', [0.5, 0.5, 0.5])
('use', [0.4, 0.4, 0.4])
What is the correct way to iterate over words in documents
to get the average vector sum of all words? In the example above, the end result looks like this:
('testDoc1':[0.25, 0.25, 0.25])
('testDoc2':[0.35, 0.35, 0.35])
Solution
Suppose you have a function tokenize
to convert a string to a list of words. Then you can flatMap
documents
to get the RDD
: of the tuple (word, document id).
flattened_docs = documents.flatMap(lambda x: [(word, x[0]) for word in tokenize(x[1])])
Then adding words
will give you the (word, (document id, vector))
tuple, at which point you can delete the word:
doc_vectors = flattened_docs.join(words).values
Note that this is an internal join, so you will discard a word that is not embedded. Since you may want to count these words in the mean, the left join may be more appropriate, and then you have to replace any result with a vector of zero (or whatever vector you choose) with None
We can group by document ID to get the rdd of (document id, [list of vectors])
and then average it (I assume you have a function called average
).
final_vectors = doc_vectors.groupByKey().mapValues(average)
(Forgive me for being influenced by Scala for Python.) It’s been a while since I’ve been using pyspark and I haven’t checked if it’s flatMap
or flat_map
etc.)