Python – The Gensim Doc2Vec most_similar() method did not work as expected

The Gensim Doc2Vec most_similar() method did not work as expected… here is a solution to the problem.

The Gensim Doc2Vec most_similar() method did not work as expected

I’m struggling with Doc2Vec and I don’t see what I’m doing wrong.
I have a text file with sentences. I would like to know what is the closest sentence we can find in that file for a given sentence.

Here’s the code to create the model:

sentences = LabeledLineSentence(filename)

model = models. Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)

Here is my file for testing purposes:

uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg

Here is my test :

test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))

Regardless of what parameters are trained, this should obviously tell me that the most similar sentence is the 4th (SENT_3 or SENT_4, I don’t know how their indexes work, but the sentence label is this form). But here’s the result:

[('SENT_0', 0.15669342875480652),
 ('SENT_2', 0.0008485736325383186),
 ('SENT_4', -0.009077289141714573)]

What am I missing? If I try to use the same sentence (I like dogs), I have SENT_2, then 1, then 4… I really don’t understand. Why are the numbers so low? When I run the load several times in a row, I don’t get the same result either.

Thank you for your help

Solution

Doc2Vec is not suitable for toy size examples. (Published works use tens of thousands to millions of texts, and even tiny unit tests in Gensim use hundreds of texts, combined with smaller vector sizes as well as more ITER periods for barely reliable results.) )

So I don’t want your code to have consistent or meaningful results. This is especially true when:

  • Use small data to maintain large vector sizes (which can lead to severe model overfitting).
  • Use min_count=1 (because words without many different usage examples don’t get a good vector).
  • Change the min_alpha to remain the same as the larger starting alpha (because the often beneficial behavior of the stochastic gradient descent learning algorithm relies on a gradual decay of this update rate).
  • Use documents with only a few words (because the training of document vectors is proportional to the number of words they contain).

Finally, even if everything else works fine, infer_vector() usually benefits from more steps than the default 5 (tens or hundreds), and sometimes the starting alpha is less like its inference default (0.1) and more like the training value (0.025).

So:

  • Do not change the min_count or min_alpha
  • Get more data
  • If not tens of thousands of text, use smaller vector sizes and more epochs (but realize that the results of small datasets may still be weak).
  • If each text is small, use more epochs (but realize that the result may still be weaker than longer text).
  • Try other infer_vector() arguments, such as steps=50 (or more, especially small text) and alpha=0.025

Related Problems and Solutions