The Gensim Doc2Vec most_similar() method did not work as expected
I’m struggling with Doc2Vec and I don’t see what I’m doing wrong.
I have a text file with sentences. I would like to know what is the closest sentence we can find in that file for a given sentence.
Here’s the code to create the model:
sentences = LabeledLineSentence(filename) model = models. Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025) model.build_vocab(sentences) model.train(sentences, epochs=50, total_examples=model.corpus_count) model.save(modelName)
Here is my file for testing purposes:
uduidhud duidihdd dsfsdf sdf sddfv dcv dfv dfvdf g fgbfgbfdgnb i like dogs sgfggggggggggggggggg ggfggg
Here is my test :
test = "i love dogs".split() print(model.docvecs.most_similar([model.infer_vector(test)]))
Regardless of what parameters are trained, this should obviously tell me that the most similar sentence is the 4th (SENT_3 or SENT_4, I don’t know how their indexes work, but the sentence label is this form). But here’s the result:
[('SENT_0', 0.15669342875480652), ('SENT_2', 0.0008485736325383186), ('SENT_4', -0.009077289141714573)]
What am I missing? If I try to use the same sentence (I like dogs), I have SENT_2, then 1, then 4… I really don’t understand. Why are the numbers so low? When I run the load several times in a row, I don’t get the same result either.
Thank you for your help
Doc2Vec is not suitable for toy size examples. (Published works use tens of thousands to millions of texts, and even tiny unit tests in
Gensim use hundreds of texts, combined with smaller vector
sizes as well as more
ITER periods for barely reliable results.) )
So I don’t want your code to have consistent or meaningful results. This is especially true when:
- Use small data to maintain large vector
sizes(which can lead to severe model overfitting).
min_count=1(because words without many different usage examples don’t get a good vector).
- Change the
min_alphato remain the same as the larger starting alpha (because the often beneficial behavior of the stochastic gradient descent learning algorithm relies on a gradual decay of this update rate).
- Use documents with only a few words (because the training of document vectors is proportional to the number of words they contain).
Finally, even if everything else works fine, infer_vector() usually benefits from more
steps than the default 5 (tens or hundreds), and sometimes the starting
alpha is less like its inference default (0.1
) and more like the training value (0.025).
- Do not change the
- Get more data
- If not tens of thousands of text, use smaller vector
epochs(but realize that the results of small datasets may still be weak).
- If each text is small, use more
epochs(but realize that the result may still be weaker than longer text).
- Try other infer_vector() arguments, such as steps=50
(or more, especially small text) and