The Gensim Doc2Vec most_similar() method did not work as expected
I’m struggling with Doc2Vec and I don’t see what I’m doing wrong.
I have a text file with sentences. I would like to know what is the closest sentence we can find in that file for a given sentence.
Here’s the code to create the model:
sentences = LabeledLineSentence(filename)
model = models. Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)
Here is my file for testing purposes:
uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg
Here is my test :
test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))
Regardless of what parameters are trained, this should obviously tell me that the most similar sentence is the 4th (SENT_3 or SENT_4, I don’t know how their indexes work, but the sentence label is this form). But here’s the result:
[('SENT_0', 0.15669342875480652),
('SENT_2', 0.0008485736325383186),
('SENT_4', -0.009077289141714573)]
What am I missing? If I try to use the same sentence (I like dogs), I have SENT_2, then 1, then 4… I really don’t understand. Why are the numbers so low? When I run the load several times in a row, I don’t get the same result either.
Thank you for your help
Solution
Doc2Vec
is not suitable for toy size examples. (Published works use tens of thousands to millions of texts, and even tiny unit tests in Gensim
use hundreds of texts, combined with smaller vector sizes
as well as more ITER
periods for barely reliable results.) )
So I don’t want your code to have consistent or meaningful results. This is especially true when:
- Use small data to maintain large vector
sizes
(which can lead to severe model overfitting). - Use
min_count=1
(because words without many different usage examples don’t get a good vector). - Change the
min_alpha
to remain the same as the larger starting alpha (because the often beneficial behavior of the stochastic gradient descent learning algorithm relies on a gradual decay of this update rate). - Use documents with only a few words (because the training of document vectors is proportional to the number of words they contain).
Finally, even if everything else works fine, infer_vector() usually benefits from more steps
than the default 5 (tens or hundreds), and sometimes the starting alpha
is less like its inference default (0.1)
and more like the training value (0.025).
So:
- Do not change the
min_count
ormin_alpha
- Get more data
- If not tens of thousands of text, use smaller vector
sizes
and moreepochs
(but realize that the results of small datasets may still be weak). - If each text is small, use more
epochs
(but realize that the result may still be weaker than longer text). - Try other infer_vector() arguments, such as steps=50
(
or more, especially small text) andalpha=0.025