Python – Fine-tune pre-trained word2vec Google News

Fine-tune pre-trained word2vec Google News… here is a solution to the problem.

Fine-tune pre-trained word2vec Google News

I’m currently using a Word2Vec model trained on Google News Corpus (from here)
Since this is only trained for news before 2013, I need to update vectors based on news from after 2013 and add new words to the vocabulary.

Let’s say I have a new news corpus after 2013. Can I retrain or fine-tune or update the Google News Word2Vec model? Can it be done using Gensim? Can it be done with FastText?

Solution

You can take a look at this:
https://github.com/facebookresearch/fastText/pull/423

It does exactly what you want to do:
The following is the link to:

Train a classification model or word vector model step by step.

./fasttext [supervised | skipgram | cbow] -input train.data -inputModel trained.model.bin -output re-trained [other options] -incr

-incr stands for incremental training.

When training word embeddings, you can start from scratch with all the data every time, or just use new data. For classification, it can be trained from scratch using a pre-trained word embedding with all the data, or just a new word embedding without changing the word embedding.

Incremental training actually means training a model with the data we

got earlier, and then retraining the model with the new data we get instead of starting from scratch.

Related Problems and Solutions