Python – UnicodeDecodeError error while loading word2vec

UnicodeDecodeError error while loading word2vec… here is a solution to the problem.

UnicodeDecodeError error while loading word2vec

Full description

I started using word embedding and found a lot of information about it. I understand that so far I can train my own word vectors or use previously trained word vectors, such as Google or Wikipedia’s, which work for English but not for me because I’m dealing with Brazilian Portuguese. So I continued to look for pre-trained Portuguese word vectors and finally found Hirosan’s List of Pretrained Word Embeddings。 This led me to Kyubyong’s WordVectors, from which I learned about Rami Al-Rfou Polyglot.

Short description

I can’t load a pre-trained word vector; I’m trying <a href=”https://github.com/Kyubyong/wordvectors” rel=”noreferrer noopener nofollow”>WordVectors and Polyglot .

Download

Load attempt

Kyubyong’s WordVectors
First attempt: Use Gensim as suggested by Hirosan Hirosan;

from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)

Error returned:

[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The downloaded zip also contains other files, but they all return similar errors.

Polyglot
First attempt: followAl-Rfous’s instructions;

import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))

Error returned:

File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
    words, embeddings = pickle.load(open(polyglot_path, "rb"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)

Second attempt: Use Polyglot’s word embedding load function ;

First, we have to install polyglot:: via pip

pip install polyglot

Now we can import it :

from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)

Error returned:

File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Extra information

I use python 3 on MacOS High Sierra.

Solution

Kyubyong’s WordVectors
As Aneesh Joshi points out, the correct way to load the Kyubyong model is to call Word2Vec’s native load function.

from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)

Although I am very grateful for the Aneesh Joshi solution, multilingualism seems to be a better model for working in Portuguese. Any thoughts on this?

Solution

For Kyubyong’s pre-trained word2vector .bin file:
It may have been saved using Gensim’s save feature.

“Load the model using load(). Not load_word2vec_format (this is for C tool compatibility). ”

That is, model = Word2Vec.load(fname).

Please let me know if it works.

Quote: Gensim mailing list

Related Problems and Solutions