Python – From Features to Words Python (“reverse” bag of words)

From Features to Words Python (“reverse” bag of words)… here is a solution to the problem.

From Features to Words Python (“reverse” bag of words)

I used sklearn to create a BOW in Python with 200 features that are easy to extract. But how do I reverse it? That is, from a vector with 200 0s or 1s to the corresponding word? Since the vocabulary is a dictionary, there is no sorting, and I’m not sure which word each element in the feature list corresponds to. Also, if the first element in my 200-dimensional vector corresponds to the first word in the dictionary, how can I extract the word from the dictionary by index?

This is how the BOW is created

vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()

So “feature” is a matrix (n,200) matrix (n is the number of sentences).

Solution

I’m not quite sure what you’re going to do, but you seem to just want to figure out which column represents which word. For this, there are convenient get_feature_names parameters.

Let’s look at sample corpus provided in docs:

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?' ]

# Put into a dataframe
data = pd. DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
                             description
0            This is the first document.
1  This document is the second document.
2             And this is the third one.
3            Is this the first document?

# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()

# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()

>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

To see which column represents which word, use get_feature_names:

>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

So your first column is and, the second column is document, and so on. For readability, you can paste it in the data frame:

>>> pd. DataFrame(features, columns = vec.get_feature_names())
   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1

Related Problems and Solutions