Python – Get the top word from tf-idf sparse matrix (highest tf-idf value)

Get the top word from tf-idf sparse matrix (highest tf-idf value)… here is a solution to the problem.

Get the top word from tf-idf sparse matrix (highest tf-idf value)

I have a list of size 208 (208 sentence array) and it looks like:

all_words = [["this is a sentence ... "] , [" another one hello bob this is alice ... "] , ["..."] ...] 

I want to get the word with the highest tf-idf value.
I created a tf-idf matrix:

from sklearn.feature_extraction.text import TfidfVectorizer

tokenize = lambda doc: doc.split(" ")
sklearn_tfidf = TfidfVectorizer(norm='l2', tokenizer=tokenize, ngram_range=(1,2))
tfidf_matrix = sklearn_tfidf.fit_transform(all_words)
sentences = sklearn_tfidf.get_feature_names()

dense_tfidf = tfidf_matrix.todense()

Now I don’t know how to get the word with the highest tf-idf value.

Each column of dense_tfidf represents one word/2 words. (The matrix is 208×5481).

When I summed each column, it didn’t really help – got the same result as simple buzzwords (I guess because it was the same as simple word count).

How do I get the word with the highest tf-idf value? Or how can I normalize it wisely?

Solution

There is a similar problem, but > found this, just change the X and Y inputs according to your data frame. The code in the blog is as follows. Sklearn’s documentation helped me: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Product, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))

Related Problems and Solutions