Python – How to combine tfidf features with homemade features

How to combine tfidf features with homemade features… here is a solution to the problem.

How to combine tfidf features with homemade features

For a simple web page classification system, I tried to combine some homemade features (frequency of HTML tags, frequency of certain word collocations) with features obtained after applying tfidf. However, I am facing the following problem and I don’t really know how to start here.

Now I’m trying to put all of this in a data frame, mostly by following the following link Code in :

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()

tfidf_df = pd. DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])

But this doesn’t

return my index in the original data frame (from 0 to 2464) and other features, it doesn’t seem to produce readable column names either, it doesn’t use different words for headers, but numbers

Also, I’m not sure if this is the right way to combine features, as this would result in extremely high-dimensional data frames, which may not be good for classifiers.

Solution

You can use >hstack Merge two sparse matrices without having to convert to a dense format.

from scipy.sparse import hstack

hstack([X_train_counts, X_train_custom])

Related Problems and Solutions