Python – Classification (string) feature in sklearn for 10cv SVM regression

Classification (string) feature in sklearn for 10cv SVM regression… here is a solution to the problem.

Classification (string) feature in sklearn for 10cv SVM regression

I mixed different types of features (categorical string, 0-1 binary, float) in one csv file. I want to do SVM regression with 10-fold cross-validation. Based on this post, I tried the following method of reading data, But there is an error that the string cannot be converted to float:

df = pd.read_csv("output.csv")
datanumpy = df.as_matrix()
x = datanumpy[:, 0:143]  # select columns 1 through 41 (the features)
y = datanumpy[:, 144]  # select column 42 (the labels)

clf = SVC(kernel='linear')

clf.fit(x, y)

Any idea how I can handle these factors?

The error message is:

ValueError                                Traceback (most recent call last)
<ipython-input-22-731136d5a713> in <module>()
     75 
     76 # # fitting x samples and y classes
---> 77 clf.fit(x, y)
     78 
     79 

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
    149         self._sparse = sparse and not callable(self.kernel)
    150 
--> 151         X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
    152         y = self._validate_targets(y)
    153 

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    519     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    520                     ensure_2d, allow_nd, ensure_min_samples,
--> 521                     ensure_min_features, warn_on_dtype, estimator)
    522     if multi_output:
    523         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: '/Users/dorien/AC/Projects/memory/S1 - Stimuli/Exp1-2-Stimuli/MIDI/Stimulus9.mid'

Should I indicate which columns are factors?

Solution

Since machine learning algorithms only accept numeric data, it is important to convert text data to numbers first.
In Python, we have two methods.

The first uses LabelEncoder().

from sklearn.preprocessing import LabelEncoder

df = df.apply(LabelEncoder().fit_transform)  

Another approach is to use pandas.get_dummies().

import pandas as pd

df_dummies = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

In the example above, categorical_columns is a list of categorical variables.

Related Problems and Solutions