Classification (string) feature in sklearn for 10cv SVM regression… here is a solution to the problem.
Classification (string) feature in sklearn for 10cv SVM regression
I mixed different types of features (categorical string, 0-1 binary, float) in one csv file. I want to do SVM regression with 10-fold cross-validation. Based on this post, I tried the following method of reading data, But there is an error that the string cannot be converted to float:
df = pd.read_csv("output.csv")
datanumpy = df.as_matrix()
x = datanumpy[:, 0:143] # select columns 1 through 41 (the features)
y = datanumpy[:, 144] # select column 42 (the labels)
clf = SVC(kernel='linear')
clf.fit(x, y)
Any idea how I can handle these factors?
The error message is:
ValueError Traceback (most recent call last)
<ipython-input-22-731136d5a713> in <module>()
75
76 # # fitting x samples and y classes
---> 77 clf.fit(x, y)
78
79
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
149 self._sparse = sparse and not callable(self.kernel)
150
--> 151 X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
152 y = self._validate_targets(y)
153
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520 ensure_2d, allow_nd, ensure_min_samples,
--> 521 ensure_min_features, warn_on_dtype, estimator)
522 if multi_output:
523 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: '/Users/dorien/AC/Projects/memory/S1 - Stimuli/Exp1-2-Stimuli/MIDI/Stimulus9.mid'
Should I indicate which columns are factors?
Solution
Since machine learning algorithms only accept numeric data, it is important to convert text data to numbers first.
In Python, we have two methods.
The first uses LabelEncoder().
from sklearn.preprocessing import LabelEncoder
df = df.apply(LabelEncoder().fit_transform)
Another approach is to use pandas.get_dummies().
import pandas as pd
df_dummies = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
In the example above, categorical_columns
is a list of categorical variables.