使用scikit-learn python的线性SVM时出现ValueError

时间:2022-05-31 18:03:58

I am currently working on large scale hierarchical text classification of ODP documents. The dataset provided to me is in the libSVM format. I am trying to run the linear kernel SVM of python's scikit-learn to develop the model. Below is the sample data from training samples:


29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3 

33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1 

The following is the code I have used to construct the linear SVM model


from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)

Upon running clf.score(), I get the following error:


ValueError                                Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
      1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
      3 print time.time() - start_time, "seconds"

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
    292         """
    293         from .metrics import accuracy_score
--> 294         return accuracy_score(y, self.predict(X))

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    464             Class labels for samples in X.
    465         """
--> 466         y = super(BaseSVC, self).predict(X)
    467         return self.classes_.take(y.astype(np.int))

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    280         y_pred : array, shape (n_samples,)
    281         """
--> 282         X = self._validate_for_predict(X)
    283         predict = self._sparse_predict if self._sparse else self._dense_predict
    284         return predict(X)

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
    402             raise ValueError("X.shape[1] = %d should be equal to %d, "
    403                              "the number of features at training time" %
--> 404                              (n_features, self.shape_fit_[1]))
    405         return X

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

Can someone please let me know what is exactly wrong with either this code or the piece of data I have? Thanks in advance


Below attached are the values of X_train, y_train, X_test, and y_test:




  (0, 9453)         1.0
  (0, 11741)    1.0
  (0, 18883)    14.0
  (0, 26839)    1.0
  (0, 35146)    1.0
  (0, 52781)    1.0
  (0, 72082)    1.0
  (0, 73243)    1.0
  (0, 78944)    1.0
  (0, 79912)    1.0
  (0, 79985)    1.0
  (0, 86709)    3.0
  (0, 117285)   1.0
  (0, 139819)   1.0
  (0, 142457)   1.0
  (0, 146314)   1.0
  (0, 151004)   2.0
  (0, 161453)   3.0
  (0, 172236)   1.0
  (0, 187531)   2.0
  (0, 202462)   1.0
  (0, 210417)   1.0
  (0, 250581)   1.0
  (0, 251689)   1.0
  (0, 296384)   2.0
  : :
  (4462, 735469)    1.0
  (4462, 737059)    15.0
  (4462, 740127)    1.0
  (4462, 743798)    1.0
  (4462, 766063)    1.0
  (4462, 778958)    2.0
  (4462, 784004)    4.0
  (4462, 837264)    2.0
  (4462, 839095)    22.0
  (4462, 844735)    6.0
  (4462, 859721)    2.0
  (4462, 875267)    1.0
  (4462, 910761)    1.0
  (4462, 931244)    1.0
  (4462, 945069)    6.0
  (4462, 948728)    1.0
  (4462, 948850)    2.0
  (4462, 957682)    1.0
  (4462, 975170)    1.0
  (4462, 989192)    1.0
  (4462, 1014294)   1.0
  (4462, 1042424)   1.0
  (4462, 1049027)   1.0
  (4462, 1072931)   1.0
  (4462, 1145790)   1.0



[  2.90000000e+01   3.30000000e+01   3.30000000e+01 ...,   1.65475000e+05
   1.65518000e+05   1.65518000e+05]



  (0, 18573)    1.0
  (0, 23501)    1.0
  (0, 29954)    1.0
  (0, 42112)    1.0
  (0, 46402)    1.0
  (0, 63041)    2.0
  (0, 67942)    2.0
  (0, 83522)    1.0
  (0, 88413)    2.0
  (0, 99454)    1.0
  (0, 126041)   1.0
  (0, 139819)   1.0
  (0, 142678)   1.0
  (0, 151004)   1.0
  (0, 166351)   2.0
  (0, 173794)   1.0
  (0, 192162)   3.0
  (0, 210417)   2.0
  (0, 254468)   1.0
  (0, 263895)   2.0
  (0, 277567)   1.0
  (0, 278419)   2.0
  (0, 279181)   2.0
  (0, 281319)   2.0
  (0, 298898)   1.0
  : :
  (1857, 1100504)   3.0
  (1857, 1103247)   1.0
  (1857, 1105578)   1.0
  (1857, 1108986)   2.0
  (1857, 1118486)   1.0
  (1857, 1120807)   9.0
  (1857, 1129243)   2.0
  (1857, 1131786)   1.0
  (1857, 1134029)   2.0
  (1857, 1134410)   5.0
  (1857, 1134494)   1.0
  (1857, 1139045)   25.0
  (1857, 1142239)   3.0
  (1857, 1142651)   1.0
  (1857, 1144787)   1.0
  (1857, 1151891)   1.0
  (1857, 1152094)   1.0
  (1857, 1157533)   1.0
  (1857, 1159376)   1.0
  (1857, 1178944)   1.0
  (1857, 1181310)   2.0
  (1857, 1182023)   1.0
  (1857, 1187098)   1.0
  (1857, 1194344)   2.0
  (1857, 1195819)   9.0



[  2.90000000e+01   3.30000000e+01   1.56000000e+02 ...,   1.65434000e+05
   1.65475000e+05   1.65518000e+05]

3 个解决方案



The error message


ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

explains itself: the number of features in the testing data is different compared to the training data, which has been used to train the model. That is, X_train.shape[1] is not equal to X_test.shape[1].


You should check why they are not equal, as they should be.


One possibility is that they are loaded as sparse matrices and the number of features is inferred by load_svmlight_file. If the testing data contains features unseen by the training data, the resulting X_test might have a larger dimension. To avoid this, you can specify the number of features in load_svmlight_file by passing the argument n_features.




You can use n_features option.


X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])

This error also can be solved by using load_svmlight_files


from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])



Problem found!!

问题发现! !

# -*- coding:utf-8 -*-
  1. The file should be encoding with utf-8
  2. 该文件应该使用utf-8进行编码
  3. The data frame object should be reshaped. Like X_train.values[4].reshape(1, -1)
  4. 数据框架对象应该被重新塑造。像X_train.values[4]。重塑(1,1)



The error message


ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

explains itself: the number of features in the testing data is different compared to the training data, which has been used to train the model. That is, X_train.shape[1] is not equal to X_test.shape[1].


You should check why they are not equal, as they should be.


One possibility is that they are loaded as sparse matrices and the number of features is inferred by load_svmlight_file. If the testing data contains features unseen by the training data, the resulting X_test might have a larger dimension. To avoid this, you can specify the number of features in load_svmlight_file by passing the argument n_features.




You can use n_features option.


X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])

This error also can be solved by using load_svmlight_files


from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])



Problem found!!

问题发现! !

# -*- coding:utf-8 -*-
  1. The file should be encoding with utf-8
  2. 该文件应该使用utf-8进行编码
  3. The data frame object should be reshaped. Like X_train.values[4].reshape(1, -1)
  4. 数据框架对象应该被重新塑造。像X_train.values[4]。重塑(1,1)