ValueError:预期n_邻居

时间:2022-04-04 01:36:30

I'm using SCIkit KNN and levenstein distance to some work on strings, much like this example at the bottom of this page: http://scikit-learn.org/stable/faq.html . The difference being my data is split into training sets and is in a dataframe.

我正在使用SCIkit KNN和levenstein距离对字符串进行一些处理,非常类似于这个页面底部的这个例子:http://scikitlearn.org/stable/faq.html。不同的是,我的数据被分割成训练集,并且在一个dataframe中。

The split is listed here:

拆分列表如下:

train_feature, test_feature, train_class, test_class = train_test_split(features, classes,
                                                    test_size=TEST_SET_SIZE, train_size=TRAINING_SET_SIZE,
                                                    random_state=42)

I have the following:

我有以下几点:

>>> model = KNeighborsClassifier(metric='pyfunc',func=machine_learning.custom_distance)
>>> model.fit(train_feature['id'], train_class.as_matrix(['gender']))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='pyfunc',
       metric_params={'func': <function custom_distance at 0x7fd0236267b8>},
       n_neighbors=5, p=2, weights='uniform')

Where train_features has one column ([24000 rows x 1 columns]), id and train_class (Name: gender, dtype: object) is a series with "gender" which is 'M' or 'F'. The id corresponds to a key in a dict elsewhere.

train_features有一个列([24000行x 1列])、id和train_class (Name: gender, dtype: object)是一个带有“M”或“F”的“性别”系列。id对应于其他地方的命令中的一个键。

The custom distance function is:

自定义距离函数为:

def custom_distance(x,y):
i, j = int(x[0]), int(y[0])
return damerau_levenshtein_distance(lookup_dict[i],lookup_dict[j])

When I try to get the accuracy of the model:

当我试图获得模型的准确性时:

 accuracy = model.score(test_feature, test_class)

I receive this error:

我收到这个错误:

 ValueError: Expected n_neighbors <= 1. Got 5

I'm honestly really confused. I've checked the length of each of my datasets and they are fine. Why would it be telling me I only have one data point to plot from? Any help would be greatly appreciated.

我真的很困惑。我检查了每个数据集的长度,它们都很好。为什么要告诉我,我只有一个数据点?非常感谢您的帮助。

3 个解决方案

#1


2  

The classifier thinks that your dataset has only a single entry. Probably it interprets the vector of id's as a row vector instead of a column vector.

分类器认为您的数据集只有一个条目。可能它解释了id作为行向量而不是列向量的向量。

Try

试一试

model.fit(train_feature.as_matrix(['id']), train_class.as_matrix(['gender']))

and see if it helps.

看看是否有用。

#2


0  

I faced the same error. I have a huge db where I get the train and test data, but for code testing purposes I use a quite smaller one (~0.5% of the original). In the training procedure, I test a number of different neighbors, f.e

我面临同样的错误。我有一个巨大的db,在那里我得到了火车和测试数据,但是为了代码测试的目的,我使用了一个相当小的(~0.5%的原始数据)。在培训过程中,我测试了一些不同的邻居,f.e。

for neighbor in range(5,19): ...

The ValueError exception was raised for n_neigbors=19. This error was thrown only when I used the small db. The reason is that it didn't have the actual data input to create 19 different measurements. When I tested with the full db, no such exception was raised.

为n_neigbors=19提出了ValueError异常。这个错误是在我使用小数据库时才抛出的。原因是它没有实际的数据输入来创建19个不同的度量。当我用完整的db测试时,没有出现这样的异常。

Setting algorithm='brute' will not solve the problem although it might work. The thing you should do is check the length of your observations , both training and testing, and put an upper limit to the value of n_neighbors accordingly.

设置算法='野蛮'不能解决问题,尽管它可能有用。您应该做的事情是检查您的观察的长度,包括训练和测试,并相应地对n_邻居的值设置上限。

#3


-1  

I figured it out. I needed to set the model to brute force and metric to the distance:

我想出来。我需要把模型设置成蛮力和度规的距离:

model = KNeighborsClassifier(metric=machine_learning.custom_distance,algorithm='brute',n_neighbors=50)

#1


2  

The classifier thinks that your dataset has only a single entry. Probably it interprets the vector of id's as a row vector instead of a column vector.

分类器认为您的数据集只有一个条目。可能它解释了id作为行向量而不是列向量的向量。

Try

试一试

model.fit(train_feature.as_matrix(['id']), train_class.as_matrix(['gender']))

and see if it helps.

看看是否有用。

#2


0  

I faced the same error. I have a huge db where I get the train and test data, but for code testing purposes I use a quite smaller one (~0.5% of the original). In the training procedure, I test a number of different neighbors, f.e

我面临同样的错误。我有一个巨大的db,在那里我得到了火车和测试数据,但是为了代码测试的目的,我使用了一个相当小的(~0.5%的原始数据)。在培训过程中,我测试了一些不同的邻居,f.e。

for neighbor in range(5,19): ...

The ValueError exception was raised for n_neigbors=19. This error was thrown only when I used the small db. The reason is that it didn't have the actual data input to create 19 different measurements. When I tested with the full db, no such exception was raised.

为n_neigbors=19提出了ValueError异常。这个错误是在我使用小数据库时才抛出的。原因是它没有实际的数据输入来创建19个不同的度量。当我用完整的db测试时,没有出现这样的异常。

Setting algorithm='brute' will not solve the problem although it might work. The thing you should do is check the length of your observations , both training and testing, and put an upper limit to the value of n_neighbors accordingly.

设置算法='野蛮'不能解决问题,尽管它可能有用。您应该做的事情是检查您的观察的长度,包括训练和测试,并相应地对n_邻居的值设置上限。

#3


-1  

I figured it out. I needed to set the model to brute force and metric to the distance:

我想出来。我需要把模型设置成蛮力和度规的距离:

model = KNeighborsClassifier(metric=machine_learning.custom_distance,algorithm='brute',n_neighbors=50)