如何在sklearn中训练模型时使用预先训练过的单词嵌入？

With things like neural networks (NNs) in keras it is very clear how to use word embeddings within the training of the NN, you can simply do something like

对于像keras中的神经网络(NN)这样的东西,非常清楚如何在NN的训练中使用单词嵌入,你可以简单地做类似的事情

embeddings = ...
model = Sequential(Embedding(...),
                   layer1,
                   layer2,...)

But I'm unsure of how to do this with algorithms in sklearn such as SVMs, NBs, and logistic regression. I understand that there is a Pipeline method, which works simply (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) like

但我不确定如何使用sklearn中的算法(如SVM,NB和逻辑回归)来完成此操作。我知道有一种Pipeline方法,它的工作原理很简单(http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

pip = Pipeline([(Countvectorizer()), (TfidfTransformer()), (Classifier())])
pip.fit(X_train, y_train)

But how can I include loaded word embeddings in this pipeline? Or should it somehow be included outside the pipeline? I can't find much documentation online about how to do this.

但是如何在此管道中包含加载的字嵌入?或者它应该以某种方式包含在管道之外?我在网上找不到很多关于如何做到这一点的文档。

Thanks.

1 个解决方案

#1

You can use the FunctionTransformer class. If your goal is to have a transformer that takes a matrix of indexes and outputs a 3d tensor with word vectors, then this should suffice:

您可以使用FunctionTransformer类。如果您的目标是使用一个带有索引矩阵并使用单词向量输出3d张量的变换器,那么这应该足够了:

# this assumes you're using numpy ndarrays
word_vecs_matrix = get_wv_matrix()  # pseudo-code
def transform(x):
    return word_vecs_matrix[x]
transformer = FunctionTransformer(transform)

Be aware that, unlike keras, the word vector will not be fine tuned using some kind of gradient descent

请注意,与keras不同,单词vector不会使用某种梯度下降进行微调

#1