在Tensorflow中加载预训练的Word2Vec嵌入

时间:2022-07-10 20:23:07

I am trying to load a pretrained Word2Vec (or Glove) embedding in my Tensorflow code, however I have some problems understanding it as I cannot find many examples. The question is not about getting and loading the embedding matrix, which I understand, but about looking up the word ids. Currently I am using the code from https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/. There, first the embedding matrix is loaded (understood). Then, a vocabulary processor is used to convert a sentence x to a list of word IDs:

我试图在我的Tensorflow代码中加载一个预先训练的Word2Vec(或Glove)嵌入,但是我有一些问题需要理解它,因为我找不到很多例子。问题不是关于获取和加载嵌入矩阵,我理解,而是关于查找单词ID。目前我正在使用https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/中的代码。在那里,首先加载(理解)嵌入矩阵。然后,词汇处理器用于将句子x转换为单词ID列表:

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain = vocab_processor.fit(vocab)
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

This works and gives me a list of word ids, but I do not know if this is correct. What bothers me most is the question how the vocabulary processor gets the correct word ids from the embedding I just read (since otherwise the result of the embedding would be wrong). Does the fit step do this?

这有效,并给我一个单词ID列表,但我不知道这是否正确。让我最困扰的是词汇处理器如何从我刚刚阅读的嵌入中获取正确的单词ID(因为否则嵌入的结果将是错误的)。适合步骤是否这样做?

Or is there another way, how do you do this lookup?

或者还有另一种方法,你如何进行这种查找?

Thanks! Oliver

谢谢!奥利弗

1 个解决方案

#1


1  

Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

是的,fit步骤告诉vocab_processor vocab数组中每个单词的索引(从1开始)。 transform只是颠倒了这个查找并从单词生成索引并使用0将输出填充到max_document_size。

You can see that in a short example here:

你可以在这里看到一个简短的例子:

vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])
# 

#1


1  

Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

是的,fit步骤告诉vocab_processor vocab数组中每个单词的索引(从1开始)。 transform只是颠倒了这个查找并从单词生成索引并使用0将输出填充到max_document_size。

You can see that in a short example here:

你可以在这里看到一个简短的例子:

vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])
#