Doc2vec:如何获取文档向量

时间:2022-09-27 01:44:41

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial

如何使用Doc2vec获取两个文本文档的文档向量?我是新手,所以如果有人能给我指出正确的方向/帮助我学习一些教程将会很有帮助

I am using gensim python library.

我正在使用gensim python库。

doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

I get AttributeError: 'list' object has no attribute 'words' whenever I run this

我得到AttributeError:“list”对象在运行时没有属性“words”

3 个解决方案

#1


33  

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

Gensim更新。标签句的语法不包含标签。现在有了标签——请参阅LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html的文档

However, @bee2502 was right with

然而,@bee2502是正确的。

docvec = model.docvecs[99] 

It will should the 100th vector's value for trained model, it works with integers and strings.

它应该是训练模型的第100个向量的值,它可以处理整数和字符串。

#2


31  

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

如果您想训练Doc2Vec模型,您的数据集需要包含单词列表(类似于Word2Vec格式)和标记(文档id)。它还可以包含一些附加信息(更多信息请参见https://github.com/rare technologies/gensim/blob/develops/docs/notebooks/doc2vec-imdb .ipynb)。

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): Doc2Vec function contains alpha and min_alpha parameters, but that means that the learning rate decays during one epoch from alpha to min_alpha. To train several epochs, set the learning rate manually, like this:

UPDATE (how to train in epochs): Doc2Vec函数包含alpha和min_alpha参数,但这意味着学习速度在从alpha到min_alpha的一个时期内衰减。要训练几个不同的阶段,请手动设置学习率,如下所示:

from gensim.models import doc2vec
import random

alpha_val = 0.025        # Initial learning rate
min_alpha_val = 1e-4     # Minimum for linear learning rate decay
passes = 15              # Number of passes of one document during training

alpha_delta = (alpha_val - min_alpha_val) / (passes - 1)

model = doc2vec.Doc2Vec( size = 100 # Model initialization
    , window = 300
    , min_count = 1
    , workers = 4)

model.build_vocab(docs) # Building vocabulary

for epoch in range(passes):

    # Shuffling gets better results

    random.shuffle(docs)

    # Train

    model.alpha, model.min_alpha = alpha_val, alpha_val

    model.train(docs)

    # Logs

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha_val))

    # Next run alpha

    alpha_val -= alpha_delta

#3


24  

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

我得到了AttributeError:“list”对象没有属性“单词”,因为对Doc2vec()的输入文档没有正确的LabeledSentence格式。我希望下面的示例将帮助您理解格式。

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) 

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

更多细节如下:http://rare technologies.com/doc2vec-tutorial/然而,我通过使用TaggedLineDocument()从文件中获取输入数据来解决这个问题。文件格式:一个文档=一行=一个TaggedDocument对象。单词预期已经被预处理并被空格分隔,标签被自动地从文档行号构造。

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

要获取文档向量:可以使用docvecs。这里有更多细节:https://radimrehurek.com/gensim/models/doc2vec.html# gensim.models.2vec.taggeddocument

docvec = model.docvecs[99] 

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

其中99是我们想要的向量的文档id。如果标签是整数格式(默认情况下,如果使用TaggedLineDocument()加载),则像我一样直接使用整数id。如果标签是字符串格式,使用“SENT_99”,这类似于Word2vec

#1


33  

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

Gensim更新。标签句的语法不包含标签。现在有了标签——请参阅LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html的文档

However, @bee2502 was right with

然而,@bee2502是正确的。

docvec = model.docvecs[99] 

It will should the 100th vector's value for trained model, it works with integers and strings.

它应该是训练模型的第100个向量的值,它可以处理整数和字符串。

#2


31  

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

如果您想训练Doc2Vec模型,您的数据集需要包含单词列表(类似于Word2Vec格式)和标记(文档id)。它还可以包含一些附加信息(更多信息请参见https://github.com/rare technologies/gensim/blob/develops/docs/notebooks/doc2vec-imdb .ipynb)。

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): Doc2Vec function contains alpha and min_alpha parameters, but that means that the learning rate decays during one epoch from alpha to min_alpha. To train several epochs, set the learning rate manually, like this:

UPDATE (how to train in epochs): Doc2Vec函数包含alpha和min_alpha参数,但这意味着学习速度在从alpha到min_alpha的一个时期内衰减。要训练几个不同的阶段,请手动设置学习率,如下所示:

from gensim.models import doc2vec
import random

alpha_val = 0.025        # Initial learning rate
min_alpha_val = 1e-4     # Minimum for linear learning rate decay
passes = 15              # Number of passes of one document during training

alpha_delta = (alpha_val - min_alpha_val) / (passes - 1)

model = doc2vec.Doc2Vec( size = 100 # Model initialization
    , window = 300
    , min_count = 1
    , workers = 4)

model.build_vocab(docs) # Building vocabulary

for epoch in range(passes):

    # Shuffling gets better results

    random.shuffle(docs)

    # Train

    model.alpha, model.min_alpha = alpha_val, alpha_val

    model.train(docs)

    # Logs

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha_val))

    # Next run alpha

    alpha_val -= alpha_delta

#3


24  

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

我得到了AttributeError:“list”对象没有属性“单词”,因为对Doc2vec()的输入文档没有正确的LabeledSentence格式。我希望下面的示例将帮助您理解格式。

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) 

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

更多细节如下:http://rare technologies.com/doc2vec-tutorial/然而,我通过使用TaggedLineDocument()从文件中获取输入数据来解决这个问题。文件格式:一个文档=一行=一个TaggedDocument对象。单词预期已经被预处理并被空格分隔,标签被自动地从文档行号构造。

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

要获取文档向量:可以使用docvecs。这里有更多细节:https://radimrehurek.com/gensim/models/doc2vec.html# gensim.models.2vec.taggeddocument

docvec = model.docvecs[99] 

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

其中99是我们想要的向量的文档id。如果标签是整数格式(默认情况下,如果使用TaggedLineDocument()加载),则像我一样直接使用整数id。如果标签是字符串格式,使用“SENT_99”,这类似于Word2vec