导入并使用word2vec GoogleNews-vectors-negative300.bin。广州到R

时间:2023-02-15 06:22:02

I am big fan of word2vec algorithm. I had obtained vectors binary file made by google research team and I would like to make some analysis on that (which I had previously made on much smaller datasets than google had made).

我非常喜欢word2vec算法。我已经得到了谷歌研究团队所做的向量二进制文件,我想对此做一些分析(我之前做过的比谷歌更小的数据集)。

I am not able to import the file GoogleNews-vectors-negative300.bin.gz into the R.

我无法导入文件GoogleNews-vectors-negative300.bin。广州到R。

I had extracted that, and using rword2vec (found on github) transformed from bin to txt file. There is a kind of searching function inside the package, but it is sooo slooow.

我已经提取了它,并使用rword2vec(在github上可以找到)从bin转换到txt文件。包中有一种搜索函数,但它是sooo slooow。

That is why I am now attempting to import the file inside R and transform it to dataframe , if possible, with structure:

这就是为什么我现在尝试导入R中的文件并将其转换为dataframe,如果可能的话,使用结构:

name | vec1 | ... | vec300

I had tried built in readBin (could not obtain names), also readLines with txt (did not finish) or readr package and read_lines (made only 12Mb big vector)

我尝试了在readBin中构建(无法获取名称),也尝试了使用txt(未完成)或readr包和read_lines(只生成了12Mb大的向量)

could you please point me in the right direction?

你能给我指出正确的方向吗?

1 个解决方案

#1


0  

I finally found a way.

我终于找到了办法。

Using package rword2vec, it is possible to use either function bin_to_txt or framework provided in the package. For more information see the vignette provided.

使用包rword2vec,可以使用包中提供的函数bin_to_txt或框架。有关更多信息,请参阅所提供的插图。

library(rword2vec)
dist=distance(file_name = "GoogleNews-vectors-negative300.bin",search_word = "king",num = 10)
dist
           word              dist
1         kings 0.713804960250854
2         queen 0.651095926761627
3       monarch 0.641319692134857
4  crown_prince 0.620422065258026
5        prince 0.615999639034271
6        sultan 0.586482524871826
7         ruler 0.579756796360016
8       princes 0.564655303955078
9  Prince_Paras 0.543294668197632
10       throne 0.542210519313812

#1


0  

I finally found a way.

我终于找到了办法。

Using package rword2vec, it is possible to use either function bin_to_txt or framework provided in the package. For more information see the vignette provided.

使用包rword2vec,可以使用包中提供的函数bin_to_txt或框架。有关更多信息,请参阅所提供的插图。

library(rword2vec)
dist=distance(file_name = "GoogleNews-vectors-negative300.bin",search_word = "king",num = 10)
dist
           word              dist
1         kings 0.713804960250854
2         queen 0.651095926761627
3       monarch 0.641319692134857
4  crown_prince 0.620422065258026
5        prince 0.615999639034271
6        sultan 0.586482524871826
7         ruler 0.579756796360016
8       princes 0.564655303955078
9  Prince_Paras 0.543294668197632
10       throne 0.542210519313812