如何从Cassandra和R读取数据?

时间:2022-05-18 14:40:42

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.

我使用的是r2.14.1和Cassandra 1.2.11,我有一个单独的程序,它将数据写入一个Cassandra表。我没能从R中读到。

The Cassandra schema is defined like this:

Cassandra模式的定义如下:

create table chosen_samples (id bigint , temperature double, primary key(id))

I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)

我首先尝试了RCassandra包(http://www.rforge.net/RCassandra/)

> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")

The connection seems to succeed but the parsing of the table into data frame fails:

连接似乎成功了,但是将表解析为数据框架失败了:

> cs
Error in data.frame(..dfd. = c("@\"ffffff", "@(<cc><cc><cc><cc><cc><cd>",  : 
  duplicate row.names: 

I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive

我也尝试过使用JDBC连接器,如本文所述:http://www.datastax.com/dev/blog/big analytics-with-r-cassandra- hive。

> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")

But this one fails like this:

但这一次失败了:

Error in .jfindClass(as.character(driverClass)[1]) : class not found

Even though the location to the java driver is correct

尽管java驱动程序的位置是正确的。

$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar

5 个解决方案

#1


4  

You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.

您必须下载apache-cassandra-2.0.10-bin.tar。广州和cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0.jar。

There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use

不需要在本地机器上安装Cassandra;只是把cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0。在unziped apache-cassandra-2.0.10-bin.tar.gz的lib目录中的jar文件。然后您可以使用

 library(RJDBC)
 drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", 
              list.files("D:/apache-cassandra-2.0.10/lib",
              pattern="jar$",full.names=T))

That is working on my unix but not on my windows machine. Hope that helps.

那是在我的unix上,而不是在我的windows机器上。希望有帮助。

#2


3  

This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.

这个问题现在已经过时了,但是因为它是R和Cassandra的热门话题之一,我想我应该在这里留下一个简单的解决方案,因为我发现对于我认为是一个相当普遍的任务,几乎没有最新的支持。

Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.

Sparklyr现在很容易做到这一点,因为它公开了java上下文,因此可以直接使用spark - cassandra连接器。我已经打包了这个简单的包,crassy,但是它没有必要使用。

I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy.

我主要是让它在如何让sparklyr加载连接器的过程中对配置进行了解密,并且作为选择列的子集的语法有点笨拙。

Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.

支持列选择和分区过滤。我认为这些特性是Cassandra用例所必需的,因为CQL不能直接提交给集群。

I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.

我还没有找到一个解决方案来提交更通用的CQL查询,而这并不涉及编写自定义scala,不过这里有一个例子可以说明这是如何工作的。

#3


2  

Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this

是的,我找到了一种(确实很难看)的方法,简单地从R调用python,手动解析NA并重新分配R中的数据帧名,就像这样。

# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv 
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")

python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")

# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))

names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")

# and decoding NA's    
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df,  parsena))

Anybody has a better idea?

有人有更好的主意吗?

#4


1  

I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver). Also, when run in RStudio it worked fine in the second run but not first.

在通过批处理文件(R 3.2.4, Teradata驱动程序)执行RJDBC连接时,我有相同的错误消息。同样,在RStudio中运行时,它在第二次运行时运行良好,但不是第一次运行。

What helped was explicitly call:

帮助明确的是:

library(rJava)
.jinit()

#5


0  

It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.

仅仅下载驱动程序是不够的,您还需要下载依赖项并将其放入您的JAVA类路径(MacOS: /Library/ JAVA /Extensions),如项目主页上所述。

Include the Cassandra JDBC dependencies in your classpath : download dependencies

在类路径中包括Cassandra JDBC依赖项:下载依赖项。

As of the RCassandra package, right now it's still too primitive compared to RJDBC.

在RCassandra包中,与RJDBC相比,它现在仍然太原始。

#1


4  

You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.

您必须下载apache-cassandra-2.0.10-bin.tar。广州和cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0.jar。

There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use

不需要在本地机器上安装Cassandra;只是把cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0。在unziped apache-cassandra-2.0.10-bin.tar.gz的lib目录中的jar文件。然后您可以使用

 library(RJDBC)
 drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", 
              list.files("D:/apache-cassandra-2.0.10/lib",
              pattern="jar$",full.names=T))

That is working on my unix but not on my windows machine. Hope that helps.

那是在我的unix上,而不是在我的windows机器上。希望有帮助。

#2


3  

This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.

这个问题现在已经过时了,但是因为它是R和Cassandra的热门话题之一,我想我应该在这里留下一个简单的解决方案,因为我发现对于我认为是一个相当普遍的任务,几乎没有最新的支持。

Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.

Sparklyr现在很容易做到这一点,因为它公开了java上下文,因此可以直接使用spark - cassandra连接器。我已经打包了这个简单的包,crassy,但是它没有必要使用。

I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy.

我主要是让它在如何让sparklyr加载连接器的过程中对配置进行了解密,并且作为选择列的子集的语法有点笨拙。

Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.

支持列选择和分区过滤。我认为这些特性是Cassandra用例所必需的,因为CQL不能直接提交给集群。

I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.

我还没有找到一个解决方案来提交更通用的CQL查询,而这并不涉及编写自定义scala,不过这里有一个例子可以说明这是如何工作的。

#3


2  

Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this

是的,我找到了一种(确实很难看)的方法,简单地从R调用python,手动解析NA并重新分配R中的数据帧名,就像这样。

# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv 
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")

python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")

# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))

names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")

# and decoding NA's    
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df,  parsena))

Anybody has a better idea?

有人有更好的主意吗?

#4


1  

I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver). Also, when run in RStudio it worked fine in the second run but not first.

在通过批处理文件(R 3.2.4, Teradata驱动程序)执行RJDBC连接时,我有相同的错误消息。同样,在RStudio中运行时,它在第二次运行时运行良好,但不是第一次运行。

What helped was explicitly call:

帮助明确的是:

library(rJava)
.jinit()

#5


0  

It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.

仅仅下载驱动程序是不够的,您还需要下载依赖项并将其放入您的JAVA类路径(MacOS: /Library/ JAVA /Extensions),如项目主页上所述。

Include the Cassandra JDBC dependencies in your classpath : download dependencies

在类路径中包括Cassandra JDBC依赖项:下载依赖项。

As of the RCassandra package, right now it's still too primitive compared to RJDBC.

在RCassandra包中,与RJDBC相比,它现在仍然太原始。