使用Cassandra获取时间序列数据

时间:2021-10-17 23:09:14

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.

我正在研究将日志存储到Cassandra。日志的架构将是这样的。

EDIT: I've changed the schema in order to make some clarification.

编辑:我已经改变了架构,以便做出一些澄清。

CREATE TABLE log_date (
  userid bigint,
  time timeuuid,
  reason text,
  item text,
  price int,
  count int,
  PRIMARY KEY ((userid), time) - #1
  PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);

A new table will be created for the day everyday. So a table contains logs for only one day.

每天都会创建一个新表。因此,表只包含一天的日志。

My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.

我的查询条件如下。查询特定用户在特定日期(日期而非时间)的所有日志。因此,原因,项目,价格,计数将不会被用作查询的提示或条件。

My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.

我的问题是哪种PRIMARY KEY设计更适合。编辑:这里的关键是我想以原理图的方式存储日志。

If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.

如果我选择#1,那么每个日志会创建很多列。并且每个日志具有更多值的可能性非常高。上面的架构只是一个例子。日志可以包含subreason,friendid等值。

If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.

如果我选择#2,则每个日志将创建一个(非常)复合列,到目前为止,我找不到有关复合列开销的任何有价值的信息。

Which one should I choose? Please help.

我应该选择哪一个?请帮忙。

1 个解决方案

#1


My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.

我的建议是,你的两个选项似乎都不适合你的时间序列,你每天创建一个表,这似乎也不是最优的。

Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:

相反,我建议按用户ID和日期创建一个表和分区,并使用时间uuids作为事件的聚集列,这样的示例如下所示:

CREATE TABLE log_per_day (
   userid bigint,
   date text, 
   time timeuuid, 
   value text,
      PRIMARY KEY ((userid, date), time)
)

This will allow you to have all events in a day in a single row and allow you to do your query per day per user.

这将允许您将一天中的所有事件放在一行中,并允许您按用户每天进行查询。

By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.

通过声明时间聚集列允许有一个宽行,您可以在一天中根据需要插入许多事件。

So the row key is a composite key of the userid and plus date in text e.g.

因此,行键是用户ID的复合键,加上文本中的日期,例如

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')

插入log_per_day(用户ID,日期,时间,值)值(1000,'2015-05-06',aTimeUUID1,'我的值')

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')

插入log_per_day(userid,date,time,value)值(1000,'2015-05-06',aTimeUUID2,'my value2')

The two inserts above will be in the same row and therefore you will be able to read in a single query.

上面的两个插入将位于同一行,因此您将能够读取单个查询。

Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling

此外,如果您想了解有关时间序列的更多信息,我强烈建议您查看时间序列数据建模入门

Hope it helps,

希望能帮助到你,

José Luis

#1


My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.

我的建议是,你的两个选项似乎都不适合你的时间序列,你每天创建一个表,这似乎也不是最优的。

Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:

相反,我建议按用户ID和日期创建一个表和分区,并使用时间uuids作为事件的聚集列,这样的示例如下所示:

CREATE TABLE log_per_day (
   userid bigint,
   date text, 
   time timeuuid, 
   value text,
      PRIMARY KEY ((userid, date), time)
)

This will allow you to have all events in a day in a single row and allow you to do your query per day per user.

这将允许您将一天中的所有事件放在一行中,并允许您按用户每天进行查询。

By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.

通过声明时间聚集列允许有一个宽行,您可以在一天中根据需要插入许多事件。

So the row key is a composite key of the userid and plus date in text e.g.

因此,行键是用户ID的复合键,加上文本中的日期,例如

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')

插入log_per_day(用户ID,日期,时间,值)值(1000,'2015-05-06',aTimeUUID1,'我的值')

insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')

插入log_per_day(userid,date,time,value)值(1000,'2015-05-06',aTimeUUID2,'my value2')

The two inserts above will be in the same row and therefore you will be able to read in a single query.

上面的两个插入将位于同一行,因此您将能够读取单个查询。

Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling

此外,如果您想了解有关时间序列的更多信息,我强烈建议您查看时间序列数据建模入门

Hope it helps,

希望能帮助到你,

José Luis