为基于事件的分析设计数据库模式

时间:2021-10-21 00:41:40

I'm trying to figure out the best way to model the schema for this event-based analytics system I'm writing. My main concern is writing this in a way that makes queries simple and fast. I'm going to be using MySQL as well. I'll go over some of the requirements and present an outline of a possible (but I think poor) schema.

我正在试图找出为我正在编写的基于事件的分析系统建模模型的最佳方法。我主要担心的是以一种简单快速的查询方式编写本文。我也将使用MySQL。我将回顾一些要求,并提出一个可能的(但我认为很差)架构的概述。

Requirements

  • Track events (e.g. track occurrences of the "APP_LAUNCH" event)

    跟踪事件(例如,跟踪“APP_LAUNCH”事件的发生)

  • Define custom events

    定义自定义事件

  • Ability to segment events on >1 custom properties (e.g. get occurrences of "APP_LAUNCH" segmented on the "APP_VERSION" property)

    能够在> 1个自定义属性上对事件进行分段(例如,在“APP_VERSION”属性上分段出现“APP_LAUNCH”)

  • Track sessions

    跟踪会话

  • Perform queries based on timestamp range

    根据时间戳范围执行查询

Possible Modeling

The main problem that I'm having is how to model segmentation and the queries to perform to get the overall counts of an event.

我遇到的主要问题是如何建模分段和执行查询以获取事件的总体计数。

My original idea was to define an EVENTS table with an id, int count, timestamp, property (?), and a foreign key to an EVENTTYPE. An EVENTTYPE has an id, name, and additional information belonging to a generic event type.

我最初的想法是定义一个EVENTS表,其中包含id,int count,timestamp,property(?)和EVENTTYPE的外键。 EVENTTYPE具有属于通用事件类型的id,名称和附加信息。

For example, the "APP_LAUNCH" event would have an entry in the EVENTS table with unique id, count representing the number of times the event happened, the timestamp (unsure about what this is stamped on), and a property or list of properties (e.g. "APP_VERSION", "COUNTRY", etc.) and a foreign key to an EVENTTYPE with name "APP_LAUNCH".

例如,“APP_LAUNCH”事件在EVENTS表中将有一个条目,其中包含唯一的id,count表示事件发生的次数,时间戳(不确定标记的内容)以及属性或属性列表(例如“APP_VERSION”,“COUNTRY”等)和名为“APP_LAUNCH”的EVENTTYPE的外键。

Comments and Questions

I'm pretty sure this isn't a good way to model this for the following reasons. It makes it difficult to do timestamp ranged queries ("Number of APP_LAUNCHES between time x and y"). The EVENTTYPE table doesn't really serve a purpose. Finally, I'm unsure as to how I would even perform queries for different segmentations. The last one is the one I'm most worried about.

由于以下原因,我很确定这不是建模的好方法。这使得难以进行时间戳范围查询(“时间x和y之间的APP_LAUNCHES数”)。 EVENTTYPE表并不真正起作用。最后,我不确定如何对不同的分段执行查询。最后一个是我最担心的那个。

I would appreciate any help in helping to correctly model this or in pointing me to resources that would help.

我将非常感谢帮助正确建模或帮助我找到有用的资源。

A final question (which is probably dumb): Is it bad to insert a row for every event? For example, say my client-side library makes the following call to my API:

最后一个问题(可能是愚蠢的):为每个事件插入一行是不是很糟糕?例如,假设我的客户端库对我的API进行以下调用:

track("APP_LAUNCH", {count: 4, segmentation: {"APP_VERSION": 1.0}})

How would I actually store this in the table (this is closely related to the schema design obviously)? Is it bad to simply insert a row for each one of these calls, of which there may be a significant amount? My gut reaction is that I'm really interested mainly in the overall aggregated counts. I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries. Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?

我如何将其实际存储在表中(这显然与架构设计密切相关)?简单地为每个这些呼叫插入一行是不是很糟糕,其中可能有很大一部分?我的直觉反应是,我真的对整体聚合计数感兴趣。我没有足够的SQL经验来了解这些查询如何执行可能数十万这些条目。当我希望客户端实际获得分析时,聚合表或内存缓存是否有助于缓解问题?

I realize there are lots of questions here, but I would really appreciate any and all help. Thanks!

我知道这里有很多问题,但我真的很感激任何帮助。谢谢!

1 个解决方案

#1


17  

I think most of your concerns are unnecessary. Taking one of your questions after another:

我认为你的大部分担忧都是不必要的。接下来提出一个问题:

1) The biggest issue are the custom attributes, different for each event. For this, you have to use EAV (entity-attribute-value) design. The important question is - what types can these attributes have? If more than one - e.g. string and integer, then it is more complicated. There are in general two types of such design:

1)最大的问题是自定义属性,每个事件都有所不同。为此,您必须使用EAV(实体 - 属性 - 值)设计。重要的问题是 - 这些属性有哪些类型?如果不止一个 - 例如字符串和整数,然后它更复杂。一般有两种类型的设计:

  • use one table and one column for values of all type - and convert everything to string (not scalable solution)

    使用一个表和一列来表示所有类型的值 - 并将所有内容转换为字符串(不是可扩展的解决方案)

  • have separate tables for each data type (very scalable, I'd go for this)

    为每种数据类型都有单独的表(非常可扩展,我会这样做)

So, the tables would look like:

所以,表格看起来像:

Events             EventId int,  EventTypeId varchar,   TS timestamp
EventAttrValueInt  EventId int,  AttrName varchar,  Value int
EventAttrValueChar EventId int,  AttrName varchar,  Value varchar

2) What do you mean by segmentation? Querying various parameters of the event? In the EAV design mentioned above, you can do this:

2)你的分词是什么意思?查询事件的各种参数?在上面提到的EAV设计中,您可以这样做:

select * 
from Events 
  join EventAttrValueInt  on Id = EventId and AttrName = 'APPVERSION' and Value > 4
  join EventAttrValueChar on Id = EventId and AttrName = 'APP_NAME' 
                                          and Value like "%Office%"
where EventTypeId = "APP_LAUNCH"

This will select all events of APP_LAUNCH type where APPVERSION is > 4 and APP_NAME contains "Office".

这将选择APP_LACHCH类型的所有事件,其中APPVERSION> 4且APP_NAME包含“Office”。

3) The EVENTTYPE table could serve the purpose of consistency, i.e. you could:

3)EVENTTYPE表可以达到一致性的目的,即您可以:

table EVENTS (.... EVENTTYPE_ID varchar - foreign key to EVENTTYPE ...)
table EVENTTYPE (EVENTTYPE_ID varchar)

Or, you could use ID as number and have event name in the EVENTTYPE table - this saves space and allows for renaming of the events easily, but you will need to join this table in every query (resulting in a bit slower queries). Depends on priority of saving storage space vs lower query time / simplicity.

或者,您可以使用ID作为数字并在EVENTTYPE表中使用事件名称 - 这样可以节省空间并允许轻松重命名事件,但是您需要在每个查询中加入此表(导致查询速度稍慢)。取决于节省存储空间的优先级与较低的查询时间/简单性。

4) timestamp ranged queries are actually very simple in your design:

4)时间戳范围查询在您的设计中实际上非常简单:

select * 
from EVENTS
where EVENTTYPE_ID = "APP_LAUNCH" and TIMESTAMP > '2013-11-1'

5) "Is it bad to insert a row for every event?"

5)“为每个事件插入一行是不是很糟糕?”

This totally depends on you! If you need the timestamp and/or different parameters of every such event, then probably you should have a row for every event. If there is a huge amount of events of the same type and parameters, you can probably do what most loging systems do: aggregate the events which occur in one row. If you have such a gut feeling, then it's probably a way to go.

这完全取决于你!如果您需要每个此类事件的时间戳和/或不同参数,那么您可能应该为每个事件都有一行。如果存在大量具有相同类型和参数的事件,则可以执行大多数日志系统所执行的操作:聚合一行中发生的事件。如果你有这种直觉,那么它可能是一种可行的方式。

6) " I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries"

6)“我没有足够的SQL经验来了解这些查询如何执行可能成千上万的这些条目”

Hundreds or thousands such entries will be handled without problems. When you reach a milion, you will have to think much more on the efficiency.

将毫无问题地处理数百或数千个此类条目。当你达到一个百万,你将不得不考虑更多的效率。

7) "Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?"

7)“当我希望客户端实际获得分析时,聚合表或内存缓存是否有助于缓解问题?”

Of course, this is also a solution, if the queries get slow and you need to respond fast. But then you must introduce some mechanism to refresh the cache periodically. It is overly more complicated; maybe better to consider aggregating the events on the input, see 5).

当然,这也是一种解决方案,如果查询变慢并且您需要快速响应。但是,您必须引入一些机制来定期刷新缓存。它过于复杂了;也许最好考虑在输入上聚合事件,见5)。

#1


17  

I think most of your concerns are unnecessary. Taking one of your questions after another:

我认为你的大部分担忧都是不必要的。接下来提出一个问题:

1) The biggest issue are the custom attributes, different for each event. For this, you have to use EAV (entity-attribute-value) design. The important question is - what types can these attributes have? If more than one - e.g. string and integer, then it is more complicated. There are in general two types of such design:

1)最大的问题是自定义属性,每个事件都有所不同。为此,您必须使用EAV(实体 - 属性 - 值)设计。重要的问题是 - 这些属性有哪些类型?如果不止一个 - 例如字符串和整数,然后它更复杂。一般有两种类型的设计:

  • use one table and one column for values of all type - and convert everything to string (not scalable solution)

    使用一个表和一列来表示所有类型的值 - 并将所有内容转换为字符串(不是可扩展的解决方案)

  • have separate tables for each data type (very scalable, I'd go for this)

    为每种数据类型都有单独的表(非常可扩展,我会这样做)

So, the tables would look like:

所以,表格看起来像:

Events             EventId int,  EventTypeId varchar,   TS timestamp
EventAttrValueInt  EventId int,  AttrName varchar,  Value int
EventAttrValueChar EventId int,  AttrName varchar,  Value varchar

2) What do you mean by segmentation? Querying various parameters of the event? In the EAV design mentioned above, you can do this:

2)你的分词是什么意思?查询事件的各种参数?在上面提到的EAV设计中,您可以这样做:

select * 
from Events 
  join EventAttrValueInt  on Id = EventId and AttrName = 'APPVERSION' and Value > 4
  join EventAttrValueChar on Id = EventId and AttrName = 'APP_NAME' 
                                          and Value like "%Office%"
where EventTypeId = "APP_LAUNCH"

This will select all events of APP_LAUNCH type where APPVERSION is > 4 and APP_NAME contains "Office".

这将选择APP_LACHCH类型的所有事件,其中APPVERSION> 4且APP_NAME包含“Office”。

3) The EVENTTYPE table could serve the purpose of consistency, i.e. you could:

3)EVENTTYPE表可以达到一致性的目的,即您可以:

table EVENTS (.... EVENTTYPE_ID varchar - foreign key to EVENTTYPE ...)
table EVENTTYPE (EVENTTYPE_ID varchar)

Or, you could use ID as number and have event name in the EVENTTYPE table - this saves space and allows for renaming of the events easily, but you will need to join this table in every query (resulting in a bit slower queries). Depends on priority of saving storage space vs lower query time / simplicity.

或者,您可以使用ID作为数字并在EVENTTYPE表中使用事件名称 - 这样可以节省空间并允许轻松重命名事件,但是您需要在每个查询中加入此表(导致查询速度稍慢)。取决于节省存储空间的优先级与较低的查询时间/简单性。

4) timestamp ranged queries are actually very simple in your design:

4)时间戳范围查询在您的设计中实际上非常简单:

select * 
from EVENTS
where EVENTTYPE_ID = "APP_LAUNCH" and TIMESTAMP > '2013-11-1'

5) "Is it bad to insert a row for every event?"

5)“为每个事件插入一行是不是很糟糕?”

This totally depends on you! If you need the timestamp and/or different parameters of every such event, then probably you should have a row for every event. If there is a huge amount of events of the same type and parameters, you can probably do what most loging systems do: aggregate the events which occur in one row. If you have such a gut feeling, then it's probably a way to go.

这完全取决于你!如果您需要每个此类事件的时间戳和/或不同参数,那么您可能应该为每个事件都有一行。如果存在大量具有相同类型和参数的事件,则可以执行大多数日志系统所执行的操作:聚合一行中发生的事件。如果你有这种直觉,那么它可能是一种可行的方式。

6) " I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries"

6)“我没有足够的SQL经验来了解这些查询如何执行可能成千上万的这些条目”

Hundreds or thousands such entries will be handled without problems. When you reach a milion, you will have to think much more on the efficiency.

将毫无问题地处理数百或数千个此类条目。当你达到一个百万,你将不得不考虑更多的效率。

7) "Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?"

7)“当我希望客户端实际获得分析时,聚合表或内存缓存是否有助于缓解问题?”

Of course, this is also a solution, if the queries get slow and you need to respond fast. But then you must introduce some mechanism to refresh the cache periodically. It is overly more complicated; maybe better to consider aggregating the events on the input, see 5).

当然,这也是一种解决方案,如果查询变慢并且您需要快速响应。但是,您必须引入一些机制来定期刷新缓存。它过于复杂了;也许最好考虑在输入上聚合事件,见5)。