应该使用什么数据库/技术来计算时间范围内的唯一访问者

时间:2021-09-29 15:25:13

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.

我的报表数据库的性能有问题(表有数百万条记录,超过50条),当我想要在显示访问者惟一性的列上计算不同的值时,假设是某个hashkey。

For example: I have these columns: hashkey, name, surname, visit_datetime, site, gender, etc...

例如:我有这些列:hashkey、name、姓、visit_datetime、site、gender等等……

I need to get distinct in time span of 1 year, less than in 5 sec:

我需要在1年的时间内得到不同的结果,少于5秒:

SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD' 

This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.

这个查询在很短的时间范围内是快速的,但是如果它大于一个月,那么它需要超过30秒。

Is there a better technology to calculate something like this than relational databases?

有比关系数据库更好的技术来计算这样的东西吗?

I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.

我想知道谷歌分析用什么来做他们的独立访问者动态计算。

3 个解决方案

#1


3  

For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).

对于报告和分析(您所描述的类型),这些统计信息往往被提取出来、聚合并存储在数据仓库或其他地方。它们以一种基于性能原因的方式存储,而不是为OLTP(在线事务处理)优化的良好关系存储技术。这种预先聚合的技术称为OLAP(联机分析处理)。

#2


0  

You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.

您可以使用另一个表来存储每天的独立访客数量,每天由cron函数或其他东西更新。

#3


0  

Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.

谷歌Analytics使用的是first-party cookie,您可以看到是否使用livehttpheader等记录请求头。

All GA analytics parameters are packed into the Request URL, e.g.,

所有GA分析参数都被打包到请求URL中,例如,

utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...

utma % 3 d97315849.1774621898.1207701397.1207701397.1207701397.1 % 3 b " > http://www.google analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=iso - 8859 - 1 - &utmsr=1280x1024&utmsc=32 bit&utmul=en us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=gatc012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/gatc012.html?utm_source=www.gatc012.org&utm_campaign=campaign + gatc012&utm_term =关键字+内容+ gatc012&utm_medium gatc012&utm_content = =媒介+ gatc012&utmac = ua - 30138 - 1 - 3 d9731 &utmcc = __utma %5849.1774621898.1207701397.1207701397.1207701397.1 % 3 b…

Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:

在这个URL中是一个键控到__utmcc的片段,这些是GA cookie。在_utmcc中,是一个键控到_utma的字符串,该字符串由6个字段组成,每个字段都用“.”分隔。第二个字段是Visitor ID,这是GA服务器在查找GA cookie而未找到之后生成并设置的随机数:

__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1

In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor

在本例中,1774621898是Visitor ID,谷歌Analytics将其作为每个Visitor的唯一标识符

So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).

因此,您可以看到识别唯一访问者的技术缺陷——使用不同的浏览器或不同的设备进入站点,或者在删除cookie之后,将使您在GA中显示为唯一的访问者(例如,,它寻找它的cookie,但没有找到,因此它设置它们)。

There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.

关于这个话题,有一篇很精彩的文章。,如何建立独特性,以何种程度的确定性,以及如何战胜独特性。

Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

最后,我曾经使用过一种技术来确定某人是否曾经访问过我们的站点(假设是硬情况,即他们删除了他们的cookie等),即检查客户端对我们的favicon的请求。存储favicon的目录经常被忽略——无论是在手动扫描过程中还是通过脚本编程。

#1


3  

For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).

对于报告和分析(您所描述的类型),这些统计信息往往被提取出来、聚合并存储在数据仓库或其他地方。它们以一种基于性能原因的方式存储,而不是为OLTP(在线事务处理)优化的良好关系存储技术。这种预先聚合的技术称为OLAP(联机分析处理)。

#2


0  

You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.

您可以使用另一个表来存储每天的独立访客数量,每天由cron函数或其他东西更新。

#3


0  

Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.

谷歌Analytics使用的是first-party cookie,您可以看到是否使用livehttpheader等记录请求头。

All GA analytics parameters are packed into the Request URL, e.g.,

所有GA分析参数都被打包到请求URL中,例如,

utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...

utma % 3 d97315849.1774621898.1207701397.1207701397.1207701397.1 % 3 b " > http://www.google analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=iso - 8859 - 1 - &utmsr=1280x1024&utmsc=32 bit&utmul=en us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=gatc012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/gatc012.html?utm_source=www.gatc012.org&utm_campaign=campaign + gatc012&utm_term =关键字+内容+ gatc012&utm_medium gatc012&utm_content = =媒介+ gatc012&utmac = ua - 30138 - 1 - 3 d9731 &utmcc = __utma %5849.1774621898.1207701397.1207701397.1207701397.1 % 3 b…

Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:

在这个URL中是一个键控到__utmcc的片段,这些是GA cookie。在_utmcc中,是一个键控到_utma的字符串,该字符串由6个字段组成,每个字段都用“.”分隔。第二个字段是Visitor ID,这是GA服务器在查找GA cookie而未找到之后生成并设置的随机数:

__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1

In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor

在本例中,1774621898是Visitor ID,谷歌Analytics将其作为每个Visitor的唯一标识符

So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).

因此,您可以看到识别唯一访问者的技术缺陷——使用不同的浏览器或不同的设备进入站点,或者在删除cookie之后,将使您在GA中显示为唯一的访问者(例如,,它寻找它的cookie,但没有找到,因此它设置它们)。

There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.

关于这个话题,有一篇很精彩的文章。,如何建立独特性,以何种程度的确定性,以及如何战胜独特性。

Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

最后,我曾经使用过一种技术来确定某人是否曾经访问过我们的站点(假设是硬情况,即他们删除了他们的cookie等),即检查客户端对我们的favicon的请求。存储favicon的目录经常被忽略——无论是在手动扫描过程中还是通过脚本编程。