使用变量/维度过滤时,使用Google Core Reporting API和bigquery的用户数不正确且不一致

时间:2022-03-01 14:55:46

Background: I have app and web data, some of my apps (new iOS versions) use GA dimensions and the rest (Android and web) use GA custom variables.

背景:我有应用和网络数据,我的一些应用程序(新的iOS版本)使用GA维度,其余(Android和Web)使用GA自定义变量。

So firstly, I'm currently trying to replicate this query in BigQuery in the Query Explorer to get simple user counts over a defined date for my web users only:

首先,我正在尝试在查询资源管理器中的BigQuery中复制此查询,以便仅在我的Web用户的定义日期内获得简单的用户计数:

select count(distinct fullvisitorid, 10000000) as users
 from table_date_range([12345678.ga_sessions_],    
 timestamp('2015-02-01'), timestamp('2015-03-01')) 
  where hits.customvariables.customvarvalue like '%web%'

I get around 5.34m users. This corresponds to what I see in Google Analytics. I am confident this figure is correct.

我得到了大约534万用户。这与我在Google Analytics中看到的相对应。我相信这个数字是正确的。

If I go into the Query Explorer and apply no filters (so I include my app and web users) I get 5.70m users. Again, this corresponds to Google Analytics and we're confident this figure is correct, web makes up the majority of our traffic.

如果我进入查询资源管理器并且不应用任何过滤器(因此我包括我的应用程序和Web用户),我将获得5.70m用户。同样,这与Google Analytics相对应,我们相信这个数字是正确的,网络占据了我们的大部分流量。

If I run another query in Query Explorer but this time apply the filter:

如果我在查询资源管理器中运行另一个查询,但这次应用过滤器:

ga:customVarValue1=@web

I get 8.73m users. So I have more users after applying the filter than without... obviously this isn't correct and has something to do with how the Query Explorer is applying the filter post aggregation.

我得到873万用户。因此,在应用过滤器之后我有更多用户...显然这不正确,并且与查询资源管理器如何应用过滤器后聚合有关。

Note: When I run this query in BigQuery:

注意:当我在BigQuery中运行此查询时:

select sum(users)
from (
   select count(distinct fullvisitorid, 1000000) as users,
    hits.customvariables.customvarvalue as platform
   from table_date_range([12345678.ga_sessions_],    
    timestamp('2015-02-01'), timestamp('2015-03-01')) 
   group each by 2)
    where platform like '%web%'

I get 8.73m users. Almost the exact same number as I get when applying the filter in Query Explorer, the difference I get of around 1% can be explained by the sampling. I've tested it on multiple dates so I'm sure this is what's happening. Applying the filter post aggregation instead of pre (as in my first BigQuery query) leads to a higher number of users because we had two web releases in this timeframe. So all users are being counted once for every version of web they used.

我得到873万用户。几乎与我在查询资源管理器中应用过滤器时获得的数字完全相同,我得到的差异大约为1%可以通过抽样来解释。我已经在多个日期测试了它,所以我确信这就是正在发生的事情。应用过滤器后期聚合而不是pre(在我的第一个BigQuery查询中)会导致更多用户,因为我们在此时间段内有两个Web版本。因此,对于他们使用的每个Web版本,所有用户都被计算一次。

To add: One of the developers on my team wrote some Python script back in February which replicated the first BigQuery code written above (a simple user count where the variable=web) but instead hits the Core Reporting API and requests an unsampled report. Until March 5th 2015 the number of users we got using BigQuery versus the Python script were almost identical (difference of 1% due to sampling). Then on March 5th they began to diverge, even for historical user counts, and instead our Python script started producing counts similar to the Query Explorer (filters being applied post aggregation instead of pre).

添加:我的团队中的一位开发人员在二月份写了一些Python脚本,它复制了上面写的第一个BigQuery代码(一个简单的用户计数,其中变量= web),而是点击Core Reporting API并请求一个非抽样的报告。在2015年3月5日之前,我们使用BigQuery和Python脚本的用户数几乎相同(由于采样,差异为1%)。然后在3月5日他们开始分歧,即使对于历史用户计数,而我们的Python脚本开始产生类似于查询资源管理器的计数(过滤器在聚合后而不是预先应用)。

My question(s) are: 1. What changed on March 5th? 2. How do we replicate in Query explorer the first BigQuery code above? Are we applying the variable filter correctly? 3. How do we replicate the BigQuery code in our Python script which hits the Core reporting API?

我的问题是:1。3月5日有什么变化? 2.我们如何在Query explorer中复制上面的第一个BigQuery代码?我们正确应用可变过滤器吗? 3.我们如何在我们的Python脚本中复制BigQuery代码,该脚本会访问Core报告API?

Lastly: When in Query Explorer I ask for user counts over a given date and instead use a dimension filter:

最后:在查询资源管理器中,我要求在给定日期内进行用户计数,而是使用维度过滤器:

ga:dimension2=@ios

I get around 50% LESS than I get in BigQuery when running:

在运行时,我比BigQuery得到的要少50%:

select count(distinct fullvisitorid, 10000000) as users
 from table_date_range([12345678.ga_sessions_],    
 timestamp('2015-02-01'), timestamp('2015-03-01')) 
  where hits.customdimensions.value like '%ios%'

If the filter was being applied post aggregation as it is when filtering using variables then I would get a higher user count, not less. I seriously cannot explain what the Query Explorer is doing in order to give me substantially lower counts when filtering on dimensions.

如果过滤器在聚合后应用,就像使用变量进行过滤那样,那么我会获得更高的用户数,而不是更少。我严重无法解释查询资源管理器正在做什么,以便在过滤维度时给予我大大降低的计数。

Please halp

1 个解决方案

#1


I don't have an answer for you, but since you're using BigQuery, I assume you are a Premium customer? If so, you can open a ticket with their support team - they should get back to you quickly.

我没有给你答案,但既然你正在使用BigQuery,我认为你是一个高级客户?如果是这样,您可以与他们的支持团队打开一张票 - 他们应该尽快给您回复。

#1


I don't have an answer for you, but since you're using BigQuery, I assume you are a Premium customer? If so, you can open a ticket with their support team - they should get back to you quickly.

我没有给你答案,但既然你正在使用BigQuery,我认为你是一个高级客户?如果是这样,您可以与他们的支持团队打开一张票 - 他们应该尽快给您回复。