如何在BigQuery中缩放旋转?

时间:2022-06-15 14:06:52

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

比如,我有音乐视频播放统计表mydataset。某一天的统计数据(3B行,100万用户,6K艺术家)。简化的模式是:UserGUID字符串、ArtistGUID字符串

I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user

我需要从行到列的pivot/转置艺术家,所以模式将是:UserGUID字符串,Artist1 Int, Artist2 Int,…Artist8000 Int与Artist play count by各自的用户。

There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example

有一种方法建议如何将行转换为BigQuery/SQL中包含大量数据的列?如何在谷歌BigQuery中为数千个类别创建虚拟变量列?但是我的例子中,它看起来不像我的数字。

Can this approach be scaled for my example?

我的例子可以按比例缩放吗?

1 个解决方案

#1


5  

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

我尝试了下面的方法,最多6000个功能,并按预期工作。我相信它会达到10K的特性,这是一个表中列数的硬限制。

STEP 1 - Aggregate plays by user / artist

第一步-用户/艺术家的集合播放

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

第二步-正常化uid和援助-所以他们是连续的数字1,2,3,…。我们需要这样做至少有两个原因:a)使以后动态创建的sql尽可能紧凑;b)具有更有用/更友好的列名

Combined with first step – it will be:

结合第一步,它将是:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID 

Let’s write output to table - mydataset.aggs

让我们将输出写入表mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

第3步-每次为N个特性(艺术家)使用已经建议的(在上面提到的问题中)方法。在我的例子中,通过实验,我发现基本方法在2000年到3000年之间的特性数量上运行良好。为了安全起见,我决定一次使用2000个特性

Below script is used for dynamically generating query that then run to create partitioned tables

下面的脚本用于动态生成查询,然后运行查询来创建分区表

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

上述查询又生成如下所示的另一个查询:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid 

This should be run and written to mydataset.pivot_1_2000

应该将其运行并写入mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

再执行步骤3两次(调整帮助> NNNN和aid < NNNN),我们得到三个表mydataset。pivot_2001_4000 mydataset。如您所见,pivot_4001_6000 - mydataset。pivot_1_2000期望有模式,但是对从1到2001年的功能有帮助;mydataset。pivot_2001_4000只具有2001年至4000年的辅助功能;等等

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

步骤4 -将所有分区的pivot表合并到最终的pivot表,并在一个表中以列表示所有特性

Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

与以上步骤相同。首先我们需要生成查询,然后运行它,首先我们将“缝合”mydataset。pivot_1_2000 mydataset.pivot_2001_4000。然后用mydataset.pivot_4001_6000结果

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

应该运行上面的输出字符串,并将结果写入mydataset.pivot_1_4000

Then we repeat STEP 4 like below

然后我们重复第4步,如下所示

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

结果被写入mydataset.pivot_1_6000

The resulted table has following schema:

结果表有以下模式:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

注意:a.我只尝试了6000个特性,并按预期运行b.第3步和第4步中第二个/主要查询的运行时间从20分钟到60分钟不等。好消息是,相应的表的大小相对较小(30-40MB),账单字节也是如此。对于“2016年之前”项目,一切都被宣传为第一级,但2016年10月之后,这可能成为一个问题。有关更多信息,请参见High-Compute查询d中的计时。我仍然认为(但我可能是错的)存储物化特征矩阵不是最好的主意。

#1


5  

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

我尝试了下面的方法,最多6000个功能,并按预期工作。我相信它会达到10K的特性,这是一个表中列数的硬限制。

STEP 1 - Aggregate plays by user / artist

第一步-用户/艺术家的集合播放

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

第二步-正常化uid和援助-所以他们是连续的数字1,2,3,…。我们需要这样做至少有两个原因:a)使以后动态创建的sql尽可能紧凑;b)具有更有用/更友好的列名

Combined with first step – it will be:

结合第一步,它将是:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID 

Let’s write output to table - mydataset.aggs

让我们将输出写入表mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

第3步-每次为N个特性(艺术家)使用已经建议的(在上面提到的问题中)方法。在我的例子中,通过实验,我发现基本方法在2000年到3000年之间的特性数量上运行良好。为了安全起见,我决定一次使用2000个特性

Below script is used for dynamically generating query that then run to create partitioned tables

下面的脚本用于动态生成查询,然后运行查询来创建分区表

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

上述查询又生成如下所示的另一个查询:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid 

This should be run and written to mydataset.pivot_1_2000

应该将其运行并写入mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

再执行步骤3两次(调整帮助> NNNN和aid < NNNN),我们得到三个表mydataset。pivot_2001_4000 mydataset。如您所见,pivot_4001_6000 - mydataset。pivot_1_2000期望有模式,但是对从1到2001年的功能有帮助;mydataset。pivot_2001_4000只具有2001年至4000年的辅助功能;等等

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

步骤4 -将所有分区的pivot表合并到最终的pivot表,并在一个表中以列表示所有特性

Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

与以上步骤相同。首先我们需要生成查询,然后运行它,首先我们将“缝合”mydataset。pivot_1_2000 mydataset.pivot_2001_4000。然后用mydataset.pivot_4001_6000结果

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

应该运行上面的输出字符串,并将结果写入mydataset.pivot_1_4000

Then we repeat STEP 4 like below

然后我们重复第4步,如下所示

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

结果被写入mydataset.pivot_1_6000

The resulted table has following schema:

结果表有以下模式:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

注意:a.我只尝试了6000个特性,并按预期运行b.第3步和第4步中第二个/主要查询的运行时间从20分钟到60分钟不等。好消息是,相应的表的大小相对较小(30-40MB),账单字节也是如此。对于“2016年之前”项目,一切都被宣传为第一级,但2016年10月之后,这可能成为一个问题。有关更多信息,请参见High-Compute查询d中的计时。我仍然认为(但我可能是错的)存储物化特征矩阵不是最好的主意。