谷歌BigQuery——在谷歌BigQuery SQL中模拟熊猫removeduplicate ()

Given a Google BigQuery dataset with col_1....col_m, how can you use Google BigQuery SQL to return the dataset where there are no duplicates in say... [col1, col3, col7] such that when there are rows with duplicates in [col1, col3, col7], then the first row among those duplicates is returned, and the rest of the rows which have duplicate fields in those columns are all removed?

给定一个谷歌BigQuery数据集与col_1 ....col_m，如何使用谷歌BigQuery SQL返回没有重复的数据集…[col1, col3, col7]这样，当[col1, col3, col7]中有重复的行时，那么返回这些重复的行中的第一行，并删除那些列中具有重复字段的其余行?

Example: removeDuplicates([col1, col3])

例如:removeDuplicates([col1 col3])

    col1 col2 col3
    ---- ---- ----
r1: 20   25   30
r2: 20   70   30
r3: 40   70   30

returns

  col1 col2 col3
  ---- ---- ----
r1: 20   25   30
r3: 40   70   30

To do this using python pandas is easy. For a dataframe (i.e. matrix), you call the pandas function removedDuplicates([field1, field2, ...]). However, removeDuplicates is not specified within the context of Google Big Query SQL.

使用python熊猫是很容易的。对于一个dataframe(即矩阵)，您可以调用大熊猫函数removed副本([field1, field2，…])。但是，在谷歌Big Query SQL的上下文中没有指定removeduplicate。

My best guess with how to do it in Google Big Query is to use the rank() function:

在谷歌大查询中，我的最佳猜测是使用rank()函数:

https://cloud.google.com/bigquery/query-reference#rank

https://cloud.google.com/bigquery/query-reference排名

I am looking for a concise solution if one exists.

我正在寻找一个简洁的解决方案，如果存在的话。

1 个解决方案

#1

You can group by all of your columns that you want to remove duplicates from, and use FIRST() of the others. That is, removeDuplicates([col1, col3]) would translate to

您可以对所有列进行分组，您希望删除它们的副本，并使用其他列的第一个()。也就是说，removeduplicate ([col1, col3])将转化为

SELECT col1, FIRST(col2) as col2, col3 
FROM table 
GROUP EACH BY col1, col3

Note that in BigQuery SQL, if you have more than a million distinct values for col1 and col3, you'll need the EACH keyword.

注意，在BigQuery SQL中，如果col1和col3有超过100万个不同的值，则需要使用EACH关键字。

#1