获取BigQuery中嵌套字段的第一行

时间:2021-01-24 15:39:14

I have been struggling with a question that seem simple, yet eludes me. I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.

我一直在努力解决一个看起来很简单的问题但却没有找到我。我正在处理比特币上的公共BigQuery表,我想提取每个被挖掘的块的第一个事务。换句话说,要将第一行的嵌套字段替换为表预览中显示的字段。没有可以识别它的字段,只有它存储在表中的顺序。

I ran the following query:

我运行了以下查询:

#StandardSQL
SELECT timestamp,
    block_id,
    FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`

But it process 492 GB when run and throws the following error:

但它在运行时处理492 GB并抛出以下错误:

Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..

It seems so simple, I must be missing something. Do you have an idea about how to handle such task?

看起来很简单,我一定是在遗漏一些东西。你对如何处理这样的任务有所了解吗?

2 个解决方案

#1


2  

#standardSQL
SELECT * EXCEPT(transactions),
  (SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`    

Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema

建议:在玩这样的大表时 - 我建议创建它的较小版本 - 这样你的开发/测试的成本就会降低。下面可以帮助解决这个问题 - 您可以在带有目标表的BigQuery UI中运行它,然后您将用于您的开发人员。确保设置允许大结果并取消设置展平结果,以便保留原始模式

#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks@1529518619028]     

The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858

1529518619028的值来自下面的查询(在运行时) - 我四天前的原因是我知道此表中的行数时间仅为912而不是当前的528,858

#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000) 

#2


1  

An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:

Mikhail的另一种方法:只需要[OFFSET(0)]来询问数组的第一行:

#StandardSQL
SELECT timestamp,
    block_id,
    transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10

That first row from the array still has some nested data, that you might want to flatten to only their first row too:

数组中的第一行仍然有一些嵌套数据,您可能也想要展平到第一行:

#standardSQL
SELECT timestamp
    , block_id
    , transactions[OFFSET(0)].transaction_id first_transaction_id
    , transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
    , transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

#1


2  

#standardSQL
SELECT * EXCEPT(transactions),
  (SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`    

Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema

建议:在玩这样的大表时 - 我建议创建它的较小版本 - 这样你的开发/测试的成本就会降低。下面可以帮助解决这个问题 - 您可以在带有目标表的BigQuery UI中运行它,然后您将用于您的开发人员。确保设置允许大结果并取消设置展平结果,以便保留原始模式

#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks@1529518619028]     

The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858

1529518619028的值来自下面的查询(在运行时) - 我四天前的原因是我知道此表中的行数时间仅为912而不是当前的528,858

#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000) 

#2


1  

An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:

Mikhail的另一种方法:只需要[OFFSET(0)]来询问数组的第一行:

#StandardSQL
SELECT timestamp,
    block_id,
    transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10

That first row from the array still has some nested data, that you might want to flatten to only their first row too:

数组中的第一行仍然有一些嵌套数据,您可能也想要展平到第一行:

#standardSQL
SELECT timestamp
    , block_id
    , transactions[OFFSET(0)].transaction_id first_transaction_id
    , transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
    , transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000