如何在Presto中交叉连接不需要的JSON数组

时间:2022-02-02 21:04:26

Given a table that contains a column of JSON like this:

给定一个包含JSON列的表,如下所示:

{"payload":[{"type":"b","value":"9"}, {"type":"a","value":"8"}]}
{"payload":[{"type":"c","value":"7"}, {"type":"b","value":"3"}]}

How can I write a Presto query to give me the average b value across all entries?

如何编写Presto查询以获得所有条目的平均b值?

So far I think I need to use something like Hive's lateral view explode, whose equivalent is cross join unnest in Presto.

到目前为止,我认为我需要使用像Hive的侧视图爆炸这样的东西,其等价物是Presto中的交叉连接。

But I'm stuck on how to write the Presto query for cross join unnest.

但我仍然坚持如何为交叉连接编写Presto查询。

How can I use cross join unnest to expand all array elements and select them?

如何使用交叉连接取消扩展所有数组元素并选择它们?

3 个解决方案

#1


3  

As you pointed out, this was finally implemented in Presto 0.79. :)

正如您所指出的,这最终在Presto 0.79中实现。 :)

Here is an example of the syntax for the cast from here:

以下是此处的强制转换语法示例:

select cast(cast ('[1,2,3]' as json) as array<bigint>);

Special word of advice, there is no 'string' type in Presto like there is in Hive. That means if your array contains strings make sure you use type 'varchar' otherwise you get an error msg saying 'type array does not exist' which can be misleading.

特别的建议,Presto中没有'字符串'类型,就像在Hive中一样。这意味着如果您的数组包含字符串,请确保使用类型'varchar',否则会收到错误消息,提示'类型数组不存在',这可能会产生误导。

select cast(cast ('["1","2","3"]' as json) as array<varchar>);

#2


1  

The problem was that I was running an old version of Presto.

问题是我运行的是旧版本的Presto。

unnest was added in version 0.79

版本0.79中添加了unexst

https://github.com/facebook/presto/blob/50081273a9e8c4d7b9d851425211c71bfaf8a34e/presto-docs/src/main/sphinx/release/release-0.79.rst

#3


1  

Here's an example of that

这是一个例子

with example(message) as (
VALUES
(json '{"payload":[{"type":"b","value":"9"},{"type":"a","value":"8"}]}'),
(json '{"payload":[{"type":"c","value":"7"}, {"type":"b","value":"3"}]}')
)


SELECT
        n.type,
        avg(n.value)
FROM example
CROSS JOIN
    UNNEST(
            CAST(
                JSON_EXTRACT(message,'$.payload')
                    as ARRAY(ROW(type VARCHAR, value INTEGER))
                    )
                ) as x(n)
WHERE n.type = 'b'
GROUP BY n.type

with defines a common table expression (CTE) names example with a column aliased as message

with定义了一个公用表表达式(CTE)名称示例,其中列别名为message

VALUES returns a verbatim table rowset

VALUES返回逐字表行集

UNNEST is taking an array within a column of a single row and returning the elements of the array as multiple rows.

UNNEST在一行的一列中获取一个数组,并将该数组的元素作为多行返回。

CAST is changing the JSON type into an ARRAY type that is required for UNNEST. It could easily have been an ARRAY<MAP< but I find ARRAY(ROW( nicer as you can specify column names, and use dot notation in the select clause.

CAST正在将JSON类型更改为UNNEST所需的ARRAY类型。它可能很容易成为ARRAY

JSON_EXTRACT is using a jsonPath expression to return the array value of the payload key

JSON_EXTRACT使用jsonPath表达式返回有效内容密钥的数组值

avg() and group by should be familiar SQL.

avg()和group by应该是熟悉的SQL。

#1


3  

As you pointed out, this was finally implemented in Presto 0.79. :)

正如您所指出的,这最终在Presto 0.79中实现。 :)

Here is an example of the syntax for the cast from here:

以下是此处的强制转换语法示例:

select cast(cast ('[1,2,3]' as json) as array<bigint>);

Special word of advice, there is no 'string' type in Presto like there is in Hive. That means if your array contains strings make sure you use type 'varchar' otherwise you get an error msg saying 'type array does not exist' which can be misleading.

特别的建议,Presto中没有'字符串'类型,就像在Hive中一样。这意味着如果您的数组包含字符串,请确保使用类型'varchar',否则会收到错误消息,提示'类型数组不存在',这可能会产生误导。

select cast(cast ('["1","2","3"]' as json) as array<varchar>);

#2


1  

The problem was that I was running an old version of Presto.

问题是我运行的是旧版本的Presto。

unnest was added in version 0.79

版本0.79中添加了unexst

https://github.com/facebook/presto/blob/50081273a9e8c4d7b9d851425211c71bfaf8a34e/presto-docs/src/main/sphinx/release/release-0.79.rst

#3


1  

Here's an example of that

这是一个例子

with example(message) as (
VALUES
(json '{"payload":[{"type":"b","value":"9"},{"type":"a","value":"8"}]}'),
(json '{"payload":[{"type":"c","value":"7"}, {"type":"b","value":"3"}]}')
)


SELECT
        n.type,
        avg(n.value)
FROM example
CROSS JOIN
    UNNEST(
            CAST(
                JSON_EXTRACT(message,'$.payload')
                    as ARRAY(ROW(type VARCHAR, value INTEGER))
                    )
                ) as x(n)
WHERE n.type = 'b'
GROUP BY n.type

with defines a common table expression (CTE) names example with a column aliased as message

with定义了一个公用表表达式(CTE)名称示例,其中列别名为message

VALUES returns a verbatim table rowset

VALUES返回逐字表行集

UNNEST is taking an array within a column of a single row and returning the elements of the array as multiple rows.

UNNEST在一行的一列中获取一个数组,并将该数组的元素作为多行返回。

CAST is changing the JSON type into an ARRAY type that is required for UNNEST. It could easily have been an ARRAY<MAP< but I find ARRAY(ROW( nicer as you can specify column names, and use dot notation in the select clause.

CAST正在将JSON类型更改为UNNEST所需的ARRAY类型。它可能很容易成为ARRAY

JSON_EXTRACT is using a jsonPath expression to return the array value of the payload key

JSON_EXTRACT使用jsonPath表达式返回有效内容密钥的数组值

avg() and group by should be familiar SQL.

avg()和group by应该是熟悉的SQL。