Google bigquery python客户端库SQL选择正则表达式错误

I'm trying to query a google bigquery table using the regex from this blog post. Here it is, slightly modified:

我正在尝试使用此博客文章中的正则表达式查询google bigquery表。在这里,稍作修改:

pd\.([^”,\.\(\,\`) \’:\[\]\/\\={}]*)

regex101其用法示例

It does not, however, work in my google bigquery python client SQL query:

但是,它不适用于我的google bigquery python客户端SQL查询:

query_results = client.run_sync_query(
"""
SELECT
  REGEXP_EXTRACT(SPLIT(content, '\n'),
                 r'pd\.([^”,\.\(\,\`) \’:\[\]\/\\={}]*)')
FROM
  [fh-bigquery:github_extracts.contents_py]
LIMIT 10
""")

query_results.run()

data = query_results.fetch_data()
data

BadRequest: BadRequest: 400 Failed to parse regular expression "pd.([^”,.(\,`) \’:[]/\={}]*)": invalid escape sequence: \’

BadRequest:BadRequest:400无法解析正则表达式“pd。([^”,。(\,`)\':[] / \ = {}] *)“:无效的转义序列:\'

1 个解决方案

#1

The problem here is that BigQuery uses re2 library for its regex operations.

这里的问题是BigQuery使用re2库进行正则表达式操作。

If you try the same regex but using the golang flavor you will see the exact same error (golang also uses re2).

如果你尝试相同的正则表达式,但使用golang风格,你会看到完全相同的错误(golang也使用re2)。

So maybe if you just remove the escaping of the ' character you'll already have it working for you (as I tested here it seemed to work properly).

所以,如果你只是删除'角色的逃脱,你已经拥有它为你工作(因为我在这里测试它似乎工作正常)。

Another issue that you might find is that the result of the SPLIT operation is an ARRAY. That means that BigQuery won't process your query saying that the signature of REGEXP_EXTRACT does not allow ARRAY<STRING> as input. You could use REGEXP_REPLACE instead:

您可能会发现的另一个问题是SPLIT操作的结果是ARRAY。这意味着BigQuery不会处理您的查询,说明REGEXP_EXTRACT的签名不允许ARRAY 作为输入。您可以使用REGEXP_REPLACE:

"""
SELECT
  REGEXP_EXTRACT(REGEXP_REPLACE(content, r'.*(\\n)', ''),
                 r'pd\.([^”,\.\(\,\`) ’:\[\]\/\\={}]*)')
FROM
  [fh-bigquery:github_extracts.contents_py]
LIMIT 10
"""

The character "\n" is replaced by "" in this operation and the result is a STRING.

在此操作中,字符“\ n”将替换为“”,结果为STRING。

#1