如何在Google BigQuery中的URL字符串中的模式之后用SYMBOLS提取字符串

时间:2022-11-24 15:30:55

i have two possible forms of a URL string

我有两种可能的URL字符串形式

http://www.abcexample.com/landpage/?pps=[Y/lyPw==;id_1][Y/lyP2ZZYxi==;id_2];[5403;ord];
http://www.abcexample.com/landpage/?pps=Y/lyPw==;id_1;unknown;ord; 

I want to get out the Y/lyPw== in both examples

我想在两个例子中得出Y / lyPw ==

so everything before ;id_1 between the brackets

所以之前的一切;在括号之间的id_1

will always come after the ?pps= part

总是会出现在?pps = part之后

What is the best way to approach this? I want to use the big query language as this is where my data sits

解决这个问题的最佳方法是什么?我想使用大查询语言,因为这是我的数据所在

3 个解决方案

#1


7  

Here is one way to build a regular expression to do it:

以下是构建正则表达式的一种方法:

SELECT REGEXP_EXTRACT(url, r'\?pps=;[\[]?([^;]*);') FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

#2


6  

You can use this regex:

你可以使用这个正则表达式:

pps=\[?([^;]+)

Working demo

如何在Google BigQuery中的URL字符串中的模式之后用SYMBOLS提取字符串

The idea behind this regex is:

这个正则表达式背后的想法是:

pps=    -> Look for the pps= pattern
\[?     -> might have a [ or not
([^;]+) -> store the content up to the first semi colon

So, for your both url this regex will match (in blue) and capture (in green) as below:

所以,对于你的两个url,这个正则表达式将匹配(蓝色)和捕获(绿色),如下所示:

如何在Google BigQuery中的URL字符串中的模式之后用SYMBOLS提取字符串

For BigQuery you have to use

对于BigQuery,你必须使用

REGEXP_EXTRACT('str', 'reg_exp')

Quoting its documentation:

引用其文档:

REGEXP_EXTRACT: Returns the portion of str that matches the capturing group within the regular expression.

REGEXP_EXTRACT:返回与正则表达式中的捕获组匹配的str部分。

You have to use a code like this:

你必须使用这样的代码:

SELECT
   REGEXP_EXTRACT(word,r'pps=\[?([^;]+)') AS fragment
FROM
   ...

For a working example code you can use:

对于可以使用的工作示例代码:

SELECT
   REGEXP_EXTRACT(url,r'pps=\[?([^;]+)') AS fragment
FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

#3


2  

This regex should work for you

这个正则表达式应该适合你

(\w+);id_1

It will extract XYZXYZ

它将提取XYZXYZ

It uses the concept of Group capture

它使用了组捕获的概念

See this Demo

看这个演示

#1


7  

Here is one way to build a regular expression to do it:

以下是构建正则表达式的一种方法:

SELECT REGEXP_EXTRACT(url, r'\?pps=;[\[]?([^;]*);') FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

#2


6  

You can use this regex:

你可以使用这个正则表达式:

pps=\[?([^;]+)

Working demo

如何在Google BigQuery中的URL字符串中的模式之后用SYMBOLS提取字符串

The idea behind this regex is:

这个正则表达式背后的想法是:

pps=    -> Look for the pps= pattern
\[?     -> might have a [ or not
([^;]+) -> store the content up to the first semi colon

So, for your both url this regex will match (in blue) and capture (in green) as below:

所以,对于你的两个url,这个正则表达式将匹配(蓝色)和捕获(绿色),如下所示:

如何在Google BigQuery中的URL字符串中的模式之后用SYMBOLS提取字符串

For BigQuery you have to use

对于BigQuery,你必须使用

REGEXP_EXTRACT('str', 'reg_exp')

Quoting its documentation:

引用其文档:

REGEXP_EXTRACT: Returns the portion of str that matches the capturing group within the regular expression.

REGEXP_EXTRACT:返回与正则表达式中的捕获组匹配的str部分。

You have to use a code like this:

你必须使用这样的代码:

SELECT
   REGEXP_EXTRACT(word,r'pps=\[?([^;]+)') AS fragment
FROM
   ...

For a working example code you can use:

对于可以使用的工作示例代码:

SELECT
   REGEXP_EXTRACT(url,r'pps=\[?([^;]+)') AS fragment
FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

#3


2  

This regex should work for you

这个正则表达式应该适合你

(\w+);id_1

It will extract XYZXYZ

它将提取XYZXYZ

It uses the concept of Group capture

它使用了组捕获的概念

See this Demo

看这个演示