如何使用regexp_substr从字符串中提取所有的hashtag ?

时间:2022-08-27 23:27:18

I need a regex pattern which extracts all hastags from a tweets in a table. My data like is

我需要一个regex模式,它从表中的tweet中提取所有hastags。我的数据是

select regexp_substr('My twwet #HashTag1 and this is the #SecondHashtag    sample','#\S+')
from dual

it only brings #HashTag1 not #SecondHashtag

它只带来#HashTag1而不是#SecondHashtag

I need a output like #HashTag1 #SecondHashtag

我需要一个像#HashTag1 #SecondHashtag这样的输出

Thanks

谢谢

1 个解决方案

#1


2  

You can use regexp_replace to remove all that doesn't match your pattern.

可以使用regexp_replace删除与模式不匹配的所有内容。

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;

Produces;

产生;

#HashTag1 #SecondHashtag #onemorehashtag

regexp_substr will return one match. What you can do is turn your string into a table using connect by:

regexp_substr将返回一个匹配。你所能做的就是使用connect by将你的字符串变成表格:

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;

Returns:

返回:

#HashTag1
#SecondHashtag
#onemorehashtag

EDIT:

\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.

\S匹配任何非空格字符。最好使用与a-z、a-z、0-9和_匹配的\w。

As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.

@mathguy评论道:hashtag以字母开头,然后允许使用字母数字字符或下划线。

So, pattern #[[:alpha:]]\w* will work better.

所以,模式#[:alpha:] \w*会更好。

with t (col) as (
  select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;

Produces:

生产:

#HashTag1
#SecondHashtag
#onemorehashtag

#1


2  

You can use regexp_replace to remove all that doesn't match your pattern.

可以使用regexp_replace删除与模式不匹配的所有内容。

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;

Produces;

产生;

#HashTag1 #SecondHashtag #onemorehashtag

regexp_substr will return one match. What you can do is turn your string into a table using connect by:

regexp_substr将返回一个匹配。你所能做的就是使用connect by将你的字符串变成表格:

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;

Returns:

返回:

#HashTag1
#SecondHashtag
#onemorehashtag

EDIT:

\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.

\S匹配任何非空格字符。最好使用与a-z、a-z、0-9和_匹配的\w。

As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.

@mathguy评论道:hashtag以字母开头,然后允许使用字母数字字符或下划线。

So, pattern #[[:alpha:]]\w* will work better.

所以,模式#[:alpha:] \w*会更好。

with t (col) as (
  select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;

Produces:

生产:

#HashTag1
#SecondHashtag
#onemorehashtag