从String(Ruby)中提取URL(正则表达式和链接缩短)

时间:2022-09-13 11:15:21

I heard that URI::extract() only returns links with a :, however since I am grabbing a tweet, and it does not contain a :, I believe I would have to use a regular expression. I need to check for a "swoo.sh/whatever" link, and store it to a variable. However, how could I look for the first (which it returns automatically apparently), "swoo.sh/whatever" link, in regards to that I have to maintain everything after the /. For example, if the tweet says

我听说URI :: extract()只返回带有:的链接,但是因为我抓了一条推文,而且它不包含:,我相信我必须使用正则表达式。我需要检查“swoo.sh/whatever”链接,并将其存储到变量中。但是,我怎么能找到第一个(它显然会自动返回),“swoo.sh/whatever”链接,关于我必须维护/之后的所有内容。例如,如果推文说

Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum

Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum

How would I grab the swoo.sh link, and all the different things that come directly after the /?

我怎么能抓住swoo.sh链接,以及/之后直接出现的所有不同的东西?

2 个解决方案

#1


1  

Here is one approach using match:

以下是使用匹配的一种方法:

match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
    puts match[1]
else
    puts "no match"
end

Demo

If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.

如果您还需要同时捕获完整URL的能力,那么我的答案必须更新。这只能解答您的直接问题。

#2


1  

We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like foo.bar

我们可以使用URI不能包含空格的事实,Ruby有URI :: Generic,它将解析几乎任何看起来像URI的东西。然后我们只需要过滤掉非web-URI,我假设每个web URI都必须以foo.bar之类的东西开头。

require 'uri'
require 'pathname'

tweet.
  split.
  map { |s| URI.parse(s) rescue nil }.
  select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }

Example output

tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]

This can't really work in general because of ambiguity. "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space.

由于含糊不清,这通常无法正常工作。 “car.net”看起来像是一个缩短的链接,但在上下文中它可能是“我的邻居在我的窗户上扔了一个棒球,所以我把他们的车载收起来的轮毂罩!”,这显然只是一个缺失的空间。

#1


1  

Here is one approach using match:

以下是使用匹配的一种方法:

match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
    puts match[1]
else
    puts "no match"
end

Demo

If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.

如果您还需要同时捕获完整URL的能力,那么我的答案必须更新。这只能解答您的直接问题。

#2


1  

We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like foo.bar

我们可以使用URI不能包含空格的事实,Ruby有URI :: Generic,它将解析几乎任何看起来像URI的东西。然后我们只需要过滤掉非web-URI,我假设每个web URI都必须以foo.bar之类的东西开头。

require 'uri'
require 'pathname'

tweet.
  split.
  map { |s| URI.parse(s) rescue nil }.
  select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }

Example output

tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]

This can't really work in general because of ambiguity. "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space.

由于含糊不清,这通常无法正常工作。 “car.net”看起来像是一个缩短的链接,但在上下文中它可能是“我的邻居在我的窗户上扔了一个棒球,所以我把他们的车载收起来的轮毂罩!”,这显然只是一个缺失的空间。