在终端中使用正则表达式从字符串中提取字符串。

时间:2022-09-13 11:41:30

I have a string like first url, second url, third url and would like to extract only the url after the word second in the OS X Terminal (only the first occurrence). How can I do it?

我有一个字符串,比如第一个url、第二个url、第三个url,我只想在OS X终端的第二个单词后面提取url(只有第一个出现)。我怎么做呢?

In my favorite editor I used the regex /second (url)/ and used $1 to extract it, I just don't know how to do it in the Terminal.

在我最喜欢的编辑器中,我使用regex /second (url)/并使用$1提取它,我只是不知道如何在终端中执行。

Keep in mind that url is an actual url, I'll be using one of these expressions to match it: Regex to match URL

记住url是一个实际的url,我将使用其中一个表达式来匹配它:Regex来匹配url

4 个解决方案

#1


42  

echo 'first url, second url, third url' | sed 's/.*second//'

Edit: I misunderstood. Better:

编辑:我误解了。好:

echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'

or:

或者:

echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'

#2


8  

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.

在另一个答案中,如果您仍然保留在所需的URL之后的所有内容。所以我建议你解决以下问题。

echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'

Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

在sed下,通过转义表达式(POSIX标准)来分组表达式。

#3


6  

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:

管道传输到另一个进程(如上面建议的“sed”和“perl”)可能非常昂贵,尤其是当您需要多次运行此操作时。Bash支持正则表达式:

[[ "string" =~ regex ]]

[["string" =~ regex]

Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.

类似于使用$1、$2等在您最喜欢的编辑器中提取匹配项的方法,Bash将所有匹配项填入$BASH_REMATCH数组。

In your particular example:

在你的例子:

str="first usr1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then echo "match: '${BASH_REMATCH[2]}'"; else echo "no match found"; fi

Output:

输出:

match: 'url2'

Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

具体地说,=~支持POSIX定义的扩展正则表达式,但是支持特定于平台的扩展(扩展范围不同,可能不兼容)。在Linux平台(GNU userland)上,请参阅man grep;在macOS/BSD平台上,请参阅man re_format。

#4


0  

While trying this, what you probably forgot was the -E argument for sed.

在尝试此方法时,您可能忘记了sed的-E参数。

From sed --help:

从对话——帮助:

  -E, -r, --regexp-extended
                 use extended regular expressions in the script
                 (for portability use POSIX -E).

You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.

您不需要对regex进行显著的更改,但是您确实需要添加.*来贪婪地匹配它以删除字符串的其他部分。

This works fine for me:

这对我来说很好:

echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'

Output:

输出:

url

In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.

在其中,输出“url”实际上是字符串中的第二个实例。但如果你已经知道它被格式化在逗号和空间之间,你不允许这些字符的url,然后正则表达式[^,]*应该没事的。

Optionally:

(可选):

echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
     | sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'

Which correctly outputs:

正确的输出:

://example.com/with spaces/2

#1


42  

echo 'first url, second url, third url' | sed 's/.*second//'

Edit: I misunderstood. Better:

编辑:我误解了。好:

echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'

or:

或者:

echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'

#2


8  

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.

在另一个答案中,如果您仍然保留在所需的URL之后的所有内容。所以我建议你解决以下问题。

echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'

Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

在sed下,通过转义表达式(POSIX标准)来分组表达式。

#3


6  

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:

管道传输到另一个进程(如上面建议的“sed”和“perl”)可能非常昂贵,尤其是当您需要多次运行此操作时。Bash支持正则表达式:

[[ "string" =~ regex ]]

[["string" =~ regex]

Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.

类似于使用$1、$2等在您最喜欢的编辑器中提取匹配项的方法,Bash将所有匹配项填入$BASH_REMATCH数组。

In your particular example:

在你的例子:

str="first usr1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then echo "match: '${BASH_REMATCH[2]}'"; else echo "no match found"; fi

Output:

输出:

match: 'url2'

Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

具体地说,=~支持POSIX定义的扩展正则表达式,但是支持特定于平台的扩展(扩展范围不同,可能不兼容)。在Linux平台(GNU userland)上,请参阅man grep;在macOS/BSD平台上,请参阅man re_format。

#4


0  

While trying this, what you probably forgot was the -E argument for sed.

在尝试此方法时,您可能忘记了sed的-E参数。

From sed --help:

从对话——帮助:

  -E, -r, --regexp-extended
                 use extended regular expressions in the script
                 (for portability use POSIX -E).

You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.

您不需要对regex进行显著的更改,但是您确实需要添加.*来贪婪地匹配它以删除字符串的其他部分。

This works fine for me:

这对我来说很好:

echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'

Output:

输出:

url

In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.

在其中,输出“url”实际上是字符串中的第二个实例。但如果你已经知道它被格式化在逗号和空间之间,你不允许这些字符的url,然后正则表达式[^,]*应该没事的。

Optionally:

(可选):

echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
     | sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'

Which correctly outputs:

正确的输出:

://example.com/with spaces/2