Scala正则表达式从网址中提取域名

时间:2022-06-01 21:57:26

I want to extract bell.com from these following input using Scala regex. I have tried a few variations without success.

我想使用Scala正则表达式从以下输入中提取bell.com。我尝试了一些变化没有成功。

"www.bell.com"
"bell.com"
"http://www.bell.com"
"https://www.bell.com"
"https://bell.com/about"
"https://www.bell.com?token=123"

This is my code but not working.

这是我的代码但不起作用。

val pattern = """(?:([http|https]://)?)(?:(www\.)?)([A-Za-z0-9._%+-]+)[/]?(?:.*)""".r
url match {
  case pattern(domain) =>
    print(domain)
  case _ => print("not found!")
}

EDIT: My regex is wrong. Thanks to @Tabo. This is correct one.

编辑:我的正则表达式错了。感谢@Tabo。这是正确的。

(?:https?://)?(?:www\.)?([A-Za-z0-9._%+-]+)/?.*

3 个解决方案

#1


5  

You can try:

你可以试试:

import java.net.URL
import util.Try

val t = "https://www.bell.com?token=123"

val url = Try { new URL(t) }.toOption

#2


3  

You can use Java URL class to get Host, or you can check Apache library

您可以使用Java URL类来获取Host,也可以检查Apache库

new URL("https://www.bell.com?token=123").getHost

#3


0  

You should probably use the java.net.URLmethod, but...

您应该使用java.net.URL方法,但是......

For future reference, you have a couple of issues in your regex. Square brackets match character sets so [http|https] is the same as [htps|] (meaning 'h', 't', 'p', 's', or '|'). I think you mean http|https or simply https?.

为了将来参考,您的正则表达式中存在一些问题。方括号匹配字符集,因此[http | https]与[htps |](意为“h”,“t”,“p”,“s”或“|”)相同。我认为你的意思是http | https或只是https?

Also, if you are only trying to match just the domain, you want to only have one capturing group. Note that (?:blah) denotes a non-capturing group, while (blah) is a capturing group. The three capturing groups in your regex are ([http|https]://), (www\.)?, and ([A-Za-z0-9._%+-]+). You really only want the last one.

此外,如果您只是尝试仅匹配域,则只希望拥有一个捕获组。注意,(?:blah)表示非捕获组,而(blah)表示捕获组。正则表达式中的三个捕获组是([http | https]://),(www \。)?和([A-Za-z0-9 ._%+ - ] +)。你真的只想要最后一个。

Try:

(?:https?://)?(?:www\.)?([A-Za-z0-9._%+-]+)/?.*

Test it here - https://regex101.com/r/xW4iY7/2

在此测试 - https://regex101.com/r/xW4iY7/2

#1


5  

You can try:

你可以试试:

import java.net.URL
import util.Try

val t = "https://www.bell.com?token=123"

val url = Try { new URL(t) }.toOption

#2


3  

You can use Java URL class to get Host, or you can check Apache library

您可以使用Java URL类来获取Host,也可以检查Apache库

new URL("https://www.bell.com?token=123").getHost

#3


0  

You should probably use the java.net.URLmethod, but...

您应该使用java.net.URL方法,但是......

For future reference, you have a couple of issues in your regex. Square brackets match character sets so [http|https] is the same as [htps|] (meaning 'h', 't', 'p', 's', or '|'). I think you mean http|https or simply https?.

为了将来参考,您的正则表达式中存在一些问题。方括号匹配字符集,因此[http | https]与[htps |](意为“h”,“t”,“p”,“s”或“|”)相同。我认为你的意思是http | https或只是https?

Also, if you are only trying to match just the domain, you want to only have one capturing group. Note that (?:blah) denotes a non-capturing group, while (blah) is a capturing group. The three capturing groups in your regex are ([http|https]://), (www\.)?, and ([A-Za-z0-9._%+-]+). You really only want the last one.

此外,如果您只是尝试仅匹配域,则只希望拥有一个捕获组。注意,(?:blah)表示非捕获组,而(blah)表示捕获组。正则表达式中的三个捕获组是([http | https]://),(www \。)?和([A-Za-z0-9 ._%+ - ] +)。你真的只想要最后一个。

Try:

(?:https?://)?(?:www\.)?([A-Za-z0-9._%+-]+)/?.*

Test it here - https://regex101.com/r/xW4iY7/2

在此测试 - https://regex101.com/r/xW4iY7/2