如何从URL中提取*域名(TLD)

时间:2022-12-04 08:20:01

how would you extract the domain name from a URL, excluding any subdomains?

如何从URL中提取域名,不包括任何子域名?

My initial simplistic attempt was:

我最初的简单尝试是:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

这适用于http://www.foo.com,但不适用于http://www.foo.com.au。有没有办法在不使用有关有效TLD(*域名)或国家/地区代码(因为它们发生变化)的特殊知识的情况下正确执行此操作。

thanks

谢谢

7 个解决方案

#1


41  

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

不,没有“内在”的方式知道(例如)zap.co.it是一个子域名(因为意大利的注册商出售域名如co.it)而zap.co.uk不是(因为英国的注册商)不会出售像co.uk这样的域名,但仅限于zap.co.uk)。

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

你只需要使用一个辅助表(或在线资源)来告诉你哪个TLD的行为特别像英国和澳大利亚 - 没有办法在没有这些额外语义知识的情况下仅仅盯着字符串(当然它可以最终会改变,但如果你能找到一个好的在线资源,那么资源也会相应改变,人们希望! - )。

#2


40  

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

这是一个伟大的python模块,有人在看到这个问题后写了解决这个问题:https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

该模块在公共后缀列表中查找TLD,由Mozilla志愿者提供

Quote:

引用:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

另一方面,tldextract通过根据公共后缀列表查找当前生存的gTLD,了解所有gTLD [通用*域名]和ccTLD [国家和地区代码*域名]的外观。因此,给定一个URL,它从其域中知道其子域,并从其国家代码中知道其域。

#3


40  

Using this file of effective tlds which someone else found on Mozilla's website:

使用其他人在Mozilla网站上找到的有效tld文件:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

结果是:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

如果有人让我知道上面哪些部分可以用更加pythonic的方式重写,我会很感激。例如,必须有一种更好的迭代last_i_elements列表的方法,但我想不到一个。我也不知道ValueError是否是最好的东西。注释?

#4


19  

Using python tld

使用python tld

https://pypi.python.org/pypi/tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk") 

co.uk

co.uk

or without protocol

或没有协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

#5


2  

There are many, many TLD's. Here's the list:

有很多很多TLD。这是列表:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

这是另一个清单

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

这是另一个清单

http://www.iana.org/domains/root/db/

http://www.iana.org/domains/root/db/

#6


0  

Here's how I handle it:

这是我如何处理它:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

#7


0  

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

在为所有新的更新get_tld之前,我从错误中拉出tld。当然这是糟糕的代码,但它的工作原理。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

#1


41  

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

不,没有“内在”的方式知道(例如)zap.co.it是一个子域名(因为意大利的注册商出售域名如co.it)而zap.co.uk不是(因为英国的注册商)不会出售像co.uk这样的域名,但仅限于zap.co.uk)。

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

你只需要使用一个辅助表(或在线资源)来告诉你哪个TLD的行为特别像英国和澳大利亚 - 没有办法在没有这些额外语义知识的情况下仅仅盯着字符串(当然它可以最终会改变,但如果你能找到一个好的在线资源,那么资源也会相应改变,人们希望! - )。

#2


40  

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

这是一个伟大的python模块,有人在看到这个问题后写了解决这个问题:https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

该模块在公共后缀列表中查找TLD,由Mozilla志愿者提供

Quote:

引用:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

另一方面,tldextract通过根据公共后缀列表查找当前生存的gTLD,了解所有gTLD [通用*域名]和ccTLD [国家和地区代码*域名]的外观。因此,给定一个URL,它从其域中知道其子域,并从其国家代码中知道其域。

#3


40  

Using this file of effective tlds which someone else found on Mozilla's website:

使用其他人在Mozilla网站上找到的有效tld文件:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

结果是:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

如果有人让我知道上面哪些部分可以用更加pythonic的方式重写,我会很感激。例如,必须有一种更好的迭代last_i_elements列表的方法,但我想不到一个。我也不知道ValueError是否是最好的东西。注释?

#4


19  

Using python tld

使用python tld

https://pypi.python.org/pypi/tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk") 

co.uk

co.uk

or without protocol

或没有协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

#5


2  

There are many, many TLD's. Here's the list:

有很多很多TLD。这是列表:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

这是另一个清单

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

这是另一个清单

http://www.iana.org/domains/root/db/

http://www.iana.org/domains/root/db/

#6


0  

Here's how I handle it:

这是我如何处理它:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

#7


0  

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

在为所有新的更新get_tld之前,我从错误中拉出tld。当然这是糟糕的代码,但它的工作原理。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e