仅通过下载网页的相关部分来刮取标题

时间:2022-11-19 20:56:31

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.

我想用Python抓一个网页的标题。我需要为成千上万的网站做这个,所以它必须快速。我之前看到的问题就像在python中检索网页的标题一样,但是我发现所有这些都在检索标题之前下载了整个页面,这看起来非常低效,因为通常标题包含在前几个标题中HTML行。

Is it possible to download only the parts of the webpage until the title has been found?

是否可以只下载网页的各个部分,直到找到标题?

I've tried the following, but page.readline() downloads the entire page.

我尝试了以下内容,但page.readline()下载了整个页面。

import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
    content = content + page.readline()

-- Edit --

- 编辑 -

Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.

请注意,我当前的解决方案使用BeautifulSoup仅限于处理标题,因此我可以优化的唯一位置可能无法在整个页面中读取。

title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()

-- Edit 2 --

- 编辑2 -

I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.

我发现BeautifulSoup本身将内容拆分为self.current_data变量中的多个字符串(请参阅bs4中的此函数),但我不确定如何修改代码以基本上在找到标题后停止读取所有剩余内容。一个问题可能是重定向应该仍然有效。

-- Edit 3 --

- 编辑3 -

So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.

所以这是一个例子。我有一个链接www.xyz.com/abc,我必须通过任何重定向(几乎所有我的链接使用一点点链接缩短)。我对重定向后出现的标题和域感兴趣。

-- Edit 4 --

- 编辑4 -

Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).

非常感谢您的所有帮助! Kul-Tigin的答案非常有效并且已被接受。我将保持赏金,直到它耗尽,看看是否出现更好的答案(如时间测量比较所示)。

-- Edit 5 --

- 编辑5 -

For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.

对于任何感兴趣的人:我将接受的答案计时大约是使用BeautifulSoup4的现有解决方案的两倍。

6 个解决方案

#1


12  

You can defer downloading the entire response body by enabling stream mode of requests.

您可以通过启用请求的流模式来推迟下载整个响应正文。

Requests 2.14.2 documentation - Advanced Usage

请求2.14.2文档 - 高级用法

By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

默认情况下,当您发出请求时,会立即下载响应正文。您可以覆盖此行为并推迟下载响应正文,直到您使用stream参数访问Response.content属性:

...

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

如果在发出请求时将stream设置为True,则除非您使用所有数据或调用Response.close,否则请求无法将连接释放回池。这可能导致连接效率低下。如果您在使用stream = True时发现自己部分读取请求主体(或根本没有读取它们),则应考虑使用contextlib.closing(此处记录)

So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

因此,使用此方法,您可以按块读取响应块,直到遇到title标记。由于重定向将由库处理,您将准备好。

Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

这是一个使用Python 2.7.10和3.6.0测试的容易出错的代码:

try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = retitle.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)))
            break

#2


4  

Question: ... the only place I can optimize is likely to not read in the entire page.

问题:...我可以优化的唯一地方可能无法在整个页面中阅读。

This does not read the entire page.

这不会读取整个页面。

Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

注意:如果在中间剪切Unicode序列,Unicode .decode()将引发异常。使用.decode(errors ='ignore')删除这些序列。

For instance:

import re
try:
    # PY3
    from urllib import request
except:
    import urllib2 as request

for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
    f = request.urlopen(url)
    re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
    Found = False
    data = ''
    while True:
        b_data = f.read(4096)
        if not b_data: break

        data += b_data.decode(errors='ignore')
        match = re_obj.match(data)
        if match:
            Found = True
            title = match.groups()[1]
            print('title={}'.format(title))
            break

    f.close()

Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform

输出:title =欢迎使用Python.org title = Google title = Bitly | URL Shortener和链接管理平台

Tested with Python: 3.4.2 and 2.7.9

用Python测试:3.4.2和2.7.9

#3


2  

You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.

你正在使用标准的REST请求抓取网页,我不知道任何只返回标题的请求,所以我认为这是不可能的。

I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.

我知道这并不一定有助于获得标题,但我通常使用BeautifulSoup进行任何网页抓取。这更容易。这是一个例子。

Code:

import requests
from bs4 import BeautifulSoup

urls = ["http://www.google.com", "http://www.msn.com"]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    print "Title with tags: %s" % soup.title
    print "Title: %s" % soup.title.text
    print

Output:

Title with tags: <title>Google</title>
Title: Google

Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos &amp; Videos</title>
Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos

#4


2  

the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:

你想要的东西我不认为可以做到,因为网页的设置方式,你可以在解析任何东西之前获得请求的响应。通常没有流媒体“如果遇到

然后停止给我数据”标志。如果有爱看到它,但有一些东西可以帮助你。请记住,并非所有网站都尊重这一点。因此,有些网站会强制您下载整个页面源,然后才能对其进行操作。但是很多它们都允许你指定一个范围标题。所以在一个请求示例中:</p>
import requests

targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
rangeheader = {"Range": "bytes=0-150"}

response = requests.get(targeturl, headers=rangeheader)

response.text

and you get

你明白了

'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'

now of course here's the problems with this what if you specify a range that is too short to get the title of the page? whats a good range to aim for? (combination of speed and assurance of accuracy) what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)

现在当然这里有问题,如果你指定的范围太短,无法获得页面的标题?目标是什么? (速度和准确性的结合)如果页面不尊重Range会发生什么? (大多数时候你只是得到你没有它的整个回应。)

i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.

我不知道这对你有帮助吗?但愿如此。但我做了类似的事情,只获得文件头下载检查。

EDIT4:

so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.

所以我想到了另一种可能有帮助的hacky事情。几乎每个页面都有一个404页面找不到页面。我们也许能够利用这个优势。而不是请求常规页面。要求这样的东西。

http://www.urbandictionary.com/nothing.php

the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.

一般页面将有大量的信息,链接,数据。但404页面只不过是一条消息,而且(在这种情况下)是一个视频。而且通常没有视频。只是一些文字。

but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.

但你也注意到标题仍然出现在这里。所以也许我们可以在任何页面上请求我们知道不存在的东西。

X5ijsuUJSoisjHJFk948.php

and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.

并为每个页面获得404。这样你只需下载一个非常小而简约的页面。而已。这将显着减少您下载的信息量。从而提高速度和效率。

heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.

下面是这个方法的问题:你需要以某种方式检查页面是否不提供自己的404版本。大多数页面都有它,因为它看起来不错。及其标准做法包括一个。但并非所有人都这样做。确保处理这种情况。

but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.

但我认为这可能值得一试。在数千个网站的过程中,它将为每个html节省许多ms的下载时间。

EDIT5:

so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:

所以我们谈到,因为你对重定向的网址感兴趣。我们可能会使用http头部请求。哪个不会得到网站内容。只是标题。所以在这种情况下:

response = requests.head('http://myshortenedurl.com/5b2su2')

replace my shortenedurl with tunyurl to follow along.

用tunyurl替换我的shortenedurl跟随。

>>>response
<Response [301]>

nice so we know this redirects to something.

很好,所以我们知道这重定向到某个东西。

>>>response.headers['Location']
'http://*.com'

now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.

现在我们知道url重定向到哪里而不实际关注它或下载任何页面源。现在我们可以应用之前讨论过的任何其他技术。

Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)

下面是一个例子,使用请求和lxml模块并使用404页面的想法。 (请注意,我必须用bit'ly替换bit.ly所以堆栈溢出不会生气。)

#!/usr/bin/python3

import requests
from lxml.html import fromstring

links = ['http://bit'ly/MW2qgH',
         'http://bit'ly/1x0885j',
         'http://bit'ly/IFHzvO',
         'http://bit'ly/1PwR9xM']

for link in links:

    response = '<Response [301]>'
    redirect = ''

    while response == '<Response [301]>':
        response = requests.head(link)
        try:
            redirect = response.headers['Location']
        except Exception as e:
            pass

    fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'

    scrapetarget = requests.get(fakepage)
    tree = fromstring(scrapetarget.text)
    print(tree.findtext('.//title'))

so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:

所以这里我们得到404页面,它将遵循任意数量的重定向。现在继承人的输出:

Urban Dictionary error
Page Not Found - Stack Overflow
Error 404 (Not Found)!!1
Kijiji: Page Not Found

so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.

所以你可以看到我们确实得到了冠军。但我们发现该方法存在一些问题。即一些标题添加的东西,有些只是没有一个好的标题。这就是该方法的问题。然而,我们也可以尝试范围方法。这样做的好处是标题是正确的,但有时我们可能会错过它,有时我们必须下载整个页面源来获取它。增加所需时间。

Also credit to alecxe for this part of my quick and dirty script

同样归功于alecxe这部分快速而又脏的脚本

tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))

for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:

以范围方法为例。在链接中的链接循环:将try catch语句之后的代码更改为:

rangeheader = {"Range": "bytes=0-500"}

scrapetargetsection = requests.get(redirect, headers=rangeheader)
tree = fromstring(scrapetargetsection.text)
print(tree.findtext('.//title'))

output is:

None
Stack Overflow
Google
Kijiji: Free Classifieds in...

here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.

在这里,我们看到都市词典没有标题或我错过了返回的字节。在任何这些方法中都存在权衡。接近总体准确度的唯一方法是下载我认为的每个页面的整个源代码。

#5


2  

using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:

使用urllib可以设置Range标头来请求一定范围的字节,但是有一些后果:

  • it depends on the server to honor the request
  • 它取决于服务器以兑现请求

  • you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
  • 你假设你正在寻找的数据在期望的范围内(但是你可以使用不同的范围标题来获得另一个请求以获得下一个字节 - 即,如果你在第一个结果中找不到标题,则只下载前300个字节并获得另外300个字节 - 2个300字节的请求仍然比整个文件便宜得多)

  • (edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code

    (编辑) - 为了避免标题标签在两个范围请求之间拆分的情况,让你的范围重叠,请参阅我的示例代码中的'range_header_overlapped'函数

    import urllib

    req = urllib.request.Request('http://www.python.org/')

    req = urllib.request.Request('http://www.python.org/')

    req.headers['Range']='bytes=%s-%s' % (0, 300)

    req.headers ['Range'] ='bytes =%s-%s'%(0,300)

    f = urllib.request.urlopen(req)

    f = urllib.request.urlopen(req)

    just to verify if server accepted our range:

    content_range=f.headers.get('Content-Range')

    print(content_range)

#6


0  

my code also solves cases when title tag is splitted between chunks.

我的代码也解决了在块之间拆分title标签的情况。

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue May 30 04:21:26 2017
====================
@author: s
"""

import requests
from string import lower
from html.parser import HTMLParser

#proxies = { 'http': 'http://127.0.0.1:8080' }
urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
        'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
        'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
        'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
        'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
        'http://apprize.info/programming/opencv/8.html',
        'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
        'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
        'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']

class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''
    def handle_starttag(self, tag, attributes):
        self.match = True if tag == 'title' else False
    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

def valid_content( url, proxies=None ):
    valid = [ 'text/html; charset=utf-8',
              'text/html',
              'application/xhtml+xml',
              'application/xhtml',
              'application/xml',
              'text/xml' ]
    r = requests.head(url, proxies=proxies)
    our_type = lower(r.headers.get('Content-Type'))
    if not our_type in valid:
        print('unknown content-type: {} at URL:{}'.format(our_type, url))
        return False
    return our_type in valid

def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
    """
    generate overlapping ranges
    (to solve cases when title tag splits between them)

    seg_num: segment number we want, 0 based
    overlap: number of overlaping bytes, defaults to 50
    """
    start = chunksize * seg_num
    end = chunksize * (seg_num + 1)
    if seg_num:
        overlap = overlap * seg_num
        start -= overlap
        end -= overlap
    return {'Range': 'bytes={}-{}'.format( start, end )}

def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
    if not valid_content(url, proxies=proxies):
        return False
    current_chunk = 0
    myparser = TitleParser()
    while current_chunk <= max_chunks:
        headers = range_header_overlapped( chunksize, current_chunk )
        headers['Accept-Encoding'] = 'deflate'
        # quick fix, as my locally hosted Apache/2.4.25 kept raising
        # ContentDecodingError when using "Content-Encoding: gzip"
        # ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', 
        #                  error('Error -3 while decompressing: incorrect header check',))
        r = requests.get( url, headers=headers, proxies=proxies )
        myparser.feed(r.content)
        if myparser.title:
            return myparser.title
        current_chunk += 1
    print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
    return False

#1


12  

You can defer downloading the entire response body by enabling stream mode of requests.

您可以通过启用请求的流模式来推迟下载整个响应正文。

Requests 2.14.2 documentation - Advanced Usage

请求2.14.2文档 - 高级用法

By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

默认情况下,当您发出请求时,会立即下载响应正文。您可以覆盖此行为并推迟下载响应正文,直到您使用stream参数访问Response.content属性:

...

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

如果在发出请求时将stream设置为True,则除非您使用所有数据或调用Response.close,否则请求无法将连接释放回池。这可能导致连接效率低下。如果您在使用stream = True时发现自己部分读取请求主体(或根本没有读取它们),则应考虑使用contextlib.closing(此处记录)

So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

因此,使用此方法,您可以按块读取响应块,直到遇到title标记。由于重定向将由库处理,您将准备好。

Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

这是一个使用Python 2.7.10和3.6.0测试的容易出错的代码:

try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = retitle.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)))
            break

#2


4  

Question: ... the only place I can optimize is likely to not read in the entire page.

问题:...我可以优化的唯一地方可能无法在整个页面中阅读。

This does not read the entire page.

这不会读取整个页面。

Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

注意:如果在中间剪切Unicode序列,Unicode .decode()将引发异常。使用.decode(errors ='ignore')删除这些序列。

For instance:

import re
try:
    # PY3
    from urllib import request
except:
    import urllib2 as request

for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
    f = request.urlopen(url)
    re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
    Found = False
    data = ''
    while True:
        b_data = f.read(4096)
        if not b_data: break

        data += b_data.decode(errors='ignore')
        match = re_obj.match(data)
        if match:
            Found = True
            title = match.groups()[1]
            print('title={}'.format(title))
            break

    f.close()

Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform

输出:title =欢迎使用Python.org title = Google title = Bitly | URL Shortener和链接管理平台

Tested with Python: 3.4.2 and 2.7.9

用Python测试:3.4.2和2.7.9

#3


2  

You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.

你正在使用标准的REST请求抓取网页,我不知道任何只返回标题的请求,所以我认为这是不可能的。

I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.

我知道这并不一定有助于获得标题,但我通常使用BeautifulSoup进行任何网页抓取。这更容易。这是一个例子。

Code:

import requests
from bs4 import BeautifulSoup

urls = ["http://www.google.com", "http://www.msn.com"]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    print "Title with tags: %s" % soup.title
    print "Title: %s" % soup.title.text
    print

Output:

Title with tags: <title>Google</title>
Title: Google

Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos &amp; Videos</title>
Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos

#4


2  

the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:

你想要的东西我不认为可以做到,因为网页的设置方式,你可以在解析任何东西之前获得请求的响应。通常没有流媒体“如果遇到

然后停止给我数据”标志。如果有爱看到它,但有一些东西可以帮助你。请记住,并非所有网站都尊重这一点。因此,有些网站会强制您下载整个页面源,然后才能对其进行操作。但是很多它们都允许你指定一个范围标题。所以在一个请求示例中:</p>
import requests

targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
rangeheader = {"Range": "bytes=0-150"}

response = requests.get(targeturl, headers=rangeheader)

response.text

and you get

你明白了

'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'

now of course here's the problems with this what if you specify a range that is too short to get the title of the page? whats a good range to aim for? (combination of speed and assurance of accuracy) what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)

现在当然这里有问题,如果你指定的范围太短,无法获得页面的标题?目标是什么? (速度和准确性的结合)如果页面不尊重Range会发生什么? (大多数时候你只是得到你没有它的整个回应。)

i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.

我不知道这对你有帮助吗?但愿如此。但我做了类似的事情,只获得文件头下载检查。

EDIT4:

so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.

所以我想到了另一种可能有帮助的hacky事情。几乎每个页面都有一个404页面找不到页面。我们也许能够利用这个优势。而不是请求常规页面。要求这样的东西。

http://www.urbandictionary.com/nothing.php

the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.

一般页面将有大量的信息,链接,数据。但404页面只不过是一条消息,而且(在这种情况下)是一个视频。而且通常没有视频。只是一些文字。

but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.

但你也注意到标题仍然出现在这里。所以也许我们可以在任何页面上请求我们知道不存在的东西。

X5ijsuUJSoisjHJFk948.php

and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.

并为每个页面获得404。这样你只需下载一个非常小而简约的页面。而已。这将显着减少您下载的信息量。从而提高速度和效率。

heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.

下面是这个方法的问题:你需要以某种方式检查页面是否不提供自己的404版本。大多数页面都有它,因为它看起来不错。及其标准做法包括一个。但并非所有人都这样做。确保处理这种情况。

but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.

但我认为这可能值得一试。在数千个网站的过程中,它将为每个html节省许多ms的下载时间。

EDIT5:

so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:

所以我们谈到,因为你对重定向的网址感兴趣。我们可能会使用http头部请求。哪个不会得到网站内容。只是标题。所以在这种情况下:

response = requests.head('http://myshortenedurl.com/5b2su2')

replace my shortenedurl with tunyurl to follow along.

用tunyurl替换我的shortenedurl跟随。

>>>response
<Response [301]>

nice so we know this redirects to something.

很好,所以我们知道这重定向到某个东西。

>>>response.headers['Location']
'http://*.com'

now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.

现在我们知道url重定向到哪里而不实际关注它或下载任何页面源。现在我们可以应用之前讨论过的任何其他技术。

Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)

下面是一个例子,使用请求和lxml模块并使用404页面的想法。 (请注意,我必须用bit'ly替换bit.ly所以堆栈溢出不会生气。)

#!/usr/bin/python3

import requests
from lxml.html import fromstring

links = ['http://bit'ly/MW2qgH',
         'http://bit'ly/1x0885j',
         'http://bit'ly/IFHzvO',
         'http://bit'ly/1PwR9xM']

for link in links:

    response = '<Response [301]>'
    redirect = ''

    while response == '<Response [301]>':
        response = requests.head(link)
        try:
            redirect = response.headers['Location']
        except Exception as e:
            pass

    fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'

    scrapetarget = requests.get(fakepage)
    tree = fromstring(scrapetarget.text)
    print(tree.findtext('.//title'))

so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:

所以这里我们得到404页面,它将遵循任意数量的重定向。现在继承人的输出:

Urban Dictionary error
Page Not Found - Stack Overflow
Error 404 (Not Found)!!1
Kijiji: Page Not Found

so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.

所以你可以看到我们确实得到了冠军。但我们发现该方法存在一些问题。即一些标题添加的东西,有些只是没有一个好的标题。这就是该方法的问题。然而,我们也可以尝试范围方法。这样做的好处是标题是正确的,但有时我们可能会错过它,有时我们必须下载整个页面源来获取它。增加所需时间。

Also credit to alecxe for this part of my quick and dirty script

同样归功于alecxe这部分快速而又脏的脚本

tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))

for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:

以范围方法为例。在链接中的链接循环:将try catch语句之后的代码更改为:

rangeheader = {"Range": "bytes=0-500"}

scrapetargetsection = requests.get(redirect, headers=rangeheader)
tree = fromstring(scrapetargetsection.text)
print(tree.findtext('.//title'))

output is:

None
Stack Overflow
Google
Kijiji: Free Classifieds in...

here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.

在这里,我们看到都市词典没有标题或我错过了返回的字节。在任何这些方法中都存在权衡。接近总体准确度的唯一方法是下载我认为的每个页面的整个源代码。

#5


2  

using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:

使用urllib可以设置Range标头来请求一定范围的字节,但是有一些后果:

  • it depends on the server to honor the request
  • 它取决于服务器以兑现请求

  • you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
  • 你假设你正在寻找的数据在期望的范围内(但是你可以使用不同的范围标题来获得另一个请求以获得下一个字节 - 即,如果你在第一个结果中找不到标题,则只下载前300个字节并获得另外300个字节 - 2个300字节的请求仍然比整个文件便宜得多)

  • (edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code

    (编辑) - 为了避免标题标签在两个范围请求之间拆分的情况,让你的范围重叠,请参阅我的示例代码中的'range_header_overlapped'函数

    import urllib

    req = urllib.request.Request('http://www.python.org/')

    req = urllib.request.Request('http://www.python.org/')

    req.headers['Range']='bytes=%s-%s' % (0, 300)

    req.headers ['Range'] ='bytes =%s-%s'%(0,300)

    f = urllib.request.urlopen(req)

    f = urllib.request.urlopen(req)

    just to verify if server accepted our range:

    content_range=f.headers.get('Content-Range')

    print(content_range)

#6


0  

my code also solves cases when title tag is splitted between chunks.

我的代码也解决了在块之间拆分title标签的情况。

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue May 30 04:21:26 2017
====================
@author: s
"""

import requests
from string import lower
from html.parser import HTMLParser

#proxies = { 'http': 'http://127.0.0.1:8080' }
urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
        'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
        'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
        'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
        'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
        'http://apprize.info/programming/opencv/8.html',
        'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
        'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
        'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']

class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''
    def handle_starttag(self, tag, attributes):
        self.match = True if tag == 'title' else False
    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

def valid_content( url, proxies=None ):
    valid = [ 'text/html; charset=utf-8',
              'text/html',
              'application/xhtml+xml',
              'application/xhtml',
              'application/xml',
              'text/xml' ]
    r = requests.head(url, proxies=proxies)
    our_type = lower(r.headers.get('Content-Type'))
    if not our_type in valid:
        print('unknown content-type: {} at URL:{}'.format(our_type, url))
        return False
    return our_type in valid

def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
    """
    generate overlapping ranges
    (to solve cases when title tag splits between them)

    seg_num: segment number we want, 0 based
    overlap: number of overlaping bytes, defaults to 50
    """
    start = chunksize * seg_num
    end = chunksize * (seg_num + 1)
    if seg_num:
        overlap = overlap * seg_num
        start -= overlap
        end -= overlap
    return {'Range': 'bytes={}-{}'.format( start, end )}

def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
    if not valid_content(url, proxies=proxies):
        return False
    current_chunk = 0
    myparser = TitleParser()
    while current_chunk <= max_chunks:
        headers = range_header_overlapped( chunksize, current_chunk )
        headers['Accept-Encoding'] = 'deflate'
        # quick fix, as my locally hosted Apache/2.4.25 kept raising
        # ContentDecodingError when using "Content-Encoding: gzip"
        # ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', 
        #                  error('Error -3 while decompressing: incorrect header check',))
        r = requests.get( url, headers=headers, proxies=proxies )
        myparser.feed(r.content)
        if myparser.title:
            return myparser.title
        current_chunk += 1
    print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
    return False