无需下载即可获得pdf的大小

时间:2023-02-09 15:15:02

Is it possible to know the size of a pdf e.g. http://example.com/ABC.pdf using requests module in python without actually downloading it. I am writing an application where if the internet speed is slow and if the size of the pdf is large then it will postpone the download for the future

是否可以知道pdf的大小,例如http://example.com/ABC.pdf在python中使用请求模块而不实际下载它。我正在写一个应用程序,如果互联网速度很慢,如果pdf的大小很大,那么它将推迟下载以备将来使用

2 个解决方案

#1


8  

use a HTTP-HEAD request

Response shall provide in headers more details of the file to download without fetching full file.

响应应在标题中提供要下载的文件的更多详细信息,而无需获取完整文件。

>>> url = "http://www.pdf995.com/samples/pdf.pdf"
>>> req = requests.head(url)
>>> req.content
''
>>> req.headers["content-length"]
'433994'

or try streaming read

>>> req = requests.get(url, stream=True)
>>> res = req.iter_content(30)
>>> res
<generator object generate at 0x7f9ad3270320>
>>> res.next()
'%PDF-1.3\n%\xc7\xec\x8f\xa2\n30 0 obj\n<</Len'
>>> res.next()
'gth 31 0 R/Filter /FlateDecode'
>>> res.next()
'>>\nstream\nx\x9c\xed}\xdd\x93%\xb7m\xef\xfb\xfc\x15S\xf7%NU\xf6\xb8'

You can then decode pdf size from initial pdf file bytes and decide to continue or not.

然后,您可以从初始pdf文件字节解码pdf大小,并决定是否继续。

Use Range request header

HTTP allows asking for retrieval only range of bytes.

HTTP允许仅检索检索字节范围。

If your server supports that, you can use a trick, you ask for range of bytes which are available only with too large files. If you get some bytes (and status is OK), you know, the file is too large.

如果您的服务器支持,您可以使用技巧,您要求的字节范围只有太大的文件才可用。如果你得到一些字节(并且状态正常),你知道,文件太大了。

If you get an exception ChunkedEncodingError: IncompleteRead(0 bytes read), then you know, the file is smaller.

如果你得到一个异常ChunkedEncodingError:IncompleteRead(0字节读取),那么你知道,文件更小。

Call it like this:

这样叫:

>>> headers = {"Range": "bytes=999500-999600"}
>>> req = requests.get(url, headers=headers)

This will work only, if your server allows serving partial content.

如果您的服务器允许提供部分内容,这将仅起作用。

#2


7  

Like this

喜欢这个

import urllib2
response = urllib2.urlopen('http://example.com/ABC.pdf')
size_of_pdf = response.headers['Content-Length']

Before response.read() is called, the contents are not downloaded.

在调用response.read()之前,不会下载内容。

Take a look at Response Headers in Wikipedia

看看*中的响应标题

...
Content-Length  The length of the response body in octets (8-bit bytes) Content-Length: 348 Permanent
...

OP asked for using requests, so @JanVlcinsky answer is more appropriate.

OP要求使用请求,所以@JanVlcinsky的答案更合适。

#1


8  

use a HTTP-HEAD request

Response shall provide in headers more details of the file to download without fetching full file.

响应应在标题中提供要下载的文件的更多详细信息,而无需获取完整文件。

>>> url = "http://www.pdf995.com/samples/pdf.pdf"
>>> req = requests.head(url)
>>> req.content
''
>>> req.headers["content-length"]
'433994'

or try streaming read

>>> req = requests.get(url, stream=True)
>>> res = req.iter_content(30)
>>> res
<generator object generate at 0x7f9ad3270320>
>>> res.next()
'%PDF-1.3\n%\xc7\xec\x8f\xa2\n30 0 obj\n<</Len'
>>> res.next()
'gth 31 0 R/Filter /FlateDecode'
>>> res.next()
'>>\nstream\nx\x9c\xed}\xdd\x93%\xb7m\xef\xfb\xfc\x15S\xf7%NU\xf6\xb8'

You can then decode pdf size from initial pdf file bytes and decide to continue or not.

然后,您可以从初始pdf文件字节解码pdf大小,并决定是否继续。

Use Range request header

HTTP allows asking for retrieval only range of bytes.

HTTP允许仅检索检索字节范围。

If your server supports that, you can use a trick, you ask for range of bytes which are available only with too large files. If you get some bytes (and status is OK), you know, the file is too large.

如果您的服务器支持,您可以使用技巧,您要求的字节范围只有太大的文件才可用。如果你得到一些字节(并且状态正常),你知道,文件太大了。

If you get an exception ChunkedEncodingError: IncompleteRead(0 bytes read), then you know, the file is smaller.

如果你得到一个异常ChunkedEncodingError:IncompleteRead(0字节读取),那么你知道,文件更小。

Call it like this:

这样叫:

>>> headers = {"Range": "bytes=999500-999600"}
>>> req = requests.get(url, headers=headers)

This will work only, if your server allows serving partial content.

如果您的服务器允许提供部分内容,这将仅起作用。

#2


7  

Like this

喜欢这个

import urllib2
response = urllib2.urlopen('http://example.com/ABC.pdf')
size_of_pdf = response.headers['Content-Length']

Before response.read() is called, the contents are not downloaded.

在调用response.read()之前,不会下载内容。

Take a look at Response Headers in Wikipedia

看看*中的响应标题

...
Content-Length  The length of the response body in octets (8-bit bytes) Content-Length: 348 Permanent
...

OP asked for using requests, so @JanVlcinsky answer is more appropriate.

OP要求使用请求,所以@JanVlcinsky的答案更合适。