带有BOM的UTF-8 HTML和CSS文件(以及如何用Python删除BOM)

时间:2023-01-06 00:27:13

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

首先,一些背景:我正在使用Python开发一个web应用程序。我所有的(文本)文件目前都存储在UTF-8和BOM中。这包括我所有的HTML模板和CSS文件。这些资源作为二进制数据(BOM和all)存储在我的DB中。

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

当我从数据库中检索模板时,我使用template.decode(“utf-8”)对它们进行解码。当HTML到达浏览器时,BOM就出现在HTTP响应主体的开头。这在Chrome中产生了一个非常有趣的错误:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

额外的< html >。将属性迁移回原来的元素并忽略标记。

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

Chrome似乎在看到BOM时自动生成了一个标签,并在内容上出错,使得真正的标签是一个错误。

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

因此,使用Python,从我的UTF-8编码模板中删除BOM的最好方法是什么(如果它存在的话——我不能在将来保证它)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

对于其他基于文本的文件,如CSS,主流浏览器会正确地解释(或忽略)BOM吗?它们被作为纯二进制数据发送,而没有。decode('utf-8')。

Note: I am using Python 2.5.

注意:我使用的是Python 2.5。

Thanks!

谢谢!

4 个解决方案

#1


23  

Since you state:

既然你状态:

All of my (text) files are currently stored in UTF-8 with the BOM

我所有的(文本)文件目前都存储在UTF-8和BOM中。

then use the 'utf-8-sig' codec to decode them:

然后使用“utf-8-sig”编码解码:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

它会自动删除预期的BOM,如果BOM不存在,则可以正常工作。

#2


10  

Check the first character after decoding to see if it's the BOM:

解码后检查第一个字符是否为BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]

#3


1  

The previously-accepted answer is WRONG.

先前被接受的答案是错误的。

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

u'\ufffe'不是一个字符。如果你在unicode字符串中得到它,有人会把它塞满。

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

BOM(又称零宽度不间断空间)是u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

读这个(Ctrl-F搜索BOM),这个和这个(Ctrl-F搜索BOM)。

Here's a correct and typo/braino-resistant answer:

这里有一个正确的、排版错误的答案:

Decode your input into unicode_str. Then do this:

将输入解码为unicode_str。然后这样做:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

额外的好处:使用一个命名常量可以让你的读者对正在发生的事情有更多的了解,而不是一组看似随意的六边形。

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

不幸的是,在标准的Python库中似乎没有合适的命名常量。

Alas, the codecs module provides only "a snare and a delusion":

唉,编解码模块只提供了一个“陷阱和错觉”:

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

更新2如果您还没有解码您的输入,并且希望为BOM进行检查,您需要检查UTF-16的两种不同的BOM和UTF-32的至少两种不同的BOM。如果只有一种方法,那你就不需要BOM了,是吗?

Here verbatim unprettified from my own code is my solution to this:

从我自己的代码中,逐字不加修饰的是我的解决方案:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

输入应该至少是输入的前4个字节。它返回可用于解码输入后BOM部分的编码,加上BOM的长度(如果有的话)。

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

如果您是偏执狂,您可以允许另一个2(非标准的)UTF-32命令,但是Python不为它们提供编码,而且我从来没有听说过实际发生的情况,所以我不麻烦。

#4


0  

You can use something similar to remove BOM:

您可以使用类似的方法删除BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()

#1


23  

Since you state:

既然你状态:

All of my (text) files are currently stored in UTF-8 with the BOM

我所有的(文本)文件目前都存储在UTF-8和BOM中。

then use the 'utf-8-sig' codec to decode them:

然后使用“utf-8-sig”编码解码:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

它会自动删除预期的BOM,如果BOM不存在,则可以正常工作。

#2


10  

Check the first character after decoding to see if it's the BOM:

解码后检查第一个字符是否为BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]

#3


1  

The previously-accepted answer is WRONG.

先前被接受的答案是错误的。

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

u'\ufffe'不是一个字符。如果你在unicode字符串中得到它,有人会把它塞满。

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

BOM(又称零宽度不间断空间)是u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

读这个(Ctrl-F搜索BOM),这个和这个(Ctrl-F搜索BOM)。

Here's a correct and typo/braino-resistant answer:

这里有一个正确的、排版错误的答案:

Decode your input into unicode_str. Then do this:

将输入解码为unicode_str。然后这样做:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

额外的好处:使用一个命名常量可以让你的读者对正在发生的事情有更多的了解,而不是一组看似随意的六边形。

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

不幸的是,在标准的Python库中似乎没有合适的命名常量。

Alas, the codecs module provides only "a snare and a delusion":

唉,编解码模块只提供了一个“陷阱和错觉”:

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

更新2如果您还没有解码您的输入,并且希望为BOM进行检查,您需要检查UTF-16的两种不同的BOM和UTF-32的至少两种不同的BOM。如果只有一种方法,那你就不需要BOM了,是吗?

Here verbatim unprettified from my own code is my solution to this:

从我自己的代码中,逐字不加修饰的是我的解决方案:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

输入应该至少是输入的前4个字节。它返回可用于解码输入后BOM部分的编码,加上BOM的长度(如果有的话)。

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

如果您是偏执狂,您可以允许另一个2(非标准的)UTF-32命令,但是Python不为它们提供编码,而且我从来没有听说过实际发生的情况,所以我不麻烦。

#4


0  

You can use something similar to remove BOM:

您可以使用类似的方法删除BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()