如何避免在读取文件时被UTF-8 BOM绊倒

时间:2023-01-06 11:11:40

I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.

我正在使用一个最近添加了Unicode BOM头(U+FEFF)的数据提要,我的rake任务现在被它搞砸了。

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?

我可以用file跳过前3个字节。但是,是否有一种更优雅的方法来读取Ruby中的文件,无论BOM是否存在,都可以正确地处理这个问题?

3 个解决方案

#1


52  

With ruby 1.9.2 you can use the mode r:bom|utf-8

使用ruby 1.9.2,您可以使用模式r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  text_without_bom = file.read
}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.

不管BOM是否在文件中可用。


You may also use the encoding option with other commands:

您也可以使用其他命令的编码选项:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

(你会得到一个包含所有行的数组)。

Or with CSV:

或CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
  csv.each{ |row| p row }
}

#2


10  

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

我不会盲目地跳过前三个字节;如果生产商不再添加BOM怎么办?您应该做的是检查前几个字节,如果它们是0xEF 0xBB 0xBF,则忽略它们。这是BOM字符(U+FEFF)在UTF-8中的形式;在尝试解码流之前,我更喜欢处理它,因为BOM处理从一种语言/工具/框架到另一种语言/框架是如此不一致。

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

事实上,这就是你应对BOM的方式。如果一个文件已经被用作UTF-16,那么在开始解码之前,您必须检查前两个字节,以便您知道该将它读为big-endian还是little-endian。当然,UTF-8 BOM与字节顺序无关,它只是让您知道编码是UTF-8,以防您还不知道。

#3


0  

I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file.

当存在0xEF 0xBB 0xBF的BOM时,我不会“信任”某个文件被编码为UTF-8,您可能会失败。通常在检测UTF-8 BOM时,它应该是一个UTF-8编码的文件。但是,例如,如果某人刚刚将UTF-8 BOM添加到ISO文件中,那么如果其中有大于0x0F的字节,那么您将无法对这样的文件进行编码。如果只包含0x0F的字节,则可以信任该文件,因为在这种情况下,它是一个UTF-8兼容的ASCII文件,同时也是一个有效的UTF-8文件。

If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth.

如果有不是字节数< = 0 x0f文件内(BOM)后,可以肯定的是这是正确utf - 8编码你必须检查有效的序列和——甚至当所有序列是有效的——检查每个codepoint如果从序列使用尽可能短的序列和检查如果没有codepoint相匹配或low-surrogate很高。还要检查序列的最大字节数是否大于4,最高的代码点是0x10FFFF。最高的codepoint也限制了startbyte的有效负载位不高于0x4,并且第一个字节的有效负载不高于0xF。如果上面提到的所有检查都成功通过,您的UTF-8 BOM将告诉您真相。

#1


52  

With ruby 1.9.2 you can use the mode r:bom|utf-8

使用ruby 1.9.2,您可以使用模式r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  text_without_bom = file.read
}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.

不管BOM是否在文件中可用。


You may also use the encoding option with other commands:

您也可以使用其他命令的编码选项:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

(你会得到一个包含所有行的数组)。

Or with CSV:

或CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
  csv.each{ |row| p row }
}

#2


10  

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

我不会盲目地跳过前三个字节;如果生产商不再添加BOM怎么办?您应该做的是检查前几个字节,如果它们是0xEF 0xBB 0xBF,则忽略它们。这是BOM字符(U+FEFF)在UTF-8中的形式;在尝试解码流之前,我更喜欢处理它,因为BOM处理从一种语言/工具/框架到另一种语言/框架是如此不一致。

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

事实上,这就是你应对BOM的方式。如果一个文件已经被用作UTF-16,那么在开始解码之前,您必须检查前两个字节,以便您知道该将它读为big-endian还是little-endian。当然,UTF-8 BOM与字节顺序无关,它只是让您知道编码是UTF-8,以防您还不知道。

#3


0  

I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file.

当存在0xEF 0xBB 0xBF的BOM时,我不会“信任”某个文件被编码为UTF-8,您可能会失败。通常在检测UTF-8 BOM时,它应该是一个UTF-8编码的文件。但是,例如,如果某人刚刚将UTF-8 BOM添加到ISO文件中,那么如果其中有大于0x0F的字节,那么您将无法对这样的文件进行编码。如果只包含0x0F的字节,则可以信任该文件,因为在这种情况下,它是一个UTF-8兼容的ASCII文件,同时也是一个有效的UTF-8文件。

If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth.

如果有不是字节数< = 0 x0f文件内(BOM)后,可以肯定的是这是正确utf - 8编码你必须检查有效的序列和——甚至当所有序列是有效的——检查每个codepoint如果从序列使用尽可能短的序列和检查如果没有codepoint相匹配或low-surrogate很高。还要检查序列的最大字节数是否大于4,最高的代码点是0x10FFFF。最高的codepoint也限制了startbyte的有效负载位不高于0x4,并且第一个字节的有效负载不高于0xF。如果上面提到的所有检查都成功通过,您的UTF-8 BOM将告诉您真相。