是否有办法从UTF-8编码的文件中删除BOM ?

时间:2023-01-06 00:27:19

Is there a way to remove the BOM from a UTF-8 encoded file?

是否有办法从UTF-8编码的文件中删除BOM ?

I know that all of my JSON files are encoded in UTF-8, but the data entry person who edited the JSON files saved it as UTF-8 with the BOM.

我知道我所有的JSON文件都是用UTF-8编码的,但是编辑JSON文件的数据输入人员将其保存为UTF-8和BOM。

When I run my Ruby scripts to parse the JSON, it is failing with an error. I don't want to manually open 58+ JSON files and convert to UTF-8 without the BOM.

当我运行Ruby脚本解析JSON时,它出现了错误。我不想手动打开58+ JSON文件,然后在没有BOM的情况下转换为UTF-8。

4 个解决方案

#1


27  

With ruby >= 1.9.2 you can use the mode r:bom|utf-8

使用ruby >= 1.9.2可以使用模式r:bom|utf-8

This should work (I haven't test it in combination with json):

这应该是可行的(我还没有结合json对其进行测试):

json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  json = JSON.parse(file.read)
}

It doesn't matter, if the BOM is available in the file or not.

不管BOM是否在文件中可用。


Andrew remarked, that File#rewind can't be used with BOM.

Andrew说,文件#rewind不能用于BOM。

If you need a rewind-function you must remember the position and replace rewind with pos=:

如果你需要一个重风功能,你必须记住这个位置,并使用pos=:

#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
  f << "\xEF\xBB\xBF" #add BOM
  f << 'some content'
}

#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
  pos =f.pos
  p content = f.read  #read and write file content
  f.pos = pos   #f.rewind  goes to pos 0
  p content = f.read  #(re)read and write file content
}

#2


18  

So, the solution was to do a search and replace on the BOM via gsub! I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

所以,解决办法就是通过gsub对BOM进行搜索和替换!我强制将字符串编码到UTF-8,并强制将regex模式编码为UTF-8。

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

我可以通过查看http://self.d-struct.org/195/howto remoto -byte-order-mark-with-ruby-and-iconv和http://blog.grayproductions.net/articles/ruby_19s_string来获得解决方案

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

#3


6  

You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.

您还可以使用该文件指定编码。阅读和CSV。读取方法,但不指定读取模式。

File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')

#4


3  

the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:

如果您只读取一次文件,那么“bom|UTF-8”编码工作得很好,但是如果您调用文件#rewind,就会失败,就像我在代码中所做的那样。为此,我做了以下工作:

def ignore_bom
  @file.ungetc if @file.pos==0 && @file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end

which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.

这似乎很有效。不确定是否还需要查找其他类似的类型字符,但是可以很容易地将它们构建到这个方法中,以便在任何时候倒带或打开时调用。

#1


27  

With ruby >= 1.9.2 you can use the mode r:bom|utf-8

使用ruby >= 1.9.2可以使用模式r:bom|utf-8

This should work (I haven't test it in combination with json):

这应该是可行的(我还没有结合json对其进行测试):

json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  json = JSON.parse(file.read)
}

It doesn't matter, if the BOM is available in the file or not.

不管BOM是否在文件中可用。


Andrew remarked, that File#rewind can't be used with BOM.

Andrew说,文件#rewind不能用于BOM。

If you need a rewind-function you must remember the position and replace rewind with pos=:

如果你需要一个重风功能,你必须记住这个位置,并使用pos=:

#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
  f << "\xEF\xBB\xBF" #add BOM
  f << 'some content'
}

#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
  pos =f.pos
  p content = f.read  #read and write file content
  f.pos = pos   #f.rewind  goes to pos 0
  p content = f.read  #(re)read and write file content
}

#2


18  

So, the solution was to do a search and replace on the BOM via gsub! I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

所以,解决办法就是通过gsub对BOM进行搜索和替换!我强制将字符串编码到UTF-8,并强制将regex模式编码为UTF-8。

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

我可以通过查看http://self.d-struct.org/195/howto remoto -byte-order-mark-with-ruby-and-iconv和http://blog.grayproductions.net/articles/ruby_19s_string来获得解决方案

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

#3


6  

You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.

您还可以使用该文件指定编码。阅读和CSV。读取方法,但不指定读取模式。

File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')

#4


3  

the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:

如果您只读取一次文件,那么“bom|UTF-8”编码工作得很好,但是如果您调用文件#rewind,就会失败,就像我在代码中所做的那样。为此,我做了以下工作:

def ignore_bom
  @file.ungetc if @file.pos==0 && @file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end

which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.

这似乎很有效。不确定是否还需要查找其他类似的类型字符,但是可以很容易地将它们构建到这个方法中,以便在任何时候倒带或打开时调用。