字符串#coding不修复“UTF-8中的无效字节序列”错误

时间:2023-01-05 22:31:08

I know there are multiple similar questions about this error, and I've tried many of them without luck. The problem I'm having involves the byte \xA1 and is throwing

我知道关于这个错误有很多类似的问题,我已经尝试了很多没有运气的问题。我遇到的问题涉及字节\ xA1并且正在抛出

ArgumentError: invalid byte sequence in UTF-8

ArgumentError:UTF-8中的无效字节序列

I've tried the following with no success:

我试过以下但没有成功:

"\xA1".encode('UTF-8', :undef => :replace, :invalid => :replace,
    :replace => "").sub('', '')
"\xA1".encode('UTF-8', :undef => :replace, :invalid => :replace,
    :replace => "").force_encoding('UTF-8').sub('', '')
"\xA1".encode('UTF-8', :undef => :replace, :invalid => :replace,
    :replace => "").encode('UTF-8').sub('', '')

Each line throws the error for me. What am I doing wrong?

每一行都为我抛出错误。我究竟做错了什么?

UPDATE:

更新:

The above lines fail only in IRB. However, I modified my application to encode lines of a CVS file using the same String#encode method and arguments, and I get the same error when reading the line from a file (note: it works if you perform the operations on the same string w/o using IO).

上述行仅在IRB中失败。但是,我修改了我的应用程序,使用相同的String#encode方法和参数对CVS文件的行进行编码,从文件中读取行时出现相同的错误(注意:如果对同一个字符串执行操作,它会起作用没有使用IO)。

bad_line = "col1\tcol2\tbad\xa1"

bad_line.sub('', '') # does NOT fail
puts bad_line # => col1 col2    bad?

tmp = Tempfile.new 'foo' # write the line to a file to emulate real problem
tmp.puts bad_line
tmp.close

tmp2 = Tempfile.new 'bar'

begin
  IO.foreach tmp.path do |line|
    line.encode!('UTF-8', :undef => :replace, :invalid => :replace, :replace => "")
    line.sub('', '') # fail: invalid byte sequence in UTF-8
    tmp2.puts line
  end
  tmp2.close

  # this would fail if the above error didn't halt execution
  CSV.foreach(tmp2.path) do |row|
    puts row.inspect # fail: invalid byte sequence in UTF-8
  end
ensure
  tmp.unlink
  tmp2.close
  tmp2.unlink
end

2 个解决方案

#1


30  

It would seem that ruby thinks that the string encoding is already utf8, so when you do

似乎ruby认为字符串编码已经是utf8,所以当你这样做时

line.encode!('UTF-8', :undef => :replace, :invalid => :replace, :replace => "")

it doesn't actually do anything because the destination encoding is the same as the current encoding (at least that's my interpretation of the code in transcode.c)

它实际上没有做任何事情,因为目标编码与当前编码相同(至少这是我对transcode.c中代码的解释)

The real question here is whether your starting data is valid in some encoding that isn't utf-8 or whether this is data that is supposed to be utf-8 but has a few warts in it that you want to discard.

这里真正的问题是你的起始数据是否在某些不是utf-8的编码中是有效的,或者这是否是应该是utf-8的数据,但是它有一些你想要丢弃的瑕疵。

In the first case, the correct thing to do is tell ruby what this encoding is. You can do this when you open the file

在第一种情况下,正确的做法是告诉ruby这种编码是什么。您可以在打开文件时执行此操作

File.open('somefile', 'r:iso-8859-1')

will open the file, interpreting its contents as iso-8859-1

将打开文件,将其内容解释为iso-8859-1

You can even get ruby to transcode for you

你甚至可以让ruby为你转码

File.open('somefile', 'r:iso-8859-1:utf-8')

will open the file as iso-8859-1, but when you read data from it the bytes will be converted to utf-8 for you.

将以iso-8859-1打开文件,但是当您从中读取数据时,字节将为您转换为utf-8。

You can also call force_encoding to tell ruby what a string's encoding is (this doesn't modify the bytes at all, it just tells ruby how to interpret them).

你也可以调用force_encoding告诉ruby字符串的编码是什么(这根本不会修改字节,它只是告诉ruby如何解释它们)。

In the second case, where you just want to dump whatever nasty stuff has got into your utf-8, you can't just call encode! as you have because that's a no-op. In ruby 2.1 and higher, you can use String#scrub, in previous versions you can do this

在第二种情况下,你只想转储你的utf-8中的任何令人讨厌的东西,你不能只调用编码!因为你有,因为这是一个无操作。在ruby 2.1及更高版本中,您可以使用String#scrub,在以前的版本中,您可以执行此操作

line.encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "")
line.encode!('UTF-8')

We first convert to utf-16. Since this is a different encoding, ruby will actually replace our invalid sequences. We can then convert back to utf-8. This won't lose us any extra data because utf-8 and utf-16 are just two different ways of encoding the same underlying character set.

我们首先转换为utf-16。由于这是一种不同的编码,因此ruby实际上会替换我们的无效序列。然后我们可以转换回utf-8。这不会丢失任何额外的数据,因为utf-8和utf-16只是编码相同底层字符集的两种不同方式。

#2


2  

Maybe you are running this code in IRB. I have had a lot of encoding issues with IRB. In this case, try saving this code as a .rb file and run the code from the command line.

也许您在IRB中运行此代码。我对IRB有很多编码问题。在这种情况下,请尝试将此代码保存为.rb文件,并从命令行运行代码。

#1


30  

It would seem that ruby thinks that the string encoding is already utf8, so when you do

似乎ruby认为字符串编码已经是utf8,所以当你这样做时

line.encode!('UTF-8', :undef => :replace, :invalid => :replace, :replace => "")

it doesn't actually do anything because the destination encoding is the same as the current encoding (at least that's my interpretation of the code in transcode.c)

它实际上没有做任何事情,因为目标编码与当前编码相同(至少这是我对transcode.c中代码的解释)

The real question here is whether your starting data is valid in some encoding that isn't utf-8 or whether this is data that is supposed to be utf-8 but has a few warts in it that you want to discard.

这里真正的问题是你的起始数据是否在某些不是utf-8的编码中是有效的,或者这是否是应该是utf-8的数据,但是它有一些你想要丢弃的瑕疵。

In the first case, the correct thing to do is tell ruby what this encoding is. You can do this when you open the file

在第一种情况下,正确的做法是告诉ruby这种编码是什么。您可以在打开文件时执行此操作

File.open('somefile', 'r:iso-8859-1')

will open the file, interpreting its contents as iso-8859-1

将打开文件,将其内容解释为iso-8859-1

You can even get ruby to transcode for you

你甚至可以让ruby为你转码

File.open('somefile', 'r:iso-8859-1:utf-8')

will open the file as iso-8859-1, but when you read data from it the bytes will be converted to utf-8 for you.

将以iso-8859-1打开文件,但是当您从中读取数据时,字节将为您转换为utf-8。

You can also call force_encoding to tell ruby what a string's encoding is (this doesn't modify the bytes at all, it just tells ruby how to interpret them).

你也可以调用force_encoding告诉ruby字符串的编码是什么(这根本不会修改字节,它只是告诉ruby如何解释它们)。

In the second case, where you just want to dump whatever nasty stuff has got into your utf-8, you can't just call encode! as you have because that's a no-op. In ruby 2.1 and higher, you can use String#scrub, in previous versions you can do this

在第二种情况下,你只想转储你的utf-8中的任何令人讨厌的东西,你不能只调用编码!因为你有,因为这是一个无操作。在ruby 2.1及更高版本中,您可以使用String#scrub,在以前的版本中,您可以执行此操作

line.encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "")
line.encode!('UTF-8')

We first convert to utf-16. Since this is a different encoding, ruby will actually replace our invalid sequences. We can then convert back to utf-8. This won't lose us any extra data because utf-8 and utf-16 are just two different ways of encoding the same underlying character set.

我们首先转换为utf-16。由于这是一种不同的编码,因此ruby实际上会替换我们的无效序列。然后我们可以转换回utf-8。这不会丢失任何额外的数据,因为utf-8和utf-16只是编码相同底层字符集的两种不同方式。

#2


2  

Maybe you are running this code in IRB. I have had a lot of encoding issues with IRB. In this case, try saving this code as a .rb file and run the code from the command line.

也许您在IRB中运行此代码。我对IRB有很多编码问题。在这种情况下,请尝试将此代码保存为.rb文件,并从命令行运行代码。