将非ascii字符从ASCII-8BIT转换为UTF-8。

时间:2022-01-22 13:18:17

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

我正在从远程站点中提取文本,并试图将其加载到一个Ruby 1.9/Rails 3应用程序中,默认情况下使用utf-8。

Here is an example of some offending text:

下面是一些冒犯性文本的例子:

Cancer Res; 71(3); 1-11. ©2011 AACR.\n

That Copyright code expanded looks like this:

版权代码的扩展是这样的:

Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

Ruby告诉我,字符串被编码为ASCII-8BIT,并输入到我的Rails应用程序中:

incompatible character encodings: ASCII-8BIT and UTF-8

I can strip the copyright code out using this regex

我可以用这个正则表达式去掉版权代码。

str.gsub(/[\x00-\x7F]/n,'?')

to produce this

产生这样的

Cancer Res; 71(3); 1-11. ??2011 AACR.\n

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

但是,我怎样才能获得一个版权符号(以及其他各种符号,如希腊字母)在UTF-8中转换为相同的符号呢?肯定是可能的…

I see references to using force_encoding but this does not work:

我看到了使用force_encoding的引用,但这不起作用:

str.force_encoding('utf-8').encode

I realize there are many other people with similar issues but I've yet to see a solution that works.

我意识到有很多人也有类似的问题,但我还没有看到一个有效的解决方案。

3 个解决方案

#1


54  

This works for me:

这工作对我来说:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

#2


24  

There are two possibilities:

有两种可能性:

  1. The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.

    输入数据已经是UTF-8了,但是Ruby只是不知道而已。这似乎是您的情况,因为“\xC2\xA9”对于版权标志是有效的UTF-8。在这种情况下,您只需告诉Ruby,数据已经使用force_encoding已经是UTF-8了。

    For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.

    例如,“\xC2\xA9”.force_encoding(“ASCII-8BIT”)将重新创建您的输入数据的相关位。“\xC2\xA9”.force_encoding('ASCII-8BIT').force_encoding('UTF-8')将显示您可以告诉Ruby它是真正的UTF-8,并得到预期的结果。

  2. The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.

    输入数据在其他编码中,需要Ruby将其转换为UTF-8。在这种情况下,你必须告诉Ruby当前的编码是什么(ASCII-8BIT是二进制的,它不是一个真正的编码),然后告诉Ruby代码。

    For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

    例如,输入数据是ISO-8859-1。在这种编码中,版权标志就是“\xA9”。这将生成这样的数据:“\xA9”.force_encoding('ISO-8859-1'),这将表明您可以让Ruby将其转换为UTF-8:“\xA9”.force_encoding('ISO-8859-1').encode('UTF-8')

#3


6  

I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:

我曾经用开放uri、iconv和Hpricot对希腊的windows编码的页面进行了编码。

doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))

I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9

我相信那是Ruby 1.8.7,不知道Ruby 1.9是如何实现的。

#1


54  

This works for me:

这工作对我来说:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

#2


24  

There are two possibilities:

有两种可能性:

  1. The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.

    输入数据已经是UTF-8了,但是Ruby只是不知道而已。这似乎是您的情况,因为“\xC2\xA9”对于版权标志是有效的UTF-8。在这种情况下,您只需告诉Ruby,数据已经使用force_encoding已经是UTF-8了。

    For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.

    例如,“\xC2\xA9”.force_encoding(“ASCII-8BIT”)将重新创建您的输入数据的相关位。“\xC2\xA9”.force_encoding('ASCII-8BIT').force_encoding('UTF-8')将显示您可以告诉Ruby它是真正的UTF-8,并得到预期的结果。

  2. The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.

    输入数据在其他编码中,需要Ruby将其转换为UTF-8。在这种情况下,你必须告诉Ruby当前的编码是什么(ASCII-8BIT是二进制的,它不是一个真正的编码),然后告诉Ruby代码。

    For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

    例如,输入数据是ISO-8859-1。在这种编码中,版权标志就是“\xA9”。这将生成这样的数据:“\xA9”.force_encoding('ISO-8859-1'),这将表明您可以让Ruby将其转换为UTF-8:“\xA9”.force_encoding('ISO-8859-1').encode('UTF-8')

#3


6  

I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:

我曾经用开放uri、iconv和Hpricot对希腊的windows编码的页面进行了编码。

doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))

I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9

我相信那是Ruby 1.8.7,不知道Ruby 1.9是如何实现的。