ruby,`match':UTF-8中的无效字节序列

时间:2023-01-06 14:33:17

I have some problem with UTF-8 conding. I have read some posts here but still it does not work properly somehow.

我有一些UTF-8 conding的问题。我在这里阅读了一些帖子,但它仍然无法正常工作。

That is my code:

那是我的代码:

#!/bin/env ruby
#encoding: utf-8

def determine
  file=File.open("/home/lala.txt")          
  file.each do |line|           
    puts(line)
    type = line.match(/DOG/)
    puts('aaaaa')

    if type != nil 
      puts(type[0])
      break
    end        

  end
end

That are the first 3 lines of my file :

这是我文件的前3行:

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
text/lalalalala1.0.0.1515
text/lalalala�DOG

When I run this code it shows me an error exactly when reading the third line of the file (where the word dog stands):

当我运行此代码时,它会在读取文件的第三行(单词dog所在的位置)时向我显示错误:

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
aaaaa

text/lalalalala1.0.0.1515
aaaaa

text/lalalala�DOG
/home/kik/Desktop/determine2.rb:16:in `match': invalid byte sequence in UTF-8 (ArgumentError)

BUT: if I run just a a determine function with the following content:

但是:如果我只使用以下内容运行一个确定函数:

#!/bin/env ruby
#encoding: utf-8

    def determine
    type="text/lalalala�DOG".match(/DOG/)
    puts(type)
end

it works perfectly.

它完美地运作。

What is going wrong there? Thanks in advance!

那里出了什么问题?提前致谢!

EDIT: The third line in the file is:

编辑:文件中的第三行是:

text/lalalal»DOG

BUT when I print the thirf line of the file in ruby it shows up like:

但是当我在ruby中打印文件的thirf行时,它显示为:

text/lalalala�DOG

EDIT2:

This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.The format of the file is a binary file with data stored in network byte order (big-endian format).

此格式也是为了支持本地化而开发的。存储在文件中的字符串存储为2字节UNICODE字符。文件格式为二进制文件,数据以网络字节顺序(big-endian格式)存储。

3 个解决方案

#1


3  

I believe @Amadan is close, but has it backwards. I'd do this:

我相信@Amadan很接近,但却倒退了。我这样做:

File.open("/home/lala.txt", "r:ASCII-8BIT")

The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.

该字符无效UTF-8,但出于您的目的,它看起来像8位ASCII将正常工作。我的理解是,当您使用字符串时,Ruby默认使用该编码,这就是为什么这样做的原因。

Update: Based on your most recent comment, it sounds like this is what you need:

更新:根据您最近的评论,听起来这就是您所需要的:

File.open("/home/lala.txt", "rb:UTF-16BE")

#2


1  

Try using this:

试试这个:

File.open("/home/lala.txt", "r:UTF-8")

There seems to be an issue with wrong encoding being used at some stage. #encoding :utf specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open uses.

在某个阶段使用错误的编码似乎存在问题。 #encoding:utf仅指定源文件的编码,这会影响文字字符串的解释方式,并且对File.open使用的编码没有影响。

#3


-1  

Simple Solution for less number of files:

简化解决方案,减少文件数量:

@Katja open the file in some text editor and click on save as option and change its format to UTF-8 and click OK. pop up will be displayed to replace or create new. Replace existing file and you are on.

@Katja在某个文本编辑器中打开该文件,然后单击“另存为”选项并将其格式更改为UTF-8并单击“确定”。弹出窗口将显示以替换或创建新的。替换现有文件,您就可以了。

#1


3  

I believe @Amadan is close, but has it backwards. I'd do this:

我相信@Amadan很接近,但却倒退了。我这样做:

File.open("/home/lala.txt", "r:ASCII-8BIT")

The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.

该字符无效UTF-8,但出于您的目的,它看起来像8位ASCII将正常工作。我的理解是,当您使用字符串时,Ruby默认使用该编码,这就是为什么这样做的原因。

Update: Based on your most recent comment, it sounds like this is what you need:

更新:根据您最近的评论,听起来这就是您所需要的:

File.open("/home/lala.txt", "rb:UTF-16BE")

#2


1  

Try using this:

试试这个:

File.open("/home/lala.txt", "r:UTF-8")

There seems to be an issue with wrong encoding being used at some stage. #encoding :utf specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open uses.

在某个阶段使用错误的编码似乎存在问题。 #encoding:utf仅指定源文件的编码,这会影响文字字符串的解释方式,并且对File.open使用的编码没有影响。

#3


-1  

Simple Solution for less number of files:

简化解决方案,减少文件数量:

@Katja open the file in some text editor and click on save as option and change its format to UTF-8 and click OK. pop up will be displayed to replace or create new. Replace existing file and you are on.

@Katja在某个文本编辑器中打开该文件,然后单击“另存为”选项并将其格式更改为UTF-8并单击“确定”。弹出窗口将显示以替换或创建新的。替换现有文件,您就可以了。